Zero size chunks: how to solve it once and for all?

Christoph · 4 November 2021 09:54

Great discussion. Keep it going!

In the case of zero-size chunks it’s even worse than that: even when duplicacy knows the file is useless, it simply ignored that fact.

Granted, I’m exaggerating a little bit, because it’s the check command that “knows” and the backup command that assumes, but you get the point.

Which brings us back to the fact that duplicacy is apparently not using existing hashing features of the backends it supports. (Or am I misunderstanding the meaning of transactional here?)

I wasn’t aware of this (I was aware of the command, but didn’t understand the use case). If this is so, shouldn’t something like -checkchunks be an option for backup? Or if commands should remain clearly distinguished from each other, shouldn’t this at least be part of the backup documentation as a kind of standard setup suggestion?
But I think I would prefer it as an option for backup because then it could act on it immediately and reupload whatever is broken.

I understand that this is the opposite of what @saspus wants in an ideal world, but my world, at least, is not ideal.

Droolio · 4 November 2021 15:22

You’re right - file transfers must be transactional; mutual - it’s not the job of the backend alone to decide that a transaction was successful - it’s the application that uses it. The backend storage handles it after having fully received the file. It hasn’t received it.

Or, to put it another way, most backends in my view don’t provide enough functionality to ensure a file is atomiclly uploaded. Duplicacy is assuming too much (treating a fs like a database), and a universal solution is to put in an extra check (optionally, with a -verify flag) that ensures truncated files are very unlikely to happen - even in the case of temporary lack of disk space, sudden disconnections, bugs with Duplicacy’s handling of API errors etc…

The explanation is right there. I’ve had to go through the process several times, so it works, is necessary, and contradicts your strange claim there’s no recourse to repair the database. i.e. manually delete 0-byte chunks/snapshots, go straight to check -chunks (since a normal check doesn’t reveal truncated chunks), manually delete other truncated chunks and/or snapshots, re-run with chunk -files -r <latest>, prune with -exhaustive etc…

Obviously, snapshot revisions where chunks/snapshots got truncated will be corrupted beyond repair - but they MUST be removed from the database lest they cause issues in the future.

The point is, if a backup fails, it’s evidently possible the backup storage can be left in a bad state - and Duplicacy will go on creating backups and pruning chunks and not even a regular check will tell you anything is amiss. The average Duplicacy user isn’t wise to this fact.

Which is expensive, and unnecessary, when the hashes are already available during upload and most backends support remote checksumming. Even just checking the file size would be a nice first step!

Actually, the normal check command doesn’t know that either. You have to use -chunks or otherwise re-download to highlight a problem that occurred during the initial upload. Which, of course, sounds ridiculous because it should be detected and prevented right then.

Edit: BTW, it might sound like I’m being super critical of Duplicacy. I do trust it with my backups (since I do 3-2-1). Yet I’m super paranoid these days when a backup fails - having seen how common truncated data is happening - to run deep checks more regularly than perhaps should be necessary.

saspus · 4 November 2021 19:01

Support for transactions is a property of the service, not client: backend must ensure that the file is saved in its entirely, or not at all. Every protocols when uploading files passes expected file length, and the integrity is guarantees by the transport protocol. So it’s up to the backend not to drop the ball

Yes! And how does the client supposed to determine if it was a success? By looking at what “put/upload/store/whathaveyou” API returned. Because only backend knows if it managed to save the file successfully, not client.

So, my reasoning is simple: by suggesting that duplicacy should verify what has actually happened after file got uploaded and backend returned success you are implying that at least upload API of said backend cannot be trusted. Which means the whole backend cannot be trusted. Which means – stop using it.

D: Here, take this 2MB file A
B: Received A: Success
D: (Prepares to leave, then turns around abruptly, squints) – what is the size of file A I just gave?
B: 1.4MB
D: Aha! I got you evil bastard!

You suggest: have duplicacy retry until succeeded.
I suggest: have duplicacy do nothing. When the discrepancy is detected – hard fail, and tell the user to stop using this unreliable backend.

There is probably a middle ground somewhere – to detect some of the most common discrepancies early. Maybe what you are saying is OK – after uploading check size. However this will double API cost for all users, and does not provide any guarantee – the bad backend can drop the ball somewhere else. Seems hardly worth it to penalize most users to marginally improve chances with a couple of bad apples.

Right. It knows that the chunk file with the specified ID is present, and the [good] backend shall guarantee through mechanisms described above its integrity.

Exactly my point. It’s expensive and unnecessary. But if don’t trust the backend to handle hashing during upload, why do you trust it to handle it later? What is different? And so, since backend can lie, (by saving hash and returning it, as opposed to actually hashing a file, as described above) – they only way to actually check is to download.

This is not a repair, it’s damage control. Data loss has already occurred, the version history has been lost. If the “repair” involves deleting versions and snapshots it’s not a repair. You don’t heal headache with a guillotine.

Train of thought: This backend caused me to lose data → I"m not using that backend anymore. That’s the only solution, not more code in duplicacy.

Good point. Does it mean you recommend to schedule check by default? For example, when creating backup schedule also automatically create check schedule? User can always delete it if they don’t want it, but the default config will be safe.

However I’m strongly against adding code to backup procedure that is motivated by distrusting the remote.

It’s not common (in my anecdotal experience) Adding more checks marginally improves the chances of having healthy backup with bad remote while costing all users more in API costs and performance. Switching to a remote that does not suck – does it drastically.

But there is no difference! Detected and prevented right then – means downloaded right after uploading. There is no other way to check. You need to do it with every chunk file if you don’t trust upload API.

It does not matter if you do it right after every chunk or in bulk (check --chunks) after entire backup routine has completed.

The benefit of the second approach is that for remotes that are trusted (which is majority) there is no need to double check them. If you do it during backup – then you impose overhead on all users, most of which don’ need that.

It looks like we are going in circles, but I think we are converging to something.

Christoph · 4 November 2021 21:42

I don’t understand this. The logs I posted above seem to suggest something else. Here is the beginning of the log file where you can see that I’m not using -chunks and yet it is telling me which chunks are zero size.

Running check command from /tmp/.duplicacy-web/repositories/localhost/all
Options: [-log check -storage pcloud -threads 4 -a -tabular]
2021-10-30 05:00:01.611 INFO STORAGE_SET Storage set to webdav://*********@webdav.pcloud.com/Backup/Duplicacy
2021-10-30 05:00:03.224 INFO SNAPSHOT_CHECK Listing all chunks
2021-10-30 05:16:41.861 WARN SNAPSHOT_CHECK Chunk d06b4e277f99a5ea0f0745a27e57d3e633b98a4849a6a0b885f7d2710f0d1b7f has a size of 0
2021-10-30 05:16:43.385 WARN SNAPSHOT_CHECK Chunk 51c91de816f41ca233ed57c836fa95cd17f389937dc2e604cdd05b434b851644 has a size of 0
2021-10-30 05:21:53.387 INFO SNAPSHOT_CHECK 15 snapshots and 1730 revisions
2021-10-30 05:21:53.447 INFO SNAPSHOT_CHECK Total chunk size is 3038G in 2762586 chunks

saspus · 4 November 2021 22:32

Apparently, duplicacy does checks for zero file sizes:

github.com

gilbertchen/duplicacy/blob/e43e848d47176144049b93938f09820c61effc72/src/duplicacy_snapshotmanager.go#L848

    
      
          			continue
          		}
          
          
		if strings.HasSuffix(chunk, ".fsl") {
          			continue
          		}
          
          
		chunk = strings.Replace(chunk, "/", "", -1)
          		chunkSizeMap[chunk] = allSizes[i]
          
          
		if allSizes[i] == 0 && !strings.HasSuffix(chunk, ".tmp") {
          			LOG_WARN("SNAPSHOT_CHECK", "Chunk %s has a size of 0", chunk)
          			emptyChunks++
          		}
          	}
          
          
	if snapshotID == "" || showStatistics || showTabular {
          		snapshotIDs, err := manager.ListSnapshotIDs()
          		if err != nil {
          			LOG_ERROR("SNAPSHOT_LIST", "Failed to list all snapshots: %v", err)
          			return false

since 16 months ago:

Droolio · 5 November 2021 00:42

So all except B2 and S3 cannot be trusted?

What I’m suggesting, is that Duplicacy made a design choice that heavily relies on the assumption that a filesystem can substitute as a database, and that all such storages provide a robost enough API to allow it to function in that way. I don’t think it does, or necessarily can - without certain extra safeguards.

Never once said or implied that. I want Duplicacy to fail gracefully and not leave its ‘database’ in a corrupted state.

…
B: 1.4MB; I may, or may not, be able to inform client I ran out of disk space, but at least I won’t leave my filesystem in a corrupted state
D: OK, let’s not leave this broken file laying around like a lego brick; don’t want my database corrupted after all.

API cost would be optional (via a -verify flag) and much cheaper than re-downloading the chunk.

Of course it provides a guarantee; that the file sent was received in full! The minimum I’d expect of an application that wishes to treat a fs as a transactional database. The mechanism isn’t meant to safeguard against bit rot or other issues after the fact, only that a transaction 100% completed.

It’s like arguing erasure codes is a waste of time because ‘nothing should go wrong’. In fact I’d argue the implementation of erasure codes is similar in desirability, and yet truncated chunks created during backup arises more frequently than bit rot ever would.

Dunno why you insist on this line of reasoning when I’ve already demonstrated why it came to be broken (temporary lack of disk space), why it had to be fixed quickly (because Duplicacy silently continues and pretends nothing is wrong), how (manually deleting chunks/snapshots), and that it was ultimately successful (disk space was freed up and most of the version history in tact). AKA - a repair.

But major hassle.

Do you suggest a filesystem should be left corrupted, when it runs out of disk space? That it should be abandoned? Throw out the drives and start again.

If not, why then, should Duplicacy’s ‘database’ be so easy to corrupt, just by running out of space?

Again, an optional -verify flag.

Overhead is mostly on the backend and the rest mitigated, if desired, by an extra -thread or so.

Furthermore, should be relatively straightforward to implement, without changing database format.

You’re absolutely right, this is a relatively new addition, and wasn’t present when I last had to tidy up a broken storage due to 0-byte chunks.

However, it won’t catch truncated chunks that aren’t 0 bytes in size, which I’ve personally witnessed.

saspus · 5 November 2021 02:01

I would not conflate the api specification with implementation. For example Minio running on a disintegrating array without checksumming will not be very robust while amazon’s hosted storage service over the same API spec likely will be much better. Or the case on hand – webdav and pcloud. These issues likely have nothing do with webdav, and everything to do with lean business with aggressive corner cutting pcloud seems to be running (I’m speculating, I have no idea what’s going on there).

Yes, that seems to be the case, and it is not an unreasonable assumption (backed by API documentation!) with most services; with others – duplicacy implements a workaround via .tmp files (only after tmp files is successfully uploaded is it renamed in-place)

But that is not enough. Your HTTPS put request also provided status code that said that request succeded. Why is that not enough? It’s just another API.

Yep. I’d still (passionately) argue it’s a misplaced and dangerously misleading feature: this gives users an excuse to continue relying on rotting media as backup destination justified by a slight reduction is chance of data loss. I’d rather users move to better storage than implement a bandaid that does not change anything in the grand scheme of things. (and by that the design does not provide guarantee, like ZFS would). But it’s a different conversation.

I guess there are two separate things.

High priority: Detecting flaky remotes and notifying users. Remote that runs out of space and cannot propagate the error back to the client is a bad remote and should not be trusted with other tasks either. (it may start overwriting or deleting exiting files too “to save the filesystem”; and if architects think that saving filesystem is more important than customer data – I’d like to know about that sooner than later to never ever even come close to that “service”).
Low priority: Duplicacy shall be able to self-cleanup. Automatically. Without manual intervention. Data loss has already occurred, so all that find-affected-snapshots-and-delete-along-with-orphaned-chunks thing should be done automatically, with user notification. There is no reason no to cleanup. Yes we can call it “repair”. I call it low priority because data loss has already occurred. The system failed: user data was lost. There is little consolation in duplicacy being able to proceed dumping more data to unstable remote that allowed data loss to occur. (Actually, maybe it’s a good thing that it does not auto heal: better chance users will start over with different storage? Probably not.)

Zero size chunks is the only very specific corner case that can be (and as it turns out, already is) checked quickly. For detecting any of other possible issues (e.g. 1-byte shorter chunk, corrupted in the middle chunk, etc), or this one:

– you need full download. Asking remote for hash is pointless: (hey do you still have my file with this hash? yes, I do. You can’t deduce anything from this information. Remember Backblaze bug recently – data was correct, and yet API returned crap.)

That’s why it’s imperative to pick reliable remotes (whatever that means), and detecting bad remote is therefore more important that fixing the datastore after data has already been lost.

Nobody can know that. There can be a chunk file fthat affects all snapshots in the datastore. Data loss is data loss. 1 byte loss == fail. Might as well scrap whole backup and start over. On a new remote.

So, to summarize:

We want duplicacy to detect corrupted datastore asap. Zero chunk detection that is in place and immediate check after each backup covers the range of reasonable accomodations. Maybe do that for a few first months with a new remote.
Users shall be couched to abandon remotes that fail that check even once, as being poorly designed and unreliable.
This is with the understanding that relying on any means of verification served by the remote itself (e.g. asking for hash or health) that hinge on remote being honest are pointless.
On a completely unrelated note: yes, duplicacy shall be able to self-heal the state of the datastore, since “repair” we discussed does not increase data loss, but merely allows duplicacy to proceed. Actually, I would consider inability to do that today a high priority bug.

Do you (partially?) agree?

Droolio · 5 November 2021 16:35

Indeed, this only happens on certain backends (local and sftp) - which begs the question, how are truncated chunks (without the .tmp extension) still happening on these?

During a check only, not during a backup.

There might be numerous backups and even a prune or several (if configured that way) before the next check is run.

A check is an expensive operation when you have a lot of chunks and an off-site remote, so it’s normally ran much less regularly. Likewise, a regular check -chunks is just as expensive, since each run has to ListAllFiles(). This process alone takes 12 minutes on my G Suite storage. 30+ mins on various Vertical Backup stores on local sftp. Just to list files!

Not to mention, a prune can further ‘corrupt’ a storage when it’s already missing chunks (or has truncated metadata chunks) - leaving more and more bad snapshots around. (No .fsl for snapshots.)

Duplicacy isn’t responsible for bugs in the backend, nor bit rot, or any of that. It is responsible for using adequate workarounds (renaming from .tmp etc.) and putting in extra safeguards when that’s not possible. Basically, a design choice, in lieu of a real database.

Well yes I can know that, since I have to run a full gamut of checks and/or restores to purge all bad chunks/snapshots. Simply facts, and comes with extensive experience.

In the case of lack of disk space, this has always meant every backup revision created after the initial occurrence, cannot be trusted and probably should be manually pruned. Everything before that is perfectly fine (one great aspect about Duplicacy’s design at least).

We have different ideas about what asap means.

This is not about a particular user’s individual remote, though. This is about each remote type which Duplicacy directly supports being suitable for the job. Is WebDAV unreliable? Where is the disclaimer, that a bad internet connection, may render it corrupt?

Self-heal would be nice, I agree, but not leaving the database in a corrupted state in the first place would be better. Isn’t that not even more of a priority bug?

Do you agree there?

What might be useful for everyone here is to extensively test each of their backend storages for specific failure scenarios - e.g. out of disk space (quota), temporary or permanent loss of connection - to more properly gauge extent of the issue. IMO, the outcome will surprise many.

saspus · 6 November 2021 01:12

Are they? The only possibility of this happening I can imagine is buggy sftp implementation – which reports success on truncated upload. Are you aware of reproducible case? It needs to be reported to sftp server vendor. It would be interesting to try to reproduce – I’ll try to do that later.

Of course. Default workflow must be designed on the assumption that the software and services work. To catch exceptions there is check. (of course I’m not talking about cases when operation fails and backend returns failure; these are still normal; I’m talking about operation failing and backend returning success, which is a source of all issues here)

Ideally it should be never run. so running it infrequently is not bad, it’s expected.

This is about right. Why is this a problem though?

Well, this involves a lot of random IO, and unless the cache is hot it will be ridiculously slow. I would slow it down further and let it take 8 hours instead, to minimize impact on performance.

The point being – high performance cost for infrequent and non-mandatory operation is petered to even minimal increase of overhead of main execution path – which is backup.

How can prune corrupt anything with missing chunks? All ti can do is delete more of chunks that need to be deleted anyway.

Correct. It must be designed in such a way to be able to tell, given backend behaves per spec, whether the operation succeeded or not, and properly handle failure, by e.g. retrying. As far as I understand, it is designed this way. Hence, failures are due to the clause in italic is violated.

If you are talking about undetected full disk scenario when subsequent backups are corrupted – then yes, previous ones will be unaffected. But in case of flaky storage, it is not impossible to imagine that one very old chunk is referenced by all snapshots, and damaging that chunk kills entire datastore.

Ok, for me asap means “at first check”.

No no, it is specifically about each individual remote, not remote type. As I wrote somewhere above – there can be excellent WebDav remote and absolute random number generator hiding behind S3. All protocols have error reporting built in, but not all implementations do a great job adhering to the spec.

No, that would be horrific. “We report error when operation fails except when we don’t”. Web API usually return server response – so when request comes – hey, take these 500 bytes, the only way the server shall return success if when it indeed received those 500 bytes and saved them successfully. Any disruption means either error returned or connection broken (another error returned but by different subsystem). So yes, WebDAV spec is rock solid. It says nothing about pcloud’s implementation of said spec.

No, the more of a priority is not letting the datastore to get into corrupted state. The only way to do it reliably is through magic of not using remotes that don’t adhere to published API (regardless of the degree of disparity. 0-tolerance policy. Does not matter if remote truncated file or saved a copy from /dev/urandom instead). Which remotes are those? Those that failed check -chunks at least once.

I would be great if dupliacy had telemetry (with proper PII scrubbing of course) to collect that statistics, even simply graph of count of check failures per invocations per remote.

For example, I posted yesterday about what seemed to be a google backend issue where duplicacy would misidentify existing file as missing. As a result, I’m going to retest with my own service account, and if the issue persists (i.e. it is not related to shared google cloud project that issued credentials) I’ll be done with google drive and will move to google cloud.

Yes! Let’s do that. I volunteer to play with SFTP server in VM. Would be interesting to reproduce zero-chunks-on-full disk use case.

I can also report on google drive in last few years: (no data about quotas, but it does tolerate abrupt connection changes very well. Not once did it fail in the past 3 (4?) years – and I carry laptop between locations, ethernet/Wifi, and sleep-wake. Never was there a data loss, but more than enough of data store corruption with prune in the form of orphaned snapshots when prune is interrupted. It’s a bug in duplicacy, not google drive, and needs to be fixed.

I won’t be surprised if the issues will be limited to small scale providers – Synology nas () in the closet, pcloud, b2, defunct hubiC, and others who did not invest enough in QA to iron out corner cases.