Zero size chunks: how to solve it once and for all?

Christoph · 30 October 2021 11:34

We have a number of post about missing chunks and zero size chunks on the forum. The basic instructions for what to do are here and a more in depth discussion is here.

I’m raising this issue once again because I’m once again struggling with zero size chunks (as well as missing chunks) and I’m wondering how this can be solved once and for all. For me, backups are something I want to setup once and then never think about it again until I need to restore something. This is clearly not the case with my current duplicacy setup and I wonder whether that is due to how duplicacy works eller because of my setup.

The problem is actually twofold: one is that missing or zero size chunks come into being every now and then and the other thing is that duplicacy doesn’t do anything about it. The best moment to fix a missing or zero size chunk is obviously right after it was discovered…

How are other people dealing with this problem?

saspus · 30 October 2021 16:38

What storage back end is it with?

And to answer the question, I personally don’t deal with it, because zero size chunks never happen to me (it must be backend failure — it should not keep incomplete files) and missing chunks only ever occurred on half-pruned snapshots where duplicacy failed to delete the snapshot file after getting rid of the chunks (either due to permissions or it was simply interrupted).

Possible solutions are at least two:

for prune to first delete the snapshot file and only then start deleting chunks. gchen is against this because in this scenario if the process is interrupted until every last eligible chunk is removed nobody will know and database will be accumulating a lot of orphan chunks. Most users won’t bother or know to run prune in exhaustive mode
hence, modifying the above approach slightly: For duplicacy to employ “two-step snapshot deletion” approach: before pruning mark the snapshot as candidate for deletion and keep it around until all chunks processing is done and marked snapshot deleted.
On subsequent prune: check for presence of marked snapshots — that would mean their prune was interrupted. Start nuking snapshots again until done.
On subsequent checks: skip marked snapshots and therefore avoid bogus “missing chunks” messages these ghost snapshots reference.

Christoph · 31 October 2021 21:50

it’s pCloud via WebDAV.

Yes, I wouldn’t be surprised if this was a backend failure. But I think even if this is the case, I would expect that my backup program doesn’t ignore this and pretend that the chunk exists when this isn’t really the case.

If I understand you correctly, zero size chunks should never exist, so why doesn’t duplicacy just delete or rename them? Ideally, that would occur already during backup, but I understand that that might be pushing the boundaries of the backup command. So could check take care of that? It’d probably be stretching the check command even more (since “check” suggests that no files are modified), but I still think that adding a -fix option to the check command would make sense. With that option, check would delete any zero size chunks immediately.

If @gchen doesn’t want that, how could this be achieved with a script?

Your suggestion to introduce a “two-step snapshot deletion” sounds good to me. What does @gchen say?

Droolio · 31 October 2021 23:02

It’s not just a backend problem - you can run into issues with 0 byte chunks just by running out of disk space.

It’s a pain to clean up, as you have to manually delete those 0 byte chunks (or snapshot files) before deciding what snapshot files to manually delete, or how to prune revisions to free up space (turns out, running -exclusive with -keep is not a good idea).

I certainly favour a two-step fossil deletion process for snapshot files but I’d really like Duplicacy to perform remote checksumming (can be done with most cloud providers and even sftp) to verify integrity of uploaded chunks in a final pass, before saving a snapshot.

saspus · 1 November 2021 06:17

… on the backend. In which case the backend is supposed to fail to upload call. If it does not – it’s a bug on a backend. Shall duplicacy workaround bugs of backends, that’s the big question.

Or are you are referring to running out of space locally? If duplicacy proceeds with the partial chunk anyway it’s duplicacy’s bug, it needs to get reproduced and fixed. This is however a rather unusual scenario, and when disk is full to the brim there are usually many other issues and plenty of warnings.

IMO it’s a matter of where to draw a line. Ultimately, Duplicacy, or any other client, has to trust backend API to not lie. Zero size files are likely not the only issue. Taking care of them will be silencing the symptoms without addressing the root cause. (i.e. Why only check for 0 size files? How about truncated ones? Corrupted?)

When duplicacy says: hey, take this blob of length N, (with hash H, for some backends) and the backends says – yep! done, saved successfully! duplicacy has to trust that it was indeed transferred and saved successfully.

Otherwise, shall it download the chunk immediately after and verify? Maybe download in an hour to check it’s still there? What if the backend develops rot? What if the backend when requested to download immediately return data from the ram cache but data on disk was already bad or never (fully) written? Shall it check again in a month? Shall we have to routinely run check --chunks --files after every backup and daily? Would this not negate all benefits of using cheap backends by having massive egress?

Most backends support passing checksum/hash and/or length along with the put request or equivalent. For those that don’t duplicacy uploads to a temp file, and only when upload succeeds renames the file to the intended name. If the upload is uninterrupted duplicacy must trust that the full file was uploaded.

I don’t know what’s the deal with pcloud, and whether it ignores or does not report failures, but if API is lying – not unlike the recent issue with OneDrive Business that allowed incomplete uploads – it ether must be fixed at the backend or that backend should not be used.

WebDav specifically is not the best protocol for handling bulk storage; my experience both with HubiC and Yandex Disk over WebDav was horrendous, so maybe it’s not a problem with a cloud but rather inherent symptom of misusing WebDAV? No idea.

If we start implementing workarounds for flaky backends duplicacy will become a bloated unsupportable mess with unpredictable non-deterministic performance and cost (e.g. for all those spurious downloads for verification) very quickly.

Of course, as a workaround you can do that; since duplicacy keeps the log of checked chunks you can routinely run check --chunks after every backup; every chunk will get downloaded and checked once. However having to do that is clear signal to abandon that backend: if you can’t trust your storage even in the short term can you really trust it to not mess up the data in the long term? I wouldn’t. I’d rather throw 3x money at the problem and be sure that its’ solid than getting a discount on a demonstrably flaky service.

Christoph · 1 November 2021 08:27

Thanks a lot for your detailed explanation. I do see the point of not working around faulty backends.

So you’re implying that if the temp file has zero bytes, it will still rename it and thus create a faulty chunk? - I see. This is really interesting, since I don’t know those details of the upload process.

So, if checksum/hash/length is supported by the backend it will use it and it is therefore up to the backend to report if checksum/hash/length don’t match. I suppose in that event it would be standard procedure that the backend throws an error and duplicacy will re-upload, right? It would then seem to me that we can virtually rule out chunks getting corrupted upon upload with such backend, as it would be a pretty serious bug, if the backend didn’t produce that error. (Alternatively, duplicacy could report the wrong size, for example, I guess we can rule that one out, also because it would be even more likely if the backend then makes the same same error and reports OK. Except, maybe in the case of 0 bytes. Can we be 100 percent sure that duplicacy will never report a length of 0 bytes?)

When the backend does not support checksum/hash/length, Duplicacy does its best by using a temp-file, but that only safeguards against interrupted uploads, not other errors. Would webDAV be such a backend? Or what kind of problems did you have with webDAV?

Finally, one more suggestion regarding how duplicacy could perhaps simplify things for users with problematic backends, without working around those backend flaws: when check finds zero size chunks, it reports them first thing in the log, but when it goes on to check the snapshots, it will still be happy to check of those faulty chunks as existing whenever a snapshot points to them. Technically, this is of course correct, but wouldn’t it make sense if duplicacy would take not of which snapshot versions are using a zero size chunk and list those at the end of the log?

The advantage would be that the user would not have to first remove the zero size chunks and then run check again to see which snapshots are affected but could directly go ahead and delete those faulty snapshots. At least in my current case, that would have made things much easier for me, as it turned out that those two zero size chunks were used in only one snapshot (version), which I had no problem deleting.

Final question: Assuming that webDAV can work with checksum/hash/length, what exactly would I ask pCloud support about their webDAV implementation and what would I want them to do to make it more robust?

gchen · 1 November 2021 17:03

Since version 2.6.0, the CLI will report an error when encountering a chunk of size 0: Release Duplicacy Command Line Version 2.6.0 · gilbertchen/duplicacy · GitHub

I can’t agree more!

Droolio · 1 November 2021 17:16

Yea, it should - because it’s not the backend that’s at fault. We’ve been over this before.

I cited out-of-disk space as an example of how Duplicacy can get its knickers in a twist, but it’s not the only failure mode. Broken connections can do the same, however out-of-disk-space can be common, and fs APIs have calls that report such, so I don’t think it’s unreasonable that when the storage is working fine and simply out of space or is disconnected, applications should be expected to fail gracefully.

Duplicacy doesn’t - at least not for sftp in my case, and now it seems certain cloud backends in others’. This is a logic problem and rather worrying, because local and cloud supposedly upload files differently; with cloud, it’s in-place - with local, files are renamed after upload - yet we’re seeing these issues with both types - how? Maybe Duplicacy is assuming too much…

As far as I understand, Duplicacy doesn’t use any type of checksum verification on any of the storage types (maybe size verification in some cases) - despite most backends being more than capable. Conversely, Rclone uses checksums by default on all its transfers, where supported.

IMO, verification by checksum needs to be part of a backup tool. Especially when that backup tool is designed to rely merely on a chunk’s existence for subsequent backups.

0-byte chunks/snapshots are the easiest to detected, but truncated chunks are a problem too. See the thread I linked above.

Christoph · 1 November 2021 20:20

Yes, it does and that is great. But then it goes on to accept those very same 0-chunks as “existing” in whichever snapshop revision they are required.

Here are the details, starting from the en of my fixing process.

After I deleted my two zero-size chunks, check identifies a single snapshot revision (NAS_config at revision 325) that’s using these deleted chunks:

2021-11-01 00:16:54.857 WARN SNAPSHOT_VALIDATE Chunk 51c91de816f41ca233ed57c836fa95cd17f389937dc2e604cdd05b434b851644 referenced by snapshot NAS_config at revision 325 does not exist

Here are some lines from the check-log from before I removed those chunks:

2021-10-30 05:16:41.861 WARN SNAPSHOT_CHECK Chunk  d06b4e277f99a5ea0f0745a27e57d3e633b98a4849a6a0b885f7d2710f0d1b7f has a size of 0
2021-10-30 05:16:43.385 WARN SNAPSHOT_CHECK Chunk 51c91de816f41ca233ed57c836fa95cd17f389937dc2e604cdd05b434b851644 has a size of 0

So duplicacy is warning me about thos chunks being being zero in size. But further down, in the same check-log, it reports that all is fine with snapshot NAS_config at revision 325:

2021-10-30 05:54:12.887 INFO SNAPSHOT_CHECK All chunks referenced by snapshot NAS_config at revision 325 exist

In my mind, that doesn’t make sense (even though it is technically correct). If duplicacy had flagged the snapshots affected by the zero size chunks already during the first check process (i.e. NAS_config at revision 325), I would have just deleted that snapshot and move on with other things.

Why would the check command tick off a revision as being complete, when it knows that some of its chunks are invalid?

gchen · 1 November 2021 22:08

Duplicacy does checksum verification when uploading to S3 and B2, but not other storages. There are already two levels of built-in checksums (file-level and chunk-level), so I feel that it isn’t really necessary to have another checksum to just ensure the integrity of the upload step. Data corruption should not happen over https connections if my understanding is correct.

Rclone is different. It is a sync tool so this kind of checksums is the only way to verify that files are transferred correctly.

Christoph · 1 November 2021 22:38

Well, that is the question, I guess.

If that were the case, this thread wouldn’t exist. Maybe it’s not https’s fault, but problems obviously exist.

Droolio · 2 November 2021 01:34

Could this not be expanded to other storages? Since most cloud storages and even sftp does support remote hashing. We’re not talking about storing chunk or file-level checksums - just using known data at the time of upload, to verify that a chunk was actually uploaded in one piece.

Plus I’m not sure how a sync or backup tool should differ in wanting to make sure files were sent correctly(?).

fisowiw784 · 2 November 2021 03:40

I’m just genuinely curious, why does Duplicacy perform checksum verification when uploading to S3 and B2, but not for other providers? Is checksum verification more resource/bandwidth-intensive than file-level and chunk-level??

saspus · 2 November 2021 03:59

S3 and others support server side verification via API. With others, the only way to check is to download the file back and verify. This download bit is unexpected and often expensive and time consuming.

If you have to do it – you can run duplicacy check --chunks after backup and the chunks that have not yet been validated will get downloaded and checked. But this means you are generating the egress equal to ingress, and that may be expensive and in vast majority of circumstances not only unnecessary but also non-actionable, besides perhaps telling you to not use that specific backend anymore.

fisowiw784 · 2 November 2021 15:24

Ah, so it’s a case of S3 and B2 being superior to other backends and Duplicacy taking advantage of server-side verification. It might be time for me to switch to one of them.

saspus · 2 November 2021 16:31

Define “superior”. Yes, api based bulk cloud storage services will work much better for backup than file based ones: s3 will be significantly more robust than WebDAV — those are designed for different purposes. WebDav was never meant to handle millions of small files; it’s a an http extension for sharing documents for collaboration. But still this is probably not a main reasons to pick one versus the other.

Example: I use google drive as a backend. Zero issues (with asterisk. Meaning, zero issues I did not know how to handle). It’s built on the same google cloud service you can use directly yourself, but it is definitely not designed for the bulk storage. It’s a collaboration service. I’m effectively abusing it by using it as a backup backend, and yet, it was and is solid.

B2 on the other hand had some issues recently with returning bad data, performance is not uniform — see recent post of someone not being able to get any reasonable bandwidth from it. Yeah, it is 2x-4x cheaper. You get what you pay.

It all boils down to company culture and how they handle development, testing, quality control, and maintenance. I would trust google and Amazon to keep my data healthy but not pcloud, hubic, iDrive, Backblaze, or other little companies with little side projects and three overworked QA engineers and flaky processes (judging by the outcomes).

Reality is — if you want reliable backup you should not really be using *Drive services as backends regardless of vendor. It’s a minefield or gotchas — for example, eventual consistency of virtually all those services means you can end up with duplicate folder names. Unless your backup tool is designed to handle it — you will see weird issues unless you are very deliberate in how you design your backup workflow (as a side note, rclone dedup can fix those — guess why I know this). Then there is throttling, anti abuse features, various other performance limits that are OK for intended use of sharing documents but are there on purpose to discourage abuse or as a side effect of compromises taken. (For example don’t even try to download with multiple threads from Dropbox on OneDrive)

It just so happens that google figured out how to build reliable web services and google drive happens to be robust enough ultra cheap solution that happens to work well historically with minimal pitflals. It’s a unicorn. Every single other *Drive service happens to be unsuitable as backup target for any reasonably large dataset.

So yes, you are right, for set it and forget it rock solid approach you should go by market share and use Amazon S3, Google Cloud or Microsoft Azure (in that order). That would be expensive. Very. But how much is your data worth to you?

if the cost is not an issue (wither if you have small dataset or deep pockets) then that’s where the story ends.

Otherwise you can optimize out further based on type of data you store and your workflow. For example, my largest important dataset is few TB of family photos and videos. They live in iCloud. Meaning they are managed by apple and stored at google cloud or Amazon s3. Chances of me losing data in iCloud are very slim — data there is backed up and replicated and can be recovered if attacked by ransomware. I’m backing it up elsewhere anyway out of sheer paranoia but never had to nor ever expect to need to restore from there. In this case archival services are great. Like Google Archive or Amazon Glacier. They are ultra cheap to store (under $0.001/GB/month), but quite expensive to restore. (Duplicacy does not work out of the box with Glacier due to thawing requirements, not sure about Google Archive, I use them with another backup tool for selective subset of data.)

My backup to google drive in other words is a toy. I know in the long run I’ll fully switch to actual cloud storage service (and probably archival one) — but it’s fun to see how far can it be stretched. (And it’s cheap, so why not)

Droolio · 3 November 2021 13:32

That’s not true, and why I linked to this page to highlight that most cloud providers - and even sftp (with a shell and md5sum/sha1sum) - support remote hashing, and obviate the need to re-download the file. e.g. Google Drive also supports MD5.

Likely, the difference with B2 and S3 is that the hash is automatically computed when uploading the file. But most of these providers have API calls that can get the backend to return the hash of the file without having to download it.

IMO, considering the issues we’re running into with 0-byte and truncated files, Duplicacy needs a -verify flag for backends that support remote hashing, and to check the file size for all other cases at the very least.

saspus · 3 November 2021 20:53

This highly depends on the implementation on the backend, and may not do anything to ensure data consistency.

For example, rehashing data repeatedly every time user asks for it would be too expensive, so when returning hash data it will be by derived from caches at various stages, not actual file: from simply returning hash stored along with metadata, to (best case) returning hash of objects in the local cash of the server that is serving your data today, while the origin could have already rotted. For data just uploaded that may be even from ram, before they corrupted the file during write to their dying array.

That success of that check would mean “backend had the correct file at some point”, not that “the file is correct now”.

The only true verification is to actually restore some files. This approach is taken by a few tools – you specify download window, say 30 min every week, and it attempts to restore random pieces from the dataset for the duration of that window. This puts a cap on cost and gives confirmation that at least some data is still recoverable and the whole thing does not go to /dev/null on the other end.

I would not say “even”, but I don’t know what’s the word with opposite meaning would fit here. SFTP is a special case, as it’s a remote shell, so admin can technically allow the user to run md5 or sha1 or whatever else reduction algorithm user wants remotely on their data, (including, “duplicacy --chck -chunks” on a local to the server temporary repo). Neither of it is a good idea because:

it’s a pure waste of resources: validation on per-file basis is pointless when whole storage consistency is being validated anyway, (as it should), (and futhermore, and as discussed before, its not a job of backup tool to ensure data consistency of the backend, if anything because even if that was possible technically there is no recourse once discrepancy is detected)
Even home servers, let alone commercial appliances, can have substantial local cache, and validating cache content (that its usually SSD and unlikely to rot) is not helpful when data on the main array may have already rotted (if admin forgot to enable scrub).

Partially agree.

I still think remote hashing is pointless because the meaning of it inconsistent across across remotes, and for properly optimized remotes check success will not mean what users might think it does and be a false promise.
For few remotes that experience this zero size thing (and I only could find mention of pcloud and some sftp servers – not the paragons of reliability or good design) – the only correct solution is to stop using that remote, not add more fluff around the process. If you don’t trust the remote to write out transferred file successfully why would you trust any other promises it makes, including not just save the hash and putting your file on a rotting potato with no redundancy.
As a temporary bandaid solution while users search for replacement of said remote – duplicacy already supports incremental validation with check -chunks: after each backup user can schedule check --chunks, and this will only result of egress in the amount of data just uploaded. This would be of similar degree of trustworthiness guarantee, but will work with any backends, even those that don’t deserver trust, and does not require more per-remote code to be written, maintained, and documented.

Droolio · 4 November 2021 01:41

This is a big assumption and even if it were true for a particular backend, I’m sure you’d agree, most good cloud-based solutions will be using ECC memory and its reliability of properly hashing the data its been given, should be very high. After all, things like the transport protocol will have their own protections if anything got corrupted. Do we not benefit from that, too?

But ECC or not, is irrelevant; we’re not talking about protecting against bit rot or cosmic rays - just that the backend did in fact receive the data we sent it.

The backend is patently not going to give a false positive where the chunk was empty or got truncated, because if it did, it’s magically constructing a hash from data it doesn’t have. Thus it would (should) tell us - immediately - if it simply received the file we sent it - whether it’s using the cache or not, doesn’t matter. Once it’s past that stage, it’s for the backend to do its job.

As you say, check -chunks and full or partial restores will test for other failure modes, but Duplicacy should be trying its best to safeguard against this seemingly common one.

As far as I know, the first part isn’t necessarily possible (the raw hash of a chunk wouldn’t be referenced in a snapshot), so it wouldn’t be possible to remotely verify checksums after the fact - the point is to do it while uploading.

And not all sftp backends will allow you to run Duplicacy remotely anyway. Even if it did, you could end up with several (successful, albeit corrupted) incremental backups til the next deep check ran. With upload verification, it would be instant.

You keep saying this, but it’s a real problem - for Duplicacy, not the backend - if it leaves broken files on a perfectly good backend, and subsequent incremental backups are broken because it makes assumptions about the existence of a file, a file that wasn’t checked to see if it was uploaded fully.

Again, you keep saying this, but I already previously explained that detecting errors as soon as possible, is critical to 1) avoiding an ever-widening window of data loss, while backups are broken, and 2) fixing a broken storage (which is always possible, and more to the point, desirable). So of course there’s recourse.

saspus · 4 November 2021 05:15

Agree on the first bit.

Can you give me an example of a scenario on any “perfectly good backend” when this is possible? If the transfer is not complete – backend shall send an error. If the transfer is complete – backend should report OK. If backend saves half of a file and says OK – it’s a backend problem, it’s not “perfectly good” , and shall be avoided.

File transfer must be transactional, backends that don’t handle this – shall not be used in geeneral, with duplciacy specifically.

I have re-read that several times but I don’t see an explanation on how to repair the datastore. IF the chunk is lost the data in it (bits and pieces of versioned data) are lost. Where do you get old versions of the users files to repair the corrupted chunk with?

My bigger point is that fixing those issues on duplicacy side is counterproductive: it makes flaky remote look and behave like a good one, while only addressing obvious blunders. Remote that cannot accept a file reliably shall fail ASAP so users can stop using it. Check --chunks immediately after backup accomplishes that.

The alternative you are suggesting is a cover up of obvious issues, but the quality of the remote is still shite, so other, less obvious issues will creep in much later, without an obvious fix.

For the FTP server that gets full and corrupts data – it’s a bad ftp server. SFTP API has return codes, and if it successfully uploaded truncated file – it’s shite, and needs to be replaced. It’s not duplicacy’s or any other clients’ fault, and therefore shall not be addressed at the client.

I yet to see a convincing counter-argument, that’s all. Patchwork for some of the issues of flaky remotes is not the direction I’m ready to agree with.