Zero size chunks: how to solve it once and for all?

gchen · 1 November 2021 22:08

Duplicacy does checksum verification when uploading to S3 and B2, but not other storages. There are already two levels of built-in checksums (file-level and chunk-level), so I feel that it isn’t really necessary to have another checksum to just ensure the integrity of the upload step. Data corruption should not happen over https connections if my understanding is correct.

Rclone is different. It is a sync tool so this kind of checksums is the only way to verify that files are transferred correctly.

Christoph · 1 November 2021 22:38

Well, that is the question, I guess.

If that were the case, this thread wouldn’t exist. Maybe it’s not https’s fault, but problems obviously exist.

Droolio · 2 November 2021 01:34

Could this not be expanded to other storages? Since most cloud storages and even sftp does support remote hashing. We’re not talking about storing chunk or file-level checksums - just using known data at the time of upload, to verify that a chunk was actually uploaded in one piece.

Plus I’m not sure how a sync or backup tool should differ in wanting to make sure files were sent correctly(?).

fisowiw784 · 2 November 2021 03:40

I’m just genuinely curious, why does Duplicacy perform checksum verification when uploading to S3 and B2, but not for other providers? Is checksum verification more resource/bandwidth-intensive than file-level and chunk-level??

saspus · 2 November 2021 03:59

S3 and others support server side verification via API. With others, the only way to check is to download the file back and verify. This download bit is unexpected and often expensive and time consuming.

If you have to do it – you can run duplicacy check --chunks after backup and the chunks that have not yet been validated will get downloaded and checked. But this means you are generating the egress equal to ingress, and that may be expensive and in vast majority of circumstances not only unnecessary but also non-actionable, besides perhaps telling you to not use that specific backend anymore.

fisowiw784 · 2 November 2021 15:24

Ah, so it’s a case of S3 and B2 being superior to other backends and Duplicacy taking advantage of server-side verification. It might be time for me to switch to one of them.

saspus · 2 November 2021 16:31

Define “superior”. Yes, api based bulk cloud storage services will work much better for backup than file based ones: s3 will be significantly more robust than WebDAV — those are designed for different purposes. WebDav was never meant to handle millions of small files; it’s a an http extension for sharing documents for collaboration. But still this is probably not a main reasons to pick one versus the other.

Example: I use google drive as a backend. Zero issues (with asterisk. Meaning, zero issues I did not know how to handle). It’s built on the same google cloud service you can use directly yourself, but it is definitely not designed for the bulk storage. It’s a collaboration service. I’m effectively abusing it by using it as a backup backend, and yet, it was and is solid.

B2 on the other hand had some issues recently with returning bad data, performance is not uniform — see recent post of someone not being able to get any reasonable bandwidth from it. Yeah, it is 2x-4x cheaper. You get what you pay.

It all boils down to company culture and how they handle development, testing, quality control, and maintenance. I would trust google and Amazon to keep my data healthy but not pcloud, hubic, iDrive, Backblaze, or other little companies with little side projects and three overworked QA engineers and flaky processes (judging by the outcomes).

Reality is — if you want reliable backup you should not really be using *Drive services as backends regardless of vendor. It’s a minefield or gotchas — for example, eventual consistency of virtually all those services means you can end up with duplicate folder names. Unless your backup tool is designed to handle it — you will see weird issues unless you are very deliberate in how you design your backup workflow (as a side note, rclone dedup can fix those — guess why I know this). Then there is throttling, anti abuse features, various other performance limits that are OK for intended use of sharing documents but are there on purpose to discourage abuse or as a side effect of compromises taken. (For example don’t even try to download with multiple threads from Dropbox on OneDrive)

It just so happens that google figured out how to build reliable web services and google drive happens to be robust enough ultra cheap solution that happens to work well historically with minimal pitflals. It’s a unicorn. Every single other *Drive service happens to be unsuitable as backup target for any reasonably large dataset.

So yes, you are right, for set it and forget it rock solid approach you should go by market share and use Amazon S3, Google Cloud or Microsoft Azure (in that order). That would be expensive. Very. But how much is your data worth to you?

if the cost is not an issue (wither if you have small dataset or deep pockets) then that’s where the story ends.

Otherwise you can optimize out further based on type of data you store and your workflow. For example, my largest important dataset is few TB of family photos and videos. They live in iCloud. Meaning they are managed by apple and stored at google cloud or Amazon s3. Chances of me losing data in iCloud are very slim — data there is backed up and replicated and can be recovered if attacked by ransomware. I’m backing it up elsewhere anyway out of sheer paranoia but never had to nor ever expect to need to restore from there. In this case archival services are great. Like Google Archive or Amazon Glacier. They are ultra cheap to store (under $0.001/GB/month), but quite expensive to restore. (Duplicacy does not work out of the box with Glacier due to thawing requirements, not sure about Google Archive, I use them with another backup tool for selective subset of data.)

My backup to google drive in other words is a toy. I know in the long run I’ll fully switch to actual cloud storage service (and probably archival one) — but it’s fun to see how far can it be stretched. (And it’s cheap, so why not)

Droolio · 3 November 2021 13:32

That’s not true, and why I linked to this page to highlight that most cloud providers - and even sftp (with a shell and md5sum/sha1sum) - support remote hashing, and obviate the need to re-download the file. e.g. Google Drive also supports MD5.

Likely, the difference with B2 and S3 is that the hash is automatically computed when uploading the file. But most of these providers have API calls that can get the backend to return the hash of the file without having to download it.

IMO, considering the issues we’re running into with 0-byte and truncated files, Duplicacy needs a -verify flag for backends that support remote hashing, and to check the file size for all other cases at the very least.

saspus · 3 November 2021 20:53

This highly depends on the implementation on the backend, and may not do anything to ensure data consistency.

For example, rehashing data repeatedly every time user asks for it would be too expensive, so when returning hash data it will be by derived from caches at various stages, not actual file: from simply returning hash stored along with metadata, to (best case) returning hash of objects in the local cash of the server that is serving your data today, while the origin could have already rotted. For data just uploaded that may be even from ram, before they corrupted the file during write to their dying array.

That success of that check would mean “backend had the correct file at some point”, not that “the file is correct now”.

The only true verification is to actually restore some files. This approach is taken by a few tools – you specify download window, say 30 min every week, and it attempts to restore random pieces from the dataset for the duration of that window. This puts a cap on cost and gives confirmation that at least some data is still recoverable and the whole thing does not go to /dev/null on the other end.

I would not say “even”, but I don’t know what’s the word with opposite meaning would fit here. SFTP is a special case, as it’s a remote shell, so admin can technically allow the user to run md5 or sha1 or whatever else reduction algorithm user wants remotely on their data, (including, “duplicacy --chck -chunks” on a local to the server temporary repo). Neither of it is a good idea because:

it’s a pure waste of resources: validation on per-file basis is pointless when whole storage consistency is being validated anyway, (as it should), (and futhermore, and as discussed before, its not a job of backup tool to ensure data consistency of the backend, if anything because even if that was possible technically there is no recourse once discrepancy is detected)
Even home servers, let alone commercial appliances, can have substantial local cache, and validating cache content (that its usually SSD and unlikely to rot) is not helpful when data on the main array may have already rotted (if admin forgot to enable scrub).

Partially agree.

I still think remote hashing is pointless because the meaning of it inconsistent across across remotes, and for properly optimized remotes check success will not mean what users might think it does and be a false promise.
For few remotes that experience this zero size thing (and I only could find mention of pcloud and some sftp servers – not the paragons of reliability or good design) – the only correct solution is to stop using that remote, not add more fluff around the process. If you don’t trust the remote to write out transferred file successfully why would you trust any other promises it makes, including not just save the hash and putting your file on a rotting potato with no redundancy.
As a temporary bandaid solution while users search for replacement of said remote – duplicacy already supports incremental validation with check -chunks: after each backup user can schedule check --chunks, and this will only result of egress in the amount of data just uploaded. This would be of similar degree of trustworthiness guarantee, but will work with any backends, even those that don’t deserver trust, and does not require more per-remote code to be written, maintained, and documented.

Droolio · 4 November 2021 01:41

This is a big assumption and even if it were true for a particular backend, I’m sure you’d agree, most good cloud-based solutions will be using ECC memory and its reliability of properly hashing the data its been given, should be very high. After all, things like the transport protocol will have their own protections if anything got corrupted. Do we not benefit from that, too?

But ECC or not, is irrelevant; we’re not talking about protecting against bit rot or cosmic rays - just that the backend did in fact receive the data we sent it.

The backend is patently not going to give a false positive where the chunk was empty or got truncated, because if it did, it’s magically constructing a hash from data it doesn’t have. Thus it would (should) tell us - immediately - if it simply received the file we sent it - whether it’s using the cache or not, doesn’t matter. Once it’s past that stage, it’s for the backend to do its job.

As you say, check -chunks and full or partial restores will test for other failure modes, but Duplicacy should be trying its best to safeguard against this seemingly common one.

As far as I know, the first part isn’t necessarily possible (the raw hash of a chunk wouldn’t be referenced in a snapshot), so it wouldn’t be possible to remotely verify checksums after the fact - the point is to do it while uploading.

And not all sftp backends will allow you to run Duplicacy remotely anyway. Even if it did, you could end up with several (successful, albeit corrupted) incremental backups til the next deep check ran. With upload verification, it would be instant.

You keep saying this, but it’s a real problem - for Duplicacy, not the backend - if it leaves broken files on a perfectly good backend, and subsequent incremental backups are broken because it makes assumptions about the existence of a file, a file that wasn’t checked to see if it was uploaded fully.

Again, you keep saying this, but I already previously explained that detecting errors as soon as possible, is critical to 1) avoiding an ever-widening window of data loss, while backups are broken, and 2) fixing a broken storage (which is always possible, and more to the point, desirable). So of course there’s recourse.

saspus · 4 November 2021 05:15

Agree on the first bit.

Can you give me an example of a scenario on any “perfectly good backend” when this is possible? If the transfer is not complete – backend shall send an error. If the transfer is complete – backend should report OK. If backend saves half of a file and says OK – it’s a backend problem, it’s not “perfectly good” , and shall be avoided.

File transfer must be transactional, backends that don’t handle this – shall not be used in geeneral, with duplciacy specifically.

I have re-read that several times but I don’t see an explanation on how to repair the datastore. IF the chunk is lost the data in it (bits and pieces of versioned data) are lost. Where do you get old versions of the users files to repair the corrupted chunk with?

My bigger point is that fixing those issues on duplicacy side is counterproductive: it makes flaky remote look and behave like a good one, while only addressing obvious blunders. Remote that cannot accept a file reliably shall fail ASAP so users can stop using it. Check --chunks immediately after backup accomplishes that.

The alternative you are suggesting is a cover up of obvious issues, but the quality of the remote is still shite, so other, less obvious issues will creep in much later, without an obvious fix.

For the FTP server that gets full and corrupts data – it’s a bad ftp server. SFTP API has return codes, and if it successfully uploaded truncated file – it’s shite, and needs to be replaced. It’s not duplicacy’s or any other clients’ fault, and therefore shall not be addressed at the client.

I yet to see a convincing counter-argument, that’s all. Patchwork for some of the issues of flaky remotes is not the direction I’m ready to agree with.

Christoph · 4 November 2021 09:54

Great discussion. Keep it going!

In the case of zero-size chunks it’s even worse than that: even when duplicacy knows the file is useless, it simply ignored that fact.

Granted, I’m exaggerating a little bit, because it’s the check command that “knows” and the backup command that assumes, but you get the point.

Which brings us back to the fact that duplicacy is apparently not using existing hashing features of the backends it supports. (Or am I misunderstanding the meaning of transactional here?)

I wasn’t aware of this (I was aware of the command, but didn’t understand the use case). If this is so, shouldn’t something like -checkchunks be an option for backup? Or if commands should remain clearly distinguished from each other, shouldn’t this at least be part of the backup documentation as a kind of standard setup suggestion?
But I think I would prefer it as an option for backup because then it could act on it immediately and reupload whatever is broken.

I understand that this is the opposite of what @saspus wants in an ideal world, but my world, at least, is not ideal.

Droolio · 4 November 2021 15:22

You’re right - file transfers must be transactional; mutual - it’s not the job of the backend alone to decide that a transaction was successful - it’s the application that uses it. The backend storage handles it after having fully received the file. It hasn’t received it.

Or, to put it another way, most backends in my view don’t provide enough functionality to ensure a file is atomiclly uploaded. Duplicacy is assuming too much (treating a fs like a database), and a universal solution is to put in an extra check (optionally, with a -verify flag) that ensures truncated files are very unlikely to happen - even in the case of temporary lack of disk space, sudden disconnections, bugs with Duplicacy’s handling of API errors etc…

The explanation is right there. I’ve had to go through the process several times, so it works, is necessary, and contradicts your strange claim there’s no recourse to repair the database. i.e. manually delete 0-byte chunks/snapshots, go straight to check -chunks (since a normal check doesn’t reveal truncated chunks), manually delete other truncated chunks and/or snapshots, re-run with chunk -files -r <latest>, prune with -exhaustive etc…

Obviously, snapshot revisions where chunks/snapshots got truncated will be corrupted beyond repair - but they MUST be removed from the database lest they cause issues in the future.

The point is, if a backup fails, it’s evidently possible the backup storage can be left in a bad state - and Duplicacy will go on creating backups and pruning chunks and not even a regular check will tell you anything is amiss. The average Duplicacy user isn’t wise to this fact.

Which is expensive, and unnecessary, when the hashes are already available during upload and most backends support remote checksumming. Even just checking the file size would be a nice first step!

Actually, the normal check command doesn’t know that either. You have to use -chunks or otherwise re-download to highlight a problem that occurred during the initial upload. Which, of course, sounds ridiculous because it should be detected and prevented right then.

Edit: BTW, it might sound like I’m being super critical of Duplicacy. I do trust it with my backups (since I do 3-2-1). Yet I’m super paranoid these days when a backup fails - having seen how common truncated data is happening - to run deep checks more regularly than perhaps should be necessary.

saspus · 4 November 2021 19:01

Support for transactions is a property of the service, not client: backend must ensure that the file is saved in its entirely, or not at all. Every protocols when uploading files passes expected file length, and the integrity is guarantees by the transport protocol. So it’s up to the backend not to drop the ball

Yes! And how does the client supposed to determine if it was a success? By looking at what “put/upload/store/whathaveyou” API returned. Because only backend knows if it managed to save the file successfully, not client.

So, my reasoning is simple: by suggesting that duplicacy should verify what has actually happened after file got uploaded and backend returned success you are implying that at least upload API of said backend cannot be trusted. Which means the whole backend cannot be trusted. Which means – stop using it.

D: Here, take this 2MB file A
B: Received A: Success
D: (Prepares to leave, then turns around abruptly, squints) – what is the size of file A I just gave?
B: 1.4MB
D: Aha! I got you evil bastard!

You suggest: have duplicacy retry until succeeded.
I suggest: have duplicacy do nothing. When the discrepancy is detected – hard fail, and tell the user to stop using this unreliable backend.

There is probably a middle ground somewhere – to detect some of the most common discrepancies early. Maybe what you are saying is OK – after uploading check size. However this will double API cost for all users, and does not provide any guarantee – the bad backend can drop the ball somewhere else. Seems hardly worth it to penalize most users to marginally improve chances with a couple of bad apples.

Right. It knows that the chunk file with the specified ID is present, and the [good] backend shall guarantee through mechanisms described above its integrity.

Exactly my point. It’s expensive and unnecessary. But if don’t trust the backend to handle hashing during upload, why do you trust it to handle it later? What is different? And so, since backend can lie, (by saving hash and returning it, as opposed to actually hashing a file, as described above) – they only way to actually check is to download.

This is not a repair, it’s damage control. Data loss has already occurred, the version history has been lost. If the “repair” involves deleting versions and snapshots it’s not a repair. You don’t heal headache with a guillotine.

Train of thought: This backend caused me to lose data → I"m not using that backend anymore. That’s the only solution, not more code in duplicacy.

Good point. Does it mean you recommend to schedule check by default? For example, when creating backup schedule also automatically create check schedule? User can always delete it if they don’t want it, but the default config will be safe.

However I’m strongly against adding code to backup procedure that is motivated by distrusting the remote.

It’s not common (in my anecdotal experience) Adding more checks marginally improves the chances of having healthy backup with bad remote while costing all users more in API costs and performance. Switching to a remote that does not suck – does it drastically.

But there is no difference! Detected and prevented right then – means downloaded right after uploading. There is no other way to check. You need to do it with every chunk file if you don’t trust upload API.

It does not matter if you do it right after every chunk or in bulk (check --chunks) after entire backup routine has completed.

The benefit of the second approach is that for remotes that are trusted (which is majority) there is no need to double check them. If you do it during backup – then you impose overhead on all users, most of which don’ need that.

It looks like we are going in circles, but I think we are converging to something.

Christoph · 4 November 2021 21:42

I don’t understand this. The logs I posted above seem to suggest something else. Here is the beginning of the log file where you can see that I’m not using -chunks and yet it is telling me which chunks are zero size.

Running check command from /tmp/.duplicacy-web/repositories/localhost/all
Options: [-log check -storage pcloud -threads 4 -a -tabular]
2021-10-30 05:00:01.611 INFO STORAGE_SET Storage set to webdav://*********@webdav.pcloud.com/Backup/Duplicacy
2021-10-30 05:00:03.224 INFO SNAPSHOT_CHECK Listing all chunks
2021-10-30 05:16:41.861 WARN SNAPSHOT_CHECK Chunk d06b4e277f99a5ea0f0745a27e57d3e633b98a4849a6a0b885f7d2710f0d1b7f has a size of 0
2021-10-30 05:16:43.385 WARN SNAPSHOT_CHECK Chunk 51c91de816f41ca233ed57c836fa95cd17f389937dc2e604cdd05b434b851644 has a size of 0
2021-10-30 05:21:53.387 INFO SNAPSHOT_CHECK 15 snapshots and 1730 revisions
2021-10-30 05:21:53.447 INFO SNAPSHOT_CHECK Total chunk size is 3038G in 2762586 chunks

saspus · 4 November 2021 22:32

Apparently, duplicacy does checks for zero file sizes:

github.com

gilbertchen/duplicacy/blob/e43e848d47176144049b93938f09820c61effc72/src/duplicacy_snapshotmanager.go#L848

    
      
          			continue
          		}
          
          
		if strings.HasSuffix(chunk, ".fsl") {
          			continue
          		}
          
          
		chunk = strings.Replace(chunk, "/", "", -1)
          		chunkSizeMap[chunk] = allSizes[i]
          
          
		if allSizes[i] == 0 && !strings.HasSuffix(chunk, ".tmp") {
          			LOG_WARN("SNAPSHOT_CHECK", "Chunk %s has a size of 0", chunk)
          			emptyChunks++
          		}
          	}
          
          
	if snapshotID == "" || showStatistics || showTabular {
          		snapshotIDs, err := manager.ListSnapshotIDs()
          		if err != nil {
          			LOG_ERROR("SNAPSHOT_LIST", "Failed to list all snapshots: %v", err)
          			return false

since 16 months ago:

Droolio · 5 November 2021 00:42

So all except B2 and S3 cannot be trusted?

What I’m suggesting, is that Duplicacy made a design choice that heavily relies on the assumption that a filesystem can substitute as a database, and that all such storages provide a robost enough API to allow it to function in that way. I don’t think it does, or necessarily can - without certain extra safeguards.

Never once said or implied that. I want Duplicacy to fail gracefully and not leave its ‘database’ in a corrupted state.

…
B: 1.4MB; I may, or may not, be able to inform client I ran out of disk space, but at least I won’t leave my filesystem in a corrupted state
D: OK, let’s not leave this broken file laying around like a lego brick; don’t want my database corrupted after all.

API cost would be optional (via a -verify flag) and much cheaper than re-downloading the chunk.

Of course it provides a guarantee; that the file sent was received in full! The minimum I’d expect of an application that wishes to treat a fs as a transactional database. The mechanism isn’t meant to safeguard against bit rot or other issues after the fact, only that a transaction 100% completed.

It’s like arguing erasure codes is a waste of time because ‘nothing should go wrong’. In fact I’d argue the implementation of erasure codes is similar in desirability, and yet truncated chunks created during backup arises more frequently than bit rot ever would.

Dunno why you insist on this line of reasoning when I’ve already demonstrated why it came to be broken (temporary lack of disk space), why it had to be fixed quickly (because Duplicacy silently continues and pretends nothing is wrong), how (manually deleting chunks/snapshots), and that it was ultimately successful (disk space was freed up and most of the version history in tact). AKA - a repair.

But major hassle.

Do you suggest a filesystem should be left corrupted, when it runs out of disk space? That it should be abandoned? Throw out the drives and start again.

If not, why then, should Duplicacy’s ‘database’ be so easy to corrupt, just by running out of space?

Again, an optional -verify flag.

Overhead is mostly on the backend and the rest mitigated, if desired, by an extra -thread or so.

Furthermore, should be relatively straightforward to implement, without changing database format.

You’re absolutely right, this is a relatively new addition, and wasn’t present when I last had to tidy up a broken storage due to 0-byte chunks.

However, it won’t catch truncated chunks that aren’t 0 bytes in size, which I’ve personally witnessed.

saspus · 5 November 2021 02:01

I would not conflate the api specification with implementation. For example Minio running on a disintegrating array without checksumming will not be very robust while amazon’s hosted storage service over the same API spec likely will be much better. Or the case on hand – webdav and pcloud. These issues likely have nothing do with webdav, and everything to do with lean business with aggressive corner cutting pcloud seems to be running (I’m speculating, I have no idea what’s going on there).

Yes, that seems to be the case, and it is not an unreasonable assumption (backed by API documentation!) with most services; with others – duplicacy implements a workaround via .tmp files (only after tmp files is successfully uploaded is it renamed in-place)

But that is not enough. Your HTTPS put request also provided status code that said that request succeded. Why is that not enough? It’s just another API.

Yep. I’d still (passionately) argue it’s a misplaced and dangerously misleading feature: this gives users an excuse to continue relying on rotting media as backup destination justified by a slight reduction is chance of data loss. I’d rather users move to better storage than implement a bandaid that does not change anything in the grand scheme of things. (and by that the design does not provide guarantee, like ZFS would). But it’s a different conversation.

I guess there are two separate things.

High priority: Detecting flaky remotes and notifying users. Remote that runs out of space and cannot propagate the error back to the client is a bad remote and should not be trusted with other tasks either. (it may start overwriting or deleting exiting files too “to save the filesystem”; and if architects think that saving filesystem is more important than customer data – I’d like to know about that sooner than later to never ever even come close to that “service”).
Low priority: Duplicacy shall be able to self-cleanup. Automatically. Without manual intervention. Data loss has already occurred, so all that find-affected-snapshots-and-delete-along-with-orphaned-chunks thing should be done automatically, with user notification. There is no reason no to cleanup. Yes we can call it “repair”. I call it low priority because data loss has already occurred. The system failed: user data was lost. There is little consolation in duplicacy being able to proceed dumping more data to unstable remote that allowed data loss to occur. (Actually, maybe it’s a good thing that it does not auto heal: better chance users will start over with different storage? Probably not.)

Zero size chunks is the only very specific corner case that can be (and as it turns out, already is) checked quickly. For detecting any of other possible issues (e.g. 1-byte shorter chunk, corrupted in the middle chunk, etc), or this one:

– you need full download. Asking remote for hash is pointless: (hey do you still have my file with this hash? yes, I do. You can’t deduce anything from this information. Remember Backblaze bug recently – data was correct, and yet API returned crap.)

That’s why it’s imperative to pick reliable remotes (whatever that means), and detecting bad remote is therefore more important that fixing the datastore after data has already been lost.

Nobody can know that. There can be a chunk file fthat affects all snapshots in the datastore. Data loss is data loss. 1 byte loss == fail. Might as well scrap whole backup and start over. On a new remote.

So, to summarize:

We want duplicacy to detect corrupted datastore asap. Zero chunk detection that is in place and immediate check after each backup covers the range of reasonable accomodations. Maybe do that for a few first months with a new remote.
Users shall be couched to abandon remotes that fail that check even once, as being poorly designed and unreliable.
This is with the understanding that relying on any means of verification served by the remote itself (e.g. asking for hash or health) that hinge on remote being honest are pointless.
On a completely unrelated note: yes, duplicacy shall be able to self-heal the state of the datastore, since “repair” we discussed does not increase data loss, but merely allows duplicacy to proceed. Actually, I would consider inability to do that today a high priority bug.

Do you (partially?) agree?

Droolio · 5 November 2021 16:35

Indeed, this only happens on certain backends (local and sftp) - which begs the question, how are truncated chunks (without the .tmp extension) still happening on these?

During a check only, not during a backup.

There might be numerous backups and even a prune or several (if configured that way) before the next check is run.

A check is an expensive operation when you have a lot of chunks and an off-site remote, so it’s normally ran much less regularly. Likewise, a regular check -chunks is just as expensive, since each run has to ListAllFiles(). This process alone takes 12 minutes on my G Suite storage. 30+ mins on various Vertical Backup stores on local sftp. Just to list files!

Not to mention, a prune can further ‘corrupt’ a storage when it’s already missing chunks (or has truncated metadata chunks) - leaving more and more bad snapshots around. (No .fsl for snapshots.)

Duplicacy isn’t responsible for bugs in the backend, nor bit rot, or any of that. It is responsible for using adequate workarounds (renaming from .tmp etc.) and putting in extra safeguards when that’s not possible. Basically, a design choice, in lieu of a real database.

Well yes I can know that, since I have to run a full gamut of checks and/or restores to purge all bad chunks/snapshots. Simply facts, and comes with extensive experience.

In the case of lack of disk space, this has always meant every backup revision created after the initial occurrence, cannot be trusted and probably should be manually pruned. Everything before that is perfectly fine (one great aspect about Duplicacy’s design at least).

We have different ideas about what asap means.

This is not about a particular user’s individual remote, though. This is about each remote type which Duplicacy directly supports being suitable for the job. Is WebDAV unreliable? Where is the disclaimer, that a bad internet connection, may render it corrupt?

Self-heal would be nice, I agree, but not leaving the database in a corrupted state in the first place would be better. Isn’t that not even more of a priority bug?

Do you agree there?

What might be useful for everyone here is to extensively test each of their backend storages for specific failure scenarios - e.g. out of disk space (quota), temporary or permanent loss of connection - to more properly gauge extent of the issue. IMO, the outcome will surprise many.

saspus · 6 November 2021 01:12

Are they? The only possibility of this happening I can imagine is buggy sftp implementation – which reports success on truncated upload. Are you aware of reproducible case? It needs to be reported to sftp server vendor. It would be interesting to try to reproduce – I’ll try to do that later.

Of course. Default workflow must be designed on the assumption that the software and services work. To catch exceptions there is check. (of course I’m not talking about cases when operation fails and backend returns failure; these are still normal; I’m talking about operation failing and backend returning success, which is a source of all issues here)

Ideally it should be never run. so running it infrequently is not bad, it’s expected.

This is about right. Why is this a problem though?

Well, this involves a lot of random IO, and unless the cache is hot it will be ridiculously slow. I would slow it down further and let it take 8 hours instead, to minimize impact on performance.

The point being – high performance cost for infrequent and non-mandatory operation is petered to even minimal increase of overhead of main execution path – which is backup.

How can prune corrupt anything with missing chunks? All ti can do is delete more of chunks that need to be deleted anyway.

Correct. It must be designed in such a way to be able to tell, given backend behaves per spec, whether the operation succeeded or not, and properly handle failure, by e.g. retrying. As far as I understand, it is designed this way. Hence, failures are due to the clause in italic is violated.

If you are talking about undetected full disk scenario when subsequent backups are corrupted – then yes, previous ones will be unaffected. But in case of flaky storage, it is not impossible to imagine that one very old chunk is referenced by all snapshots, and damaging that chunk kills entire datastore.

Ok, for me asap means “at first check”.

No no, it is specifically about each individual remote, not remote type. As I wrote somewhere above – there can be excellent WebDav remote and absolute random number generator hiding behind S3. All protocols have error reporting built in, but not all implementations do a great job adhering to the spec.

No, that would be horrific. “We report error when operation fails except when we don’t”. Web API usually return server response – so when request comes – hey, take these 500 bytes, the only way the server shall return success if when it indeed received those 500 bytes and saved them successfully. Any disruption means either error returned or connection broken (another error returned but by different subsystem). So yes, WebDAV spec is rock solid. It says nothing about pcloud’s implementation of said spec.

No, the more of a priority is not letting the datastore to get into corrupted state. The only way to do it reliably is through magic of not using remotes that don’t adhere to published API (regardless of the degree of disparity. 0-tolerance policy. Does not matter if remote truncated file or saved a copy from /dev/urandom instead). Which remotes are those? Those that failed check -chunks at least once.

I would be great if dupliacy had telemetry (with proper PII scrubbing of course) to collect that statistics, even simply graph of count of check failures per invocations per remote.

For example, I posted yesterday about what seemed to be a google backend issue where duplicacy would misidentify existing file as missing. As a result, I’m going to retest with my own service account, and if the issue persists (i.e. it is not related to shared google cloud project that issued credentials) I’ll be done with google drive and will move to google cloud.

Yes! Let’s do that. I volunteer to play with SFTP server in VM. Would be interesting to reproduce zero-chunks-on-full disk use case.

I can also report on google drive in last few years: (no data about quotas, but it does tolerate abrupt connection changes very well. Not once did it fail in the past 3 (4?) years – and I carry laptop between locations, ethernet/Wifi, and sleep-wake. Never was there a data loss, but more than enough of data store corruption with prune in the form of orphaned snapshots when prune is interrupted. It’s a bug in duplicacy, not google drive, and needs to be fixed.

I won’t be surprised if the issues will be limited to small scale providers – Synology nas () in the closet, pcloud, b2, defunct hubiC, and others who did not invest enough in QA to iron out corner cases.