Missing chunks shortly after initial backup to GCS

Posting here because while it’s a specific issue, I think it’s very much apropos of the thread above about testing backups.

2 days ago I did my first backup to Google Cloud Storage using “autoclass.” I uploaded 50Gb quickly with no errors. I’ve since done a few more backups, and last night I did my first check. It showed all chunks present for revision 1, but for each of the other 6 revisions there were 6 chunks missing (all different in each case).

Honestly, despite reading a number of the threads that @saspus and @Droolio kindly linked to, and doing some searching of my own, the task of fixing these missing chunks doesn’t seem like something I know how to do easily without devoting a ton of time to.

Can I just delete all of the revisions after revision 1?

Edit: I’ve confirmed that the chunks the check command refers to are not present in the GCS bucket.

Good news is, missing chunks are generally easier to fix than bad.

Did you run a prune at all? (If it was interrupted, chunks would’ve been deleted before the revision file - in which case you’d have to manually delete the affected revisions.)

Try delete your .duplicacy/cache directory then run a check -fossils -resurrect.

1 Like

Well that’s a problem, something went clearly wrong. Missing chunks shall never happen, outside of known issue with interrupted prune.

Did you run prune by a chance? If prune is interrupted or fails to delete snapshots for some other reason, those snapshots remain in the storage but since they are now a lot of chunks check will report.

Checking my logs, it looks like I ran a check command yesterday and it already reported missing chunks in revisions 2-5. At that point in time, I had never run prune on this backup.

I did run prune (as part of a once-a-month script) last night, but a) it didn’t actually remove any revisions and b) the missing chunks problem had already manifested.

So I’m guessing something went wrong on the initial upload? My best guess is that perhaps I made a mistake by copying over a cache folder from an earlier .duplicacy folder (I’m not sure I did that, but perhaps a 20% chance).

The only revision without missing chunks is revision 1, and that an early test backup with a tiny bit of data, so I don’t really have a good revision of my entire data set.

I guess I’m starting over here, which means I could do a couple of things differently (chunk size larger by default, erasure encoding on, new compression algorithm???). But I went with GCS so as to avoid having a problem so now I’m really wary…

It should not matter, I believe the data in cache folder is only used if it exists on the target, but it’w worth deleting it and running check again.

But what “earlier .duplicacy folder”? Was it from the same storage? Maybe it’s still running?/

We need to figure out what went wrong – there are no known issues with chunks not being uploaded – duplicacy first uploads chunks, and then snapshot file – so by the time snapshot file exits all chunks must exist. Furthermore, “backup” command never deletes anything.

Are you sure there no other tasks running, perhaps from old instance of duplciacy, targeting the same storage, perhaps, running prune inadvertently?

GCS is solid. Let’s figure out what happened there.

I can say with certainty that there were no other instances of duplicacy running, targeting this storage (I’m 99% sure there were none running at all, but I definitely only init'd this storage the one time.

I had a .duplicacy folder for a local backup. It had some scripts and launchagent .plist files, plus a filter file that I wanted to use for the new setup for GCS. So it’s conceivable that I copied over the cache folder there. But if that was not the source of the problem with missing chunks, then I’m not sure.

Perhaps I moved too quickly, but I went ahead and deleted revisions 2-8 with the missing chunks. So my check now succeeds, but it’s a check on the first revision, which was only about a 50Mb upload.

In case it’s relevant, the Google Cloud Console reports a 50% client error rate during my backups, and when I get more info it says this:

response_code: monitoring.regex.full_match(CANCELLED|INVALID_ARGUMENT|NOT_FOUND|ALREADY_EXISTS|PERMISSION_DENIED|UNAUTHENTICATED|RESOURCE_EXHAUSTED|FAILED_PRECONDITION|ABORTED|OUT_OF_RANGE)

project_id: duplicacybackup-455420

No need to start over. You’ll want to avoid re-uploading chunks on GCS.

Absolute worst case, use fresh IDs then cleanup after with prune -exhaustive.

How many -threads are you using?

I was not limiting threads in the backup command, and I noticed in activity monitor that duplicacy seemed to be consistently using 17 threads.

Sorry to be slow, but can you spell out how to “use fresh IDs”? I assume this means run another backup to the same storage with a different ID, but since the ID is set when I init the repository, I’m not sure how to proceed. Do I add a new storage that is really the same storage, but give it a different ID?

Second question: after I backup with a new ID, would I run prune –exhaustive from the old ID or the new one?

Hmmm, I don’t think Duplicacy indicates clearly how many threads are used during backup, apart from in the first few log lines. But by default it should choose 1 if you don’t set it yourself.

While you can use add, the easiest way is to simply edit the .duplicacy/preferences file and change the ID there.

Remember, you shouldn’t need to do this if revision 1 is sound - just manually remove the broken revisions and cleanup.

After a backup with new ID (if you go this route), I’d recommend run a check -id <id> (instead of -all) - just to be sure it was successful. Then manually delete your old snapshot/<id> directory on the destination storage.

Finally, prune -exhaustive to clean up. (You can do this if you don’t use a fresh ID and just remove broken revisions instead.)

Note, it won’t remove chunks immediately, unless you use -exclusive along with it. You should avoid using that flag unless you’re absolutely sure nothing else is happening. Otherwise, chunks are removed in the two-step fossil collection process eventually.

Thanks so much for your help!

Perhaps there’s a difference between how MacOS Activity Monitor reports threads, and how this is set in duplicacy. But when I have the duplicacy CLI running and I go to activity monitor to check the process, it reports “17” under the threads column. It stays there fairly consistently. When it first ramps up it’s smaller, and I’ve seen it go to 18, but with a long backup it seems to sit on 17.It’s not taking that much CPU or memory and I notice no hit to my system (M4).

I had already gone ahead – it seems prematurely, my mistake – and removed the revisions with missing chunks. This was an error because I forgot that the first revision (the only one without missing chunks) was a very small initial backup.

In any case, with those revisions removed, check reports no errors . So I’m now running another backup with about 1/5th of my data. I’ll then run check immediately after that and see where we are.

OK, I get it – I just change it in preferences and then the new backup appears under a different snapshot ID. That makes sense – thanks.

EDIT to add: I still don’t know what caused this, but I have to guess it was my mistake. I wonder if I had a backup running and then loaded a launchctl plist that had the “runAtload” key set to “true.” Somehow this tried to check or prune while a backup was running. I don’t have evidence I did this, but it’s not out of the realm of possibility.

UPDATE

Maybe it wasn’t me, because I seem to have replicated the problem. To be clear,

  1. Earlier today I removed all revisions other than revision 1.
  2. I ran a check on revision 1 and it reported all chunks exist.
  3. I then did a new backup of about 3Gb of files. It completed revision 2 without errors.
  4. I immediately ran a check and it reported this:
1 snapshots and 2 revisions
Total chunk size is 8,352M in 1894 chunks
All chunks referenced by snapshot miniM4 at revision 1 exist
Chunk 565cf03cc9969bfda8c33eeb08d401fee466843ac8b4c73f1a99a59b907afe8e referenced by snapshot miniM4 at revision 2 does not exist
Chunk 9ce8675d638b53ec808c7c42850678b621ccf460747c964368b57fe7933b4f64 referenced by snapshot miniM4 at revision 2 does not exist
Some chunks referenced by snapshot miniM4 at revision 2 are missing
Some chunks referenced by some snapshots do not exist in the storage

I can confirm that neither of those chunks are in the bucket (I searched through the google cloud storage command line). But I don’t understand why they are referenced in the snapshot.

I’m lost.

EDIT: just ran check -fossils -resurrect (after deleting the cache folder) and ended it resulted in the same reported missing chunks

Aha! Process threads, little bit different to transfer threads, but makes sense now. I don’t recall what Windows or Linux uses but it’s probably similar.

Interesting. This reminds me of some recent-ish threads here and here.

Are you possibly re-using snapshot IDs from earlier runs?

EDIT:

Did you search for 5cf03cc9969bfda8c33eeb08d401fee466843ac8b4c73f1a99a59b907afe8e in bucket/56 ? (First two chars are its parent dir.)

I guess I must be, in that I didn’t change the snapshot ID. I just deleted the revisions that had missing chunked, and then ran backup again.

Definately delete your .duplicacy/cache then, there’s no harm doing that except Duplicacy downloads a handful of metadata chunks from the destination.

I ran this:
gcloud storage ls gs://dup47/\*\* | grep 565cf03cc9969b

MORE…

OK. I just changed my snapshot ID in preferences, ran backup again, and now check reports all is well in that revision. I’m not sure what that means I do next: keep running with this changed snapshot ID and assume all is well?

You should be fine. Just run check more regularly in the beginning to monitor things, then you can run it less often. Probably just weirdness with the initial setup. :slight_smile:

OK, great. I’ve made a note (in my huge pile of notes) that if something goes wrong on check, change the snapshot ID before backing up again.

I also went ahead and used the prune command set to the OLD snapshot ID to remove the two revisions associated with that snapshot. After doing so, a check on the new snapshot ID still gives the all clear.

I guess must have done something wonky when I was getting it set up the first time. I’ll go slower this time and not get the scripts running until things look OK.

Thanks again for your help – really appreciate it.

Generally you’d want to keep the same ID if you’ve been making good progress on backups, you’ll want to keep historic snapshots. I only suggested as a worst case fix, to get past first time setup. Instead, deleting the affected revisions (and the cache) is always better.

Take a gander at the snapshot/<old_ID> directory anyway, just to make sure they’re gone. Typically, Duplicacy won’t remove the last revision through prune unless you force it with -exclusive.

Brilliant. :slight_smile:

np!

So my overnight backup of a larger chunk (30Gb) of data, has again failed with missing chunks.

I now have a better guess at the cause: I was running a different instance of duplicacy to a local backup on an hourly schedule. My backups to GCS that fail are ones that take longer to run, so I’m assuming the launching of the local backup is the issue??? I guess this should have been obvious, but in my extensive reading I never saw it mentioned and it really just didn’t occur to me. To be clear, this second instance was not connected to the same storage and had a completely different .duplicacy folder (cache, etc.).

I’ll shut off that local backup now, and then I guess I may have to do the repair process by starting with yet another snapshot ID.

EDIT: further reading indicates that running multiple instances should only be an issue if they are backing up to the same storage. So I’m still very confused about what’s going wrong with these GCS backups. I’ve now fixed the issue again, but this second time around I definitely didn’t do anything weird (other than having a separate instance of duplicacy [in its own .duplicacy folder] backup up to a local hard drive).

Indeed. Even then, so long as the snapshot IDs are different, you can backup simultaneously to any storage - that’s part of the design.

GCS is probably one of the least-tested backends, so it could be that it has issues. May want to ask @gchen if there could anything be going on. Try running Duplicacy with the -d debug flag from now on, there may be API stuff going on there that’ll shed light.

My only other suggestion would be to run a few more backups without doing any cleanup - see the same revisions get flagged with missing chunks after several runs.