How can a just-created snapshot revision be "missing chunks"?

I have a duplicacy repository that backs up my laptop’s home directory onto a B2 bucket. It has been running for a couple years; the revision numbers are up to 1500.

At some point along the way, something went wrong (probably prune being run from two places as I migrated from my old laptop to a new one), and now check reports “missing chunks”.

I can understand that if some chunks referenced by a snapshot revision were deleted from the storage, for whatever reason, then the check command would report “missing chunks” for that revision.

However, I’m seeing “missing chunks” reported for snapshot revision that was created just now — without any prune operation since the revision was created.

How is this possible?

I thought that Duplicacy’s backup operation worked like this:

  1. Crawl the entire source tree (in this case, my laptop home directory).
  2. As directories and files are read, break this source data into “chunks”.
  3. When a chunk is created, calculate its hash; this becomes the name of the chunk.
  4. See if a chunk with this hash already exists in the storage; if not, upload it.
  5. Keep track of all the chunk hashes.
  6. When the whole tree has been traversed, record the list of chunks; this is what the “revision” actually contains.

I must be wrong about some aspect of the above, because if Duplicacy followed these steps on every backup, I don’t see how any chunks could be missing from a new revision immediately after creation.

Please help me understand.

This is not an a full answer to my own question, but it’s information that helps to clarify the situation a little more:

I ran backup again on the snapshot id with missing chunks, this time including the -hash option. The resulting revision is complete (no missing chunks), according to check. This indicates that backup -hash is closer to the algorithm I described above.

I’d still like some help understanding why chunks may be missing when the -hash option is not used.

Once the revision reports missing chunks – does it stay broken or eventually gets fixed?

Does this reproduce after deleting local cache? Perhaps its state is no longer consistent, so duplicacy may create metadata chunk referencing some existing chunk that is in actuality missing from the storage. Cache is supposed to be used for chunks that exist on storage, which is source of truth, but this is a plausible corner case.

Here, in the UploadSnapshot:

So a new revision can reference missing chunks when those chunk IDs were present in the previous snapshot metadata but the corresponding objects were deleted from storage earlier for some reason (e.g. interupted prune?)

It seems to have stayed broken for hundreds of revisions in a row, until I ran backup -hash.

Deleting the local cache did not fix it.

That does seem like it would lead to the problem I experienced.

To confirm: you deleted the local cache from under .duplicacy/repositories, ran backup to b2, immediately ran check, and the newly created revision fails check?

In my case, I ran

find ~/.duplicacy-web -name cache | while read d; do rm -rvf "$d"; done

which deleted the caches for all the Duplicacy Web UI backups.

Then, yes, I ran backup to b2, then check (both commands via the Web UI), and the newly created revision failed the check. The operations followed each other “immediately” in the sense that they there were no prune operations in between them, although a few hours elapsed due to the schedule of the human running the commands :slightly_smiling_face:

Ok, this is very puzzling.

Did you verify that cache subfolders are actually gone? in case there are some permission issues.

Can you add another backup destination in e.g. /tmp, or some sftp server, etc, perhaps for a subset of data, and test backups into there, to rule out B2 shenanigans? (The fact that it works with --hash may be some timing related stuff.

Also, I assume there are no failures in the backup log from B2?

Yes.

I did not see any failures in the logs.

After reading the Backup function in duplicacy_backupmanager.go, I have a clearer understanding of how the process works.

Chunks from the previous revision are never (re)uploaded

The backup function reads a list of chunks IDs from the last snapshot revision into a variable chunkCache. To populate chunkCache, the code looks at which chunk IDs are referenced by the previous snapshot revision. It does not verify whether those chunks actually exist on the storage.

// This cache contains all chunks referenced by last snasphot. Any other chunks will lead to a call to
// UploadChunk.
chunkCache := make(map[string]bool)

Importantly for this discussion, the code comment’s inverse is also true: any chunk which is listed in chunkCache is not uploaded.

In other words, Duplicacy will not, under any circumstance, attempt to upload any chunk that the previous snapshot revision already refers to.

File metadata is compared with the previous revision

I also see that the function looks to see if each local file has an exact match (by path, date and size) in the last snapshot revision. If it does, it just copies the chunk-id references from the last revision. It doesn’t make any attempt to (re)upload those chunks or verify whether they still exist on the storage.

However with the -hash option, the list of files in the last revision is ignored, and as a result, all local files are considered “new”, i.e. suitable for backing up. Everything goes into the chunker, all at once. This may result in a different chunking than previous revisions produced (see below).

TL;DR A new snapshot revision can be missing chunks, inherited from the previous revision

— unless the -hash option is used.

Read on for more of what I learned about trying to recreate missing chunks.

But can missing chunks be recreated?

The same file may be represented by different chunks in different contexts

My attempt at reading duplicacy_chunkmaker.go leads me to believe that a given chunk can contain data from more than one file. Files may be fed into the chunk maker in different orders / different combinations each time a backup runs, depending on the state of the file system at the time. Therefore, the exact same file will end up being chunked differently at different times — particularly the starting and ending data of files, and small files.

Please correct me if I’m misinterpreting the chunk maker’s behavior.

Therefore it may not be possible to get Duplicacy to recreate and reupload missing chunks, even if the source data still exists

The way that duplicacy creates and uploads chunks depends not only on the current state of files in the repository, but also on the complex ways that the filesystem has changed throughout the backup history. A given chunk may encode parts of two or more files which happened to be eligible for uploading at a particular moment, and it may be practically impossible to recreate the exact state that led to that chunking.