System Migration Causes Full Backup

h1d3m3 · 1 November 2023 12:57

Long time user, but had to migrate to a new home system recently. Running Duplicacy in a Docker container and the underlying system data was migrated (disk to disk via rsync) during the move. Couple of notes:

Duplicacy Web Edition 1.7.2
The Docker container volume mountpoint used to do backups changed, but the underlying data is the same.
The Duplicacy app data, config and cache was migrated, I did not rebuild the setup from scratch.
Storage is in wasabi and contains approx 500k chunks coming in around 3TB.

Checks and pruning jobs work fine, but what I am seeing is that when backups start, the job wants to backup every file as if every file has changed. I would have hoped that it recognized that it already had content backed up in wasabi and so it would just perform incrementals and backup only on changed files. I suspect something happened to make it think every file has changed (timestamp? chown permissions? backup endpoint change?). I thought I was careful with that when using rsync, but it’s not clear to me what kind of changes would trigger a need for Duplicacy to perform a full backup.

Unfortunately this will double my storage usage if I let it run. I would either need to make the backup look beyond whatever surface level change causing this to happen (I am confident that no data file content has changed) or I will have to do some heavy pruning to make room for a full backup (something i’d like to avoid if I can).

Cheers.

saspus · 1 November 2023 15:09

Do you actually see upstream traffic? If your data did not change, chunks that are generated will already be present on the storage and nothing will get uploaded. Even if you shuffle your files around or change paths etc.

h1d3m3 · 1 November 2023 15:26

Thanks for the reply.

see upstream traffic

What I see is the estimator showing something like 7 hours for the backup to complete. Previously before the move, the estimation was typically in minutes. I can try a -dry-run and -d to get more details if you think that would be helpful, but I suspect the results will likely be a massive log file.

(maybe I should clarify, I assume it’s going to do a full transfer of all data to wasabi, but it might still be in an indexing phase and end up not actually transferring anything when it’s time…it’s just looks like it will do everything and I want to avoid that)

your data did not change

Does that include things like file timestamps, permissions, owners, access times? Wondering if it’s the metadata might have some impact here.

saspus · 1 November 2023 15:38

Yes, this is likely the case. Depending on how you copied stuff, Duplicacy may perceive those files as changed. But it does not matter, because if the content did not change [vast overwhelming majority of] chunks it generates during backup will be the same and therefore already exist on the target. Even if you backup to a new snapshot Id. Chunk name is derived from the content. Content the same — chunk name is the same.

Not sure what metadata specifically, but some definitely. E.g. even file ownership, IIRC that probably changed when you moved data to the new system.

It will take quite a time to shred all that data into chunks again.

Dry-run will do absolutely everything besides uploading the chunks. But we don’t expect chunks to get uploaded anyway.

-d might be interesting.

Or just do the low tech solution - run backup and look at upstream network traffic for a few minutes.

I would not worry much though. If due to some crazy bug duplicacy will end up uploading the whole thing again — you can always prune that revision (and even if it didn’t completer — orphaned chunks with -exhaustive option) and investigate further. It’s not irreversible.

h1d3m3 · 1 November 2023 15:46

Ok. So I am currently running -v -d -dry-run -threads 4 -stats and it is estimating around 8 hours to complete. At the end of that. will we be able to determine whether it needed to upload anything? It would be good to know if the dry run did all the calculations and then at the end basically say “99% of the chunks were found in wasabi and here are the number new chunks that I need to upload”. Is that possible?

end up uploading the whole thing again

Yeah, it’s just the cost. It’s not a huge amount, but if it can be avoided…it would be nice

saspus · 1 November 2023 15:55

I’m not sure. But maybe dry run will end up updating caches and duplicacy will change estimation. Either way it’s just an estimation

I would not bother personally. Run backup with -d if you want, and see if chunks are actually being uploaded and look at upstream. The content addressable storage is in core architecture of duplicacy. It successfully deduplicates common pieces of data from different machines, let alone the same literally dataset.

We are talking about worst case one day worth of storing extra few tb. Oh wait. Wasabi still charge for the whole month (or three?) anyway, don’t they?

h1d3m3 · 1 November 2023 22:12

The dry-run with debugging completed. I’m not exactly sure what the results mean. Hoping there is information detailing what will happen before I run it for real.

2023-11-01 17:59:40.802 INFO BACKUP_STATS Files: 541159 total, 2477G bytes; 540015 new, 2477G bytes
2023-11-01 17:59:40.802 INFO BACKUP_STATS File chunks: 524532 total, 2477G bytes; 3596 new, 22,524M bytes, 21,484M bytes uploaded
2023-11-01 17:59:40.805 INFO BACKUP_STATS Metadata chunks: 115 total, 132,608K bytes; 79 new, 93,057K bytes, 54,386K bytes uploaded
2023-11-01 17:59:40.805 INFO BACKUP_STATS All chunks: 524647 total, 2477G bytes; 3675 new, 22,615M bytes, 21,538M bytes uploaded
2023-11-01 17:59:40.805 INFO BACKUP_STATS Total running time: 06:32:14

Looking at the “All chunks” summary line, does this mean it found 21,538M bytes in 3675 new chunks that would have been uploaded? (adding approx 22GB to the store).
At the end of the run, a grep shows it logged 540003 lines that looks like:

INFO UPLOAD_FILE Uploaded some/path/to/a file/blah.jpg (13826)"

If this means each of those files would have been “uploaded”, that is a lot more then I would have expected. But we don’t upload files right? Just chunks…so maybe I am misinterpreting that log line.

saspus · 1 November 2023 23:17

Yes

It’s “uploaded” in the sense that this file will be part of the backup stored on the destination.

The chunks for this file would be uploaded unless they are already present on the target.

This happens to be the case for 0.7% of total chunks, according to the output. 99.3% of all chunks will be re-used and not re-uploaded.

So it seems it will upload 22Gb of new chunks.

h1d3m3 · 2 November 2023 22:18

Thank you @saspus for your guidance here. All is well, a full backup finished (with -hash to be sure) and it uploaded only what was reported in the dry-run. Learned a bunch and how it works is a bit clearer to me now. So a couple of observations for others to note…hopefully I got this right…

The estimation is really a combination of the indexing and chunk calculations made on files. Just because it is estimating hours to completion, it could be that 0 bytes actually gets uploaded when each chunk is checked locally against what is stored in storage.
The info term UPLOAD_FILE in the logs is confusing. As stated above, its just an indicator that the file is a member of the backup, not that it was just uploaded.
It would be really helpful if there was some kind of indicator of actual chunk upload activity in the app. There is a per-second data rate provided. but I don’t think it is network related.
Just because a file is flagged as changed and needs to be “evaluated”, it doesn’t mean there will be upload activity. It first needs to be checked to whether there are actual changes to the data that make up a file that will determine whether there is a need for chunks of it to be uploaded (done during the indexing/chunk calculation process). That wasn’t completely clear to me before. I just thought if a file was flagged for evaluation (due to chown/chmod/etc), it was going to end up being send to storage.

system · 12 November 2023 22:19

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.