Specify revision number

Thank you so much for confirming this all for me! I think I will go with the second option. Every now and then I have as much as 100 GB written in one day. Other times almost nothing at all. I can look at the statistics after my next remote backup and reseed creating r2 locally if I’m worried about the amount of data to be uploaded.

I was guessing that the software assumed all chunks existed in the latest previous rev due to how fast it runs after the initial run and most time isn’t spent uploading as much as chunking for me with my internet speeds.
This makes me question things like “how does it know what’s new/ changed from the previous rev?” “What are the chances it misses changes?” “How do file attributes affect the backup? If I do a setfacl to add a user to every file on my data drive, is it going to re-chunk the whole thing or will it even notice a difference?” “Did I know the answers to all of this years ago and have forgotten? (lol) If not how did I know question it?”

You’ve been a great help across all of my posts today. I will get out of your hair for now. This is like… my yearly check-in. lol

As a final check, I will run my check script on both storages when I’m done for good measure.

Not entirely certain what effect that might have on a Linux system but even if it ‘touched’ all the files, I’m fairly certain Duplicacy would only need to rehash the source data and determine that it would only need to upload the meta-data chunks, not the file content chunks. Since chunking is deterministic.

In theory, all the revision files for a repository contain enough information to know what chunks should exist in storage. But that excludes chunks for other repositories and backup clients, and it might be difficult to scale if it has to read every revision.

So perhaps it reads only the last revision, which contains the chunklist. And for everything else, before upload it checks to see if it exists? Not certain of that either, would need to check the source code.

Haha, no problem at all. :slight_smile:

Good idea, and don’t forget you can always -dry-run to make sure it’s not overly zealous with the downloads. (Actually! It just crossed my mind that it might still download content chunks?? I definately need to check the source…).

Edit: Well OK, seems the copy command doesn’t have a -dry-run option!

:sweat_smile: It would be really expensive if it downloads the content chunks, then just skips them if they already exist. I think it would really depend on how it determine what needs to be copied.

Maybe it will compare the latest remote rev with the latest local, and copy what it thinks it needs that way. Maybe it will check every single file it thinks it needs by listing the remote storage and only copy the ones it needs. Maybe it will only copy the files that were made in the latest revision… which would would actually not contain all of the files since the revisions aren’t currently in sync. :man_shrugging:

If that last scenario is true, I think I would be stuck with having to rsync the files up with the most current local revision initially, but then I’m back to having a broken revision number. :tired_face:

Eh I’ll try what we talked about and verify it to see what happens

What I mean is, by downloading the revision files, Duplicacy can get an idea of what chunks IDs already exist on the storage, since the revision files contain a complete chunklist (sequence) of what each revision uses. See here (damn, I struggled to find those docs again!).

So it doesn’t need the chunk contents as such, only the chunk ID.

I just think it would be infeasible to download all the revision files (for other repositories as well) and combine that information, as it would be memory-intensive and still incomplete, so it likely only uses the last revision as that’s good enough. Then if Duplicacy has to hash and rechunk a file, it knows the chunk IDs (which are broadly deterministic) and can check them against the list stored in the previous revision.

What I wonder is if Duplicacy double-checks if a genuinly new chunk exists before each upload - as it still may clash with another revision or repository, and overwriting would be wasteful and risky.

Anyway, let us know how you get on. :slight_smile:

Ha. Wow. This does clear things up. I was wondering why the snapshot files were so small. All of the data in them is actually chunked and dedupe-uploaded with the rest of the data! It also provided some insight into the -hash option. I thought it might download each piece of data to verify, but if it’s just checking against the hashes in the repo, I might include the -hash once a month or so.

Blockquote What I wonder is if Duplicacy double-checks if a genuinly new chunk exists before each upload - as it still may clash with another revision or repository, and overwriting would be wasteful and risky.

I seem to “remember” an old technical write-up before this forum existed that explained lock-free deduplication mostly, as well as some other things. I thought it suggested that a “copy” was performed on each new chunk of data during the backup, but when the copy failed due to an already existing file, that chunk was considered skipped. This saves an api call and speeds the upload at the same time by skipping a lookup. Things could have changed since then… including my memory. I wondered though, if copy worked similarly and if it did, if the copy would fail due to existing files before or after the actual download occurred and if it would check every single file in the repository in this manner or just new ones etc. Besides, that’s just for the backup command. There’s no telling with the copy command.

I decided to verify both backup locations before doing everything to make sure if there are problems, they didn’t already exist… well my latest backup [edit: latest 2 backups] is missing about 5+ chunks. I guess I will prune that backup first.

I’ve had this problem before a long time ago, but that was when a bug was found that fixed it. It would be nice if the “check” command had a flag to attempt to restore these chunks by re-chunking the files associated with them and attempting to see if they can all still be made or if the local data no longer supports restoring those chunks. Like an --attempt-reconstruction or something like that. Should I make this a feature request?

I suppose I could fake this by moving the snapshots, running a backup, running the backup, then putting them back. Either way I could just delete them instead of pruning so I don’t have to re-upload all of the data I think.

[edit]
PS this worked. I was able to copy the broken revisions out, run the backup, copy the new conflicting revision out, copy the previously broken revisions back in and verify with a “check” that all of the previously missing chunks were there.

The second strategy is not a feasible option. It is very obviously doing what I thought I remembered from the old documentation. It only tries to copy and skips if it exists. This preforms a download of every chunk. It does not check to see if the files exist already before downloading. I’m going to hand download the latest revision and verify the data against it to see if it has everything after. Right now the remote backup is latest, so I will run local once more first

This didn’t work either :frowning: I’m not sure what to do

This operation would cost me over $400. Would be nice to be able to more easily add a local storage to the mix. Can I somehow manually decrypt a snapshot, spoof the revision number, then re-encrypt it?

note: I didn’t read the whole topic, so i may talk stupid stuff

If you wanna decrypt, reencrypt why not just make a new snapshot? It’ll start from revision 0, but since :d: is all about deduplication ,all should be ok with how much new data is uploaded, right?

Tldr, my main goal is to create a new local backup and start copying it to my remote. The remote is on revision 566 now and would have conflicting revisions if I tried that. I want my local copy to basically think it’s on 577 instead of 1 so I can copy up from now on… Or any other work around but others haven’t worked so far.

That shouldn’t be happening. Not if the same repository, with the exact same content, was backed up to a local storage that was created with add -copy. Most of your chunks are already present.

From your screenshot above, it’s saying that most chunks were “skipped at the destination”. That means it has no reason to download that chunk from remote - it merely knows of its existence from the revision you’re copying.

(But make sure you’re only copying the last revision of that specific repository with id <snapshot id> -r 566, otherwise it’ll grab all the revisions from all of the repositories. :slight_smile: )

Yes, they are skipped, but network traffic is very high. Are you sure it’s not skipped after trying to copy and then realizing it exists and choosing not to overwrite? Even if it doesn’t download completely before deciding it already exists, how would a get API call be calculated? The copy log shows x/total. That total matches the total number of chunks for my entire repository. I don’t know why there would be so much network traffic if it’s skipping all these files by looking at the latest snapshot from both locations in memory or even just the latest remote snapshot against the files on disk. I might just have to crawl the code and tweak it to my needs if it’s simple enough. :man_shrugging:

The total is normal, it’s just showing the total number of chunks used for that revision. Skipped chunks should not use bandwidth or any API calls. I can only imagine the network usage is where it’s actually copying needed chunks, like metadata, in separate threads.

You’ll notice, from your screenshot, that the one at the bottom was “copied to the destination”, but that its chunk number (452274) is far earlier than all the others (i.e. 452943). That’s because a thread finished copying that chunk after iterating through. It also suggests to me that the vast majority of chunks are being skipped, and Duplicacy will sail through those 1.3m chunks pretty sharpish. :slight_smile:

Hmm… IF you ran a recent backup to both the local and remote storages from the same repository, those chunks - including the metadata chunks - should be mostly the same.

HOWEVER, I’ve just now realised it’s possible that when your local and remote repositories were hashed, the way your data was chunked may explain the discrepancy. So, try run a fresh backup to both local and remote with the -hash key option. The backup will take longer as it has to hash everything, and you may end up uploading a bit more to B2, but when it comes to copying back revision 567 (as it will be then), it should have less to download and you’ll have less (or hopefully no) download costs to incur.

You can rename a snapshot file if the storage is not encrypted. So if an unencrypted local storage is acceptable to you, you can simply run a local backup at revision 1, and then rename it to whatever revision you want.

Otherwise, another option is to use a new repository id for the local backup, so when you copy from local to B2 there won’t be a revision number conflict.