Specify revision number

Droolio · 1 May 2019 00:36

Not entirely certain what effect that might have on a Linux system but even if it ‘touched’ all the files, I’m fairly certain Duplicacy would only need to rehash the source data and determine that it would only need to upload the meta-data chunks, not the file content chunks. Since chunking is deterministic.

In theory, all the revision files for a repository contain enough information to know what chunks should exist in storage. But that excludes chunks for other repositories and backup clients, and it might be difficult to scale if it has to read every revision.

So perhaps it reads only the last revision, which contains the chunklist. And for everything else, before upload it checks to see if it exists? Not certain of that either, would need to check the source code.

Haha, no problem at all.

Good idea, and don’t forget you can always -dry-run to make sure it’s not overly zealous with the downloads. (Actually! It just crossed my mind that it might still download content chunks?? I definately need to check the source…).

Edit: Well OK, seems the copy command doesn’t have a -dry-run option!

Charles · 1 May 2019 01:18

It would be really expensive if it downloads the content chunks, then just skips them if they already exist. I think it would really depend on how it determine what needs to be copied.

Maybe it will compare the latest remote rev with the latest local, and copy what it thinks it needs that way. Maybe it will check every single file it thinks it needs by listing the remote storage and only copy the ones it needs. Maybe it will only copy the files that were made in the latest revision… which would would actually not contain all of the files since the revisions aren’t currently in sync.

Charles · 1 May 2019 01:26

If that last scenario is true, I think I would be stuck with having to rsync the files up with the most current local revision initially, but then I’m back to having a broken revision number.

Charles · 1 May 2019 01:39

Eh I’ll try what we talked about and verify it to see what happens

Droolio · 1 May 2019 13:03

What I mean is, by downloading the revision files, Duplicacy can get an idea of what chunks IDs already exist on the storage, since the revision files contain a complete chunklist (sequence) of what each revision uses. See here (damn, I struggled to find those docs again!).

So it doesn’t need the chunk contents as such, only the chunk ID.

I just think it would be infeasible to download all the revision files (for other repositories as well) and combine that information, as it would be memory-intensive and still incomplete, so it likely only uses the last revision as that’s good enough. Then if Duplicacy has to hash and rechunk a file, it knows the chunk IDs (which are broadly deterministic) and can check them against the list stored in the previous revision.

What I wonder is if Duplicacy double-checks if a genuinly new chunk exists before each upload - as it still may clash with another revision or repository, and overwriting would be wasteful and risky.

Anyway, let us know how you get on.

Charles · 1 May 2019 14:50

Ha. Wow. This does clear things up. I was wondering why the snapshot files were so small. All of the data in them is actually chunked and dedupe-uploaded with the rest of the data! It also provided some insight into the -hash option. I thought it might download each piece of data to verify, but if it’s just checking against the hashes in the repo, I might include the -hash once a month or so.

Blockquote What I wonder is if Duplicacy double-checks if a genuinly new chunk exists before each upload - as it still may clash with another revision or repository, and overwriting would be wasteful and risky.

I seem to “remember” an old technical write-up before this forum existed that explained lock-free deduplication mostly, as well as some other things. I thought it suggested that a “copy” was performed on each new chunk of data during the backup, but when the copy failed due to an already existing file, that chunk was considered skipped. This saves an api call and speeds the upload at the same time by skipping a lookup. Things could have changed since then… including my memory. I wondered though, if copy worked similarly and if it did, if the copy would fail due to existing files before or after the actual download occurred and if it would check every single file in the repository in this manner or just new ones etc. Besides, that’s just for the backup command. There’s no telling with the copy command.

Charles · 1 May 2019 16:05

I decided to verify both backup locations before doing everything to make sure if there are problems, they didn’t already exist… well my latest backup [edit: latest 2 backups] is missing about 5+ chunks. I guess I will prune that backup first.

Charles · 1 May 2019 16:13

I’ve had this problem before a long time ago, but that was when a bug was found that fixed it. It would be nice if the “check” command had a flag to attempt to restore these chunks by re-chunking the files associated with them and attempting to see if they can all still be made or if the local data no longer supports restoring those chunks. Like an --attempt-reconstruction or something like that. Should I make this a feature request?

Charles · 1 May 2019 16:51

I suppose I could fake this by moving the snapshots, running a backup, running the backup, then putting them back. Either way I could just delete them instead of pruning so I don’t have to re-upload all of the data I think.

[edit]
PS this worked. I was able to copy the broken revisions out, run the backup, copy the new conflicting revision out, copy the previously broken revisions back in and verify with a “check” that all of the previously missing chunks were there.

Charles · 2 May 2019 00:56

The second strategy is not a feasible option. It is very obviously doing what I thought I remembered from the old documentation. It only tries to copy and skips if it exists. This preforms a download of every chunk. It does not check to see if the files exist already before downloading. I’m going to hand download the latest revision and verify the data against it to see if it has everything after. Right now the remote backup is latest, so I will run local once more first

Charles · 2 May 2019 01:31

This didn’t work either I’m not sure what to do

Charles · 2 May 2019 04:14

This operation would cost me over $400. Would be nice to be able to more easily add a local storage to the mix. Can I somehow manually decrypt a snapshot, spoof the revision number, then re-encrypt it?

TheBestPessimist · 2 May 2019 04:36

note: I didn’t read the whole topic, so i may talk stupid stuff

If you wanna decrypt, reencrypt why not just make a new snapshot? It’ll start from revision 0, but since is all about deduplication ,all should be ok with how much new data is uploaded, right?

Charles · 2 May 2019 04:57

Tldr, my main goal is to create a new local backup and start copying it to my remote. The remote is on revision 566 now and would have conflicting revisions if I tried that. I want my local copy to basically think it’s on 577 instead of 1 so I can copy up from now on… Or any other work around but others haven’t worked so far.

Droolio · 2 May 2019 12:25

That shouldn’t be happening. Not if the same repository, with the exact same content, was backed up to a local storage that was created with add -copy. Most of your chunks are already present.

From your screenshot above, it’s saying that most chunks were “skipped at the destination”. That means it has no reason to download that chunk from remote - it merely knows of its existence from the revision you’re copying.

(But make sure you’re only copying the last revision of that specific repository with id <snapshot id> -r 566, otherwise it’ll grab all the revisions from all of the repositories. )

Charles · 2 May 2019 13:06

Yes, they are skipped, but network traffic is very high. Are you sure it’s not skipped after trying to copy and then realizing it exists and choosing not to overwrite? Even if it doesn’t download completely before deciding it already exists, how would a get API call be calculated? The copy log shows x/total. That total matches the total number of chunks for my entire repository. I don’t know why there would be so much network traffic if it’s skipping all these files by looking at the latest snapshot from both locations in memory or even just the latest remote snapshot against the files on disk. I might just have to crawl the code and tweak it to my needs if it’s simple enough.

Droolio · 2 May 2019 15:49

The total is normal, it’s just showing the total number of chunks used for that revision. Skipped chunks should not use bandwidth or any API calls. I can only imagine the network usage is where it’s actually copying needed chunks, like metadata, in separate threads.

You’ll notice, from your screenshot, that the one at the bottom was “copied to the destination”, but that its chunk number (452274) is far earlier than all the others (i.e. 452943). That’s because a thread finished copying that chunk after iterating through. It also suggests to me that the vast majority of chunks are being skipped, and Duplicacy will sail through those 1.3m chunks pretty sharpish.

Droolio · 2 May 2019 15:52

Hmm… IF you ran a recent backup to both the local and remote storages from the same repository, those chunks - including the metadata chunks - should be mostly the same.

HOWEVER, I’ve just now realised it’s possible that when your local and remote repositories were hashed, the way your data was chunked may explain the discrepancy. So, try run a fresh backup to both local and remote with the -hash ~~key~~ option. The backup will take longer as it has to hash everything, and you may end up uploading a bit more to B2, but when it comes to copying back revision 567 (as it will be then), it should have less to download and you’ll have less (or hopefully no) download costs to incur.

gchen · 2 May 2019 15:59

You can rename a snapshot file if the storage is not encrypted. So if an unencrypted local storage is acceptable to you, you can simply run a local backup at revision 1, and then rename it to whatever revision you want.

Otherwise, another option is to use a new repository id for the local backup, so when you copy from local to B2 there won’t be a revision number conflict.

Charles · 2 May 2019 18:36

My sync script runs a backup locally, then one remotely (to make sure the latest revision copied down has the latest data, but should mostly be the same). It then runs a copy down. While it seemed unlikely that an API call would be made to copy down skipped chunks, I wanted to be sure. I suppose it is like you said @Droolio, there are enough threads running at once that the chunks that do need to be copied are what is spiking the network. For reassurance, I found the code where this is done and there is in fact NOT an api call if the chunk exists at the destination according to the latest snapshot and the log message with the snippet skipped at the destination implies that no api call was made based on the following function.

Sorry for being so paranoid and thank you for your reassurances and suggestions.