Specify revision number

Charles · 30 April 2019 19:52

So I started with an offsite backup then later decided I wanted to add a local one. I did a copy so that all of the chunking, etc would be the same. Then, because I didn’t want to incur charges from my off-site provider initialized the local backup by preforming a regular backup command. This created a revision, 1.

From now on I want to copy from my local media to the offsite one (presumably saving some function calls to the offsite storage, saving cpu, and keeping them in sync), however the local revision is 1. The problem here is that I run into the possibility of running into a non-pruned, higher revision number on the offsite storage if I continue copying these (if the program will even let me copy a lower revision up to begin with).

Can I run the backup again and specify the revision number? Or perhaps spoof the “next” revision number somehow? Then I could prune the first revision. I considered the possibility of changing the current local version, but it’s encrypted, so whatever the solution, I feel like it has to somehow be done through the program.

RELATED POSTS:

(sort of not really this next one)

Charles · 30 April 2019 20:24

Of course any better suggestions are welcome. That was just the idea I had.

Charles · 30 April 2019 20:42

I realized the file name of the revision is not obfuscated. Can I just rename it?

[edit: ANSWER: NO. It breaks encryption and the repository fails to list]

Droolio · 30 April 2019 22:10

I assume you mean add -copy, right?

What I’d do is find the last revision number for your offsite backup and then:

duplicacy copy -id <snapshot id> -r <revision> -from remote -to local

Most of your chunks will already be present and be skipped for download, but you’ll have a more ‘recent’ (at least numerically) revision file from which subsequent backups will increment from, and which will be in sync when you copy from local to remote.

Then you can prune the local revision 1 or simply delete that revision file in /snapshots.

Optionally, copy all of the remote revisions from remote to local if that’s what you desire. But the important thing is to get both your local and remote revision numbers up to the same highest value.

OR you can start from scratch and assign a new snapshot id to both local and remote, starting from revision 1. Again, most of your chunks are already de-duplicated across local and remote storages.

Charles · 30 April 2019 23:09

Yes. I meant “add -copy” (identical).

These suggestions make sense. I’m not sure how duplicacy keeps track of which files to upload or skip between each backup. Is it by comparison to the last snapshot? If so I think the following might work to keep most of the IO on the local disks. Please correct me if I’m wrong.

Preface: the local backup r1 is newer than the remote backup r568 at the moment

Download by hand revision file 568 only (no other files)
Re-run the backup locally, creating revision 569
Duplicacy copy revision 569 from local to remote
Remove local revisions 1 and 568 and prune them (568 since it might be missing files pertaining to that backup revision)

I feel like this may be overlooking something and might end up missing files. Perhaps it would be cleaner to alternatively do the following based on your recommendation.

Preface: the local backup r1 is newer than the remote backup r568 at the moment

Run the remote backup creating r569 so it contains a full snapshot of the latest data
Duplicacy copy r569 from remote to local (Is this smart enough not to download a file if it already exists in the seeded data? Seems likely)
Prune r1 from repository (not really necessary, but to keep all future backups similar to avoid confusion)

I was mostly trying to avoid downloading cold storage data, but I also don’t want to be missing chunks.

Droolio · 30 April 2019 23:35

Yes, definitely apply your second strategy imo.

Your first strategy might work only so long as there’s zero changes to your repository. r568 will probably reference chunks - at minimum, metadata chunks - which your local storage doesn’t have. And making another backup won’t guarantee that r569 will reference all of the chunks (because it doesn’t check that all chunks referenced in r568 actually exist; it’s assumed they already do).

For your second strategy, if your repository remains mostly unchanged since you ran a remote backup (r569) and the time when you ran a local backup (r1), it should hardly download any chunks at all from remote. If your repository changed, you can always run another local backup (r2) prior to step 1.

When a copy takes place, Duplicacy inventories the destination storage to work out what chunks exist and should be skipped.

Charles · 30 April 2019 23:48

Thank you so much for confirming this all for me! I think I will go with the second option. Every now and then I have as much as 100 GB written in one day. Other times almost nothing at all. I can look at the statistics after my next remote backup and reseed creating r2 locally if I’m worried about the amount of data to be uploaded.

I was guessing that the software assumed all chunks existed in the latest previous rev due to how fast it runs after the initial run and most time isn’t spent uploading as much as chunking for me with my internet speeds.
This makes me question things like “how does it know what’s new/ changed from the previous rev?” “What are the chances it misses changes?” “How do file attributes affect the backup? If I do a setfacl to add a user to every file on my data drive, is it going to re-chunk the whole thing or will it even notice a difference?” “Did I know the answers to all of this years ago and have forgotten? (lol) If not how did I know question it?”

You’ve been a great help across all of my posts today. I will get out of your hair for now. This is like… my yearly check-in. lol

As a final check, I will run my check script on both storages when I’m done for good measure.

Droolio · 1 May 2019 00:36

Not entirely certain what effect that might have on a Linux system but even if it ‘touched’ all the files, I’m fairly certain Duplicacy would only need to rehash the source data and determine that it would only need to upload the meta-data chunks, not the file content chunks. Since chunking is deterministic.

In theory, all the revision files for a repository contain enough information to know what chunks should exist in storage. But that excludes chunks for other repositories and backup clients, and it might be difficult to scale if it has to read every revision.

So perhaps it reads only the last revision, which contains the chunklist. And for everything else, before upload it checks to see if it exists? Not certain of that either, would need to check the source code.

Haha, no problem at all.

Good idea, and don’t forget you can always -dry-run to make sure it’s not overly zealous with the downloads. (Actually! It just crossed my mind that it might still download content chunks?? I definately need to check the source…).

Edit: Well OK, seems the copy command doesn’t have a -dry-run option!

Charles · 1 May 2019 01:18

It would be really expensive if it downloads the content chunks, then just skips them if they already exist. I think it would really depend on how it determine what needs to be copied.

Maybe it will compare the latest remote rev with the latest local, and copy what it thinks it needs that way. Maybe it will check every single file it thinks it needs by listing the remote storage and only copy the ones it needs. Maybe it will only copy the files that were made in the latest revision… which would would actually not contain all of the files since the revisions aren’t currently in sync.

Charles · 1 May 2019 01:26

If that last scenario is true, I think I would be stuck with having to rsync the files up with the most current local revision initially, but then I’m back to having a broken revision number.

Charles · 1 May 2019 01:39

Eh I’ll try what we talked about and verify it to see what happens

Droolio · 1 May 2019 13:03

What I mean is, by downloading the revision files, Duplicacy can get an idea of what chunks IDs already exist on the storage, since the revision files contain a complete chunklist (sequence) of what each revision uses. See here (damn, I struggled to find those docs again!).

So it doesn’t need the chunk contents as such, only the chunk ID.

I just think it would be infeasible to download all the revision files (for other repositories as well) and combine that information, as it would be memory-intensive and still incomplete, so it likely only uses the last revision as that’s good enough. Then if Duplicacy has to hash and rechunk a file, it knows the chunk IDs (which are broadly deterministic) and can check them against the list stored in the previous revision.

What I wonder is if Duplicacy double-checks if a genuinly new chunk exists before each upload - as it still may clash with another revision or repository, and overwriting would be wasteful and risky.

Anyway, let us know how you get on.

Charles · 1 May 2019 14:50

Ha. Wow. This does clear things up. I was wondering why the snapshot files were so small. All of the data in them is actually chunked and dedupe-uploaded with the rest of the data! It also provided some insight into the -hash option. I thought it might download each piece of data to verify, but if it’s just checking against the hashes in the repo, I might include the -hash once a month or so.

Blockquote What I wonder is if Duplicacy double-checks if a genuinly new chunk exists before each upload - as it still may clash with another revision or repository, and overwriting would be wasteful and risky.

I seem to “remember” an old technical write-up before this forum existed that explained lock-free deduplication mostly, as well as some other things. I thought it suggested that a “copy” was performed on each new chunk of data during the backup, but when the copy failed due to an already existing file, that chunk was considered skipped. This saves an api call and speeds the upload at the same time by skipping a lookup. Things could have changed since then… including my memory. I wondered though, if copy worked similarly and if it did, if the copy would fail due to existing files before or after the actual download occurred and if it would check every single file in the repository in this manner or just new ones etc. Besides, that’s just for the backup command. There’s no telling with the copy command.

Charles · 1 May 2019 16:05

I decided to verify both backup locations before doing everything to make sure if there are problems, they didn’t already exist… well my latest backup [edit: latest 2 backups] is missing about 5+ chunks. I guess I will prune that backup first.

Charles · 1 May 2019 16:13

I’ve had this problem before a long time ago, but that was when a bug was found that fixed it. It would be nice if the “check” command had a flag to attempt to restore these chunks by re-chunking the files associated with them and attempting to see if they can all still be made or if the local data no longer supports restoring those chunks. Like an --attempt-reconstruction or something like that. Should I make this a feature request?

Charles · 1 May 2019 16:51

I suppose I could fake this by moving the snapshots, running a backup, running the backup, then putting them back. Either way I could just delete them instead of pruning so I don’t have to re-upload all of the data I think.

[edit]
PS this worked. I was able to copy the broken revisions out, run the backup, copy the new conflicting revision out, copy the previously broken revisions back in and verify with a “check” that all of the previously missing chunks were there.

Charles · 2 May 2019 00:56

The second strategy is not a feasible option. It is very obviously doing what I thought I remembered from the old documentation. It only tries to copy and skips if it exists. This preforms a download of every chunk. It does not check to see if the files exist already before downloading. I’m going to hand download the latest revision and verify the data against it to see if it has everything after. Right now the remote backup is latest, so I will run local once more first

Charles · 2 May 2019 01:31

This didn’t work either I’m not sure what to do

Charles · 2 May 2019 04:14

This operation would cost me over $400. Would be nice to be able to more easily add a local storage to the mix. Can I somehow manually decrypt a snapshot, spoof the revision number, then re-encrypt it?

TheBestPessimist · 2 May 2019 04:36

note: I didn’t read the whole topic, so i may talk stupid stuff

If you wanna decrypt, reencrypt why not just make a new snapshot? It’ll start from revision 0, but since is all about deduplication ,all should be ok with how much new data is uploaded, right?