Block level delta restore from B2?

restore

#1

Does Duplicacy do block level delta restores from B2?

Use case:

  • 100 GB file that’s been backed up to B2 using Duplicacy
  • A few MBs in the file has been locally changed or corrupted

If I restore this, will it download all the chunks (and cost me a ~100GB in B2 downloads)? Or will it download only a few MB for the chunks that have changed?


#2

I can’t attest to how Duplicacy handles B2 specifically, but yes Duplicacy should do delta restores if you restore over an existing file of the same file name.

I’ve done restores with Duplicacy’s cousin backup program for ESXi hosts, Vertical Backup, so we’re talking large (several hundred GB) .vmdk files. It works exactly how you should expect and skips chunks that already exist.


#3

Doesn’t seem to work when I tried it on B2. Made a small change locally to the first few bytes of a randomly generated 50MB file. Looks like it downloaded the entire file to restore.

$ duplicacy restore -r 5 -stats -overwrite

Storage set to b2://<bucket>

Restoring /Users/<User>/duplicacy_test to revision 5

Downloaded chunk 1 size 9723346, 4.64MB/s 00:00:09 19.4%

Downloaded chunk 2 size 3262497, 4.13MB/s 00:00:09 25.9%

Downloaded chunk 3 size 15093412, 4.46MB/s 00:00:05 56.1%

Downloaded chunk 4 size 7507684, 4.85MB/s 00:00:03 71.1%

Downloaded chunk 5 size 9011405, 5.32MB/s 00:00:01 89.1%

Downloaded chunk 6 size 2866402, 5.03MB/s 00:00:01 94.9%

Downloaded chunk 7 size 2535267, 4.77MB/s 00:00:01 100.0%

Downloaded bigfile (50000013)

Restored /Users/<User>/duplicacy_test to revision 5

Files: 1 total, 47.68M bytes

Downloaded 1 file, 47.68M bytes, 7 chunks

Total running time: 00:00:10

#4

OK well that’s interesting.

I got the same results with a test to local storage, by changing a few bytes at the start of the file (it downloaded all 9 chunks). Then I modified a few bytes at the very end of the file - only 1 chunk downloaded.

This is probably due to the variable size chunking algorithm; early changes are likely altering where the boundaries are in future chunks (or the chunk sizes and offsets have already been predetermined for each file). It may not happen in every case, though.

I wonder if this behaviour can be improved, perhaps by resetting the chunking when new data comes in. I’m mindful this won’t be easy due to the fact Duplicacy can download in multiple threads. @gchen?


#5

If you only modify a few bytes at the start of the file (or any other places), then Duplicacy should be able to download only the chunk that is affected by the modification.

However, if you add or delete a few bytes, then Duplicacy would have problem finding unmodified chunks from the existing file. This is not due to the inefficiency of the variable size chunking algorithm, but rather the way Duplicacy splits the existing file – it cuts the existing file at the same offsets as the chunks from the storage so if there is insertion/deletion it won’t get one identical chunk for the entire file.

An obvious improvement is to run the variable size chunking algorithm on the existing file. But, normally you don’t do insertion/deletion to large files, and the current implementation is much faster (no need to calculate the rolling hash one byte at a time over the entire file, just the hashes of chunks at specific offsets).


#6

Depending on the file type, a good solution is to use chunks with fixed sizes.

I’ve gotten good results with databases, Evernote database and Veracrypt containers.


#7

Thanks for the info, can confirm.

In my initial test I used a hexed editor to write a few bytes at the beginning of the file and didn’t realise it was in insert mode (usually those things default to overwrite). Conducting a fresh test and indeed only 1 chunk downloaded during a restore. This makes complete sense.