Backing up google takeout

C11 · 26 June 2021 18:59

So I recently realized I didn’t trust google to never lose my stuff. LoL.

Google offers an easy way to download your stuff, called “Takeout”.
I followed that procedure and then downloaded the resulting 44GB zip file.

I stuck that zip in my Documents folder on my home PC, and Duplicacy happily backed it up to B2 for me.
BUT… This seems like an unsustainable method. Every time I download a new zip, I’m going to be adding 44+ GB to my B2 backup!

I’m thinking maybe the right answer is to unzip the downloaded takeout and to put that resulting normal older tree in “my documents” instead. Then with every backup, I just replace that folder. Will duplicacy see that the vast majority of the files in the unzipped takeout folder are unchanged, and thus not duplicate them?

I guess my question is, how does duplicacy determine a file is changed? Please point me to a description of how duplicate files are recognized and handled?

saspus · 26 June 2021 21:38

That seems to be the best course of action in the circumstances, yes.

This post contains link to a IEEE paper detailing the approach duplicacy takes: Duplicacy paper accepted by IEEE Transactions on Cloud Computing

It’s interesting that you don’t trust google but do trust acrosync and Backblaze… both singnificanlly tinier and less experienced companies with far less responsibility and obligations…

If google loses data, given how much of internet services I depend on their datacentera and services, that would be much bigger problem than few your photos and documents.

Which begs the question: why not backup source data as you send to google, instead of fetching it back? I’d argue if you don’t trust google to not mess it up why do you trust that take out contains uncorrupted data?

C11 · 27 June 2021 00:57

The issue is that I don’t trust google with the ONLY copy of the data. Backblaze is the “live remote” backup copy of our stuff on our home computers and NAS, one of 3 copies.

But we use google apps, and those documents only exist in the google cloud. So google may be reliable, but they hold the only, single copy. This includes documents like my wife’s extensive writings and most of our photos in “google photos”.
Google is pretty “safe”, but not safe enough for us to hold the only copes of this stuff. Losing years of my wife’s work product is unimaginable. I should have realized this sooner.

C11 · 27 June 2021 01:05

OK, so I did a test. I guess this should have been obvious to me but:
When I unzip google takout’s folder, all the file’s “last modified” dates are updated to the time of unzipping.
So on the surface, every file looks new, even if it is a duplicate.

Will “chunk level” deduplication prevent all the freshly dated file unzipped copies from being duplicated?

(It’s going to take me a while to wade through that paper. )

jiahao_xu · 27 June 2021 03:42

I think duplicacy deduplicates by first dividing files into chunks using variable chunk division algorithm, then it calculates a rolling hash of the chunk and uses it as the filename.

If that filename exists, then it won’t be uploaded.

Also, I think duplicacy uses a cache to speed up the process, by recording what chunks already present in the remote.

Not sure whether duplicacy records timestamps in the cache though.

C11 · 30 June 2021 01:31

OK, so I tested this…
at the start ,the existing storage size, including an unzipped 9.43GB takeout folder was 129.00GB and 27,163 chunks.
Then I replaced the existing takeout folder with a new, but essentially identical one from a couple days later, still 9.43GB. So the same data, but new time stamps etc.
I was hoping the new backup would not increase much in size…

The new storage size was 134.39GB, 28,178 chunks (after pruning).
So running duplicacy with a new, but essentially identical, unzipped 9.43GB google takeout increased the storage size 5.39GB and about 1000 chunks.

Does that sound reasonable? Any ideas to reduce that?

jiahao_xu · 30 June 2021 11:42

I am not so familiar with the detail algorithm used in duplicacy, but I think it might be because each time a new backup is created, you need to record the id of the chunks (which I think is equal to the hash) that is stored in it.

I am not sure whether this 5GB are metadata I described above or new backup data due to difference in chunk size.

I am no expert in this, so take my words witha grain of salt.

C11 · 30 June 2021 21:45

I may try it one more time to confirm the increased backup size.
Frankly, with this much increase, I’ll have to find a better way. This is not a Duplicacy problem, the issue is really that I am doing a strange thing.
If anyone has a good idea to avoid the size of the backup increase… I’d be curious to hear it!

Flibble · 1 July 2021 08:51

Maybe try to use repository with fixed chunk size instead default one?

It should be similar to backing up Thunderbird files ?

jiahao_xu · 1 July 2021 11:26

I just chekced the [Backup command details](http://Duplicacy doc on backup), it seems that duplicacy checks timestamp and size of the file by default and that might cause duplicacy to backup these files again, but with different chunk cutting.

You need to pass “-hash” to ensure that it checks for modification using rolling hash instead of timestamp + size change.

That will help if you want to keep variable size chunks.

C11 · 2 July 2021 03:07

Looks like I have some homework to do…

variable vs fixed chunks
rolling hash checks

C11 · 5 July 2021 21:46

I set the option for -hash on the backup and I did a new takeout and ran the backup.
The storage size did increase again, but by less. The entire takeout was about 9.5GB, but the backup storage size only increased 2GB (compared to5 GB without the -hash)
So it appears the -hash does help. Not as much as I’d like, but a clear positive effect…

Do you guys really think making the chunk size fixed would make a significant difference? And I’m unclear on what that would do… won’t that mess up all the previous chunk comparisons? Will I end up doubling my backup size the first run with the chunks set to fixed, because none will look like the old chunks? Or does the resizing only effect “new data” that is being backed up?

jiahao_xu · 6 July 2021 02:04

IMHO it definitely could happen, so it that cases maybe it’s better to use fixed chunk size from the point the repository is initialised.

gchen · 6 July 2021 03:50

I don’t think a fixed-size storage would work better for this case. The fixed-size chunking is more suitable for few and large files (think of virtual machine disk files). I would suggest reducing the average chunk size to 1M bytes to see if this makes noticeable differences.

The difficulty in deduplication is caused by the fact the decompressed files lose their original timestamps so Duplicacy can’t really tell which files are new and which are not. -hash helps because Duplicacy would then pack all files regardless of their timestamp. If you are sure that no files have been changed between two takeouts then -hash should not introduce any new chunks, so I don’t know why the storage still increased by 2GB.

duplicacy diff -r r1 -r r2 can compare two revisions and print out files that are different by hash.

C11 · 6 July 2021 04:35

There were some changes between the two takeouts, just not that much. But maybe the “packing” ends up being a factor.
I’ll try a more careful test, and then the diff to see whats up.
But this will have to wait to next week due to operational issues. But I will try to follow up!
Thanks gchen!