Options for backup integrity checking with limited egress bandwidth

leerspace · 13 November 2018 14:11

I’m using Duplicacy to backup around 1.5 TB to Backblaze B2 and am considering backing up roughly 10 TB to GSuite. B2 offers 1 GB of free egress bandwidth per day and GSuite offers a lot more, but also has limits.

I can run duplicacy check -all to make sure the chunks exist, but if I want actually validate the integrity of the chunks in the backup I think my only built-in option is to run duplicacy check -all -files, which would download everything and far exceed the free daily egress limits (or the hard limits in the case of GSuite).

I’ve tried limiting the bandwidth using trickle -s -d 10 duplicacy check -all -files, but trickle doesn’t seem to work with duplicacy and there’s no built in option to limit bandwidth (like there is for backup).

The ideal solution in my mind would be to have something always slowly (based on a specified bandwidth limit) perform full integrity checks of random files from the storage, and then send emails and/or throw errors if it encounters a file with corrupted chunks.

To work around this, I’ve written a script that will copy the .duplicacy directory to a temporary directory, select a random snapshot and random files that fit within a specified daily egress limit, and then restore them to prove integrity (at least, of some random selection of files). Disclaimer: I wrote this for my specific setup and it probably has unhandled edge-cases, so it probably won’t work for all use cases and shouldn’t be trusted. Depending on your files the overhead of downloading the snapshot file and chunks might also be significant compared to the specified limit.

Is this kind of throttled, full integrity checking something worth submitting a feature request for Duplicacy? If it’s outside the scope of Duplicacy, then maybe someone else can find some use for the script I wrote.

towerbr · 13 November 2018 16:21

Looks interesting, but I looked there and could not see the code:

leerspace · 13 November 2018 17:29

Sorry, the permissions issue should be fixed now.

towerbr · 13 November 2018 18:43

Very interesting and complete. I’ll study it in detail later and do some testing.

The general idea is similar to this - primitive - batch that I use:

Your code is obviously much more complete and detailed.

And your solution is in line with my opinion:

leerspace · 13 November 2018 19:22

I’m glad I’m not the only one thinking about this kind of backup integrity checking. It looks like your batch script is doing almost exactly the same thing (in a lot fewer lines too!), except I’m counting on Duplicacy to throw an error if the downloaded chunks don’t match what they’re supposed to be rather than calculating the file hashes myself.

aweber · 20 November 2018 19:23

Two related questions here:

Does check -all -files download the entire repository’s fileset and keep them, like a full “restore” would do? Or does it download a file at a time and check its integrity and then discard/delete the validated file?
Does Google put a limit on egress if you were to run such a check from one of their Google Cloud virtual machines? (AWS for example, allows unlimited traffic between the region’s EC2 and S3…IIRC.)

leerspace · 20 November 2018 20:16

According to this page, it’s more like the latter.

To verify the full integrity of a snapshot, you should specify the -files option, which will download chunks and compute file hashes in memory, to make sure that all hashes match.

That’s my understanding. There aren’t just daily bandwidth limits, though. There are also API limits; even if Google didn’t count bandwidth to/from their cloud I’d still expect them to count API calls since that’ll stress their infrastructure the same regardless of where they come from.

aweber · 20 November 2018 20:32

OK, I obviously didn’t look closely at that. I’m relatively sure that AWS doesn’t count anything within their overall cloud-region. (Not a bias towards them, but I have a little more familiarity with their services…though I haven’t checked their fineprint in some time either!)

Was just throwing out a possible workaround - spin-up a tiny, “Validate Backups” VM in their cloud.

Christoph · 21 November 2018 19:39

A quick off-topic hint: there is no need to refer to another post and then quote it via copy-and-paste and add > to format it as a quote. While composing your reply, you can simply navigate to any topic on the entire forum and quote it just like quoting the post you’re replying to.

aweber · 21 November 2018 20:08

FWIW: Some google-foo leads me to believe that there is no charge for bandwidth between GCE and Google Drive (and they have a free tier that probably could run duplicacy in a check manner).

However, I don’t see anything about removing the API limits, so those would probably be the same (it might run a little faster being in a reasonably close-proximity data center though).

This is really a “con” for chunking files. Many actions will require additional calls per-file…in some cases many more API calls. It also makes me wonder if there are inherent limits to duplicacy’s scalability specific to some cloud providers. Some concrete metrics would be really helpful to those choosing a backup utility and platform.

leerspace · 21 November 2018 20:37

I think I was at least an order of magnitude off with my initial assumptions around GSuite’s daily egress limits; according to this reddit thread the GSuite daily download limit is 10TB. I was originally confusing the download limit with the upload limit (which is closer to 750 GB/day). As far as GSuite is concerned, I wouldn’t currently be limited by the daily download bandwidth limits – though someone probably is.