Database backup tests: fixed vs variable chunk size

towerbr · 6 February 2018 02:41

I made a test to evaluate the best configuration for the backup of an Evernote folder, using two jobs, one with fixed chunk of 1M and another with variable chunk.

I created a page with the results here: GitHub test repository

I’ll improve the texts later.

Any suggestions are welcome!

Christoph · 6 February 2018 08:59

Thanks for the detailed report. Now we only need to find out whether this result is specific to database files or whether it is generally recommended to use 1M fixed size chunks…

towerbr · 6 February 2018 11:44

I’m running another test with mbox files, I should finish in 1 or 2 days, but only with 1M variables, and it seems to be working fine.

Then I can run another test with mbox files and test the fixed x variable.

gchen · 6 February 2018 16:18

Great work! It confirmed that fixed size chunking works better than variable size chunking for databases.

However, I wouldn’t recommend fixed size chunking for general purposes, as it is susceptible to insertions and deletions. As a result, Duplicacy doesn’t use the pack-and-split approach with fixed size chunking, and files smaller than the chunk size are stored in individual chunks, leading to too many chunks.

There are two places in the full data table where it says duplicati. Is that a typo?

Christoph · 6 February 2018 16:51

I don’t know to what extent mbox files are also a special case, but what I am looking for are some pointers regarding the best strategy for mixed backups, i.e. containing “all kinds of stuff”, including databases. To start with, the question is:

under what conditions does it make sense to set up separate storages for database backups (i.e. excluded all database files from the “ordinary” backup with variable chunks and include them in a separate backup with fixed chunks? If the answer is that in most “normal use” cases, where database files are a small part of the total volume of the repository, separating the two is not worth the trouble, then the next question would be:
How to decide whether fixed or variable chunks should be used for the mixed repository.

Duplicacy doesn’t use the pack-and-split approach with fixed size chunking, and files smaller than the chunk size are stored in individual chunks, leading to too many chunks.

So what exactly are you saying here? That Duplicacy’s approach is “leading to too many chunks”?

towerbr · 6 February 2018 16:51

There are two places in the full data table where it says duplicati. Is that a typo?

Yes, it was a typo, thanks for pointing. A copy-and-paste error from the previous worksheet, used in the Duplicacy x Duplicati test.

I splitted the tables to improve the readability, and also created a summary of the tests on the readme page.

towerbr · 6 February 2018 16:56

I don’t know to what extent mbox files are also a special case

mbox are very strange files: a “stack” of emails in text format, accessed through a brother file with an index.

When you delete a message, it is deleted from the index, but not from the main file. When you compress the file, then it is deleted.

That is, very different from a database.

under what conditions does it make sense to set up separate storages for database backups (i.e. excluded
all database files from the “ordinary” backup with variable chunks and include them in a separate backup
with fixed chunks?

I had thought of it initially, maybe a new test to be done

In the case of Evernote it may not be necessary because the database represents more than 90% of the repository, but in the case of mbox files it might be interesting.

And there is also a third subject (in my case): Veracrypt files.

towerbr · 17 February 2018 02:17

I published the results of the two new tests:

test_07_Thunderbird_kairasku_branches

and

test_08_Evernote_kairasku_branches

There’s a lot of data there.

Christoph · 17 February 2018 10:01

Why did you use variable chunks when we already know that those don’t work with databases? Or maybe I should ask: why did you use a repository with databases for the tests?

I tried to figure out what the the two alternative versions of duplicacy do differently but I couldn’t. Do you understand them?

But I noticed that the file boundaries version has some recent “tweaks” added. You might want to check if those affect your tests. In any case, it probably makes sense if you link (also) to the exact version that you used, rather than to the (current version of the) branch.

@gchen I’m wondering what might be a good way of handling duplicacy’s strong reactions to certain kinds of changes in the repository (i.e. the kind of changes that a database reindexing causes). What I mean here, as opposed to some of my previous questions, is how to detect and handle extreme situations, i.e. where a new snapshot has caused a significant increase in storage use.

At a very basic level, I wonder if it would be possible to make duplicacy automatically tag such snapshots for easy identification? The point would be to deliberately target the preceding snapshot with the prune command. In other words, what I’m heading for is to optimize pruning instead of (or in addition to) optimizing backup. Does that make any sense?

At a more advanced level, duplicacy could routinely compare the increase of the repositorysize with the volume of uploaded chunks and issue a warning (or tag the snapshot) if that ratio becomes to big.

towerbr · 17 February 2018 15:07

Why did you use variable chunks when we already know that those don’t work with databases?

Because, as you yourself said, we already know that fixed chunks are better in this case, and the purpose of the test, as I wrote down there, was to compare the efficiency of the new branches. For this they would have to be used with variable chunks, so in order to make the comparison I used variable chunks in all jobs.

Why did you use a repository with databases for the tests?

For this reason I did two tests: with Evernote (~ DB) and with Thunderbird (mixed)

Do you understand them?

Not in depth, only what is written in the #334 issue.

Christoph · 17 February 2018 23:17

Very interesting discussion in that issue. Learned a lot. And realized that I assumed that duplicacy was already more advanced than it is, i.e. that it already respects file boundaries.

Note, BTW, that kairisku suggested that you compare official 1M fixed with his branches 1M variable:

That makes more sense to me.

towerbr · 18 February 2018 13:17

[Off topic] I didn’t know Hypothes.is. Liked it!

Christoph · 18 February 2018 18:49

I think it’s quite revolutionary, actually. Or rather: the fact that web annotations are now (since about a year) an official W3C standard. The people at and around hypothesis have been the driving this for years.