Combining fixed and variable size chunks (Thunderbird backup tests)

I finished the first test with the mbox files (soon I will post the results).

From the Evernote test we have seen that the fixed 1M chunk works well for databases. But the variable chunks seems to work better for the rest.

I’m now setting up a new test to back up into two separate sets/jobs:

  • one storage with fixed 1M chunk for the single SQLite database (1 file)
  • another storage with variable 1M chunk for the rest of the repository

I already executed init and the add commands for the two storages.

I thought of doing an “include” with the database on the first job and an “exclude” with the same database on the second.

BUT, I realized that I have only one filters file…

How do I make a “conditional filter”? Or how do I use two filter files?

1 Like

You can create a new repository for the subdirectory where the SQLite database is. If the SQLite database is right under the root of the repository, then you best bet is to make a new repository on a different directory and symlink the current repository as a subdirectory.

A -filters option as required by https://github.com/gilbertchen/duplicacy/issues/314 would have been handy…

The “subfolder’s repository solution” worked fine.

I’m running the test #6 (with some very interesting initial results) and I will post the result in a few days.

I posted test #5 results: Thunderbird (mbox + SQLite files) with 1M variable chunks (link).

It presented some interesting results but I’m seeing more interesting things in test # 6 that I’m running now. Cristoph, I think it will show some of the answers to your questions about “merged settings”.

<shameless advertising for a feature request>

A nice example for why we need this feature: Folder exclusion based on contained file name · Issue #337 · gilbertchen/duplicacy · GitHub

</shameless advertising for a feature request>

Edit: no, sorry. What a nonsense! Issue 337 wont help at all here. 314, the one Gilbert mentioned will.

@towerbr: could I suggest that you label your tests on github not just with a number but with somethink like “#4 - mixed files fixed 1M chunks” or so. I’m finding it increasinly difficult to grasp what is actually being tested in each test and to navigate the tests accordingly.

It also took me a while to grasp the point of the first two charts (I guess because they combined show four data series but really it’s only three as one is in both charts). Why not combine them into a single chart?

I also miss information on if an when you ran the prune command.

your conclusion that

Deleting messages also represent an increase in storage, since a lot of new chunks seems do be generated.

is not surprising but I cannot relate it to this chart:

I assume the “duplicati” bit is a typo, so doesn’t this show a decrease in storage use? The labels for the green and blue data series seem to be reversed, either in the graph or in the table at the end.

for each increase in the repository, storage increases (on average) 14 times the increment of the repository.

Yes, we are paying a high price compared with just uploading the encrypted files to, say B2 and us the built in versioning there. But I guess there is no other way.

It just occurred to me that a sqlite database can’t be directly backed up if it is still open by a different process: https://stackoverflow.com/questions/25675314/how-to-backup-sqlite-database

If you use -vss it might work if the owner process implements the vss writer and presents a copy of the database in a consistent state…

could I suggest that you label your tests on github not just with a number but with somethink
like “#4 - mixed files fixed 1M chunks” or so.

The description on the readme page is not clear? Give me an example of how you are thinking it could be the description.

I’m finding it increasinly difficult to grasp what is actually being tested in each test and to navigate
the tests accordingly.

In fact I’m testing more than one thing on each test. In the 5th I tested:

  • if 1M chunk is suitable for a mixed repository (mbox and SQLite files)
  • what happens when deleting a large number of files (the account on day 3)
  • what is the impact of include / exclude patterns?

is not surprising but I cannot relate it to this chart

day 2: 22.374.000 => day 3: 24.324.000

Why not combine them into a single chart?

I thought the look would be confusing, but that might be a good idea. In test 4 it was fine.

The purpose of the first graph is to show that even with decreasing repository size (with the removal of files on the third day), the storage space does not immediately reduce without a prune, which is obvious at first, but not all people realize this.

The purpose of the second chart is to show that the actual size (reported by Rclone) is not the same as that reported by the Duplicacy log, which is associated with chunks.

But you’re right: I could be clearer. I’ll put these texts there.

I also miss information on if an when you ran the prune command.

No prune so far…

I assume the “duplicati” bit is a typo

Yes, the same as the other time, I will correct it, thank you!

The description on the readme page is not clear?

I meant the title, where it now just says “Test #5”, which doesn’t tell you anything except that it’s a test and apparently the fifth one.

As for the ReadMe file: even if it is perfect, that wont help a user who, for example, clicks on your link above and ends up directly on the test #5 page.

Give me an example of how you are thinking it could be the description.
Did you not like the one I gave: something like #4 - mixed files fixed 1M chunks. Or, even better: duplicacy backup test #4: #4 - mixed files, fixed 1M chunks

Anyway, you get the idea.

In fact I’m testing more than one thing on each test. In the 5th I tested:

  • if 1M chunk is suitable for a mixed repository (mbox and SQLite files)
  • what happens when deleting a large number of files (the account on day 3)
  • what is the impact of include / exclude patterns?

I don’t see it like that. These three don’t stand next to each other. Rather, I would say your test was about the first one and the other two were some of the means of testing the first. So that’s what I’d like to see in the title.

is not surprising but I cannot relate it to this chart
day 2: 22.374.000 => day 3: 24.324.000

But after the third day it just decreases…

But you’re right: I could be clearer. I’ll put these texts there.
Yes, that would help. Readers want guidance. Tell me what I’m supposed to see in this graph, and I’ll see it.

But after the third day it just decreases…

Because on the 4th day I took out the big account.

I would say your test was about the first one and the other two were some of the means of testing the first.

I see, that’s indeed a better way to see the tests. In fact, to test more than one aspect at the same time it’s not a good test practice (I was just trying to save time …).

I’ll repaginate the repository when I’m going to post test 6, including the file names.

Because on the 4th day I took out the big account

Yes, which is why I didn’t understand how it fits with your conclusion that

Deleting messages also represent an increase in storage, since a lot of new chunks seems do be generated.

Surely deleting a big account means deleting many messages, right?

BTW: If you didn’t do any pruning, how can the storage decrease at all?

Surely deleting a big account means deleting many messages, right?

The operations are different. When yout delete messages, they are deleted “inside” the mbox files, but the files remains there.

When you delete a account, all the mbox files are deleted.

BTW: If you didn’t do any pruning, how can the storage decrease at all?

See the text that I put there yesterday, about the second chart:

“This second chart shows that the actual size (reported by Rclone) is not the same as that reported by the Duplicacy log, which is associated with chunks.”

And thank you for your contributions. They are helping me to improve the descriptions.

I just put this there:

“It’s worth noting that deleting the messages and deleting the account are not equivalent operations. When the messages are deleted, they are deleted “inside” the mbox files, but they remain there, only the index changes. When the account is deleted, all related mbox files are effectively deleted.”

I published test #6 on GitHub … I think you’ll find it interesting:link

Finally! (I was really looking forward to this!) Thanks for all the testing work. Really well done!

Regarding your conclusion:

It is clear that for normal daily use it is better to have separate jobs / settings for database files (with fixed chunls) and for other files (with variable chunks).

While it is not wrong, I think it doesn’t entirely reflect the results. What is missing (and what is most important for me) is that there is no difference in terms of storage use. So I would suggest presenting two conclusion points: 1. Regarding storage use and 2. regarding speed and bandwidth use (or perhaps make it three points. whatever.)

So, for me personally the conclusion is that I can continue my initial backup with 1M variable chunks. But for Gilbert, I think it is also worth considering (at least somewhere down the road) whether duplicacy could/should not be made to internally handle variable and fixed chunks within the same repository/storage, i.e. that you can specify (probably in the filters file) which files should be backed up with fixed and which ones should be backed up with variable chunks (instead of having to set up separate backups for each.

And for TheBestPessimist: how about integrating that into your scripts (i.e. to let the scripts split up what looks like one backup job into two)?

So I would suggest presenting two conclusion points

Good idea, i’ll change it for something separated.

Gilbert, I still have a very basic doubt about the logs.

In the line referring to “all chunks”, as for example:

All chunks: 7428 total, 9,162M bytes; 401 new, 586,478K bytes, 295,944K bytes uploaded

If there are 586 Mbytes of new chunks, why the upload was only 295 Mbytes?

new refers to the local size and the difference is due to compression?

Yes, the original size of these 401 new chunks are 586M but after compression it is only 295M.

I published the results of the two new tests:

test_07_Thunderbird_kairasku_branches

and

test_08_Evernote_kairasku_branches

There’s a lot of data there. :wink:

1 Like

@towerbr Thank you very much for you Thunderbird fixed chunks research.
Yesterday I started new backup job for Thunderbird with fixed 1MB chunks instead of default settings:
This is default settings, every revision have 1-1,3 GB
And with fixed chunks, revisions have 30-300 MB
So improvement is huge :slightly_smiling_face:

2 Likes

Glad to be useful!

2020202020202

2 Likes