First backup/prune and chunks left/skipped

iocularis · 11 April 2021 14:32

Hi,

i just performed a backup as a test and later i did duplicacy prune -exclusive -r 1

The revision is removed but in the chunks dir on the storage i still see some empty chunk dirs.

Is this intended? If yes, why?

This Empty folders after prune partially answer my question. “I’d definitely not tinker with the /chunks directory if I were you - if you still have any backup data in the storage, that is.” . But here i do not have any backup data in the storage, it should be “empty” because i pruned the only 1 revision present on it at all.

And also, when i perform the backup again, like if i do it for the first time, i see this before any Uploaded chunk:

Skipped chunk 3 size 1048576, 1024KB/s 00:02:13 0.7%
Skipped chunk 4 size 1048576, 2.00MB/s 00:01:06 1.4%
Skipped chunk 13 size 1048576, 3.00MB/s 00:00:44 2.2%
Skipped chunk 14 size 1048576, 4.00MB/s 00:00:33 2.9%

This was true also for the first time i performed the backup (maybe because a file was in use and it was not possible to backup it?).

From my first backup:

Loaded 0 include/exclude pattern(s)
Use 10 uploading threads
Skipped chunk 3 size 1048576, 1024KB/s 00:02:16 0.7%
Skipped chunk 4 size 1048576, 2.00MB/s 00:01:08 1.4%
Failed to open file for reading: open \\?\C:\Users\xxx: The process cannot access the file because it is being used by another process.
Skipped chunk 13 size 1048576, 3.00MB/s 00:00:45 2.1%
Skipped chunk 14 size 1048576, 4.00MB/s 00:00:34 2.9%

Why these firsts chunks are skipped?

Thanks in advance!

jiahao_xu · 12 April 2021 05:02

That is duplicacy’s deduplication in action.

When the chunks to be uploaded are already present, they will be skipped.

As for why some of the chunks are not deleted, I am not sure.

Maybe you didn’t remove all snapshots or some of them are referenced by any repository?

iocularis · 12 April 2021 08:42

Thanks for your reply!

I don’t understand why skipping the first chunks for the first backup I do. The storage was empty.
On another device (my NAS) I’m performing daily backups and I never see skipped chunks in the log that arrives via email. So I don’t understand. In that case I am not deduplicating?

About the chunks not deleted, I don’t know, I was testing with a single repository and removing the single snapshot present

jiahao_xu · 12 April 2021 09:13

From what I have observed by backing up the same dir twice with minor changes, duplicacy will skip when the chunk is already stored in the cloud, namely deduplication.

If you never seen skipped in the log, then tour data must change so frequently or the chunk related settings (chunk size, chunk max and min size) may not be optimal.

About the empited cloud storage, I really have no idea when certain chunks are left behind.

iocularis · 12 April 2021 09:32

Do you use Web ui? I see no skipped chunks in the log that is sent via mail from the Web ui after a schedule. I backup for example my docker-composes dir. My docker-compose files rarely change so I would expect to see most chunks skipping.

I saw the skipped chunks on the other device but from the CLI.

Well I suppose it’s the same.

EDIT: Looking at the logs of my NAS I see that indeed not everything is uploaded but only the files with some changes, but I do not see any skipping chunks. Maybe the logs from the Web-ui behaves differently?

jiahao_xu · 12 April 2021 09:59

I uses CLI command “duplicacy backup -t tag -stats” for my backup and never used GUI.

In my CLI logs, chunks skipped are explicitly mentioned.

Only after all chunks are uploaded, does duplicacy mentioned uploaded files.

It might be related to the flags “-stats” I used during backup, which prints these detail information.

iocularis · 12 April 2021 10:04

See this log for example, it’s a set of folders with different docker-compose files. Basically they never change.

Back up /source/DockerCompose@localhost to storage DK_Duplicacy_Backups, status: Completed, log file: backup-20210412-104351.log
--------------------------------------------------------------------------------------------------------------------------------
Running backup command from /cache/localhost/0 to back up /source/DockerCompose
Options: [-log backup -storage DK_Duplicacy_Backups -threads 10 -stats]
2021-04-12 10:43:52.287 INFO REPOSITORY_SET Repository set to /source/DockerCompose
2021-04-12 10:43:52.287 INFO STORAGE_SET Storage set to b2://xxx
2021-04-12 10:43:53.720 INFO BACKBLAZE_URL download URL is: https://xxx
2021-04-12 10:43:55.222 INFO BACKUP_START Last backup at revision 10 found
2021-04-12 10:43:55.222 INFO BACKUP_INDEXING Indexing /source/DockerCompose
2021-04-12 10:43:55.222 INFO SNAPSHOT_FILTER Parsing filter file /cache/localhost/0/.duplicacy/filters
2021-04-12 10:43:55.222 INFO SNAPSHOT_FILTER Loaded 0 include/exclude pattern(s)
2021-04-12 10:43:56.983 INFO BACKUP_END Backup for /source/DockerCompose at revision 11 completed
2021-04-12 10:43:56.983 INFO BACKUP_STATS Files: 62 total, 69K bytes; 0 new, 0 bytes
2021-04-12 10:43:56.983 INFO BACKUP_STATS File chunks: 3 total, 75K bytes; 0 new, 0 bytes, 0 bytes uploaded
2021-04-12 10:43:56.983 INFO BACKUP_STATS Metadata chunks: 3 total, 15K bytes; 0 new, 0 bytes, 0 bytes uploaded
2021-04-12 10:43:56.983 INFO BACKUP_STATS All chunks: 6 total, 91K bytes; 0 new, 0 bytes, 0 bytes uploaded
2021-04-12 10:43:56.983 INFO BACKUP_STATS Total running time: 00:00:01

I see that no files are uploaded but there is no skipped chunk.

jiahao_xu · 12 April 2021 12:04

It is quite different from my “duplicacy backup -t tag -stats” output.

Maybe you could add “-stats” to the CLI?

iocularis · 12 April 2021 12:17

It seems i already have it: Options: [-log backup -storage DK_Duplicacy_Backups -threads 10 -stats]

towerbr · 12 April 2021 13:56

I was also in doubt. Maybe something with the B2 bucket’s lyfecicle configuration? Or with the permissions of the key used?

Did you set up storage with a fixed block size? Because I realized that they are all 1048576 in size.

The logs are informing about the new files and these new files - due to their characteristics - are not similar to those already in storage, so there is no deduplication (which would be indicated by skipped chunks).

Yes, this is normal behavior. Nothing has changed, so it doesn’t even try uploading the chunks. See the log of a repository of mine that rarely has new files:

2021-04-11 20:00:08.780 INFO REPOSITORY_SET Repository set to ***************
2021-04-11 20:00:08.796 INFO STORAGE_SET Storage set to b2://***************
2021-04-11 20:00:12.667 INFO BACKBLAZE_URL download URL is: https://f001.backblazeb2.com
2021-04-11 20:00:17.737 TRACE CONFIG_ITERATIONS Using 16384 iterations for key derivation
2021-04-11 20:00:18.516 TRACE BACKBLAZE_CALL [0] URL request 'HEAD https://f001.backblazeb2.com/file/***************/nesting' returned status code 404
2021-04-11 20:00:18.520 TRACE SNAPSHOT_DOWNLOAD_LATEST Downloading latest revision for snapshot ***************
2021-04-11 20:00:18.524 TRACE SNAPSHOT_LIST_REVISIONS Listing revisions for snapshot ***************
2021-04-11 20:00:21.152 INFO BACKUP_START Last backup at revision 21 found
2021-04-11 20:00:21.153 INFO BACKUP_INDEXING Indexing ***************
2021-04-11 20:00:21.153 INFO SNAPSHOT_FILTER Parsing filter file ***************
2021-04-11 20:00:21.153 INFO SNAPSHOT_FILTER Loaded 0 include/exclude pattern(s)
2021-04-11 20:00:23.424 TRACE LIST_FILES Listing snapshots/
2021-04-11 20:00:23.425 TRACE LIST_FILES Listing snapshots/***************/
2021-04-11 20:00:23.427 TRACE LIST_FILES Listing chunks/
2021-04-11 20:00:23.430 TRACE LIST_FILES Listing chunks/f7/
2021-04-11 20:00:23.432 TRACE LIST_FILES Listing chunks/df/
2021-04-11 20:00:23.432 TRACE LIST_FILES Listing chunks/bd/
2021-04-11 20:00:23.433 TRACE LIST_FILES Listing chunks/bc/
2021-04-11 20:00:23.434 TRACE LIST_FILES Listing chunks/ba/
2021-04-11 20:00:23.434 TRACE LIST_FILES Listing chunks/b4/
2021-04-11 20:00:23.435 TRACE LIST_FILES Listing chunks/b0/
2021-04-11 20:00:23.436 TRACE LIST_FILES Listing chunks/99/
2021-04-11 20:00:23.436 TRACE LIST_FILES Listing chunks/96/
2021-04-11 20:00:23.438 TRACE LIST_FILES Listing chunks/47/
2021-04-11 20:00:23.439 TRACE LIST_FILES Listing chunks/45/
2021-04-11 20:00:23.439 TRACE LIST_FILES Listing chunks/30/
2021-04-11 20:00:23.441 TRACE LIST_FILES Listing chunks/26/
2021-04-11 20:00:23.441 TRACE LIST_FILES Listing chunks/20/
2021-04-11 20:00:23.442 TRACE LIST_FILES Listing chunks/16/
2021-04-11 20:00:23.442 INFO BACKUP_END Backup for *************** at revision 22 completed
2021-04-11 20:00:23.442 INFO BACKUP_STATS Files: 1 total, 4,096M bytes; 0 new, 0 bytes
2021-04-11 20:00:23.443 INFO BACKUP_STATS File chunks: 4096 total, 4,096M bytes; 0 new, 0 bytes, 0 bytes uploaded
2021-04-11 20:00:23.443 INFO BACKUP_STATS Metadata chunks: 3 total, 300K bytes; 0 new, 0 bytes, 0 bytes uploaded
2021-04-11 20:00:23.443 INFO BACKUP_STATS All chunks: 4099 total, 4,096M bytes; 0 new, 0 bytes, 0 bytes uploaded
2021-04-11 20:00:23.443 INFO BACKUP_STATS Total running time: 00:00:05

iocularis · 12 April 2021 14:03

What i meant was that after i deleted the revision number 1 (the only one i had) i found that in che “chunks” folder on my storage there were some empty chunk folders.
Why these are left there and empty i don’t know.

I see, and there is a way to see the “full” log?

Thanks!
So the CLI works differently, it tells in the logs when it’s skipping file, while this not happens in the logs from the Web ui?

gchen · 12 April 2021 14:18

Those Skipped chunk log messages are filtered out of the log files by the web GUI. Chunks are skipped because they already exist in the storage, not because corresponding files are skipped (in fact when a file can’t be opened there is no way to read its content and produce chunks).

It is possible for Duplicacy to skip some chunk sin an initial backup even with an empty storage, if there are duplicate files in the source directory.

When chunks are pruned from the storage Duplicacy may leave some empty directories. Empty directories usually do not take much space and it is not worth the effort to remove them.

iocularis · 12 April 2021 22:03

Thanks!

With this you mean similar files, so there is no need to upload a chunk if there is alrady an equal chunk loaded on the storage?

towerbr · 12 April 2021 22:10

This is the principle of deduplication.

Considering this possibility:

Allow me to ask again: Did you set up storage with a fixed chunk size? Because I realized that they are all 1048576 in size.

iocularis · 12 April 2021 22:15

I forgot to mention that the B2 logs were from my NAS where i use the Web-ui. The logs with the skipped chunks were from my laptop where i was trying the CLI with a WorkDocs attached drive storage.

Actually no i did not select any custom setting.

Recreated the backup for test with the CLI now. It’s my Thunderbird profile:

Loaded 0 include/exclude pattern(s)
Use 10 uploading threads
Skipped chunk 3 size 1048576, 1024KB/s 00:02:16 0.7%
Skipped chunk 4 size 1048576, 2.00MB/s 00:01:08 1.4%
Uploaded chunk 2 size 1048576, 3.00MB/s 00:00:45 2.1%
Uploaded chunk 1 size 4428181, 7.22MB/s 00:00:18 5.2%
Skipped chunk 11 size 1048576, 8.22MB/s 00:00:16 6.0%
Skipped chunk 12 size 1048576, 9.22MB/s 00:00:14 6.7%
Uploaded chunk 5 size 3914996, 12.96MB/s 00:00:10 9.4%
Uploaded chunk 10 size 4913836, 8.82MB/s 00:00:14 12.8%
Uploaded chunk 13 size 1550923, 9.56MB/s 00:00:13 13.9%
Uploaded chunk 6 size 13418698, 15.96MB/s 00:00:07 23.3%
Uploaded chunk 9 size 5603327, 24.67MB/s 00:00:04 36.0%
Uploaded chunk 8 size 12655375, 21.99MB/s 00:00:05 32.1%
Uploaded chunk 7 size 15593400, 32.10MB/s 00:00:03 46.9%
Uploaded chunk 15 size 2845337, 33.46MB/s 00:00:03 48.8%
Uploaded chunk 17 size 1249017, 34.05MB/s 00:00:03 49.7%
Uploaded chunk 19 size 2328451, 23.44MB/s 00:00:03 51.3%
Uploaded chunk 14 size 8839577, 26.25MB/s 00:00:03 57.5%
Uploaded chunk 21 size 1566033, 26.75MB/s 00:00:03 58.6%
Uploaded chunk 16 size 5997765, 28.66MB/s 00:00:02 62.8%
Uploaded chunk 20 size 1989691, 29.29MB/s 00:00:02 64.1%
Uploaded chunk 22 size 3321686, 30.35MB/s 00:00:02 66.5%
Uploaded chunk 23 size 2231021, 31.05MB/s 00:00:02 68.0%
Uploaded chunk 18 size 10105000, 34.27MB/s 00:00:01 75.1%
Uploaded chunk 26 size 9214231, 27.90MB/s 00:00:01 81.5%
Uploaded chunk 24 size 11199626, 30.57MB/s 00:00:01 89.3%
Uploaded chunk 25 size 15310484, 34.22MB/s 00:00:01 100.0%

iocularis · 14 April 2021 20:56

@towerbr what do you think?

gchen · 15 April 2021 04:13

It is normal for the default variable-size chunking algorithm to produce all 1048576 byte chunks, if the file to be backed up is a sparse file that contains a lot of zeros. The nature of the algorithm causes it to always split zero blocks at the minimum chunk size, which happens to be 1048576 with the default setting.

towerbr · 15 April 2021 11:16

From your log, now with the other chunks, I see that it is obviously not configured for fixed size chunks.

I think the @gchen explanation above is a possibility of the reason for the 4 skipped chunks.

Joining with this hypothesis:

I think that you have some small empty files, and in the same first backup job, the chunks referring to the first of these files were used later in the same job.

iocularis · 15 April 2021 14:17

Thank you both! This helped me a lot understanding how Duplicacy works.

system · 25 April 2021 14:17

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.