Backup vs check stats confusion

4degrees · 11 December 2019 17:20

Longtime duplicacy user here, but I’ve managed to confuse myself thoroughly interpreting the check and backup results recently. This was whilst taking a second look at how efficiently duplicacy is deduplicating my data. I’d appreciate someone clearing things up for me

My setup:
2 Windows laptops and 1 Raspberry PI 4 acting as my NAS with 4 separate drives attached.
1 of the NAS drives is my local backup storage. I have 7 repositories across the laptops and NAS that all backup to this storage (either direct or via SFTP). I then copy the local storage offsite using duplicacy copy. I only use the duplicacy CLI (no gui, no web gui). I have scheduled jobs setup to run backups, checks, prunes etc all regularly.

So, onto my confusion. Consider the output from the latest backup (revision 19) of my laptop to the local storage:

2019-12-10 19:06:12.080 INFO BACKUP_END Backup for C:\Users\martin at revision 19 completed
2019-12-10 19:06:12.080 INFO BACKUP_STATS Files: 592948 total, 121,256M bytes; 4641 new, 12,019M bytes
2019-12-10 19:06:12.080 INFO BACKUP_STATS File chunks: 24497 total, 121,626M bytes; 493 new, 2,510M bytes, 2,491M bytes uploaded
2019-12-10 19:06:12.080 INFO BACKUP_STATS Metadata chunks: 45 total, 199,887K bytes; 38 new, 169,857K bytes, 50,699K bytes uploaded
2019-12-10 19:06:12.080 INFO BACKUP_STATS All chunks: 24542 total, 121,821M bytes; 531 new, 2,676M bytes, 2,541M bytes uploaded

As I understand it, this says it added 531 new chunks increasing storage usage by approx 2.5GB.

Then over on the NAS I run the check command:

           snap | rev |                               |  files |    bytes | chunks |    bytes |  uniq |    bytes |   new |    bytes |
 martins-laptop |   1 | @ 2019-11-22 22:51 -hash -vss | 608322 | 153,659M |  30815 | 140,970M |    55 |  85,428K | 30815 | 140,970M |
 martins-laptop |   2 |      @ 2019-11-23 19:00  -vss | 603336 | 153,663M |  30801 | 140,976M |     4 |   4,619K |    41 |  91,638K |
<SNIP>
 martins-laptop |  17 |      @ 2019-12-08 20:14  -vss | 593012 | 119,512M |  23427 | 107,103M |     1 |   2,748K |     1 |   2,748K |
 martins-laptop |  18 |      @ 2019-12-10 12:41  -vss | 592068 | 119,510M |  23422 | 107,105M |    37 |  50,598K |    51 |  90,776K |
 martins-laptop |  19 |      @ 2019-12-10 19:00  -vss | 592948 | 121,256M |  23724 | 108,577M |    56 | 205,812K |  1469 |   7,575M |
 martins-laptop | all |                               |        |          |  36781 | 168,305M | 21312 |  97,687M |       |

For that same revision 19 this seems to tell a different tale - 56 new (uniq) chunks for the storage at 205MB (or 1469 newly seen chunks for this snapshot id at 7.5GB).

It’s at this point I get confused. I haven’t really paid much attention to the stats part of check before, but I’m not sure how to interpret these numbers (despite browsing through the forum and help) or why they seem so different to the backup report?

As a slight aside, if I want to know how much my storage usage is increasing by do I look at the uniq columns or the new columns in the check stats report. My gut says the uniq ones, but is that correct?

towerbr · 11 December 2019 19:06

Take a look:

4degrees · 11 December 2019 19:33

Thanks for reply - I’ve read that topic already a few times and am still confused by the following line:

“New” is the number of chunks that first appear in this revision.

Is my interpretation of ‘new’ as ‘new to this snapshot id series, but not necessarily new chunks in the entire storage’ accurate?

gchen · 11 December 2019 20:23

That is right. 1469 chunks don’t appear in the any ‘martins-laptop’ snapshots before revision 19. However, some of them may already exist in other snapshot ids, so only 531 were uploaded.

4degrees · 11 December 2019 20:36

Thanks! That clears that bit up

So in my case above, how much storage was consumed by revision 19? Was it 2.5GB or just 205MB?

And if it was 2.5GB, is there a way to determine that from just the check output (as I don’t keep the backup reports as long)?

gchen · 11 December 2019 21:38

Revision uploaded 2.5 GB. The size of unique chunks would have been 2.5 GB if the check command was run at that time. However, some backups from other snapshot ids were done later which shared some of these 2.5 GB and thus reduced the unique size in revision 19 to 205 MB.

4degrees · 11 December 2019 21:54

Oh, interesting. So here’s the only backup that was later:

                 snap | rev |                          |  files | bytes | chunks | bytes |   uniq |    bytes |    new |  bytes |
server-personal-media |   3 | @ 2019-12-11 14:11       | 179188 | 1540G | 306314 | 1426G |     30 | 200,769K |    507 | 2,544M |

And I expected this backup to contain mostly duplicate data so your explanation makes sense. We also see that this backup reports only 200MB of unique data confirming this.

2019-12-11 14:26:06.975 INFO BACKUP_STATS Files: 179188 total, 1540G bytes; 4754 new, 12,015M bytes
2019-12-11 14:26:06.975 INFO BACKUP_STATS File chunks: 318028 total, 1540G bytes; 22 new, 179,182K bytes, 177,900K bytes uploaded
2019-12-11 14:26:06.975 INFO BACKUP_STATS Metadata chunks: 17 total, 69,977K bytes; 8 new, 47,978K bytes, 22,869K bytes uploaded
2019-12-11 14:26:06.975 INFO BACKUP_STATS All chunks: 318045 total, 1540G bytes; 30 new, 227,160K bytes, 200,769K bytes uploaded

But it seems now that there isn’t a reliable way to figure out how much additional space a revision ended up consuming by using the check command (the 2.5GB for revision 19)?

gchen · 12 December 2019 01:53

2.5 GB is how much additional space revision 19 consumed when it was just created. 205 MB is how much additional space revision 19 consumes now. I think the latter is more relevant – that is the amount you’ll reclaim if revision 19 is to be deleted.

4degrees · 12 December 2019 15:02

I think the latter is more relevant – that is the amount you’ll reclaim if revision 19 is to be deleted.

Agrees that is useful information

What I would like to be able to tell as well is how much storage usage is increasing by when I run the check command. Right now that information is lost - there is no way to tell that 2.5GB of storage was consumed across those two relevant revisions. It reads like only 205MB was added, but we know that isn’t true

For now my main question is answered and I can work around this by keeping my own reports from the backup logs. Thanks!

But I wonder if we could do something with the check command as well? Below I’m adding a ‘shared’ column to account for the value of shared chunks (no idea if that is possible or not).

           snap | rev |                          |  files |    bytes | chunks |    bytes |  uniq |    bytes |  shared |    bytes |   new |    bytes |
 martins-laptop |  19 | @ 2019-12-10 19:00  -vss | 592948 | 121,256M |  23724 | 108,577M |    56 | 205,812K |     465 |   2,471M |  1469 |   7,575M |

towerbr · 12 December 2019 15:14

But if you think about it, the check output is correct.

When you finished backing up revision 19, if you “checked” it, it would report 2.5GB, and that really was the amount increased.

After the backup of server-personal-media, the increase was only 200MB, because it took advantage of the previous chunks, etc.

That is, the check is reporting the size correctly at that time, isn’t it?

4degrees · 12 December 2019 16:33

I think correct is very subjective here based on expectation.

For me, when martins-laptop revision 19 was backed up, it added 2.5GB to the consumed space in unique chunks. When server-personal-media revision 3 was backed up, it added 200MB to the consumed space (because most of it was deduplicated against revision 19). So from that I could see that 2.7GB was consumed in storage space since the timestamp of revision 19.

But the check command doesn’t give me that useful information in it’s current form because it alters the past such that it looks like martins-laptop revision 19 only added 200MB when it actually added 2.5GB. And because server-personal-media revision 3 also reports only 200MB unique chunks, those 2.5GB have now been hidden from the stats.

I guess I’d like to see check stats relevant to the timestamp of each revision. So regardless of the time check is run, I could see how many unique chunks revision 19 added at its timestamp, rather than just how many unique chunks it references now. Perhaps some additional fields in the output?

It’s been a while since I last looked through the Duplicacy code, but, if someone can point me in the general direction, I’ll hack around a bit on the current check command stats to see if I come to a solution.