Odd Sizing Issue

joseph · 21 September 2023 01:36

I have a little over 2TB to back up. Size on backup is 3.6TB. Huh?
I’ve tried pruning with exclusive and it barely makes a dent?

saspus · 21 September 2023 01:59

You may want to run prune -exhaustive, to get rid of any potentially orphaned chunks.

If still no dice —Then that’s the difference between versions. Perhaps you are backing up some high turnover data.

You can run check with -stats argument to see statistics

joseph · 21 September 2023 14:49

I don’t need a ton of versions here. As long as each file has 1 or two versions it’s good. It’s all music. I had run prune -a -exhaustive. Standard prune is -keep 0:45 -keep 7:30 -keep 1:7 -a which might be even too much. Wanna keep the backup size close to the actual size.

saspus · 21 September 2023 15:07

Music is immutable. If you create multiple versions of files that don’t change, storage size does not increase: all data is already on the storage, only the file manifest will be uploaded.

In other words, regardless of number of version of static files backed up the backup size will be very close to the source size

So, something else is at play here.

What is your backup destination? One possibility I can think of is if the target is on a filesystem with massive sector size. Duplicacy creates a lot of small files, 4MiB on average, and there could be a lot of overhead.

joseph · 21 September 2023 15:34

Destination is a NAS, dedicated folder.

saspus · 21 September 2023 15:44

Ah, plot thickens. What filesystem is on the volume the backup is located on?

Can you ssh to the nas, and run this:

du -hd 0 /path/to/duplicacy/datastore
du -hd 0 --apparent-size /path/to/duplicacy/datastore

to see the difference between size of data vs size it takes on disk.

And you are sure that there is no possibility that you have backed up some files other than immutable media files?

joseph · 21 September 2023 17:50

du -sh was showing 3.6 TB
-hd 0 shows 3.6
–apparent-size is not recognized.

saspus · 21 September 2023 18:19

Try -A instead of --apparent-size.

But yeah, if du shows 3.6 – then that’s what you have there.

Did duplicacy check -stats reveal anything interesting?

joseph · 21 September 2023 18:32

-A still shows 3.6 T
Check -stats shows up as not initialized? I’m primarily using the web interface, so I am within the backed up folder where the cmdline normally would work.

Web interface sizes look normal for that specific backup job.

Prune -a -exhaustive is still running though.

saspus · 22 September 2023 02:21

IIRC backup command is run form the /0/,/1/, etc sub path, buck check, prune and others from /all/ sub path.

Or you can create a schedule, untick all days, andd the -stats flag, and launch the schedule manually.

Let’s see if it finds and removes orphans once it’s done.

joseph · 22 September 2023 16:24

Finished and nothing cleared

saspus · 22 September 2023 16:49

Great, that there was no orphans – means it all is working correctly.

So, the size difference is then due to some data that has either been backed up in one of the backup runs and sits there, or maybe it’s being constantly picked up. e.g. perhaps some hidden folder with cache data, or other transient stuff.

Run the duplciacy check -stats or check -tabular, it will show you among other things amount of new data between snapshots.

If you only backup media, the amount of new data should only increase after you actually add new data.

You can also run list -files to see all the files that have been backed up. Confirm that these are only the files you have intended to have backed up.

joseph · 22 September 2023 18:49

Let’s see where it goes! Thanks. Running and waiting!

joseph · 25 September 2023 15:31

Total chunk size shows 2414G but size on disk shows 3.4TB. Huh?

saspus · 25 September 2023 16:29

Ok, so the 2.4G seems more or less reasonable, so we need to figure out where does this extra terabyte go:)

Since we ruled out the orphaned chunks (-exhaustive did not change anything) and large sector size on the filesystem (actual size is close to apparent size as reported by du), I see two possibilities:

Filesystem corruption on your nas. Check if you can run some sort of diagnostic/repair tool, like chkdsk. Is the disk you backup to part of the array or a USB disk? What filesystem is there,
Perhaps some other data got moved in into the duplicacy folder? Accidental drag and drop? stuff like that? You can check if there are any extraneous files visually or with a shell script to find files that does not match specific pattern.
```
find -E /path/to/duplicacy/folder/chunks  -type f ! -iregex '.*/[0-9a-f]{62}$'
```
Explanation:
- -E turns on extended regex
- -type f looks only for files
- ! negates the condition, i.e. find files that don’t match the patten
- -iregex looks for case-insensitive regular expression
- '.*\/[0-9a-f]{62}$' matches exactly:
- .* sequence of arbitrary characters of arbitrary length
- / followed by the path separator
- followed by exactly 62 digits or letters ‘a’ to ‘f’
- followed by the end of string.
This will find and print any files in the chunks folder that don’t look like duplciacy chunks. This works on FreeBSD, and should also work on linux, but I don’t have access to one right now to check.

joseph · 25 September 2023 18:45

I would not be able to check for corruption since this is a cloud-based storage box.

For reference, the Duplicacy instance only has access to the folder it is in. The find command kicked back.

$ > find -E chunks -type f ! -iregex ‘.*/[0-9a-f]{62}$’
Command not found. Use ‘help’ to get a list of available commands.

saspus · 25 September 2023 19:07

This is weird. You can check what shell are they giving you with echo $SHELL. But even sh has “find”.

joseph · 25 September 2023 21:08

haha, yea. No go on echo or find.

joseph · 25 September 2023 21:20

To note, this folder is not used for anything other than this specific Duplicacy instance and is not accessed by any other user or process, so it would be weird if something were out of place.

saspus · 25 September 2023 22:36

Indeed. If you can’t even use any tools – how could you copy there anything…

Since duplciacy thinks that all chunks are accounted for, and are of reasonable size, and there are no unaccounted chunks – it must be something else. Maybe some internal “lost+found” folder, something server side? Can you ask the provider to send you the list of files in your folder? I’m all out of other ideas…