Why does backing up using -hash take so much more storage space?

UserDup · 3 December 2022 14:39

I have always backed up using the additional -hash parameter.

I just thought this was just a better method of detecting if a file has changed or not - if a hash of a file has changed back it up. But it doesn’t appear to work like that.

If you include the -hash parameter all the files are output to the log as if they are all processed when backing up not just the ones that have changed.

However, if you do not specify the -hash parameter the logs indicate only the files that have date/size changed are included in the backup.

Can anyone give a quick explanation of what is happening and when you should or shouldn’t be using -hash?

(For my particular data set, the daily backup on average using -hash is double the amount of storage space so not an insignificant amount)

Thanks

saspus · 3 December 2022 17:43

UserDup · 3 December 2022 20:28

Not really sure if I understand what is being said in that thread.

My experience is without the -hash parameter storage space required by a daily backup is half what it is with the -hash parameter which contradicts what is said in that thread?

It also doesn’t explain why hash is uploading more data - if a file has changed a file has changed why is it processing all the files each time and reuploading if they haven’t changed when using -hash?

Thanks

Droolio · 4 December 2022 14:51

IMO you shouldn’t be using -hash for every backup, no. Its main purpose is to occasionally ‘defragment’ the storage after a long while of doing incremental backups, where small files especially are being constantly modified or moved about, the ideal packing of chunks gradually becomes less efficient.

(A secondary purpose is to backup certain, non-standard apps that don’t modify the timestamp, yet change the content. Excel used to do this, though not any longer I don’t think. And you can always, and maybe should, run such backups separately anyway.)

Though if you run two consecutive -hash backups one after another, and no files were modified, it won’t upload more chunks (apart from a small amount of metadata for that snapshot) - yet it still has to evaluate each file to see if its contents have been modified - thus treats every file as potentially modified, instead of only the files that had their modification timestamp change.

The reason it produces more data is coz it’s repacking data more often, resulting in multiple chunks referencing the same file data. Those older, inefficiently-packed chunks, won’t get removed til they’re pruned. So you have more of these stale chunks and a bloated storage.

IMO, you should probably only run -hash one a month or even just once a year - it’s a useful tool to get rid of stale chunks (and the space saving may not be much as you think; not least because it’ll take awhile for the oldest chunks to get pruned), but it certainly shouldn’t be run every backup.

UserDup · 4 December 2022 20:21

Thanks for the information.

I have been running a few tests to try to find out the best chunk size for me and thought I would include/exclude the hash function in my tests.

It is interesting that running with exactly the same file changes but the only difference is hash or no hash results in the hash daily backup taking twice as much storage at 4MB average chunk sizes. At 64MB average chunk sizes the hash daily backup takes fifteen (15) times as much storage!