Duplicate files, different timestamps

scottw · 21 September 2018 16:20

if I am backing up a bunch of different websites, and let’s say they are all using Word Press. Each site has many of the same files, with the exact same content in those files, but timestamps on those files are different.

Is the end result that those “duplicate files” are really just stored as one file (chunks), even though they are different paths and different timestamps?

Christoph · 21 September 2018 22:22

Let me reformulate your question to be able to give a short and clear answer: Given the files I have on different webservers, does it make a difference for deduplication if the (identical) files have different time stamps?

No, it makes no difference.

Why is the answer to your original question not as clear? Because you seem to ask whether only one copy of each file will be backed up, i.e. if you will reach 100% deduplication. But that’s not quite how it works. Not every file is backed up as a separate chunk. In fact, duplicacy currently does not “see” file boundaries at all. It just reads the data sequentially and applies cuts (i.e. starts a new chunk) based on an algorithm. This means that if your files are small and stored in a different order on each server, you may have little or no deduplication at all. If those files are big, the more deduplication you can expect. But in all these scenarios, the timestamp is irrelevant.

Now, if you are wondering: is there not room for improvement? Can’t we deduplicate also the small files? Then you are right and the code for this solution has already been written. See here: