Server side verification ? Does anyone else worry about this?

kevinvinv · 14 April 2020 01:13

So I wonder if I am the only one that is concerned about this. Does anyone worry about the integrity of the chunk and snapshot files on the server?

I know there is the check command but that doesn’t actually check the integrity of the data itself as I understand it.

I was hoping at some point to be able to perform a checksum check or something of that nature from the server side…

Am I the only one worried about this? I mean… how does everyone periodically guarantee or verify that once the data is uploaded… it doesn’t get corrupted in some way?

Coments invited!

Later edit: there’s a discussion about this topic on github: Adding a feature to help with server side data integrity verification · Issue #330 · gilbertchen/duplicacy · GitHub

saspus · 14 April 2020 01:51

Generally data integrity is something that the storage backend should guarantee. The file your write today and the file you read in 4 years must contain the same bits. (Otherwise where do you draw a line? Verify that RAM works? CPU adds 2 and 2 correctly? etc). In practice this is not so easy to achieve.

Since Duplicacy is not a client-server app, it works via file API, the only way it could verify the chunks integrity in general case is to read (download) all those chunks (which some exceptions, some storage backend APIs do support server side validation) – and egress is usually expensive.

This means If you are hosting your backend storage yourself you do need to ensure you can detect and correct bit rot: i.e. use checksumming filesystem and write data with redundancy and do periodic data scrub (that’s your check); (in other words, don’t backup to a single usb drive with exFAT filesysm. )

leerspace · 14 April 2020 02:04

You can check the integrity of the data itself with the check command if you use the -files option. But this will download all of your data at least once. The new -chunk option might also be sufficient, depending on your needs.

kevinvinv · 14 April 2020 02:48

I guess I think a simple redundancy could be added to the system to allow some level of server side verification. For example, every time a chunk is uploaded to the server, a log file could be updated with the chunks name, file size and perhaps a MD5 or similar checksum. Then periodically code on the server could just check consistency between the chunk files and the data stored in the log file.

I know it would still not be perfect but it would be one level of redundancy that would alleviate most concerns.

I spoke with @gchen about something like this some time ago but I forget what became of the idea. I think there was some reason it would not work… but I sure would like something like that still.

saspus · 14 April 2020 05:21

Duplicacy does not have a server-side component. You would need some other piece of code to do that. Which means you can already do that if you wanted to with a simple script: scan each new file, compute md5, then compare periodically.

But doing so would be re-inventing the wheel: verifying file consistency is a subset of a more universal task: filesystem scrub, which you should schedule periodically on your storage server if you care about data integrity. In contrast to manual contraption with file md5 computation scrub can actually repair the inevitable rot.

It would not. So, lets say duplicacy somehow magically detected that the chunk is corrupted. Now what? Ideally you would want to restore if from redundant copy – which duplicacy does not know (nor should know) about; those are storage backend implementation details.

The point is – let backup focus on doing coherent backup, and let storage services focus on providing consistency guarantees. it would be counterproducive and generally impossible to mix the two

kevinvinv · 14 April 2020 13:19

I dont really agree with your position on this but I do thank you for your input. It is definitely food for thought.

I have a synology RAID server I use as one of my storages… I’ll have to see if it offers some sort of scrub routine … I have not seen that yet.

Thanks again.

gchen · 14 April 2020 14:12

It is this discussion: Adding a feature to help with server side data integrity verification · Issue #330 · gilbertchen/duplicacy · GitHub

It is still on my radar, but not at this moment. I’ll be working on the memory usage optimization first.

saspus · 14 April 2020 16:29

If you have raid with fault tolerance and btrfs filesystem with data consistency box checked — then yes, it does. Make sure to schedule periodic data scrub (Storage Manager | Storage Pool | Data Scrubbing). I use synology as one of the destinations (and sources) too.

kevinvinv · 15 April 2020 13:11

Thank you for this info. My Synology is configured with BTRFS but I didn’t have scrubbing on nor did I know about it even… so thank you again.

kevinvinv · 15 April 2020 13:11

Hi @gchen, thanks for keeping this on the radar. And thanks for a good product.