Considering Duplicacy with Wasabi

skidvd · 14 June 2018 17:00

@kevinvinv, @gchen, no I think that your understanding is in line with mine. I suppose reliability is a very poor word choice. What I had in mind with that post was the potential for frequently (but very minimally so) updated files. I have been concerned that some may occasionally slip through the cracks - especially with frequent backup interval (perhaps every 15 min or so to approximate continuous backup like CrashPlan). I had thought that a full hash computation on each file may offer some greater assurance that nothing was being missed. Perhaps this is just paranoia?

I guess there are other issues I am more concerned about reliability wise than this...

As a rock-solid, 100% reliable backup (and restore) solution is my ultimate concern as well, I’m very curious what you have in mind and what you may be doing address it?

kevinvinv · 14 June 2018 22:58

Hi skidvd

My personal opinion is that I just back up once per day. I did that with Crashplan too. I dont personally want the thing crunching on my CPU all day every 15 minutes. But I see why you do and would not criticize that decision…

My reliability concern with duplicacy is server side corruption. It doesnt do anything to make sure the backups are restorable and the chunk files are not corrupted.

You can make sure all the chunk files for a given snapshot exist… and that is pretty good but you cant easily (from the server side… as in a local NAS etc) make sure the individual chunks have not be corrupted. The only way you can verify a chunk integrity is to download it back and then verify it. That is too costly.

Crashplan could always verify backup integrity b/c it was running a server side app that could always verify checksums or hashes or whatever… but duplicacy doesnt do that… instead it basically trusts that the server wont corrupt the stored backup… pretty good assumption in general but that is indeed what I worry about.

gchen is planning on adding some hooks to make server side chunk integrity verification more possible but they havent arrived yet.

skidvd · 15 June 2018 02:01

Hi kevinvinv,

Yes, you raise a very good point. One of the double-edged swords I have been considering in my evaluation process… Duplicati, for example, does download some random chunks with each backup for this express purpose. However, as you noted, this is a rather expensive option - especially if you are charged for egress and/or API calls as is likely the case other than at wasabi.

I am anxious to learn more about gchen’s plans and timeline in this regard.

So, if one were to periodically run check commands (with -files option) say perhaps weekly, this will ensure snapshots have all required chunks. However, I’m not clear on how a server (I think you are referring to repository source here - correct?) side only solution could be made to verify that the chunk is in fact safe transported and stored on the remote cloud storage? I supposed it could verify a chunk checksum/hash etc before transport to ensure it was created reliably - is this what you are getting at? However, doesn’t that leave an opening for errors to be introduced (and more importantly, missed without any means to verify) during transport and storage on remote destination? Does the wasabi or other cloud-based storage providers provide any mechanism to calc and retrieve checksums/hash of file on their storage for comparison to what was sent?

towerbr · 15 June 2018 13:26

IMHO, we should think more about the integrity of the files, and not about the chunks, which would be a consequence.

If you have a file that is broken into 5 chunks, and check (by sampling) 4 of these chunks and it’s all ok, but the last one is not (and you didn’t check the latter), you’ve lost the file. But if you check the file (which is what matters), you will know for sure that everything is ok - or not.

I, for example, set up a script to download a few random files and compare them with local files. It is not 100% safe (nothing is), but it lowers the risk. Since I use Wasabi, I have no problem with download or API calls charges.

skidvd · 15 June 2018 13:45

@towerbr, nice approach/idea. I’m curious though how you randomly determine file(s) and at approximately what frequency you do this? Would any of this change in your mind as the size of your repository grew? Would you mind sharing your script?

towerbr · 15 June 2018 14:36

I run it daily after the prune script.
It’s a very simple / ugly / hardcoded windows batch (CMD) script. I’m gradually modifying my scripts to something more parameterized and better coded, but time has been a problem ;-). Since my backups are working, I confess I’m not prioritizing this.

It basically has these steps:

(remember that everything between % and ! are variables or constants)

Read the files from local repository, storing name and size in array:

for /f "tokens=1* delims=\" %%a in ('forfiles /s /m *.* /p %repository% /c "cmd /c echo @relpath"') do (
  for %%f in (^"%%b) do (
    call set /a cont+=1
    set FILES[!cont!].name=%%~f
    set FILES[!cont!].size=%%a
  )
)

Mark random files in the array for testing

for /L %%a in (1,1,%num_tests%) do (
    set /a "num=!random! %%max"
    set FILES [!num!].random=YES
)

(num_tests is a parameter with the number of files to be tested)

Retrieves the last revision from the last log file:

for /f "tokens=1,2,3,4,5* delims= " %%m in (!LOG_FILE!) do (
    if "%%p"=="revision" (
        call set "var_revision=%%q"
    )  
)

Downloads selected files to a temp folder

for /L %%a in (1,1,%cont%) do (
  if !FILES[%%a].random! == YES (
    call set file_to_rest=%%FILES[%%a].nome%%
    call set file_to_rest=!arq_to_rest:\=/!
    call duplicacy restore -ignore-owner -stats -r %var_revision% -storage %storage_name% "!file_to_rest!"
  )
)

Generates hashes of downloaded files

for /L %%a in (1,1,%cont3%) do (
      call md5sum.exe "!file_full_path!" > file_MD5_%%a.txt
)

Compares the hashes of the downloaded files with the hashes of the repository files

  for /f "tokens=1,2* delims= " %%m in (file_MD5_%%a.txt) do (
    call set hash_downloaded=%%m
  )
  for /f "tokens=1,2* delims= " %%m in (repository_MD5_%%a.txt) do (
    call set hash_repository=%%m  
  )
  if !hash_ downloaded! == !hash_repository! (
	some code for everythng ok
)

I know there are more elegant and optimized ways to do this, but for now it’s working very well, in the future I’ll make a “2.0 version” using powershell, python or something else.

kevinvinv · 15 June 2018 15:10

This is an interesting conversation!!

When I mentioned “Server side” verficiation- I admit the terminology wasnt the best. I am not using any sort of cloud service but instead I have a NAS box at a remote location that I back up to. This is what I call the “server” and I want to be able run a routine on this remote NAS to verify the integrity of all the chunks backed up… if that makes sense.

I do like the random download checker too… that is a cool idea.

towerbr · 16 June 2018 02:46

Yes, but this idea has a weakness: it doesn’t apply to verify the backup integrity of large files (a database, for example). It only applies well to “common” files.

skidvd · 16 June 2018 14:13

Both the random file download and the chunk verification are interesting and can go a ways towards reliability. Yet, they both feel like attempts to work around something I would hope to be baked-in central features - that are apparently not there and/or being questioned at the root of this discussion. Ideally, in terms of backup, I’d like to be able to:

know at the time of backup (as a direct consequence of the backup operation, not as a follow-on operation) that it either succeeded or failed to successfully transmit the files to storage and the the storage received, confirmed and stored the same
know that if a failure occurred relative to the above, that I know exactly which file(s) were unsuccessful

The random file download and chunk verification both are after the fact. While they may indeed point out problems before a potential future restore request would fail, they do not identify the problem as soon as it happened. Additionally, they are not checking each and every file/chunk as it is stored, so there is always a chance of something being missed.

What I am not clear on, is whether or not the cloud storage APIs provide any means of requesting a cloud side checksum of some sorts? Perhaps @gchen can shed some light here - or perhaps much more than it appears is already happening? What I’d hope, is that the backup operation would in essence be not only looking for errors during source-side operations as well as file transmission; but then taking a further confirmation step to double-check that the cloud-side actually has the correct contents via checksum or similar computation and cross-check with the source. Does this make sense?

kevinvinv · 17 June 2018 02:51

I’ll just state the obvious I guess. CP could do everything we want in this regard b/c the backup receiver was running an app and could do continuous monitoring and “healing” and all of that.

I suspect that tools like this (and many many others that backup to the cloud) cant really do much on the receiver end b/c they cant run a program on the receiving computer to calculate hashes and checksums and all of that.

The best that can be done is to download the file back and then check to be sure it was without error.

This is one reason I like not backing up to the “cloud” but instead backing up to my own remote computers so that I CAN EVENTUALLY check the backup integrity at the receiver… hopefully

skidvd · 18 June 2018 12:42

I believe that I have read that the wasabi API is 100% compatable with the S3 API…

While I have not been able to locate something similar (with admittedly brief searches) for wasabi, this link for S3 looks like there may actually be something possible checksum wise? [https://aws.amazon.com/premiumsupport/knowledge-center/data-integrity-s3/] excerpt… “To ensure that S3 verifies the integrity of the object and stores the MD5 checksum in a custom HTTP header you must use the --content-md5 and --metadata arguments with the appropriate parameters.”

I’m curious what @gchen, @markfeit (or others) may be able to offer in terms of the present state of things and perhaps future plans in this regard? I don’t presently see any reference to ‘md5’ in either duplicacy_wasabistorage.go nor duplicacy_s3storage.go (where most of the implementation appears to be shared, so I’m guessing that this is presently not being utilized - however, I’d be happy to learn I am incorrect as I am certainly no expert on the duplicacy code.

This duplicacy issue appears to be speaking to this issue as well: [https://github.com/gilbertchen/duplicacy/issues/205]

dgcom · 20 June 2018 03:58

See about hash here: Overview of cloud storage systems

If you don’t trust duplicay, use rclone to sync your local backup storage to cloud provider.

skidvd · 20 June 2018 11:57

It’s not that I don’t trust duplicacy. There are just too many moving pieces so I feel that a “trust, but verify” approach would be advisable.

towerbr · 20 June 2018 12:30

There is no 100% error free software, neither Duplicacy nor Rclone, and I use both (for different filegroups).

Changing Duplicacy by Rclone does not mean that it will be more reliable.

I always adopt the practice of “trust, but verify”

dgcom · 20 June 2018 15:51

“Changing Duplicacy by Rclone does not mean that it will be more reliable.”

This will make sure that cloud upload will be more reliable. Having backup locally and in the cloud will make sure that backup will be more reliable and accessible.
And it also means that you will be able to easily and cheaply verify your backup - first against local backup then local copy against cloud copy.

gchen · 20 June 2018 16:47

Duplicacy stores both file hashes and chunk hashes in backups so it is able to detect any corruption. The issue you referred to was more of Wasabi’s bug (which was hard to believe at first giving the seriousness of the bug), but fortunately the bug was correctable by retrying a few times (because the chunk stored in Wasabi wasn’t corrupted).

The main reason we don’t use the MD5 hash provided by S3 is that it doesn’t extend to other cloud storages. Therefore, a disadvantage of this is that you’ll need to download a chunk in order to verify its content. On the other hand, the S3 MD5 hash isn’t reliable since it doesn’t work for multi-part files (Duplicacy doesn’t support multi-part uploads, but it would run into problem if the chunks were uploaded by other tools).

skidvd · 20 June 2018 17:02

@gchen, thanks for the info!

Duplicacy stores both file hashes and chunk hashes in backups so it is able to detect any corruption.

I guess the main question regarding this is when the corruption detection would happen? Is it during the backup operation itself? Or not until/if a subsequent restore or check or some other operation would trigger it? Ideally, it would be during the backup and return a failure code so that it was know at that point.

Additionally, whenever the detection is triggered, exactly what it is being compared? Hash of original repository src file/chunk to the backup of hash stored locally? Ideally, this would compare hash of original repository src file/chunk to hash of ultimate cloud storage (thereby verifying full loop).

The main reason we don’t use the MD5 hash provided by S3 is that it doesn’t extend to other cloud storages.

From a quick look at the source, it appears as if you already have customized implementations for differing providers. It appear that S3 and Wasabi are largely shared, but still separate from other non-S3 types. Wouldn’t this mean that both S3 and Wasabi would be in a position to be enhanced to use MD5 without affecting the others - especially as you said since Duplicacy does not support multi-part uploads?

gchen · 21 June 2018 01:48

The corruption detection happens when you run restore or check -v.

All the storage backends only implement a basic set of file operations. The verification logic is done solely in the higher layer. Of course, it is possible to augment the S3 backend to perform some MD5-based verification.

skidvd · 21 June 2018 17:23

I created https://github.com/gilbertchen/duplicacy/issues/455 to request an enhancement to make this an integral part of the backup operation (instead of needing to wait until restore or check -v timeframe to identify potential errors).

mfeit_duplicacy · 14 July 2018 14:23

Sorry I missed this; I wasn’t even aware there was a forum.

I’ve just commented on #455 and will add a couple of things here:

First is that I don’t think a full, read-back-and-verify check during backup is necessary. As I pointed out in the ticket, the path between your host and the cloud storage is already very reliable, as is the storage itself (barring mismanagement and flawed engineering) and the return path. The probability of data loss in both places at the same time is unbelievably low. I’ve been with Wasabi since December and am happy so far. That includes doing some restores to see how it’s working.

Second is that you shouldn’t be dependent on one kind of storage for your backups, especially if it’s all under the same roof and expensive to do a full verification. The 3-2-1 strategy (3 copies, 2 on-site [original and one backup] and 1 off-site) is easy, inexpensive and reliable for most applications. If things reach the point where your on- and off-site backups are destroyed, there are probably much bigger things to worry about than the loss of your data.

Just from a cost perspective, it’s a bargain to maintain a local copy for restores so you don’t have to burn a lot of time and download fees to pull a few TB from the cloud. My setup for that is a 4 TB drive in the machine holding most of the data that receives local backups and those from my other systems (one of which is off-site) via SFTP. The drive itself is in a removable sled and my family knows where it is and how to remove it should they need to evacuate the house when I’m not home. My big qualm right now is that I live close enough to Ashburn, VA (Wasabi’s us-east-1) that a large-enough natural disaster will take out all three copies. They’ve just spun up another data center in Hillsboro, OR (us-west-1) but haven’t been clear about whether that will provide redundancy for Ashburn or if I’ll have to move my data there to get the geographic diversity I want. Even with all of that reliability, Wasabi is still my secondary backup.