Graceful Handling of Failed and Incomplete Snapshots

santacruzskim · 9 January 2020 20:37

I have storage arrays that come online and go offline as needed. Sometimes weeks go by without them connecting to the server (they hold 20-24 spinning drives, over 100TB of files and pull a lot of power while running). I have scheduled (WebUI) backups to run every few days that include source data that live on those drives.

I noticed that if the schedule fires off a backup while the storage arrays are off, it results in a failed backup. Fair enough. However, when the next backup runs with the array connected, it scans things incredibly slowly (-hash?). Instead of the backup taking 20 minutes or so to scan contents and start uploading, it takes upwards of 12+ hours just to scan.

If there is already a way to prevent this I would love to know about it. Otherwise, it would be great if would more gracefully / intelligently handle this situation (along with incomplete snapshots, which seem to also force a full scan to see what has and hasn’t been uploaded).

I would hate to have to manually run all these backups in order to avoid all that scan time.

…thread with similar issues:

leerspace · 9 January 2020 21:46

Are you using the -hash option? If so, you’re forcing Duplicacy to re-read and hash all files from disk, rather than just looking at their metadata to determine what has changed.

If you are using -hash, I would suggest not using it for almost all backup runs. The initial backup necessarily uses -hash implicitly, but those reasons aren’t generally relevant to subsequent backups.

santacruzskim · 9 January 2020 23:44

No. Sorry if I made that confusing, but I was just implying that the failed backup might be triggering a -hash backup since it is behaving like one (and only after a failed backup, or if you cancel a backup in progress). I believe revisions are flagged somewhere in the logs as to whether they had a hash-level scan enabled so the next time it happens I’ll check.

leerspace · 10 January 2020 02:07

I cannot recreate this behavior. What’s an affected workflow?

Start backup
Cancel it immediately
Start it again
Now the entire backup starts over from scratch?

If you have any check jobs running, you can see whether -hash was used for the backup in the tabular section of the log file. For example,

       snap | rev |                          | files |   bytes | chunks |   bytes |  uniq |   bytes |  new |     bytes |
 folderrrrr |   1 | @ 2019-01-05 06:36 -hash |    69 | 21,586K |      8 | 19,386K |     3 |      9K |    8 |   19,386K |
 folderrrrr |  39 | @ 2019-02-10 06:17       |    70 | 21,589K |     11 | 20,112K |     5 |    494K |    6 |      735K |

If you look at the backup log file, you’ll see all files when using the -hash option. Without it, you should only see new/changed files – as determined by file size and/or modification time.

Is it possible that some other application is updating all of the modification times on the files on your storage arrays between the last successful backup and the one that is taking a long time? This would cause the files to need to be re-hashed in order to detect any changes.

gchen · 10 January 2020 02:26

I think one possibility is that when the storage arrays are off, Duplicacy sees an empty directory so rather than a failed backup it uploads a new backup that contains no files. The next backup will have to start from fresh while not actually uploading any chunks (basically the equivalent of -hash).

TheBestPessimist · 10 January 2020 08:46

I also though this could be the case so here’s a probably dumb? solution: can we instruct a repository to “don’t upload a new revision if 0 files are available”?

santacruzskim · 10 January 2020 17:49

If what you both are suggesting is true, not being able to find the volume that holds the repository seems like something that should immediately stop a backup/snapshot creation from happening, similar to if you have an invalid option - it doesn’t say the backup “failed,” it just throws an error (“invalid options”) and stops without doing any damage. I actually use that all the time in the WebUI where I have special case commands like super aggressive pruning set up in a schedule but with 1 additional “bad” option so it won’t run unless I remove the bad option. just skips that job and moves on to the next one in the schedule, no harm, no foul.

I will test at some point over the weekend to determine what situations = what behaviors in (so far everything has been anecdotal observation from using it for a few months). I’m just in the middle of a bunch of projects so I can’t take that storage array offline at the moment.

Droolio · 10 January 2020 18:31

Maybe a more elegant solution would be to compare the number of files being backed up to the previous snapshot? This is essentially how a human might notice anything amiss when running a check -stats and looking at the number of files column.

Perhaps return with a FAIL exit code if num_files = 0 and a WARN exit code if num_prev - num_files > X (where X may be 1000 or a configurable number)? After all, the storage could go down mid-backup.

Obviously, if Duplicacy can’t access files it should hard fail, but this might not always be possible if symbolic or hard links are involved.

dgcom · 10 January 2020 22:02

Where would it upload files if destination is not acceptable? Or, rather, where would it record this “empty” backup? Is that kept in the snapshots folder in cache? If so, maybe temporary workaround would be to clean these files in cache before each run?

Droolio · 11 January 2020 01:40

I think by ‘storage arrays’ he’s referring to the repository source disks here - not the destination storage.

dgcom · 11 January 2020 01:59

Oh, I see… In that case it’s quite possible to get empty but valid backup…
What would help here is the pre-backup script which can fail entire execution if source is not available…

Erwin · 11 March 2020 03:53

Duplicacy should indeed fail immediately if either the source OR destination volume is not present. Is it the case (I have not tried yet but this seems like an essential basic behavior)?

leerspace · 11 March 2020 11:53

No. I think in order for this to be the case someone would have to manually implement a way for the CLI to check mount status for any given backup path on each and every supported platform, and make it work for any conceivable mount type.

From what I can tell, there is no cross platform method to do this out of the box in golang.

It’s possible to make a pre-backup script to check that there are more than 0 files in the backup path. But I haven’t actually tested it myself.

santacruzskim · 15 March 2020 19:21

I think this is a further reaching issue than is being suggested and at least in my experience, happens when it clearly shouldn’t. I also don’t think such elaborate means should be needed to fix this (though I say this with a bit of arrogance). I suspect many people suffer from this issue and don’t even know it because they don’t back up obscene amounts of data (like I do) and they aren’t sitting there monitoring their backups. If a backup takes an hour instead of five minutes because something triggered a -hash, most people simply wouldn’t notice or care.

To get back to my original issue, I actually had this bite me again just yesterday. I brought my storage arrays online (for the sake of this discussion, just consider it plugging in an external USB drive). I then manually ran a backup in the WebUI. It failed, with the log saying it didn’t have access to the volume. Worst part of all (and the central reason for my issues, I believe): it went on with the backup(!) reporting 0 files at the source, 0 file chunks uploaded, and 3 metadata chunks, which I’m now thinking were probably flagging the backup as now empty and triggering a complete re-scan upon future backups).

Running backup command from C:\Users\Administrator/.duplicacy-web/repositories/localhost/3 to back up Y:/ServerFolders/eYe Archive
Options: [-log backup -storage EYE_STORAGE_01 -threads 30 -stats]
2020-03-14 23:20:30.892 INFO REPOSITORY_SET Repository set to Y:/ServerFolders/eYe Archive
2020-03-14 23:20:30.892 INFO STORAGE_SET Storage set to gcd://(e)/DUPLICACY_EYE_01
2020-03-14 23:20:43.421 INFO BACKUP_START Last backup at revision 34 found
2020-03-14 23:20:43.421 INFO BACKUP_INDEXING Indexing Y:\ServerFolders\eYe Archive
2020-03-14 23:20:43.421 INFO SNAPSHOT_FILTER Parsing filter file \\?\C:\Users\Administrator\.duplicacy-web\repositories\localhost\3\.duplicacy\filters
2020-03-14 23:20:43.422 INFO SNAPSHOT_FILTER Loaded 28 include/exclude pattern(s)
2020-03-14 23:20:43.422 WARN LIST_FAILURE Failed to list subdirectory: open \\?\Y:\ServerFolders\eYe Archive: Access is denied.
2020-03-14 23:20:51.666 WARN SKIP_DIRECTORY Subdirectory  cannot be listed
2020-03-14 23:20:51.776 INFO BACKUP_END Backup for Y:\ServerFolders\eYe Archive at revision 35 completed
2020-03-14 23:20:51.776 INFO BACKUP_STATS Files: 0 total, 0 bytes; 0 new, 0 bytes
2020-03-14 23:20:51.776 INFO BACKUP_STATS File chunks: 0 total, 0 bytes; 0 new, 0 bytes, 0 bytes uploaded
2020-03-14 23:20:51.776 INFO BACKUP_STATS Metadata chunks: 3 total, 8 bytes; 3 new, 8 bytes, 882 bytes uploaded
2020-03-14 23:20:51.776 INFO BACKUP_STATS All chunks: 3 total, 8 bytes; 3 new, 8 bytes, 882 bytes uploaded
2020-03-14 23:20:51.776 INFO BACKUP_STATS Total running time: 00:00:18
2020-03-14 23:20:51.776 WARN BACKUP_SKIPPED 1 directory was not included due to access errors

I double-checked permissions and whatnot, ran it again and got the same error. I restarted and the backup started right away, showing that it needed to upload all 9TB of data to my Google Drive account. I am watching it as we speak as reads every…single…file in my local repository, with the WebUI reporting a transfer rate of 50MB/s = about 48 hours of thrashing my disks. Of course, no actual data is being uploaded, because every single bit of data is already in my Google Drive storage. So even though is currently reporting it has uploaded over 2TB, my network traffic indicates around 600MB has come and gone through the duplicacy service from checking (not uploading) every damn chunk.

As a semi-“power user” I understand enough to get me in trouble, but also have “normal human” expectations for how an application like this should function, and I believe I represent the majority of users in that sence. From my standpoint this is what I’m seeing:

First, shouldn’t freak out if it doesn’t complete a backup, no matter the reason for the interruption. I thought the lock-free system would be perfect for this - it should know what chunks were successfully uploaded, which weren’t, and which are incomplete and need to be tossed and re-uploaded.

Second, if it does fail, it should not require a complete re-scan of every single bit of data in the repository vs the destination storage.

Third, if it fails specifically because it can’t read data on the repository when it begins the backup, no matter the reason, it should gracefully fail. In fact, I would argue that it should not have even started in the first place since that should have already been part of the process for preparing the backup operation: check the source. check the destination.

Whether it be from the volume not being mounted, or not existing, or being write protected, or reporting 0B of data… these are all red flags that should stop a backup from happening and be reported to the user, just as bad “options” already are (they just report “invalid options” and stop). The do not “fail” - they stop the operation before starting.

I hope this gets addressed sooner than later, or at least gives users a way to bypass the issue.

Sidenote - during this current fiasco, I was thinking that if I do catch it performing a bad backup like this, if I delete that revision and try again, maybe it will perform more gracefully. It doesn’t fix a single problem, but it may help me get out of a jam and work around the issue while it is hopefully being properly addressed.

Droolio · 15 March 2020 22:42

Yes, do that next next time. Delete the snapshot with 0 files, it’ll be fine. It’s rehashing because your next backup has nothing to compare to.

I wholeheartedly agree the Web UI should eventually have a source check before backup, but you need to understand that it’s not an easy task to cover in all circumstances. To build this into the CLI would be quite problematic, thus it could be a GUI task. On Linux for example, a mount point that isn’t currently mounted is just an empty directory - and all programs (backup or otherwise) just sees an empty directory.

In the meantime, you may have to write a pre-script to check perhaps for the existence of a specific file in the source directory - before running the backup job, for such external mounts. Both CLI and GUI can do pre-scripts.

gchen · 16 March 2020 02:49

I agree that if there is an error accessing the root directory to be backed up, Duplicacy shouldn’t create an empty backup. I vaguely remember there was a discussion about this issue a while ago and we decided not to change the current behavior for some reason?

leerspace · 17 March 2020 00:30

FWIW I can’t think of a reason not to treat an error while accessing the root directory as a fatal error.

This is the only other thread I could find in the time I have to search, but there might be others.

santacruzskim · 22 March 2020 19:48

I fully endorse that possibility

I just can’t help but think there is an easier way to get around what appears to be such a basic problem, at least for the issues that I am currently dealing with, which are…

I can’t create scheduled backups for source repositories that aren’t always on and always connected since the absence of the volume will report 0B (no data) and trigger a complete rescan on the next backup instance.
I have to judiciously monitor my larger backups since (as demonstrated in my current issue), having the volume offline isn’t the only reason a backup might fail or return a 0B.

Neither of these, in my mind, have to do with Duplicacy’s inability to recognize when a volume isn’t attached. It has to do with data being inaccessible (for whatever reason) and how responds to such a situation. At a minimum, and as @gchen described, Duplicacy should never create an empty backup. I can’t imagine a negative side effect from this, but my ignorance might be showing there. If no one else can either, what is the next step towards getting this implemented?

leerspace · 23 March 2020 13:17

I think there’s an important distinction between this and what @gchen wrote. The comment quoted below says that Duplicacy shouldn’t create an empty backup under a specific scenario (error accessing the root directory) – not “never”. While it may not be true with your setup, some directories just look empty when unmounted rather than throw errors.

I think that has been suggested elsewhere, but maybe there should be two changes.

Make errors accessing the root directory fatal, so the backup fails before uploading an empty snapshot
For mounts that only look empty when unmounted, maybe add an option to backup like -no-empty-root or something that causes an empty root directory to also throw a fatal error

santacruzskim · 23 March 2020 17:08

In the context of the rest of my post, I believe we are saying the same thing, but I understand the distinction you are making. Regardless, it seems everyone is in agreement that 's response to not being able to access a repository should not be to create an empty backup. The various situations that would cause the issue to arise and the issues it creates have now been well documented and agreed upon (and now reported in many places in these forums) and I think we can focus on solutions.

These make sense to me, but I’m curious why the most popular suggested workaround of creating a pre-backup script cannot also be the solution. That is, to bake the same “check” on the repository into the default behavior of the backup command itself. I have prodded at this from various angles, but for this solution in particular, what are the downsides? Maybe I’m missing something that could really cause issues for people?

Here’s a thread that specifically talks about pre-backup scripting in the context of working around this issue (linking directly to a post on my experience setting this up on Windows)