How to Best Manage High Frequency Backups

ISSUE
Duplicacy is creating 100’s of new snapshots every day with only a few actually having new data (and thus a reason for existing).

PROPOSED SOLUTIONS
(1) Enable a backup option (eg) -skip-no-unique which does not create a new snapshot if the scan of the repository reports no changes to the data.

(2) Enable sub-day pruning rules. Eg: #Keep one revision per hour for revisions older than 1 day (1 day = a duration of 24 hours, not just from “yesterday” on a calandar.

WHY
I recently started using Duplicacy on a few client computers to back up directly to the same “storage” on Google Drive. They backup very important project files which change often so the interval in between backups is short, currently every 5-10 minutes. I have already set up a script to run the backups which is working great (CLI 2.7.2), but my issue is the already 100’s of snapshots accumulating after just a few days.

I could adjust my pruning to be more aggressive earlier on in the ageing of the snapshots, but that’s just more hammering on the storage running prune operations, which take a surprisingly long time on this storage (maybe from many ID’s backing up to the same storage so more calculations are needed??), all while backups are continuing to accumulate in the background.

I also worry about more aggressive pruning because of what I would consider Duplicacy’s poor handling of backup plans with more than 1 backup/day. You really only have two options: save all of your backups from a single day, or save one. Also, as discussed here: Prune -keep 1:1 keeps oldest revision, and what I am already seeing in my tests, Duplicacy chooses the oldest snapshot from each day with the oldest data as the one to save instead of the newest. Those two attributes create a nasty combination for sub-day backup intervals and I don’t understand the logic in either of them. Does it stem from Duplicacy working at a calandar-day level when deciding on what to prune as opposed to a duration of time? To say it another way, if you run a prune at 12:01AM, does it consider the backup taken 2 minutes earlier at 11:59PM a day old or 2 minutes old?

What I would love is the ability for Duplicacy to simply skip the backup procedure if there are no new files to back up, and not produce a new snapshot. This would result in at least a >99% reduction in snapshots for me. I’d like to hear if there are downsides to doing this.

Another benefit of skipping empty snapshots would be for restores. Duplicacy is lacking in its ability to easily browse, search, review, etc. the contents of snapshots (frequently-requested feature most recently here: Feature Request: Search Backups for File / See File History). It would be much easier to go back through snapshots and know that at each snapshot new data was added. Otherwise I have to initiate other commands to generate detailed snapshot lists, scroll hundreds of snapshots looking for changes to “unique items,” note those revision #'s, and go back and start restoring (I also can’t for the life of me make the history command play a helpful role in restores).

Regarding sub-day pruning, I do think it should be available as others have expressed a need for it (eg: Enhancement proposal for prune command) and I have found it useful with other backup systems (…pretty much all of them offer something comparable). I do worry about this as a solution for my needs though, given my previously-mentioned issues with :d: pruning in general.

I would use my local backup system (Urbackup) to manage these high frequency backups, but I want them to get to the cloud asap and as discussed here I cannot get Duplicacy to play nice with that local backup destination because of the way Urbackup structures its destination storage.

Am I being overly nit-picky? Should I just shut up and let :d: make 100’s (ok, as of posting I have crossed into 1000’s) of snapshots?

Here’s a snapshot list after running for a few hours to give you an idea of what I’m talking about

 9700_PRODUCTION_BACKUP_01 |  49 | @ 2021-02-07 21:26       |   460 | 1,774M |    334 | 1,319M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  50 | @ 2021-02-07 21:35       |  1295 | 1,889M |    352 | 1,358M |    3 |     149K |  21 | 39,972K |
 9700_PRODUCTION_BACKUP_01 |  51 | @ 2021-02-07 21:38       |  2075 | 1,890M |    353 | 1,358M |    4 |     515K |   4 |    515K |
 9700_PRODUCTION_BACKUP_01 |  52 | @ 2021-02-08 02:31       |   513 | 1,864M |    353 | 1,368M |    3 |      84K |  11 | 25,308K |
 9700_PRODUCTION_BACKUP_01 |  53 | @ 2021-02-08 02:34       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   4 |     86K |
 9700_PRODUCTION_BACKUP_01 |  54 | @ 2021-02-08 02:45       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  55 | @ 2021-02-08 02:55       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  56 | @ 2021-02-08 03:05       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  57 | @ 2021-02-08 03:15       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  58 | @ 2021-02-08 03:25       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  59 | @ 2021-02-08 03:35       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  60 | @ 2021-02-08 03:45       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  61 | @ 2021-02-08 03:55       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  62 | @ 2021-02-08 04:05       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  63 | @ 2021-02-08 04:15       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  64 | @ 2021-02-08 04:25       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  65 | @ 2021-02-08 04:35       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  66 | @ 2021-02-08 04:46       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  67 | @ 2021-02-08 04:56       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  68 | @ 2021-02-08 05:06       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  69 | @ 2021-02-08 05:16       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  70 | @ 2021-02-08 05:26       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  71 | @ 2021-02-08 05:36       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  72 | @ 2021-02-08 05:46       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  73 | @ 2021-02-08 05:56       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  74 | @ 2021-02-08 06:06       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  75 | @ 2021-02-08 06:16       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  76 | @ 2021-02-08 06:26       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  77 | @ 2021-02-08 06:36       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  78 | @ 2021-02-08 06:46       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  79 | @ 2021-02-08 06:57       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  80 | @ 2021-02-08 07:07       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  81 | @ 2021-02-08 07:17       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  82 | @ 2021-02-08 07:27       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  83 | @ 2021-02-08 07:37       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  84 | @ 2021-02-08 07:47       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  85 | @ 2021-02-08 07:57       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  86 | @ 2021-02-08 08:07       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  87 | @ 2021-02-08 08:17       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  88 | @ 2021-02-08 08:27       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  89 | @ 2021-02-08 08:38       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  90 | @ 2021-02-08 08:48       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  91 | @ 2021-02-08 08:58       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  92 | @ 2021-02-08 09:08       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  93 | @ 2021-02-08 09:18       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  94 | @ 2021-02-08 09:28       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  95 | @ 2021-02-08 09:38       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  96 | @ 2021-02-08 09:48       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  97 | @ 2021-02-08 09:58       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  98 | @ 2021-02-08 10:08       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 |  99 | @ 2021-02-08 10:18       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 | 100 | @ 2021-02-08 10:29       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 | 101 | @ 2021-02-08 10:39       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 | 102 | @ 2021-02-08 10:49       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 | 103 | @ 2021-02-08 10:59       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 | 104 | @ 2021-02-08 11:09       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 | 105 | @ 2021-02-08 11:19       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |
 9700_PRODUCTION_BACKUP_01 | 106 | @ 2021-02-08 11:30       |   514 | 1,864M |    354 | 1,368M |    0 |        0 |   0 |       0 |

This isn’t a problem for me at the moment since, on my own PC, I only do backups every 2 hours. And on my client’s systems, only once or twice a day (relying on shadow copies for finer resolution). However, now I’m realising this choice is partly down to my knowledge of how Duplicacy does its pruning…

IMO, sub-day pruning rules would actually be a nice thing to have. For me, I could then extend the period for how long to keep daily backups (currently -keep 1:14) to much more than 2 weeks, and I could run more regular backups without significantly bloating the number of snapshots for that time period. We could also have more levels: e.g. -keep 3h:60 -keep 15m:14.

Either way, I don’t expect this feature alone to improve the performance of prunes and maintenance checks, but I think it might be necessary if Duplicacy were to extend its featureset - e.g. having a local db for the web UI to more easily find files for restoration.

Furthermore, a -skip flag would also be useful, but perhaps with greater integration with the web UI to flag such scheduled backups as ‘no change’. However, there might be a time-based option required (e.g. -skip 1h) so that it generates at least one backup per that interval - even if there were no changed.

Alternative, and this is my preferred option, because even CrashPlan did it: implement a real-time fs watcher. With such a feature, you could enable near continuous data protection and have it backup as soon as changes were detected in the repositories (with a suitable grace period to accumulate multiple changes). (Although the watcher might detect a change - a lock file, say - and there’d still be no changes by the time of the backup, so a -skip option would perhaps be required to complement a watcher.)

Just a comment in the middle of the discussion: maybe the “boundaries” of a backup service are being forced here, and in fact the ideal in this scenario would be synchronization + versioning (like Google Drive and others)?

Yea I guess that depends on the nature of the data you’re backing up.

It’s true, you don’t normally need so much revision history, but that’s kinda the point - this thread is about minimising those revisions, but being able to backup more regularly, so you don’t lose much data since the last good backup.

A workaround is to parse the output of check -tabular and remove those with no unique and new chunks.

In the long run, I think a file system monitor should be the solution as @Droolio pointed out. In fact, I have been thinking about building the next generation of Duplicacy, which will go for ‘continuous backup’ instead of being snapshot-based, among other things.

1 Like

it would be imperative to retain the current behavior too, and make live filesystem monitoring optional. For many reasons. Please don’t murder good product

One of the reasons I picked duplciacy was specifically because it did not have runtime component and did not do continuous backup (for performance and maintenance perspective).

How is the user expected to bisect the history if it is not done in a predictable way on a constant cadence? Picking up every change will generate ton of noise for rapidly changing data; it’s never appropriate.

Live monitoring of any kind, no matter how lightweight, will absolutely murder performance for most use cases I have. I had to go out of my way to stop the CrashPlan’s filesystem monitor (which was very efficiency and was simply collecting list of changed files to include in the next periodic backup run) because it would tax the thing I was doing by 60-100%. Besides, adding another runtime thing is a no go for laptop users. And finally, inotify tends to be horrendously unreliable, and you would need a full sweep support anyway, to pick up missed bits and pieces – just like CrashPlan did.

Basically, current implementation and approach are perfect. I backup every 15 minute and don’t see an issue in having 2000 identical snapshots. It’s not a problem that needs solving, it’s not a deficiently; it’s one of the killer features of the app.

3 Likes

This has splintered quickly!

I’m going to sidestep talks of a completely new implementation of Duplicacy as a real-time, fs-watching service, only to say that though it would help address my issues, I doubt I’d see any such tool in the near future and I have similar sentiments as @saspus toward other “always watching” solutions for my use cases.

Are you saying to run check -tabular and then manually note and remove each non-unique snapshot? I was referring to this process in my explanation for how horrible such a thing would be when you have 100’s of empty snapshots. Is there a way to fold the results of check -tabular into a prune command or something to automate that?

I wouldn’t go as far as to say I am misusing :d:, but I understand your point. I use other similar tools in my larger workflow, however a combination of all my specific needs rule many of them out or create overly-complex systems. The benefits and reasons anyone else would use Duplicacy are still also true for this use case. I can easily filter through terabytes of data to flag a few specific file types for backup, I can send those files to multiple storages, both local and offsite, I get multi-machine de-duplication, encryption, etc., all user-configurable via the command line. I also don’t consider my suggestions to be outside of the core use and mission of Duplicacy as a snapshot-based backup tool.

To boil it down…

  1. Is there a downside to the user or an issue in implementing an option such as -skip-no-unique?
  2. Could sub-day features be improved in Duplicacy? Specifically (1) Save the newest (as opposed to the oldest) snapshot when pruning at day level and (2) allow sub-day pruning rules?
  3. Are there legitimate reasons to worry about having Duplicacy constantly accumulating 1000’s of snapshots from multiple computers on the same cloud-based Duplicacy storage, while also running daily prune operations? Are there best practices for such a thing? For example, I noticed it’s not recommended to run prunes from different machines, but what if I want different pruning rules? Should I just use a different storage for each machine if I want that level of control?

…and regardless of the answer to #3, I see valid reasons for 1 & 2 and though it may be a bit naive, #1 seems easy and harmless :innocent:

Thanks for the great conversation on this.

I agree that this option would improve check and prune performance, and even the search for the latest revision at the beginning of each backup. As the revision file is generated at the end of backup, after going through the files, IMHO I don’t see any problems.

I think, basically, just performance.

You must prevent a prune operation from being performed during a backup, for the same snaphot-id.

I think it will depend on your level of deduplication usage. If you separate the storages, you will lose the deduplication gains. But if the files in the different repositories are different, either way you would not have this deduplication gain.

You must have something wrong with your system, coz I’ve never seen any such problems on numerous WIndows and Linux systems.

In fact, the whole point is to reduce resource use and the need to rescan deep directory trees. All it does it detect changes in the fs - what the software does after that may impact performance, but that has nothing to do with the watcher. An accumulating grace period solves that easily.

Support for fs watchers is cross-platform, built into the OS, and plenty of software uses it to great effect (SyncThing I know for sure, as it’s open source, but also very likely the likes of OneDrive and other cloud sync tools pretty much have to use it). They all do regular scans to pick up anything that was missed.

I don’t see the problem here, so long as there’d periodic full scans.

Plus, I’ve previously outlined why relying on indexing alone isn’t scalable. On my system, the fs cache gets emptied quite often after running certain games that indexing takes long enough where reducing my backup interval from 2h to 15mins makes it difficult.

No worries. The snapshot-based design isn’t going away. I just feel that ‘continuous backup’ has an advantage in that the prune rule can be very simple (for example delete backups older than n days and keep every else).

You’ll need to write a script to do that.

Updating the “to boil it down” list:

–No. Given that this seems like the lowest hanging fruit, can someone comment on the likelihood of such a feature being implemented? Is there any way I can help make this happen?

–No update. Do the Duplicacy powers-that-be have a take on implementing this or what a realistic roadmap would look like? If it’s probably not going to happen any time soon, that’s fine - just trying to make an educated decision here and I need information to do that.

Here’s a peak at how Kopia implements pruning:


With a 5 minute backup interval, the shown configuration saves the most recent 12 every 5 minutes (an hour’s worth), and then it prunes down to 1/hour for the next 48 hours and so on. It does not skip empty backups, but by default it filters out snapshots that are the exact same from the snapshot list which is a nice bonus. I was able to go from browsing their documentation to downloading and setting up something that worked for my needs in about 15 minutes. Then again, I still don’t know much about it so I don’t trust it but I’m just sayin…

–No, outside of performance. Just don’t let prunes run while you’re backing up (and vice-versa??). That’s news to me so thanks.


re: fs watchers, I’m sure when done correctly it would be all pro’s and no cons. I’ve had issues with conflicting states of files on sync and backup systems, but that was probably just poor implementation. For example, saving out a massive .psd file triggers a sync, but that save process is multi-step and slow and the half-saved file is replicated before the save has completed. A silent, intelligent, low/no resource fs watcher that allows lots of user configuration to suit specific use cases sounds like a great addition or alternate product. But ya, just don’t take anything away from our precious OG :d:
…also, this whole topic is only tangentially related to the OP and it’s sexy, hot-button nature is stealing focus from my selfish needs! If it is in any way a seriously considered addition to Duplicacy, it would be more than deserving of it’s own thread as a feature request (and I will immediately follow said thread!).

1 Like