Change order of files backing up

f34949e1a855f93f29c1 · 12 September 2019 20:46

I used Crashplan since they started, and recently switched to Duplicacy (after a months-long research and evaluation phase). And before Crashplan, some other popular backup products which I don’t even remember.

Crashplan and the others backed up newest-files-first. It wasn’t configurable, but that’s what I wanted anyway. (Crashplan is supposedly a little more complex than that, also taking file size into account but obviously more weighted towards modified date.)

Duplicacy seems to back up in alphabetical path order. (Or by node ID? Or however the filesystem returns data according to whatever APIs you’re calling?)

It would be fantastic, if the file backup priority was configurable! Even if with simple options, such as only one choice among:

Date: (oldest|newest) first
Size: (largest|smallest) first
Alphabetical by path (asc|desc)

Why? Consider my use-case and requirements for example, which aren’t unique:

My backup set will take a year or two to complete to B2, assuming I can get it to go faster. (Or longer if not.)
The farther back in time you go in my dataset, mostly photos, the more ways the files are already backed up in increasingly multiple ways both locally and cloud.
- While I eventually want them all backed up to one place, there is increasily less urgency the older they get.
- Like slight variations used by many pros, I store my photos and videos in folders named YYYY/YYYYMMDD.)
- My newer data is simply “more important” than older data, for various reasons, including taxes for example.
Most importantly: I can’t risk waiting 12-24 months for my newest stuff to get backed up to the cloud.

So as it is now, I have to resort to various tricks with mount --bind and filters, to approximate what I need. Which of course, makes the following problem worse.

I realize that with the current chunking algorithm, sorting by newest first would likely cause problems for deduplication, resulting in an unnecessarily higher amount of data to upload and store (and pay for).

But that problem in turn wouldn’t exist - if the problems noted, acknowledged, and objectively measured in the discussion, “Inefficient Storage after Moving or Renaming Files? #334”, were addressed - e.g. with this PR waiting to be merged to master “Bundle or chunk #456”.

That single PR alone would likely save me thousands of dollars in cloud storage bills, over the long run.

Edits Sept 13 2019:

Clarified reasons for needing newest-first priority.
Referenced discussion #334 (inefficient deduplication with moved/renamed files).
Referenced PR #456 (“bundle or chunk” algorithm to solve that problem).

tedsmith · 12 September 2019 21:38

In my case Duplicacy is so much faster than Crashplan that I don’t care about file order.

Sorting based on an attribute that will change (dates, sizes) will add more churn in that a given file may be next to different neighbors on subsequent backups, etc. which depending on file orders may cause a lot more traffic. (Most of us update a file a lot more than we move it.)

f34949e1a855f93f29c1 · 12 September 2019 23:35

Yeah I get that about churn, as noted with “might make it worse if done by date”. But that wouldn’t apply to all sorting options, and even then is only because Duplicacy groups multiple files into chunks.

If fracai’s “chunk boundary” branch [or fork?] were merged - which ends chunks at the end of large files - then it would never be a problem, regardless of sorting algorithm. (I’m aware that the branch might introduce other issues elsewhere in the code path.)

Also, grouping by size should also mitigate the problem, but that would be less useful for general backup than by date/descending. The key thing to keep in mind is that even with alphabetical or node ID-based sorting, there’s already a known problem with deduplication efficiency, that explicit sorting, such as size-based, shouldn’t make worse.

Also in my experience so far, Duplicacy is objectively not faster than Crashplan - but not having anything to do with the software. It’s my choice of Backblaze B2 for storage, specifically. I’m averaging about 750 KB/s, compared to Crashplan at ~2 MB/s for never-before-seen content. Measuring goodput at the interface, not application self-reporting. (But don’t get me started on other performance problems of Crashplan. Which have been complained about forever on the web. Such as memory consumption, crashing when default JVM settings run out of memory, etc.)

tedsmith · 13 September 2019 02:53

(I’m not trying to be argumentative, but I’m not sure we’re talking about the same things.):

FWIW I got about 400KB/s to Crashplan (much slower in the last year or so than four or five years ago) and about 3MB/s to Backblaze B2 (my internet service’s limit). Have you set the ‘-threads’ option to, say, 10? (I’m using the CLI, I don’t know if or how “theads” are exposed in a GUI.) It took me about 5 months to do my initial backup with Crashplan years ago and I’m just over a month backing up about twice as much data with Duplicacy.

But what I was talking about is that Duplicacy’s scan for changed files (on my NAS, 10 or 20 minutes) is orders of magnitude faster than Crashplan’s scan here (days for my NAS, sometimes a week or so) - in fact Duplicacy did the original complete backup (to a local disk) faster than Crashplan did a scan for changed files here (I had them running in parallel). Also Duplicacy’s scans my local files in just a matter of seconds compared to, say, 5 to 10 minutes for Crashplan.

I’m also reasonably sure you aren’t taking everything into account re different sorting options, but then again if it’s an option people can choose sorts that work well for their habits if they want to, for me alpha is a clear winner. Crashplan just never would get to the end of my larger files unless I moved a backup with the large files to the top of it’s priority list and then put things back the way they should be after the big files got uploaded. With Duplicacy I can simply start a backup in whatever repository I want even if other backups are running, so I get my 4 or 5 daily backups of my work done at the same time as it’s plowing thru my 1000 or so 2 to 3 Gig music files.

Droolio · 13 September 2019 15:48

I honestly don’t see a benefit to changing the sort order. What possible benefit can ordering by newest first, confer?

What problems are these?

f34949e1a855f93f29c1 · 13 September 2019 17:58

Interesting, my observed speeds to Crashplan gradually sped up, not down. I’m still using it in fact, and will until initial Duplicacy backup is done. (Not on same machine or volume as Duplicacy.) My initial backup took 2 years, back in the very early days - I was one of their first customers.

Performance definitely dropped with deduplication set to “always”. But improved with “automatic” or “never”, and with “always”, improved significantly with drastically increased JVM memory allocation. (And more memory to work with.) All backup products have pros and cons, that conspire for different results for different use-cases. One of Crashplan’s major flaws was needing to cache block hashes in memory, grinding if it didn’t have enough, and often crashing if it ran out.

But until Crashplan screwed the pooch, all pros and cons bundled together, their Home product was the hands-down unrivaled winner for me and many people. Unlimited storage, unlimited versions, never delete files that are deleted on client or simply become unavailable (like most other all-in-one commercial cloud backup solutions which is a single deal-breaker for many people), an unbeatable regex filter design (if poorly implemented in GUI), flawless deduplication, and by far - nothing even comes close - the cheapest solution for, say, 5TB and up of cloud backup data. Also, you could back up to your own remote hosts, which I also did. That was freaking amazing. (For an all-in-one commercial cloud backup solution.) Most of my friends and family used it on my recommendation, saving many of them from disaster. I also used their Home products for business.

Arguably their biggest flaw lately, has been making really dumb business decisions - including but not limited to: Canceling their Home product in favor of Business with a grossly inferior GUI and “centralized management” that doesn’t come close to making up for it. Then they deleted 5TB of my backup when migrating from canceled Home plan to Small Business; then deleting all customers VM backups including mine and my employers - to the point where I’m convinced they won’t survive much longer as a company. Which is why I’m switching.

No I haven’t, I’ll try that, thanks.

Since my backup set is large, takes so long to complete, and the farther back in time you go the more the files are already backed up in increasingly multiple ways both locally and cloud - it is absolutely essential that my newest newest files are backed up first. I can’t risk waiting 12 months for my newest stuff to get backed up. As a secondary goal, photos are most important.

So I have no choice but to get creative with varying mount --bind and include/exclude filters, in order to get Duplicacy to approximate my requirements. (Which further compounds the problem discussed and acknowledged here: “Inefficient Storage after Moving or Renaming Files? #334”.)

Also, I just learned about this feature coded up and tested last year (but not yet merged to master), " Bundle or chunk #456". This would completely obviate problems with different sorting algorithms, and could literally save me thousands of dollars of cloud storage bills over the long run.

tedsmith · 13 September 2019 18:19

I was a very happy Crashplan customer for years, but when my system started slowing down I worked with them for over a year and all they could come up with was that things were working as designed. Till then Crashplan seemed like the obvious choice.

f34949e1a855f93f29c1 · 13 September 2019 18:31

You honestly not seeing a benefit, has no necessary correllation to others seeing a benefit.

Consider my use-case and requirements for example (or not):

My backup set will take a year or two to complete to B2, assuming I can get it to go faster.
The farther back in time you go in my dataset, mostly photos, the more ways the files are already backed up in increasingly multiple ways both locally and cloud.
I can’t actually risk waiting 12-24 months for my newest stuff to get backed up to the cloud.

Possibly because prioritizing newest files first for backup, meets fairly inobscure requirements for many, it’s the default sorting mechanism used by Crashplan, Carbonite, and other cloud-based products.

As a secondary goal, I need photos to be prioritized. (And yes, it’s possible to design a sorting mechanism that takes multiple hierarchical objectives into account, that isn’t oppressively complex to configure or code. Crashplan did this, though somewhat opaquely. To realize how this would be possible, think in terms of “groups” of criteria. (E.g. newest stuff from 0 to 10 days old, then 10-100, etc. Within those groupings, a defineable regex first, or smallest first, etc. Or if by filesize first, groups of filesizes increasing by factors of 10. Etc.)

As discussed in one of the largest discussions threads on Duplicacy: “Inefficient Storage after Moving or Renaming Files? #334”, which has been acknowledged as a problem, objectively tested and measured, a solution has been coded up and tested, and waiting for merge to master: “Bundle or chunk #456”.

That single patch could save me thousands of dollars in cloud storage bills over the long run.

Christoph · 13 September 2019 19:04

I seem to remember that, some time ago, I also suggested that newest files should be backed up first but I can’t seem to find that discussion anywhere ( but see Resuming interrupted backup takes hours (incomplete snapshots should be saved)) and so I’m not sure on what grounds @gchen turned down my request at the time.

Have you considered giving your newer files a repository of their own? (And more generally: dividing your huge repository up into smaller ones?) If you prefer a single repo, you can get rid of the one with the new files once you have everything else backed up.

Yes, good that you bring this up again. I am also wondering why this is not being implemented. But since the Bundle-or-chunk question is unrelated to the order of backing up files, I would like to ask everyone not to discuss this question here but in a separate topic, e.g. this one:

Droolio · 13 September 2019 22:32

Changing the sort order to newest first won’t make those files safer just because they’re uploaded first.

Until your initial backup completes 100%, you do not have a backup. There is no snapshot, only chunks. So it doesn’t matter in what order the files are uploaded, because snapshots are atomic - they must complete in order for a restore to be possible.

I suggest an alternative methodology to solve this particular problem. Maybe Duplicacy can help in this regard, with a different feature (perhaps a timestamp-based filter) but I don’t think changing the sort order will benefit you if your internet connection isn’t fast enough to keep up with your data. Particularly when changing the sort order to timestamps may hamper deduplication and your ability to resume an initial backup.

Obviously, you can’t defy the laws of physics - if it’s gonna take over a year to do an initial backup, fair enough. But you can protect the data being uploaded by breaking it up, as @Christoph suggested.

If I was you, I’d exclude old files and let it complete an initial backup early. Then gradually add in older and older files. With this, you have much greater control of what exactly is backed up and how often you want to claim a greater chance of recoverability.

To make it easier to manage such a filter, you could look into the nobackup_file setting. Might even be possible to write a script that would pepper a repository with such .nobackup files in directories which contain files that haven’t been modified in the last month or whatever.

Christoph · 13 September 2019 23:07

The downside with this, however is, that it might take even longer to complete the first full backup because duplicacy keeps an incomplete file for the very first revision only.

tedsmith · 14 September 2019 01:03

I do something slightly squirly to get around this problem. I build one repository (with a link to the real files) who’s filter excludes the problem files (for me, my multigig .iso SACDs) and got that backed up in about a week. I built another repository (linking to the real files) that only included the problem files and

keep a long running backup going, and
periodically (about every day) change the filter to only refer to the files that are backed up and start that (in parallel with the big backup)

The 2nd above skips all chunks and runs about 20 times faster than the real backup per file - it only needs to do the new files.

I don’t really care whether the big backup over writes the snapshot ID of one of the smaller incremental snapshots or not. When everything is backed up I have the original snapshot of the non-problem files and a snapshot of the problem files. Then I’ll merge the two filters and do a final real backup which won’t take too long.

Running the backups that skip every chunk doesn’t slow down the big long backup so I’m not loosing time and I have a good snapshot of my current progress as well as normally functioning backups of the non-problem files.

I like that things like Windows Update restarts, power failures, net outages, etc. don’t bother the process and I don’t loose any real work getting things started again.

I love using the CLI!

f34949e1a855f93f29c1 · 14 September 2019 03:30

That’s a pretty good idea!

It’s actually closely related, because I don’t think changing the sort order could be done - not without deduplication problems - without solving the chunk issue first. That PR is, I believe, a prerequisite. But fair enough on the request to discuss it in its original thread.

Ouch. I didn’t know that. That alone actually removes Duplicacy from consideration for my requirements. I’m really glad you pointed that out while I’ve only backed up a few dozen GB so far. No solution is perfect, I knew Duplicacy wasn’t either, but this would be a deal-breaker for me. I’ll look into it more to validate for sure, but I believe you. And I’ll add “Can restore from incomplete backup?” as a column in my backup requirements spreadsheet :-). (And “Can specify newest first without dedupe penalty?”)

Looks like I’m going to have to resort to a two-stage backup process after all. (First to local, then from there to cloud. Which opens up many more backup product options.)

I appreciate the feedback!

That’s actually a very clever idea! I’m afraid though that, given what I just learned about not being able to restore from incomplete backups, Duplicacy is out of the running. Which kills me, because it tooks literally months of research and trials to get to this point, and no backup product meets all of my requirements. Up until now, I didn’t even know that “Can restore from incomplete backup” was a requirement, I just assumed they all could. But suddenly a requirement I didn’t know existed this morning, is the most important one.

I do make a point to read issues and forums, but obviously not well enough!

Duplicacy is so promising though. I’ll keep checking back.

tedsmith · 14 September 2019 04:39

Which is just my point: I don’t have incomplete backups, or more properly all of my backups are current to the day for the problem files and to the moment for the rest of my files (they backup in no time.)

Perhaps it isn’t obvious that you can be running multiple backups and/or copies at the same time. I have my normal scheduled backups running all the time, they work great and are independent of the long running backups.

I also have my problem files being backed up twice: once that runs “forever” that actually uploads the chunks and another backup which doesn’t get ahead of the first that’s complete to the day - it runs quite fast since it can skip all of the uploaded chunks and the files it’s already backed up, it produces a complete backup (to that point) and you can restore from it. In general the only files I can’t restore are the ones not backed up yet (minus part of a days worth of the problem files.) Since I already have two copies of everything on site, I’m not too worried and I just dumped my Crashplan backups entirely.

This may seem complicated but in practice it’s a lot less work for me than Crashplan. Because the Crashplan scans would die a lot Crashplan kept thinking I’d deleted all of my files. If a scan could finish then it would resurrect the files without reuploading them, but any restores I would do of files in limbo was a real pain, restoring deleted files that were really moved files, etc. My restores were multiple times as big as the data backed up. They never seemed to understand that a failed scan shouldn’t mark all files not looked at as deleted.

When the initial big backup is complete (in about 5 days) I can go back to the simple normal backups.

Droolio · 14 September 2019 12:13

As @tedsmith elucidates to, that wouldn’t necessarily be the conclusion I would draw.

Split up your backups into multiple repositories and have a tiered filter system for protecting new and important stuff first. Those backups will complete early and you can add more files with subsequent incremental backups by removing filters over time.

Once all your initial backups are complete, it’s incrementals from then on.

Also, I don’t know what kinda content you’re backing up with Duplicacy, but if you have any media, I’d honestly consider a different tool for that. Movie media generally doesn’t de-duplicate, nor does it really need a version history. Rclone is fantastic for copying media to cloud storage.

towerbr · 14 September 2019 12:55

A strategy I used in some large repositories to do the initial backup was to do a “piecemeal” backup with filters. This way, each backup that is completed is a complete revision (the Duplicacy name for “snapshot”), and with each execution you adjust the filters.

Something like that:

First run, filters file:

+folder1/*
-*

Second run, filters file:

+folder1/*
+folder2/*
-*

…

This strategy just doesn’t apply well to large single files like databases, VM’s, etc.

Christoph · 14 September 2019 20:37

Well, as others have said, I think that is not a necessary conclusion. But I have nothing to win from trying to persuade you to stay with duplicacy. In contrast, I would indeed be interested to hear which better backup solution you are considering using instead. I have spent testing different solutions for well over a year before I ended up using duplicacy, but I’m still interested in learning about good alternatives.

f34949e1a855f93f29c1 · 15 September 2019 01:15

These are good ideas that I’ll experiment with, thank you!

I have two kinds of data. Standard daily office, dev, and management document type stuff - most of which already lives in a 2tb Dropbox folder (with encfs in the middle), so I’m not too worried about that. (But do need backups and versions, as Dropbox occasionally munges conflicts that I don’t notice until months later. It’s getting better though.)

The second is photos and video. Some (mostly video) for projects, the rest personal. That’s what makes up most of my current 7tb volume. And most of that, due to my run-and-gun workflow, is actually redundant. So efficient deduplication is critical. I even tolerate ZFS insanely slow write speeds with dedup=on for my local backup. (And more recently Btrfs with offline deduplication.)

I’ve been looking into that also. More on that below…

That’s interesting. Crashplan Home (actually family plan) was truly a “set and forget” solution for me. Well…at least after years of fine-tuning regex exclude expressions. Once I fixed the crashing problem due to not enough JVM memory allocated for deduplication, it never, ever failed or stopped scanning. On between 5 and 8 machines. I’ve restored countless times from it, flawlessly, but I’ll admit to being worried about what would happen if I needed to do a full restore. I also chose Small Business at work, and while the interface is absolutely atrocious, it has done very well.

In short, I slept VERY well with Crashplan. It wasn’t perfect, didn’t meet all requirements, but no solution does. Of course they completely f’ed things up by dropping Home, and screwing their customers over (including me) during their “migration” to Small Business. And then they deleted all VM images from all their customers with little (or no?) warning - including mine, both at home, and the dozens at work. Major PITA. (Now I just 7zip them up. Which isn’t too bad.)

I know there was always much controversy around Crashplan, even more so after Code42’s string of disastrous business decisions. But it worked mostly flawlessly for me. Even after all they’ve screwed up, I’d still stick with them if 1) they stopped doing stupid s**t [which they seem unable to do], and 2) if I believed they will still exist in a few years.

I can’t really just categorically say “no way”. Unless as an emotional decision, which this is not. (The decision to ditch Crashplan is, in part, emotional. For example, it took 2 years to finish first backup. Then they deleted most of it “migrating” me to Business Pro. Now it’s another 1-2 years to catch back up. Screw that! You bet your a$$ it’s emotional.)

As for Duplicacy, although I’ve already exhausted all research on direct-to-cloud solutions, there’s still a whole 'nother entire universe of options, if I’m willing to back up to a local server first, then from that to the cloud. And I am willing to do that. In that case, Borg so far is looking pretty good. (It has no direct cloud support, but I can handle that separately as a final step from a local backup server. Which others have also done.)

And realizing this pretty not-insignificant drawback to Duplicacy, hell I can’t honestly completely rule out Crashplan yet. As much as I hate to admit it, it still checks all the major requirements I have, in spite of literally going backwards on features and capabilities. (Well except for “will surely still exist in a few years”. That’s a big one.)

So we’ll see. Need to do more research. These workarounds to Duplicacy’s drawbacks, while a pain and the fact that they are a thing, says alot about the product, but it seems pretty solid in other ways, and nothing is free of drawbacks and the need for workarounds. (I’ve have all kinds of workarounds in place with Crashplan for years, to manage their poor regex interface. The actual runtime implementation of regex is absolutely perfect…the UI to manage them, horrible. I have written helper scripts to just manage that.)

If anyone is interested, here’s my spreadsheet for comparing backup and sync solutions (which are two different things I’m implementing in parallel). Many of which I’ve experimented with. And here’s my fairly high-level cloud storage comparison. The data is not all complete, but it was never intended to be published, but also no harm in doing so. Feel free to copy for your own uses, if you think it’s worth it.