Is it bad to move lots of files around during initial backup?

Hi, all. A month ago I got my first Synology NAS (DS1618+), and I’ve been slowly copying over every file off every old backup disk and USB drive I’ve accumulated over the years, deleting duplicates, and organizing them.

Almost immediately on purchase I set up @saspus’s Docker image and bought a Duplicacy license. I’m backing up a single repository to a single Backblaze B2 bucket. I have a truly pitiful 7.5 Mbps upload speed via my suburban cable ISP, so I knew it was going to take over a month to complete my first backup. But over that month I’ve done a lot of moving files around, adding a little new stuff, and deleting duplicates and things I don’t care about.

As the initial backup went on, the estimated time to completion got truly ridiculous, hitting almost 300 days at times. I’ve only filled up about 2 TB of this NAS, and even at 7.5 Mbps it shouldn’t take most of a year to upload everything on it. And when I started trying to solve this a few days ago, Backblaze reported that the bucket had about 40% more data usage than the Duplicacy check job reported. So right then I ran a prune job and stopped checking it for a while.

Tonight Backblaze reported much closer stats to what Duplicacy reported, but the backup estimate was still many months. So I aborted the backup job and restarted it. At the moment, Backblaze reports 812 GB of storage and 161,000 files. Duplicacy reports 783 GB and 157,000 chunks. And the backup process estimates 17 days, which seems about right to upload another terabyte or so.

So, after all that, MY QUESTIONS:

  • Did I cause all that weirdness by moving stuff around during initial backup?
  • Was it actual weirdness, or just messed up estimates that would have finished in 17 days anyway if I had left it?
  • And, generally, should I stop organizing things until the first backup is done? I would think that the whole point of deduplication would be for that not to matter so much. But… now I’m not so sure.

Thanks for your time.

Do you have anyone who has faster upload speeds than you? If I were you, I’d

  1. Backup to a external HDD
  2. Ask a friend to run duplicacy copy to B2
  3. Change the backup destination to B2
  4. Do addiotional backups

iirc there are server providers who mount your own HDDs. You can ship your HDD to those providers and run duplicacy copy on the datacenters (which should usually have 1Gb+ upload speeds)

Given my starting conditions, any of those options is far more cost and trouble than just waiting two and a half more weeks, if it indeed that estimate holds. I have the important stuff backed up in other ways for now; if my house burns down in the next 18 days, the fraction of my data lost forever will be a pittance.\

I do want to keep organizing my NAS, though, unless someone confirms that that’s a bad idea based on the issues I described above.

Regarding to your doubt if moving files interferes: there is an impact, a small part of the chunks that have already been stored in storage will not be used, but my experience shows that most are used and deduplication works well.

About the difference in space reported by check and B2: I suggest configuring the bucket to keep only the last version of the files:

bucket

After finishing your first backup you would can “clean” the unused chunks with the prune command.

Yes, if you move stuff around then the estimate by Duplicacy could be too optimistic (when uploading files that have already been uploaded) or too pessimistic (if files yet to be uploaded are in fact duplicate copies).

But, I think you should use at least 2 threads if you haven’t done so. In my experience a single B2 server is capable of saturating 7.5 Mbps but the actual speed can vary a lot depending on the load. So to get a better estimate you should use 2-4 threads to connect to multiple servers.

Yes, it can cause problem, see Tarsnap - Tips for more details if you’re interested.

If you have to do live system backup with a slow link, I suggest starting with a snapshot at some time point. Once finished, it should be easier to catch up if differences between that snapshot and current status are small.