Slow backup speed, and a web ui that seems too simplified

Hello all, so I’m testing out Duplicacy as an alternative to ARQ backup for a couple reasons, mainly that ARQ backup is unable to actually use the budget setting or enforce it, resulting in me having to start my backup from scratch every few months, which is annoying. I don’t see such an option at a first glance but perhaps I’m missing it.

The other reason is that ARQ is extremely slow to scan a drive while backing up, meaning I can only effectively run one backup per day.

Anyhow, I’m starting my initial backup over the lan before I move my target machine back to a relative’s house, however the initial backup is going quite slowly. I initially saw 8 MB/s for a single drive over sftp, I changed to a network share and the speed decreased slightly.

I changed the web ui ip address and I saw a significant increase in speed, this slowly ramped down but was actually at around ~16MB/s for each of my drives running at the same time for a while, but now it’s dropped down to about 3MB/s for each drive.

I’m not running into a cpu or ram bottleneck as far as I can tell, I have 24 threads at 2.93 ghz available and cpu usage is quite low, and I have about 30% of my 72gb of ram used, mostly by another application.

I’m wondering what else I could look at as a potential cause of this slowdown, or if this is pretty much expected and has to do with the de-duplication or encryption?

My other note is that the web ui seems oversimplified, I feel like I should be able to access a lot more, in my searching I see references to settings regarding number of threads used, and tweaks people make to chunk sizes, but this all appears to be only in the command line version? I might be missing something but it definitely feels like there should be a more advanced mode toggle, or something along those lines. Perhaps there is a way to access the command line but I can’t seem to find it.

I was also curious if there is going to be an ability to search for a specific file or folder name added in, as that’s one thing I really like about ARQ, even if it was dog slow to find it and load lists from the sftp server.

I also think there could be so much more. E.g., I am backing up to Google One and the only information I have is, it takes 2 days. I don’t know how much data have will be uploaded (need to calculate myself - and it is an important information, because 2 TB are a limit) and I don’t know what has already been backed up. Just examples.

When I first used Duplicacy, I was frustrated by seemingly slow speed. I discovered a few things to consider:

  1. The first backup is always much much much longer than any incremental backup. The first backup of my computer’s hard drive to a directly attached drive would take 24 hours or more, and subsequent incrementals only a few minutes (depending on the number of changes made)

  2. When I looked into this in late 2017 (and I don’t think it’s changed), setting more threads only affects the upload, not the internal compression and chunking. So let’s say you have a nice 4-core processor, Duplicacy will register as only using 25% of CPU but is actually maxed out in its throughput. This has to do with complexity of the data and so on. On older/slower machines (where a single core runs slowly), I’ve noticed this slows things down.

  3. One of the big things when you make your first full backup is having to reference each chunk, and create the new directory structure on your drive necessary to contain that. Multithreading does NOT help here if you’re using HDD drives, because more threads just means more head movement on the drive, which will actually slow things down (I’ve done tests to support this). On subsequent backups, once the directories for the hash are set up, and especially once you are able to start skipping duplicate chunks, it speeds way up. On my initial backups, sometimes I do only see 10-30MB/s. On subsequent (full) backups where there are duplicates, I often see 80MB/s or more.

I would guess that the biggest slowdown is that if you’re hitting a single drive with 24 threads, you’re probably overloading it unless it’s SSD (and even then, 24 threads is a lot).

As for the UI question, I long ago decided to just use the command line version. It was an investment of time, but is very flexible and was worth learning.

For the 2TB limit on Google - you can’t know how much compression you’ll achieve in advance. There is no way of predicting this, because it depends on how redundant your data is, how compressible it is, and so on. With my data - where there are lots of redundancies - I tend to get 50% reduction in size on the backup from the original.

You could try just backing up to a local drive to see how it shakes out, then decide whether to go to google. You can even use “Duplicacy copy” to copy the data from a local backup into the cloud.

I actually have five hard drives to back up, four 2tb and one 1tb (though they aren’t all packed full).

I figured the cpu and network load would be able to be split but it actually just got slower and slower to the point where it froze, so I restarted the program and am doing the drives one by one. A lot of this should be fairly large files though, not thousands of super small ones, so I was surprised that it’s still going relatively so slowly.

I’m not afraid of using the command line, however I also really want to have the web ui for easier monitoring, and I have quite a few things to exclude, entering it all and keeping it up to date via command line would be somewhat tedious when I need to make a change.

I think if there was a way that we could also access the command line, either on the local machine itself, or through the web page, this would add a lot of power to the web ui version.

It does sound like the web ui is relatively new so perhaps this is something the developer can take into consideration, where we pay the monthly fee for the addition of features, not trading features for others, especially since the web ui is just controlling the command line application somehow.

Hi thanks, we all know compression size depends on the material. But knowing nothing is not good. We should at least know how many files x of y and how much GBytes a of b have been completed. How much data is included into the backup - Duplicacy knows it after first indexing, I reckon.

I haven’t seen the level of the web version log yet, but using the CLI version is shown a percentage of execution at the end of each line, which AFAIK is related to the bytes to be backed up:

INFO UPLOAD_PROGRESS Uploaded chunk 2915 size 145873, 18.19MB/s 00:02:45 51.4%

When I need, I “follow the tail” of the logs, so it’s my “monitoring screen” …

Yep, I use two or three "tail -f … | grep …"s to build files of the lines I care about and then display the last line of them every second into a console window:


or

You may be able to find the bottleneck with the benchmark command. To run a benchmark with the web GUI, enter these commands in a terminal:

cd ~/.duplicacy-web/repositories/localhost/all
../../../bin/duplicacy_osx_x64_2.1.2 benchmark -storage <storage name>
1 Like

Tried that, however it states that the repository has not been initialized. I have been having better luck again with higher speeds, although the second drive I’m backing up did slow down to 6 MB/s, but the other one I have is running at 31 MB/s, at least, if the estimate is actually all that accurate. I’ve definitely noticed that the time estimates are, well, an estimate at best as it does seem to choke on some parts of the backup process.