Optimization of speed (was: one thread for local HD backup)

MorganG · 25 December 2017 20:59

I have been doing some informal testing of duplicacy to figure out how to speed up my backups to a local RAID5 4-hard-drive array. At one point, to backup 2TB, it was saying it would take 8 days using the -stats option.

Specs:
Backup Source: 3TB “hybrid” drive with spinning HD plus 128GB cache
Backup Destination: 4-Bay, USB 3 with 4x8TB drives RAID 5 running with Softraid on Mac

I tried various -threads counts, all the way from -threads 64 down to -threads 1.

Here are informal results. By informal I mean that I would kill the ongoing backup, then resume at the same spot with a different thread count, wait until it stabilized on a long stretch where there were minimal duplicates, and then recorded the MB/s. It is not rigorous, but at least an indicator.

-threads 1 ~12MB/s - Not great, but at least it will finish the initial backup in about a day or two.
-threads 2 ~1.93MB/s - very slow, it was slated to take well over a week
-threads 64 ~1.1MB/s - extremely slow, earlier on before it stabilized, it was running at only 100KB/s
-threads 16 ~619KB/s - seems even slower, but I did not let this one run very long, so it may not have stabilized at full speed
-threads 32 ~2.0MB/s - This was a longer test than for 16 threads, and converged to a somewhat higher rate, though not significantly different than 2 threads

The bottom line is that only one clear winner emerged: 1 single backup thread.

In thinking about it, this may make more sense than it seems for a spinning hard drive. That’s because writing each chunk requires a head seek, and if there are multiple threads, that’s a LOT more seeks going on. In fact, the way I pinpointed what was going on is running a separate speed test on the drive while the multi-threaded processes were running. The array normally writes at 200MB/s, but with more than 4 backup threads running, it slowed down to less than 9MB/s.

With only 1 thread running, it is still writing at over 130MB/s for separate processes.

Based upon this, I am guessing that an SSD would actually benefit far more from multiple threads, because there is no seek involved.

However, for spinning disks locally, I’ll be doing all my backups with -threads 1, unless I’ve missed something here.

gchen · 26 December 2017 02:30

Yes, that makes total sense. For spinning disks we should always stick with -threads 1. Thanks for running the tests and sharing the data.

MorganG · 28 December 2017 18:43

While my explanation for hard drive seeks and multiple threads makes sense for running one thread, I’m still trying to figure out how to make this go faster.

e.g. I have 32TB to backup on two RAID arrays (my “repositories”). My backup of the first RAID is presently running at only an average of 6 MB/S, and it’s been running for 3 days.

At this rate, it will be 74 days for a full backup.

The problem is definitely not speed of either array for normal use. Each one runs at around 200MB/s or more when tested with AJA system test. When I do a “clone” using one of the file-level cloning tools, it takes 4-6 days, which is much closer to full speed.

How can duplicacy be sped up for this kind of use-case?

One thing I’ve noticed is that Verticalbackup to a Linux virtual machine seems to run much faster, even over the network. And even though my backups are being stored on a hard drive for the virtual machine, multiple threads do seem to help.

Thanks!

gchen · 29 December 2017 03:14

6MB/s is too slow for a local disk storage. On my old mac-mini with the repository and storage on the same disk the backup speed is usually more than 30MB/s.

Can you run some tests with smaller datasets and different disks to isolate the issue?

MorganG · 29 December 2017 03:51

okay, when I get a chance, I’ll try some tests. It does seem very slow. Thanks

MorganG · 30 December 2017 21:35

Allright, I did further testing, and things grow more interesting.

I was able to obtain a >10X speed improvement from ~10MB/s to over 140MB/s on backing up a directory with large Final Cut projects.

I did this by simply creating a fresh repository on the same RAID where I was already backing up.

Now at first I thought this rather strange. Then I noticed that the config parameters are quite different.

For the original (slow) backup they are:

{
	"file-key": "", 
	"hash-key": "6475706c6963616379", 
	"id-key": "6475706c6963616379", 
	"min-chunk-size": 1048576, 
	"max-chunk-size": 1048576, 
	"average-chunk-size": 1048576, 
	"chunk-key": "", 
	"chunk-seed": "766572746963616c6261636b7570", 
	"compression-level": 100
}

These were the defaults that started from a verticalvm backup.

For the new, much faster backup they are:

{
    "compression-level": 100,
    "average-chunk-size": 4194304,
    "max-chunk-size": 16777216,
    "min-chunk-size": 1048576,
    "fixed-nesting": true,
    "chunk-seed": "6475706c6963616379",
    "hash-key": "6475706c6963616379",
    "id-key": "6475706c6963616379",
    "chunk-key": "",
    "file-key": ""
}

These are the defaults created when I just did a duplicacy init to a fresh storage.

I am stunned that it would make such a large difference in speed.

Can you elucidate how these parameters should be best chosen for a mixed storage that will include backups of various kinds of files, including virtual machines (from vertical) and a very large amount of regular files, including video and etc?

It looks like I may have to start over with a new storage, and I’d like to get it right this time.

Thanks!

MorganG · 31 December 2017 02:26

I have done further digging.

I have observed two additional facts since the earlier post:

Difference in speed depending on processor.

I am using the same settings for the “faster backup” on two different machines now, a 2011 dual core 3.1ghz i5 iMac, and a late 2014 quad core iMac 4.0Ghz i7. Both are backing up from a local drive to a different fast RAID 5 array. Each of these arrays can handle well over 200MB/s write speed according to Aja system test. One is thunderbolt (old iMac), the other is USB3 (newer iMac).

The source drives are both spinning disks, that benchmark out at 120MB/s or so. I have done my best to quiesce other disk-intensive processes on both machines. I also have turned off spotlight for the backup drives.

Now, there is a surprising difference in speed of these two setups that I can’t account for apart from processor speed. The iMac 2014 is averaging 4X faster over a long backup process (54.4 MB/s) compared to the older iMac 2011 (12.43MB/s).

I did some quick process profiling of CPU usage while backing up. On both machines, the processor seems to hover at about 100% (i.e. 1 core worth). This of itself indicates that the rate limiting step may be CPU.

In profiling, this was further supported. There were 10 threads total. 7 of those threads spent 100% in runtime.mach_semaphore_wait (i.e. waiting for other threads). One spent 100% in runtime.kevent (sub of runtime.semasleep), and another spent 100% in runtime.usleep.

The only thread actually doing something was in duplicacy.chunkmaker. Of that, time was spent divided amongst various subroutines, but in total, 24%+20%+1%+46%=91% was spent in compress_amd64, which appears to be the chunk compression routine.

Now, I tried running with more threads to see if I could speed up compression by using multiple cores. It did not make a difference. I perused the code, and it seems that the compression is happening in only a single thread.

So, my first observation is that multi-threading the compression may speed things up, especially for older machines.

After many chunks are put in the repository, some other limiting factor comes into play

On my fast iMac, initial upload rates were over 110MB/s, with a brand new fresh repository, so there were no chunks skipped at the start. However, over the course of four hours, speed has slowly dropped to 57MB/s, despite that there are now many more skipped chunks occurring as it progresses.

Since I am using -stats, it is hard to tell exactly what files are being backed up at the time. There is a possibility that the speed decrease is due to different types of files it is backing up.

However, another possibility - which is supported by the backups to my previous storage being so slow - is that the write speed slows down as more chunks are written.

This makes sense in a way. Accessing/indexing very large directories is much more time consuming than small ones. With my current settings, duplicacy is using a one-level, single-byte hash to divide files up into 256 different directories. (oddly, with the previous storage parameters, it was two-level).

So, as of now, with 93,000 chunks written, each directory has 363 chunks on average, and growing.

This may be one source of slowing things down, because as those directories grow, doing lists to find whether a chunk exists is going to get more and more expensive.

It would seem that as things grow, adding one or two more levels to the directory structure, to take advantage of the b+ tree of the filesystem, would optimize.

Now, I can live with 50MB/s - my backup will take 3 days at that rate on the faster iMac.

However, if this creeps downwards to what it was with my old storage directory at < 10MB/s, that becomes almost unusable. I will keep an eye on it and report back.

Morgan

gchen · 31 December 2017 04:53

Setting min-chunk-size, max-chunk-size, and average-chunk-size to the same value activates the fixed-size chunking algorithm, so if a file is larger than the chunk size it will be split into chunks of the same size. If the file is smaller than the chunk size it will not be combined with other files. Therefore, there will be too many chunks if there are many small files. This is usually not a problem for virtual machines, but as you can see it doesn’t work well in general cases.

So the recommendation is not to mix virtual machine files with other type of files in the same storage. They should back up to storages initialized differently.

Another factor may be "fixed-nesting": true which enables the new chunk structure. This effectively reduces the number of file system calls so it may contribute to better performance.

Difference in speed depending on processor

Right. The chunk maker needs to run the variable-size chunking algorithm and then two hash operations on each chunks (for both chunk hash and file has). These are very computation intensive but currently it is only done by one thread so maybe this is something that can be improved by using multiple threads.

So, as of now, with 93,000 chunks written, each directory has 363 chunks on average, and growing.

I don’t know if a directory of 363 files can make a noticeable difference on the write performance, but if you want to try you can increase the nesting level to 2 by creating a file named nesting next to the config file on the storage with the following content:

{
    "read-levels": [2, 1],
    "write-level": 2
}

If this file is created correctly, you will see the follow message when running duplicacy -d list:

Storage set to /Users/gchen/storage1
Chunk read levels: [2 1], write level: 2
...

However, due to a bug that was only fixed after the 2.0.10 release, you need to build from the latest source for this to work.

MorganG · 31 December 2017 14:30

Thanks for the explanation.

Can the “fixed-nesting” be turned on after a backup has been placed in storage, or is this a parameter where we have to start with a fresh storage?

So the recommendation is not to mix virtual machine files with other type of files in the same storage. They should back up to storages initialized differently.

This is really good to know now… but it was never mentioned as far as I know over on the verticalbackups forums when you suggested use of duplicacy to manage vertical backups. I was using the same storage for both, to take advantage of deduplication, as is generally recommended with these programs. So please, please, add that somewhere to the docs so that other users don’t waste months of time struggling with this like I have.

I know that as the author, you immediately recognize what each of the parameters do (I’m referring to the example of chunk sizes for above - I hadn’t realized that vertical chose a fixed size whereas duplicacy didn’t). However, we as users (even ones with programming background) do not. So these kinds of cautions in big boldface sometimes can save a lot of time.

Anyway, the chunk write-level must be active by default for verticalbackup? They are for my storage for vertical, not for duplicacy.

Morgan

gchen · 31 December 2017 17:54

Sorry for not mentioning this anywhere in the guide. We definitely need a user guide for users that run both Vertical Backup and Duplicacy.

This "fixed-nesting’ option and the nesting file are only recognized by Duplicacy (the latest version on the master branch), not by Vertical Backup.

The “fixed-nesting” option is automatically enabled when you run version 2.0.10 of Duplicacy to initialize the storage. It may be possible to turn it on for a storage that was initialized with an earlier version by replacing the existing config file with a compatible one and then creating a nesting file that honors the existing read level, however I haven’t tested this myself.

MorganG · 31 December 2017 18:43

Thanks…

The backup on the faster machine stabilized around 63MB/s overnight, so that is reasonable. It is using the fixed-nesting feature. So perhaps my theory about large directories was wrong. Someday, if I have time, I’ll directly test 2- or 3- level nesting versus this single level nesting. But for now, I just want this to complete, so I can move on to the next phase…

Morgan

MorganG · 19 January 2020 23:34

Having re-visited this recently, I want to clear up any confusion that this post may have caused, because I was using the fixed-chunk algorithm similar to that of Vertical Backup.

My latest experiments using Duplicacy to a local USB3-attached HD clearly show that more threads, up to about 3-4, does improve speed. I don’t have the specific numbers written down, but I did a series of tests with this, and 2 threads was faster than 1, and 3 was faster than 2. At about 4 it seemed to not go any faster.

Also note that a limiting factor is the compression. I believe the compression is still done in a single thread - no matter how many threads are selected for uploads - and on my machines that always ends up being the thing that limits speed, i.e. when a single processor hits 100%. On my machines - most of which are several years old, I seem to top out at about 40MB/S (source), no matter what I do - that’s where the processor compression speed is maxed out on an older i7 4Ghz.

So for anyone wanting to improve speeds - even to local disks, the bottom line so far:

A few threads seems better than 1 - as long as a single core of the processor is fast enough to compress the data at that rate, and
If you need faster performance, get a faster processor

I hope that helps.

TheBestPessimist · 20 January 2020 05:33

Thanks for sharing this, and I confirm your findings: backing up from HDD1 to HDD2 in the same computer is the fastest when using 4 threads, at least on windows.

I have tried using both 2.5 and 3.5 inch HDD with the same result.

Because of this @gchen i suggest you change the default number of threads in web-ui to 4 from 1, and see how everybody feels. (of course, if anyone has already customised the number of threads, then you shouldn’t change anything).