Duplicacy keeps stalling

Danny · 5 September 2017 15:51

I’m looking into using duplicacy + wasabi as a replacement for CrashPlan. I started with a small directory of about 2 GB which backed up fine. I then added a larger directory of about 500 GB.

The first time I tried on this larger directory, duplicacy stalled after about 1400 chunks. No crash, it just stopped printing output indicating that it was uploading chunks. It stayed like that for at least half an hour. I aborted it, restarted, and went to bed.

This morning I saw it had stopped after 4096 chunks. By looking at the wasabi dashboard I can tell that the newest chunk was uploaded about 4.5 hours before I checked it. It had stalled for that long.

Duplicacy version is 2.0.7. OS is FreeBSD 11.0-STABLE.

It’s actually a FreeNAS box and I have duplicacy running in a jail. One of the things that I found appealing about duplicacy is that it has native binaries for FreeBSD. CrashPlan was not easy to get going on FreeNAS because the only binaries you can use are Linux binaries.

I am now running it again with the -debug and -log options. I’ll update this issue when I have more data.

Something that makes this kind of annoying is that when I hit ctrl-c it doesn’t save the incomplete snapshot, so it has to rescan everything. I know that the incomplete snapshot functionality is there because I have seen it working. Now I’m scratching my head to figure out where. I think it would be on another jail on this machine. I don’t know why it would have worked before but it’s not working now. Maybe whatever issue is causing wasabi to hang is also preventing ctrl-c from saving the incomplete snapshot.

I saw on another issue that you said in this case you have to “force a crash” and send a stack trace, but I don’t know how to do that. I’ll gladly do it if you give me a little guidance.

Danny · 5 September 2017 18:43

This might be new … the program stopped.

2017-09-05 11:34:14.644 INFO UPLOAD_PROGRESS Uploaded chunk 6796 size 3436484, 3.95MB/s 1 day 09:51:25 6.5%
2017-09-05 11:34:15.494 ERROR UPLOAD_CHUNK Failed to upload the chunk 1e7687fa046669d7dcf20956cc0baa1c9bdc4cf50d5654c50528771b5285fddd: XAmzContentSHA256Mismatch: The provided 'x-amz-content-sha256' header does not match what was computed.
        status code: 400, request id: 0756E4C61BDEA13A, host id: JT6/IBwG98s1mQdou1/lS/78m5GH62T0UWWUSJ4etb0zIJiR2Yfb6xz06yl2b31qkwzLpCF9gyde

gchen · 5 September 2017 19:48

Was this a new error, or the same error that caused the process to hang? Can it be reproducible?

You can also try s3c:// instead s3://. The default s3 backend uses https://github.com/aws/aws-sdk-go, while s3c uses https://github.com/goamz/goamz.

Danny · 5 September 2017 21:23

There was no error printed when it hung before, but I wasn’t using -log or -debug the first couple of times.

I restarted the process and let it run into chunk ~9000 or so. It was still working fine then, but I interrupted it again to see if it saved the incomplete snapshot when it’s working fine, or if it would still fail to do so. It still failed to save the incomplete snapshot.

I thought the chunk uploading error might be essentially a transmission checksum error, which is kind of believable because the link to this computer on my LAN should be 1Gbps but is in fact less than 100Mbps at best, making me think the transmission medium was boned. This motivated me to finally troubleshoot that issue properly and I replaced a faulty network cable.

I fired it up again and I’m going to let it run. I’ll let you know if I continue to have issues.

Danny · 6 September 2017 05:20

It happened again.

2017-09-05 21:56:36.347 INFO UPLOAD_PROGRESS Uploaded chunk 12835 size 6500322, 10.18MB/s 12:20:19 12.3%
2017-09-05 21:56:39.768 ERROR UPLOAD_CHUNK Failed to upload the chunk 495da1bf38db9b6c11d7e249762c7fe29e8d7936558eed15f7ac6e29328f8946: XAmzContentSHA256Mismatch: The provided 'x-amz-content-sha256' header does not match what was computed.
        status code: 400, request id: 7ACFB04835F0BAA4, host id: m816YmLlkE0GLISNIhZte5YVJ5htdmNtDj7iQ5lyiLxSDzLcfxT+y1yOd3yDJMapZRplTKNzOi0I

I’m switching to s3c and I’ll see what happens.

Danny · 6 September 2017 05:29

s3c seems much slower than s3. The process of scanning through the first part of the backup (the part where the chunks are already uploaded) is going at ~ 12 MBps and it was > 35 MBps before.

gchen · 6 September 2017 12:44

Another user reported the exactly same error: https://github.com/gilbertchen/duplicacy/issues/160

Danny · 7 September 2017 02:39

Yeah, that’s the error I’m getting for sure. Maybe the issue is with Wasabi. I’m considering BackBlaze also but I’m going to bull ahead with this one for now. Wasabi seems to have more sophisticated access controls which would be valuable for a use case I’m considering for work.

I didn’t want to let it keep going with the s3c storage, just because it seemed way slower than the s3 storage. It seems odd that it would make a material difference, but it definitely does, at least for the case where the chunk doesn’t need uploading. Once it starts uploading new chunks the upload calls probably begin to dominate.

I downloaded the source code and hacked in some extremely crude retry logic in the UploadFile method of the s3 storage. It’ll be sufficient for my purposes if it works.

gchen · 8 September 2017 23:50

I checked in an experimental fix: https://github.com/gilbertchen/duplicacy/issues/160

Please let me know if it works or not.

Danny · 9 September 2017 13:34

My backup has been running for a while now with my retry code in place. I’m over 68000 chunks now. No further issues. This is with code I cloned before you committed that fix (197d20f0e07284c9c8a35d9d0800b366d53b2f18). The thing is, it’s not exercising my retry code. I have statements to print when it hits the retry code and it hasn’t yet had to retry once with my build.

Since The most I was able to go was 4000 chunks before hitting the error before and I’ve now gone about 45000 without hitting it once, it looks like it’s been fixed without any code changes. It seems one of the following is the case:

The core issue was on the Wasabi side and they fixed it, or was transient in some other way.
I could be using different revisions of library code than you because I ran “go get” later.
Maybe there’s something about my build environment (go version?)
The problem could still be there and I might just have had a very lucky streak.

2 seems most likely to me, but I’ve only used go on one small project at work and I might be confused about how go get works. If you pin the revisions it gets I’d be getting the same code as you. I have aws-sdk-go version 1.10.39

gchen · 9 September 2017 23:36

I tend to believe 1 is the case. My aws-sdk-go version is 1.8.39, but I don’t see any changes since that version that could have fixed the issue.

It is interesting to see that the error comes from minio: https://github.com/minio/minio/blob/0d154871d50d9a88511b383bc4c96581bc46b659/cmd/api-errors.go#L575

Danny · 11 September 2017 01:25

It was easy to test. I interrupted my transfer and started over with 2.0.7 again. Just over 600 chunks in:

2017-09-10 18:27:39.113 ERROR UPLOAD_CHUNK Failed to upload the chunk 8efc559e8803a7d8c054a333929c8f509ea6ca772314ddaf6809402967df68e8: XAmzContentSHA256Mismatch: The provided 'x-amz-content-sha256' header does not match what was computed.
        status code: 400, request id: 18AEDF290E162D59, host id: IMKfSFTr352x6oKqmLw6CTrUThgs7wjLHjVFcZ/ZUaHUWNwEhSZfAtFuWv0hyvMy+Z2xxnRnx3jO
goroutine 14 [running]:
runtime/debug.Stack(0x0, 0x0, 0xc427507900)
        /usr/local/go/src/runtime/debug/stack.go:24 +0x79
runtime/debug.PrintStack()
        /usr/local/go/src/runtime/debug/stack.go:16 +0x22
github.com/gilbertchen/duplicacy/src.CatchLogException()
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_log.go:171 +0x13d
panic(0xc64bc0, 0xc427ad3050)
        /usr/local/go/src/runtime/panic.go:489 +0x2cf
github.com/gilbertchen/duplicacy/src.logf(0x2, 0xd0f4f4, 0xc, 0xd22597, 0x21, 0xc427507e48, 0x2, 0x2)
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_log.go:154 +0x1ac
github.com/gilbertchen/duplicacy/src.LOG_ERROR(0xd0f4f4, 0xc, 0xd22597, 0x21, 0xc427507e48, 0x2, 0x2)
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_log.go:97 +0x73
github.com/gilbertchen/duplicacy/src.(*ChunkUploader).Upload(0xc429492780, 0x0, 0xc429492870, 0x15c55, 0x61532c323a6c01)
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_chunkuploader.go:139 +0x553
github.com/gilbertchen/duplicacy/src.(*ChunkUploader).Start.func1(0xc429492780, 0x0)
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_chunkuploader.go:60 +0x112
created by github.com/gilbertchen/duplicacy/src.(*ChunkUploader).Start
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_chunkuploader.go:65 +0x54

Just to clarify, this is the download from GitHub (your build). I didn’t build it.

Danny · 12 September 2017 15:26

I saw that you released 2.0.9 with a later aws-sdk-go. I downloaded that one and I ran the last few thousand chunks with it. It worked. It’s hard to be completely sure because I didn’t have much of the backup left at that point, but it looks like this issue is fixed in 2.0.9.

Thanks.

Danny · 19 September 2017 14:37

Now I’m testing restore and I keep getting errors like this.

2017-09-19 08:31:25.444 ERROR UPLOAD_CHUNK Failed to decrypt the chunk a08c93db303e740a47dd6405918179199f2fa656866afb168e055da2da3f3070: cipher: message authentication failed
goroutine 46 [running]:
runtime/debug.Stack(0x0, 0x0, 0xc4202756b8)
        /usr/local/go/src/runtime/debug/stack.go:24 +0x79
runtime/debug.PrintStack()
        /usr/local/go/src/runtime/debug/stack.go:16 +0x22
github.com/gilbertchen/duplicacy/src.CatchLogException()
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_log.go:177 +0x13d
panic(0xc6d3a0, 0xc4241ed350)
        /usr/local/go/src/runtime/panic.go:489 +0x2cf
github.com/gilbertchen/duplicacy/src.logf(0x2, 0xd1868f, 0xc, 0xd2c6f6, 0x22, 0xc420275d40, 0x2, 0x2)
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_log.go:160 +0x1bc
github.com/gilbertchen/duplicacy/src.LOG_ERROR(0xd1868f, 0xc, 0xd2c6f6, 0x22, 0xc420275d40, 0x2, 0x2)
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_log.go:98 +0x73
github.com/gilbertchen/duplicacy/src.(*ChunkDownloader).Download(0xc428c78a00, 0x3, 0x0, 0x26, 0xc423757720, 0x20, 0x1c5622, 0x1, 0x1)
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_chunkdownloader.go:336 +0x547
github.com/gilbertchen/duplicacy/src.CreateChunkDownloader.func1(0xc428c78a00, 0x3)
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_chunkdownloader.go:81 +0x161
created by github.com/gilbertchen/duplicacy/src.CreateChunkDownloader
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_chunkdownloader.go:86 +0x23f

It’s a different chunk every time.

I’m giving up on Wasabi. Their service is interesting but all these errors are problematic. Upload and download speeds are also much, much lower than Backblaze. They’re not optimized for backup. It’s not worth it to save $0.0011/GB. I’m just posting this update in case others might benefit from my experience.

gchen · 19 September 2017 16:41

This is worrisome. I can’t completely rule out that this is Duplicacy’s fault.

As for the speed, my experiences are quite the opposite – Wasabi is about 4x faster than B2, and even faster than Amazon S3. Maybe that is because my test machine is located on the east coast? What is your machine’s location?

Danny · 20 September 2017 16:53

If you are finding Wasabi is faster then maybe I’m doing something wrong. For uploads I can easily saturate my 15 Mbps connection with 3 threads. For downloads I couldn’t get to use much more than 8 Mbps of my 150 Mbps download bandwidth. By contrast I was able to average 7.25 MBps (capital B) over my test restore from Backblaze and that includes a period where I throttled the bandwidth to 1280kBps for a couple of hours.

How many threads are you using to get 4* B2 performance, and/or are there other settings I should tune? I left the chunk size at the default.

My machine is in Calgary. It’s closer to the west than the east coast but it doesn’t seem like it should make this much of a difference. Latency will be higher but it doesn’t seem like latency should be a major factor. Although I was reading in Backblaze’s documentation that their API allows you to store a file with a single API call and the S3 API requires two. Still doesn’t seem like it should matter for files averaging 4MB though. Transfer time should dominate.

Danny · 20 September 2017 17:39

I can increase the bandwidth utilization by adding more threads, but not by that much – only up to ~3 MBps. I see on the traffic graph that the bandwidth utilization is very spiky. It spikes up to 70 Mbps, which is about where it was going with Backblaze. It seems like it’s not keeping the pipe full. It should be doing readahead. Or … if it’s file oriented maybe it brings things to a screeching halt while it’s extracting the files.

If I use copy instead of restore, the traffic graph is extremely flat holding at around 1MBps no matter how many threads are in play, which seems surprising at least to me.

gchen · 20 September 2017 18:32

I used the linux code base as the backup dataset (https://github.com/gilbertchen/benchmarking)

Here are total running times (12 backups for each storage):

---------------------------- sftp ----------------------------
real	0m45.700s
real	0m6.542s
real	0m22.733s
real	0m3.544s
real	0m23.702s
real	0m6.393s
real	0m21.581s
real	0m22.677s
real	0m27.055s
real	0m26.547s
real	0m30.198s
real	0m31.033s
---------------------------- s3 ----------------------------
real	0m50.356s
real	0m7.203s
real	0m30.001s
real	0m4.991s
real	0m25.225s
real	0m9.322s
real	0m45.123s
real	0m43.682s
real	0m51.769s
real	0m47.605s
real	1m1.010s
real	1m4.767s
---------------------------- wasabi ----------------------------
real	0m34.644s
real	0m5.538s
real	0m16.971s
real	0m3.997s
real	0m20.551s
real	0m5.976s
real	0m21.438s
real	0m22.126s
real	0m28.537s
real	0m29.045s
real	0m34.536s
real	0m35.741s
---------------------------- b2 ----------------------------
real	2m11.109s
real	0m11.306s
real	2m0.587s
real	0m16.727s
real	2m16.777s
real	0m28.311s
real	1m22.846s
real	0m53.006s
real	1m35.601s
real	0m51.963s
real	1m33.964s
real	1m46.014s

The test machine is an ubuntu 16.10 vm running on an ESXi server with a 200Mbps uplink. The sftp server was running on a different vm on the same ESXi host. So Wasabi is even faster than a local SFTP server.

Another test on a VirtualBox image (the other data set in https://github.com/gilbertchen/benchmarking) showed similar results.

Danny · 20 September 2017 18:59

Maybe I’m comparing apples and oranges. The first part of my backup consists of about 80,000 very small files. The rest of it is much larger files. I think performance might improve once I get past this first part but I haven’t been letting it get that far. If it works faster on larger files I’m comparing a slow start with a fast finish.

I switched to the s3c provider, cranked up the threads to 128, and let it run. It got about 600 chunks in and then crashed:

2017-09-20 12:50:45.539 ERROR UPLOAD_CHUNK Failed to decrypt the chunk ca9c51caa3e5cae3e48d72b65cd80618b7e27537b51732c54e884efcce9adc88: The storage doesn't seem to be encrypted
goroutine 134 [running]:
runtime/debug.Stack(0x0, 0x0, 0xc42a62d6b8)
        /usr/local/go/src/runtime/debug/stack.go:24 +0x79
runtime/debug.PrintStack()
        /usr/local/go/src/runtime/debug/stack.go:16 +0x22
github.com/gilbertchen/duplicacy/src.CatchLogException()
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_log.go:177 +0x13d
panic(0xc6d3a0, 0xc42834a900)
        /usr/local/go/src/runtime/panic.go:489 +0x2cf
github.com/gilbertchen/duplicacy/src.logf(0x2, 0xd1868f, 0xc, 0xd2c6f6, 0x22, 0xc42a62dd40, 0x2, 0x2)
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_log.go:160 +0x1bc
github.com/gilbertchen/duplicacy/src.LOG_ERROR(0xd1868f, 0xc, 0xd2c6f6, 0x22, 0xc42a62dd40, 0x2, 0x2)
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_log.go:98 +0x73
github.com/gilbertchen/duplicacy/src.(*ChunkDownloader).Download(0xc42876cf00, 0x47, 0x0, 0x27d, 0xc42abb77e0, 0x20, 0x331d30, 0x1, 0xc4265e4e01)
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_chunkdownloader.go:336 +0x547
github.com/gilbertchen/duplicacy/src.CreateChunkDownloader.func1(0xc42876cf00, 0x47)
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_chunkdownloader.go:81 +0x161
created by github.com/gilbertchen/duplicacy/src.CreateChunkDownloader
        /Users/chgang/zincbox/go/src/github.com/gilbertchen/duplicacy/src/duplicacy_chunkdownloader.go:86 +0x23f

This is an interesting one because it had already successfully restored over 77,000 files at this point. Maybe the issue results from the large number of threads I was using.

I also saw a bunch of messages like this.

2017-09-20 12:50:25.182 WARN CONFIG_CHUNK 190 chunks have been allocated

I suppose these are not important? It has to hold the chunks in sequence and they may arrive out of order if multiple threads are in use. I don’t know why it’s holding > 128 chunks though.

Danny · 20 September 2017 19:27

I switched back to s3 and started over with 16 threads. I get the same error, on the same chunk.