Cannot "check -chunks" -- fatal error: concurrent map read and map write

I tried twice to do duplicacy check -chunks -threads 4 with Dropbox as my backend, but this does not seem to be working. Both times, after a few hundred thousand chunks, duplicacy gives up. When it fails in this way, the verified_chunks file is not written, so i cannot pick up where I left off and am not making any progress towards my goal of verifying all chunks.

I don’t know anything about Golang but the output suggests to me there is a concurrency bug in Duplicacy. error.txt (32.7 KB)

Here is a snippet:

Verified chunk 06a1e399f234f68a1c05ad86178e4697b96c70472cd89024ef5bacae12d57d66 (591801/3267815), 21.96MB/s 7 days 01:14:03 18.1%
Verified chunk 01f59a3a8977f93a2c807bf304f22f6b06abff7b3956d841f002300574302b14 (591802/3267815), 21.96MB/s 7 days 01:14:04 18.1%
Verified chunk 7c2f487864da3fab921d9c70c98fb041c8ef08de3e8a5ce87f198d14a03e6119 (591803/3267815), 21.96MB/s 7 days 01:14:03 18.1%
Failed to download the chunk 117a57f2a70fed03aa255fb24fb2c5155730893bc4020a819455794732a0e58d: read tcp [2600:1700:8fd0:608f::d4c]:
55026->[2620:100:6017:14::a27d:20e]:443: read: connection reset by peer
Verified chunk 6543a5aca7ca8efea79ec518f29256aa96f33effb22b5a8ee01739c1fcf98831 (591804/3267815), 21.96MB/s 7 days 01:14:03 18.1%
Verified chunk 5e2daf8a553ab70f06faad1c79a2b134b5d9d3581d2c464119edc183f27845d7 (591805/3267815), 21.96MB/s 7 days 01:14:03 18.1%
Verified chunk 977705a6df268f37dfca6d7e7e3b7859df58d5f3af0d541065e05747712f3ab1 (591806/3267815), 21.96MB/s 7 days 01:14:03 18.1%
Verified chunk 6ab616e0c35a1bea5fb01109988d34690325e2c0bc1f8950b6f094437614f882 (591807/3267815), 21.96MB/s 7 days 01:14:03 18.1%
Verified chunk 313c308327c2f7169c7d37fc1a2217840cbde5a23cfe84d1b7cbe7d765725bc1 (591808/3267815), 21.96MB/s 7 days 01:14:03 18.1%
fatal error: concurrent map read and map write
Verified chunk f35c7e323da808cca13e97dcc1f6965cfb8a68fab240e31bb557f72a3ad49462 (591809/3267815), 21.96MB/s 7 days 01:14:03 18.1%

goroutine 80 [running]:
runtime.throw(0x136de59, 0x21)
        /usr/local/go/src/runtime/panic.go:1117 +0x72 fp=0xc001922da0 sp=0xc001922d70 pc=0x437f32
runtime.mapaccess2(0x1195a20, 0xc034b38000, 0xc002598320, 0xc002598320, 0xc0474f1f38)
        /usr/local/go/src/runtime/map.go:469 +0x255 fp=0xc001922de0 sp=0xc001922da0 pc=0x40f7f5
reflect.mapaccess(0x1195a20, 0xc034b38000, 0xc002598320, 0x135e83c)
        /usr/local/go/src/runtime/map.go:1318 +0x3f fp=0xc001922e18 sp=0xc001922de0 pc=0x46b3bf
reflect.Value.MapIndex(0x1195a20, 0xc034b38000, 0x15, 0x1125680, 0xc002598320, 0x98, 0x1124b80, 0xc005ded6f8, 0x86)
        /usr/local/go/src/reflect/value.go:1189 +0x16e fp=0xc001922e90 sp=0xc001922e18 pc=0x4a710e
encoding/json.mapEncoder.encode(0x139f6a8, 0xc005470000, 0x1195a20, 0xc034b38000, 0x15, 0x1190100)
        /usr/local/go/src/encoding/json/encode.go:813 +0x317 fp=0xc001923038 sp=0xc001922e90 pc=0x527557
encoding/json.mapEncoder.encode-fm(0xc005470000, 0x1195a20, 0xc034b38000, 0x15, 0x1c40100)
        /usr/local/go/src/encoding/json/encode.go:779 +0x6d fp=0xc001923078 sp=0xc001923038 pc=0x53392d
encoding/json.(*encodeState).reflectValue(0xc005470000, 0x1195a20, 0xc034b38000, 0x15, 0xc001920100)
        /usr/local/go/src/encoding/json/encode.go:360 +0x82 fp=0xc0019230b0 sp=0xc001923078 pc=0x524662
encoding/json.(*encodeState).marshal(0xc005470000, 0x1195a20, 0xc034b38000, 0x100, 0x0, 0x0)
        /usr/local/go/src/encoding/json/encode.go:332 +0xf9 fp=0xc001923110 sp=0xc0019230b0 pc=0x524259
encoding/json.Marshal(0x1195a20, 0xc034b38000, 0xc004636024, 0x180, 0xca, 0x0, 0x5d)
        /usr/local/go/src/encoding/json/encode.go:161 +0x52 fp=0xc001923188 sp=0xc001923110 pc=0x5236d2*SnapshotManager).CheckSnapshots.func2()
        /Users/gchen/zincbox/duplicacy/src/duplicacy_snapshotmanager.go:1024 +0x9e fp=0xc001923248 sp=0xc001923188 pc=0x104dc7e
        /Users/gchen/zincbox/duplicacy/src/duplicacy_log.go:229 +0x7a fp=0xc0019232b8 sp=0xc001923248 pc=0x10075da
runtime.call16(0x0, 0x139fc80, 0x0, 0x0)
        /usr/local/go/src/runtime/asm_amd64.s:550 +0x3e fp=0xc0019232d8 sp=0xc0019232b8 pc=0x47035e
runtime.reflectcallSave(0xc001923410, 0x139fc80, 0x0, 0x0)
        /usr/local/go/src/runtime/panic.go:877 +0x58 fp=0xc001923308 sp=0xc0019232d8 pc=0x4374f8
runtime.runOpenDeferFrame(0xc000375980, 0xc01d21e410, 0x0)
        /usr/local/go/src/runtime/panic.go:851 +0x62d fp=0xc001923390 sp=0xc001923308 pc=0x4371cd
panic(0x12330c0, 0xc0022eb380)
        /usr/local/go/src/runtime/panic.go:965 +0x1b9 fp=0xc001923458 sp=0xc001923390 pc=0x437779, 0x1353537, 0xe, 0x1370cd6, 0x23, 0xc001923970, 0x2, 0x2)
        /Users/gchen/zincbox/duplicacy/src/duplicacy_log.go:180 +0x93d fp=0xc0019235e8 sp=0xc001923458 pc=0x100725d, 0x1353537, 0xe, 0x1370cd6, 0x23, 0xc001923970, 0x2, 0x2)
        /Users/gchen/zincbox/duplicacy/src/duplicacy_log.go:114 +0xe8 fp=0xc001923638 sp=0xc0019235e8 pc=0x1006908*ChunkOperator).DownloadChunk(0xc0002a03c0, 0x0, 0x0, 0xc005ed4900, 0x40, 0xc019fc4440, 0x20,
 0x907bc, 0x0, 0x0, ...)
        /Users/gchen/zincbox/duplicacy/src/duplicacy_chunkoperator.go:415 +0x4ea fp=0xc001923c18 sp=0xc001923638 pc=0xfd722a*ChunkOperator).Run(0xc0002a03c0, 0x0, 0x0, 0xc005ed4900, 0x40, 0xc019fc4440, 0x20, 0x907bc,
0x0, 0x0, ...)
        /Users/gchen/zincbox/duplicacy/src/duplicacy_chunkoperator.go:184 +0x1665 fp=0xc001923e70 sp=0xc001923c18 pc=0xfd6ce5, 0x0)
        /Users/gchen/zincbox/duplicacy/src/duplicacy_chunkoperator.go:92 +0xde fp=0xc001923fd0 sp=0xc001923e70 pc=0x104b61e
        /usr/local/go/src/runtime/asm_amd64.s:1371 +0x1 fp=0xc001923fd8 sp=0xc001923fd0 pc=0x471c81
created by
        /Users/gchen/zincbox/duplicacy/src/duplicacy_chunkoperator.go:87 +0x1ba

For now I’ll try using one thread and see if that works, but this is quite painful because it’ll take weeks to run.

This is an extremely annoying issues and isn’t just related to Dropbox, bad chunks, or crashed clients.

I discovered this only recently - if you stop a check -chunks job in the Web UI or with Ctrl-C with the CLI, it doesn’t bother to write verified_chunks, wasting hours of time.

@gchen Is there any way to change this behaviour?

In OPs case, this is clear a client crash, but on clean exit can it at least update what it’s checked so far to verified_chunks? And in OPs case, could Duplicacy not write to the file every 1 or 5 mins, say? This should be easy to implement.

Your failure is actually one of the chunks failing to download and client exiting regardless. For some reason the final error logging throws another error (concurrency), but at that point it is no longer relevant. Using -persist will probably fix your check chunks issue - at least the part where it does not continue on error(s).

I don’t think the chunk failing to download is the ultimate problem. It is common for a chunk to fail to download for a variety of transient errors, and duplicacy does not usually give up after a chunk fails to download once. The crash/exit here seems to be a bug that may be related to or caused by this failure, but it should not be.

-persist says: continue processing despite chunk errors, reporting any affected (corrupted) files – this may or may not get me past my issue. I think it will not, because duplicacy did not stop because of a corrupted chunk. I suspect that the bug will still cause it to crash.

I don’t think, I am telling you what the stack trace shows. What you do with this information is up to you.

Can you point me to the part of the stack trace that shows duplicacy is giving up because a chunk did not download?

Duplicacy is supposed to retry in the event of a failure like this. Usually, it works. It did not work this time.

This is LOG_WERROR(operator.allowFailures, "DOWNLOAD_CHUNK", "Failed to download the chunk %s: %v", chunkID, err)

Because you’re not running with -persist, LOG_WERROR stops execution, this is a primary reason why -persist was added.

This is probably part of the errorhandling for your reset connection:

Failed to download the chunk 117a57f2a70fed03aa255fb24fb2c5155730893bc4020a819455794732a0e58d: read tcp [2600:1700:8fd0:608f::d4c]: 55026->[2620:100:6017:14::a27d:20e]:443: read: connection reset by peer

This is not a graceful failure, I am pretty sure :d: doesn’t retry on errors like that.

1 Like

Thanks! Are you referring to this line? duplicacy/duplicacy_chunkoperator.go at bc2d762e41c5ce11be01bffe32b39edc4d5b2eaa · gilbertchen/duplicacy · GitHub

If so, then I think there are two improvements that can be made here:

  1. A network failure like this should be a retryable failure, like the ones captured in the other branch of the if/else.
  2. That still doesn’t explain the apparent crash and fatal error: concurrent map read and map write.

In that case, I retract my statement and I think -persist will help in my case. Thank you!

Well, that’s a funny one. It fails doing exactly what you wanted it to do - writing verified_chunks on error. Except that writing this file on error doesn’t lockout verifiedChunks with mutex (as happens with all regular worker threads), so most likely what happened here is one of your threads failed, error handling mode was activated trying to save verified_chunks, but other worker threads were still writing to the map, hence another panic.

Oh no… maybe that is what prevents Ctrl-C from writing out the file too. I get the same error with Ctrl-C. @gchen what do you think?

@sevimo is right. I forgot to lock verifiedChunks, the map that contains all verified chunks, before reading it (for saving it to a file).

Thanks for confirming! What do you think about also making the connection reset by peer error retryable?

Pull request for verifiedChunksLock: Acquire verifiedChunksLock in saveVerifiedChunks by northnose · Pull Request #649 · gilbertchen/duplicacy · GitHub

1 Like

I ran into the exact same connection reset by peer just now, and with the PR mentioned above, duplicacy was able to write out verified_chunks successfully.