Chunks/other files with identical names created with Google Drive as the backend

baip · 14 November 2020 20:58

Please describe what you are doing to trigger the bug:
I’ve been using Duplicacy with Google Drive as the backend. When I tried to use rclone to copy the remote repository to another place, it complained about duplicate file names. This is when I found out about the problem.

Please describe what you expect to happen (but doesn’t):
Chunks / revisions are named differently.

Please describe what actually happens (the wrong behaviour):
First, not every time a new backup is made, will duplicate files be created, so this might be correlated to network uncertainties. And because of the occasional nature, I have already deleted the log files from the cron jobs when the problem occurred.

Some duplicate files have identical MD5 hashes, so I can use rclone dedupe to remove the (identical) duplicates. Others have identical names but different MD5 hashes. This is what confuses me and make it impossible to move the remote storage with confidence (i.e., not sure if I should keep the older and newer files; the newer files would make more sense if say the duplicate is due to Duplicacy retrying a failed operation).

    $ rclone dedupe GoogleDrive1:duplicacy Remote2:duplicacy
2020/11/14 15:56:26 NOTICE: chunks/64/83e08e3ed30ac6ec83ad385957b7ef5e4b6af5733116a8eda66b0d9c3793dd: Found 2 duplicates - deleting identical copies
chunks/64/83e08e3ed30ac6ec83ad385957b7ef5e4b6af5733116a8eda66b0d9c3793dd: 2 duplicates remain
  1:      3399462 bytes, 2020-06-27 15:17:49.693000000, MD5 072f233da8842048739f2d98923ec841
  2:      3399462 bytes, 2020-06-27 15:17:49.334000000, MD5 ffda4e78df88b4a3de1312413692d1f6

towerbr · 14 November 2020 21:48

Google drive accepts duplicate names and this always causes headaches.

But to solve your problem, just use duplicacy copy instead of rclone copy.

baip · 14 November 2020 22:03

Thanks for reminding about the duplicacy copy command! Two questions of clarification: 1) if copying between two Google Drive accounts, will this perform server-side copy as rclone --drive-server-side-across-configs sync does? and 2) will it still work when copying to a remote that does not support duplicate names?

Also what is the root cause of duplicates with different hash values? Anything we can do to minimize it?

towerbr · 15 November 2020 12:33

Nope. First of all it depends mainly on the configuration of the two storages (encrypted / unencrypted, copy compatible, bit-identical etc.), but AFAIK will download from the first storage and reupload chunks to the new storage, even if they are bit-identical (@gchen, can you confirm?).

I think the copy command will “get” only the necessary chunks, but I really don’t know, I never tested it (@gchen, help again).

baip · 15 November 2020 14:32

But the problem is how would Duplicacy know which of two identically named chunk files to copy? Does it use MD5 hash values in addition to file names to select the correct chunk files?

Droolio · 15 November 2020 19:30

Duplicacy -copy will intelligently copy only the necessary chunks - whether you use -bit-identical or not.

The purpose of -bit-identical is so you can use third-party tools, like rsync and Rclone, to synchronise backup storage based on filename.

Basically, the flag copies the ID Key (used to generate the chunk ID) and all the other encryption keys stored in the config file from the original storage - to the second storage. Otherwise, those keys are unique to each storage, but are still deterministic in that, when copying from one (non--bit-identical or not) storage to another, the chunk IDs are computed first before choosing to upload a chunk.

See: Encryption of the storage

Bare in mind, if you use a third-party tool, you miss out on the ability to choose subsets of repositories or snapshots. With Duplicacy -copy, it’ll still do it incrementally, but you can pick individual ids and revisions.

gchen · 16 November 2020 02:41

Every time a chunk is encrypted the encrypted content will be different, because a random Initialization Vector is generated for the AES-GCM encryption algorithm. So it is perfectly fine for two chunk files with the same name to have different content, if they are uploaded in parallel.

The copy command should copy over only one file and ignore the other. You can also manually delete either one.

baip · 16 November 2020 03:01

Thanks for all the replies and the good information!

This is very reassuring, which means that I can use rclone dedupe --dedupe-mode newest GoogleDrive1:duplicacy to remove the duplicate files and complete rclone sync. In the future, I’ll remember duplicacy copy before doing any migration.

towerbr · 16 November 2020 18:38

Wouldn’t it be easier to execute duplicacy copy and - after checking the integrity of the new storage - simply delete the old storage?

baip · 16 November 2020 18:57

I have already transferred a lot of files using rclone sync so want to complete it – will duplicacy copy reuse files copied by rclone? Since I’m transferring between Google Drive accounts, an additional benefit of using rclone is the server-side copying (very large datasets).

towerbr · 16 November 2020 20:24

Sure! But …

… this (server-side operation) is an advantage in this case, I understood your strategy.