'history' command takes an inordinate amount of time?

Icydog · 20 July 2020 03:42

The history command seems to take a veeeeeeeery long time. My repository has on the order of 1M files (almost all files have only one version) and 7TB data.

$ time duplicacy_linux_x64_2.6.1 history path/to/file
Storage set to b2://<redacted>
download URL is: https://f001.backblazeb2.com
    170:        24300027 2015-08-02 19:59:46 f38d2dae0ffbec73444acb2749404dd9cbc068c3ea59ca51936d530ac27b34e2 path/to/file
    171:        24300027 2015-08-02 19:59:46 f38d2dae0ffbec73444acb2749404dd9cbc068c3ea59ca51936d530ac27b34e2 path/to/file
    172:        24300027 2015-08-02 19:59:46 f38d2dae0ffbec73444acb2749404dd9cbc068c3ea59ca51936d530ac27b34e2 path/to/file
(many lines omitted)
    206:        24300027 2015-08-02 19:59:46 f38d2dae0ffbec73444acb2749404dd9cbc068c3ea59ca51936d530ac27b34e2 path/to/file
    207:        24300027 2015-08-02 19:59:46 f38d2dae0ffbec73444acb2749404dd9cbc068c3ea59ca51936d530ac27b34e2 path/to/file
current:        24300027 2015-08-02 19:59:46                                                                  path/to/file

real    412m39.805s
user    325m38.754s
sys     4m38.362s

7 hours to get the history of one file that hasn’t changed seems like a very long time! Is this a quirk of the Backblaze B2 backend or a general problem? Also, how much $$ did I just add to my B2 bill by doing that?

Related: is there a better way to find a file or probe the existence of a file?

gchen · 20 July 2020 04:16

Duplicacy needs to download metadata of every snapshot to find out how a file was changed. All download metadata chunks will be saved to the .duplicacy/cache directory so next time when you need to check the history of the same or a different file it will avoid downloading all those chunks again.

Icydog · 20 July 2020 20:09

Thanks for the quick response. I do see lots of files in .duplicacy/cache (80GB worth), however it doesn’t seem to speed up the history command much. I ran the same command twice in succession, with no other operations happening during that time, and the second one still took a long time:

real    275m0.218s
user    288m29.637s
sys     1m54.497s

You’ll also notice that a full CPU core was occupied basically the entire time. (On the first run, the wall-clock time was longer, perhaps because it was downloading chunks, but CPU time was similar.) It sounds like this should not be happening; am I maybe doing something wrong?

Edited to add: even check -a doesn’t take this long:

$ time duplicacy check -a
Storage set to b2://<redacted>
download URL is: https://f001.backblazeb2.com
Listing all chunks
1 snapshots and 39 revisions
Total chunk size is 8554G in 1828558 chunks
All chunks referenced by snapshot tank at revision 170 exist
All chunks referenced by snapshot tank at revision 171 exist
All chunks referenced by snapshot tank at revision 172 exist
...

real    11m48.802s
user    9m48.020s
sys     0m6.646s

gchen · 24 July 2020 04:19

Can you check the memory usage while the history command is running? I first suspected that there might be a memory leak similar to Memory consumption grows steadily during most operations but it turns out the history command doesn’t leak snapshot contents. So it could be that each snapshot is so big that it causes lots of virtual memory swapping when loaded into the memory.

The check command is faster because it doesn’t load the snapshot content.

Icydog · 24 July 2020 22:25

Swapping isn’t the issue as this machine is not configured with any swap space, so the process should just die if it runs out of memory.

I ran the history command again (on the same file as above, which has never been modified) and free in a once-per-second loop. The “used” number was 740MB before duplicacy was started and in the first hour climbed to ~8.5GB before eventually maxing out at ~9.5GB. Total physical RAM on the machine is 32GB so there was always plenty of memory available. Total time was 21556 seconds (6 hours).

Screenshot_20200724_151942