Limiting disk usage during `duplicacy check -chunks`

Preface: I’ve read both Cache usage details and Cache folder is is extremely big! 😱

Both seem to imply that cache usage only grows through subsequent runs as the cache is not cleared until a prune is run. This thread is asking how to limit cache usage within a single run.

Running duplicacy 2.5.1 (51CBF7)

I recently ran a duplicacy check -chunks to verify that my backups had made it to the remote storage without corruption.

duplicacy check -chunks -threads 8
Repository set to /foo
Storage set to gcd://backups/duplicacy/foo
Listing all chunks
2 snapshots and 5 revisions
Total chunk size is 5173G in 1100701 chunks
All chunks referenced by snapshot foo at revision 1 exist
All chunks referenced by snapshot bar at revision 1 exist
All chunks referenced by snapshot bar at revision 2 exist
All chunks referenced by snapshot bar at revision 3 exist
All chunks referenced by snapshot bar at revision 4 exist
Verifying 1097648 chunks

However after a few hours I noticed that disk usage was growing at an alarming rate. Currently sitting around 100GB.

Digging a bit deeper it looks like duplicacy is caching every chunk to disk.

/foo/.duplicacy # du -sh cache/
101.1G  cache/
/foo/.duplicacy/cache/default # find chunks -type f | wc -l
20312

At the current growth it looks like running duplicacy check -chunks requires enough disk to store the whole backup (or at least will attempt to cache the whole backup on disk).

My assumption is that chunk verification shouldn’t need to store more than n threads worth of chunks on disk at a time and might not need to store them on disk at all if they’re sufficiently small (default chunk size is <16M which should fit in memory easily).

Is there any way to stop :d: from caching the chunks (and therefore my whole backup) on disk? Or at least a way to ask it to clean up as soon as it no longer needs the files to minimize the amount of storage needed to run the command?

I think it’s somewhat unreasonable to expect a single system to have enough disk to store the entirety of the backup storage, especially if many systems are backing up to the same location to leverage the deduplication functionality.

Duplicacy only caches metadata chunks on disk, so subsequent check commands will be much faster.

You can create a post-check script to delete everything in the cache after a check.

Duplicacy only caches metadata chunks on disk, so subsequent check commands will be much faster.

Is this completely true? Or does it only apply to check -chunks?

I ran a duplicacy check (without -chunks) right before this and it did not use a noticeable amount of disk space (<1GB).

Is it normal to have >100GB of metadata chunks for a ~5TB backup?

You can create a post-check script to delete everything in the cache after a check.

Sure, but that only works if you have enough storage to cache all the chunks to begin with.

At the current rate :d: is using storage, I don’t think I’ll have enough storage to complete a check -chunks.

Is there any way (or any plans to support) a way of preventing/limiting the caching from happening? ie a -no-cache option to minimize the amount of disk needed for a check -chunks run?

1 Like

Looking at the code, it looks like the chunkDownloader will blindly cache any chunk it’s asked to download.

And the check -chunks routine will add all chunks (even non-metadata chunks it looks like), to the download queue. As it seems to be relying on the chunkDownloader's integrity checking for chunk verification.

Which seems to line up with what’s happening here, where all chunks in the backup storage are being cached during check -chunks. A bug perhaps? I don’t see any options that could disable this behavior.

As an aside, this also seems to imply that doing a full restore from backup will require at least 2x the storage of the backup (1x for the chunks, 1x for the actual files being restored).

4 Likes

Your suspicion is right – every chunk being checked is saved to the cache due to a bug. I just pushed a commit that fixes this bug.

This bug only affects the check command when the -chunks option is specified. A full restore should only store metadata chunks to the cache even without this fix.

Thanks for the finding. I’ll upload a new release tomorrow.

5 Likes

Thanks for the quick fix!

1 Like

This bug has been fixed by CLI 2.5.2 which has been released at Release Duplicacy Command Line Version 2.5.2 · gilbertchen/duplicacy · GitHub

5 Likes

This topic was automatically closed after 50 days. New replies are no longer allowed.