Bypass scanner with list of files that changed/added/deleted

Hi, you might know me from such posts as Duplicacy preparing for backup relatively slow with filter applied where on 5400RPM rotational ZFS zpools scanning a zvol can take upwards an hour…

Well you know what doesn’t take an hour on said system?

 time zfs diff Main-Volume/subvol-102-disk-1@autosnap_2020-04-06_20:36:32_daily Main-Volume/subvol-102-disk-1@autosnap_2020-04-07_00:05:20_daily > /dev/null
real    1m15.925s

diffing 2 snapshots! If I could pipe the output of those snapshots into a set of files that have been modified, have been added, or have been deleted within a repository then my backup time would be reduced by an order of magnitude or more!

Any chance on that being added? I don’t know go at all but it looked like the concept of the Scanner and the file list was pretty tightly coupled in the code.

I don’t think diffing 2 snapshots is slow. Instead, it is loading 2 snapshots into memory that is slow if you don’t have enough memory. See a similar issue here: WebEdition Scheduled Backup Index runs longtime

I’ll start working on the memory usage optimization next week.

2 Likes

My issue though was walking the inodes to find the existence of files. Either to act on or to filter. From what I pasted in the other thread it didn’t even get to looking at the snapshots yet correct?

Run duplicacy -d backup and the debug level log messages will tell you which step is the slowest.

I had in the other thread linked above. Scanning the directories for the filter step was taking 30+ minutes with an empty ARC cache on my ZFS system.
I can privately send you the entire output including with timestamps.
[2020-04-16 14:01:15] Storage set to
[2020-04-16 14:01:19] top: /datastore/subvol-102-disk-1, quick: true, tag:
[2020-04-16 14:01:19] Downloading latest revision for snapshot gorillaz
[2020-04-16 14:01:19] Listing revisions for snapshot gorillaz
[2020-04-16 14:01:21] Loaded file snapshots/gorillaz/52 from the snapshot cache
[2020-04-16 14:01:40] Last backup at revision 52 found
[2020-04-16 14:01:40] Indexing /datastore/subvol-102-disk-1
[2020-04-16 14:01:40] Parsing filter file /datastore/subvol-102-disk-1/.duplicacy/filters
[2020-04-16 14:01:40] There are 0 compiled regular expressions stored
[2020-04-16 14:01:40] Loaded 0 include/exclude pattern(s)
[2020-04-16 14:01:40] Listing
[2020-04-16 14:01:40] Listing bin/
[2020-04-16 14:01:40] Listing boot/
[2020-04-16 14:01:40] Listing dev/
[2020-04-16 14:01:40] Listing etc/
[2020-04-16 14:01:40] Listing etc/X11/
[2020-04-16 14:01:40] Listing etc/X11/Xreset.d/
… 44 minutes go by
[2020-04-16 14:45:01] Listing var/spool/postfix/saved/
[2020-04-16 14:45:01] Listing var/spool/postfix/trace/
[2020-04-16 14:45:01] Listing var/spool/postfix/usr/
[2020-04-16 14:45:01] Listing var/spool/postfix/usr/lib/
[2020-04-16 14:45:01] Listing var/spool/postfix/usr/lib/sasl2/
[2020-04-16 14:45:01] Listing var/spool/postfix/usr/lib/zoneinfo/
[2020-04-16 14:45:01] Listing var/spool/rsyslog/
[2020-04-16 14:45:01] Listing var/tmp/
[2020-04-16 14:45:01] Packing root/testfile
[2020-04-16 14:45:05] Chunk 449688b8f8adab6dca2e389ca718ecacd06b6436a47ba6d139a286e533c21aaf has been uploaded
[2020-04-16 14:45:05] Uploaded chunk 1 size 8816076, 2.10MB/s 01:26:21 0.0%

That 44 minute chunk of looking for the files is what I’d like to replace by doing a zfs diff which will tell me what has changed.

I would love to see the GUI implement a real-time fs watcher:

I was thinking that as well and the newer linux kernels even have a useful fanotify. But then realized that since I can already generate a list just by using the built in zfs commands a much much smaller feature would just be to have something like duplicati’s --changed-files command. It’d probably be fairly trivial to someone who knows the system/language but as I said up thread I know neither and just looking at how the scanner was entangled I didn’t feel comfortable at all looking further.

Yup but in both cases - a fs watcher or zfs diff - would require the underlying CLI backup command to support additional parameters, such as as a list of patterns, or a filename that contains a list of patterns.

Once that’s in place, the GUI can implement whichever system works best (or you could script it outside of the GUI). With such a system, you could effectively have near-continuous data protection.