Watch file system in real-time aka File System Notifications


#1

A while ago I made a rambling github issue post in completely the wrong thread about a potential new feature…

Watch file system in real-time / File System Notifications

(For some reason, this idea seems to have caused some befuddlement but I blame this on my poor explanation, and the fact that Duplicacy has both a CLI and a GUI version, where core functionality in the GUI is traditionally driven by the raw CLI commands. Attempts to wrap my head around the technical implementation - first, necessarily, in the CLI for under-the-hood functionality - may have clouded the vision of an important, and imho big feature idea, that would ultimately benefit the GUI version which not all people here use, but I bet would force people to reconsider - on their desktops, at least - and especially after GUI 2.0 arrives. So please keep an open mind when you read on :slight_smile: and I’ll try not to get bogged down with technical details… yet :stuck_out_tongue: )

Duplicacy is so very close to my idea of a perfect backup system: De-duplication, multi-storage, multi-machine, compression, encryption, core engine is open source, does local and cloud storage, snapshots, volume shadow copy, incremental, last backup is still a ‘full’ backup, the list goes on.

But it does lack one of those essential backup ‘client’ features that would put the icing on the cake for me…

File System Notifications or sometimes known as ‘watched folders’ is a concept that monitors for file changes in real-time.

A program e.g. background service or icon tray, can ask the OS to be immediately notified of any changes within specified folders. Duplicacy GUI could keep a list of these changes, and when it comes to the next scheduled backup, instead of scanning the entire repository, uses this much smaller list of files when considering what has changed. The last snapshot revision is used for the state of all other unchanged files.

CrashPlan uses it to stage backups at a frequency as low as every 15 minutes, without have to rescan a set of folders for changes by itself. Syncthing uses it to watch for changes to be synchronised to other devices, as soon as possible - again, without having to rescan the entire directory tree.

Duplicacy does already have impressive performance credentials doing rescans, but I argue that it isn’t very scalable (my relatively small Windows user profile repository still takes upwards of 3 minutes per backup run, even on an SSD), is a waste of CPU resources, and puts it some distance towards that near continuous backup protection ideal that some people, including myself, may find important in a backup tool.

In my opinion, this is a feature that would enhance the GUI to such an extent that it would rival the CrashPlan client in terms of features. Add in CPU throttling (CrashPlan does this also), a sweet web based GUI and you have a sexy backup client made for desktops.

</rambling>


Let me limit the folders that are checked during a backup
#2

Quickly run, the Pessimist steps in!

THIS! (i have the feeling you forgot to mention that in the github issue hijack :stuck_out_tongue:)


But i do have a workflow question which comes out of this: with this feature you intend to have every minute backups? Or every few seconds to make a new revision? <- that would totally kill the storage costs (storage + api calls) even with something as cheap as backblaze.


#3

Ok, briefly - I though that FS monitor idea was silently ignored but it’s apparently back :slight_smile: (No offense intended here in any way – it’s a great idea – but with caveats).

Duplicacy GUI could keep a list of these changes, and when it comes to the next scheduled backup, instead of scanning the entire repository, uses this much smaller list of files when considering what has changed

Admittedly with the clarification that the filesystem events will be used only as just an optimization to reduce frequency of periodic indexing (as opposed to turning duplicacy into a realtime backup) there is something to discuss. But still let me explain why I think both approaches are a very bad idea.

In fact lack of the runtime monitor component was one of the advantages of Duplicacy for me. If number of files that are changing is large (compiling huge source tree as opposed to editing a single word file) having constant filesystem filter overhead is way too high of a price to pay to avoid waiting 40 seconds every hour for full rescan. I remember having to shutdown DropBox and OneDrive because they had the filesystem filters (perhaps improperly written) that would hog CPU when I was building stuff (clang, nothing fancy) in the folders even outside of the dropbox. Same with CrashPlan – I had to kill their stupid filesystem monitor to get some work done.

Granted, Crashplan did it because they had to – their performance sucked to begin with: indexing was taking over 6 hours on my dataset. Of course they had to implement filesystem watcher. (Which by the way does not work on some mounted drives. On some it does work. Depends on version of SMB/NFS and remote OS and enabled features. And there is filesystem watchers limit that likes to get reset out of nowhere to 8k files. Silently. It’s a can of worms I’d rather not open. Even with filesystem monitor we would need periodic full scans because of that – just like CrashPlan had to do – and therefore it would be unusable due to being unreliable).

All those technical reasons for having a runtime watchers do not apply to Duplicacy because it does not suck: it takes 40 seconds to re-index the same dataset and therefore it does not need crutches in the form of filesystem monitoring.

I’m against realtime backup for other reasons besides performance (some workflows may not be affected) or cost (Wasabi UE cost the same as B2 but with zero API cost). It’s about managing the backup snapshots and attempting to find the right one to restore. This is not supposed to be a replacement of git; this is a backup. It does not need nor should log every change. Doing so will lead to complete revamp of the whole restore workflow which would be now based on time? Branches? Infinite versioning?

Anecdotal example - Synology CloudStation Backup does precisely that. Sits in memory and uploads every single change and stores it in a BTRFS snapshots on the server. I used it for a while and when I needed to restore something I felt much more confident restoring from Time Machine instead that kept hourly snapshots as opposed to unstructured mess of versions at random times (because pruning!) that CloudStation created. Now cloud station is uninstalled and I have Duplicacy hourly snapshots instead. Limited number of structured deterministic versions is better than a linear infinite history, partly due to effect of paradox of choice.

I would love to hear benefits of realtime backup as such, and then maybe discuss further why we should stay far far away from watching filesystem.

But, if you folks decide to implement that still – PLEASE PLEASE, for the sake of your loyal user(s) provide an option to TURN IF OFF!!!


#4

Yes, I did forget, and it’s a pretty important point. :slight_smile:

No I agree that would be overkill… I mentioned near continuous backup when linking to the Continuous Data Protection wiki, which talks about the ‘near’ distinction, as to be honest it may not be possible to achieve ‘true’ CDP anyway…

But I don’t see why backups every 5 or 15 minutes can’t be feasible. My main motivation would be to lower system resources, while allowing ever larger repo sizes and more frequent backups.

On my main desktop at the moment, I run GUI backups on several repos only every 3 hours because I don’t want to think about quitting Duplicacy before gaming, for example. (CPU throttling and game detection might help here, but rescans - while performant - is wasteful on desktop, when there’s a proven alternative.)

Never had a problem with CrashPlan wrt real-time monitoring, but could this be an OS thing? I did have issues where CrashPlan’s VSS would often freeze my entire machine while doing a backup. That was a PITA.

Yup, that’s expected, and a fair compromise imo. Even Syncthing does periodic full rescans.

CrashPlan’s “verify selection” defaulted to every 1 day at 3am. I propose something similar, but Duplicacy wouldn’t have to rescan and perform no backups - you could simply tell it to do a normal backup with full rescan on the first backup after 3am (say) each day.

Agreed, it doesn’t suck, but 40 seconds is your dataset, not mine. :slight_smile: Takes 3 minutes on my Windows user profile alone. That’s too long if I want higher backup frequencies or significantly more files.

I don’t quite understand this reasoning or the segway into git and branches. It’s really about making more regular backups to avoid data loss, not keeping a bajillion versions. In most cases for me, the one to restore is the last successful backup - I’ll be using prune -keep 1:7 at most anyway.

Improved performance is my main goal here.

Oh definately, wouldn’t have it any other way!


#5

Curious about this: i checked my machine as well: windows 10 machine with i5-6400 cpu some 6yo ssd more than half filled and 168GB of data out of which only 15GB actually get to be backed up (my userprofile + other folders on a HDD), as the rest are ignored by the filters file ( i use a symlink repo ).

2018-10-16 01:00:21.579 =================================================================
2018-10-16 01:00:21.587 ==== Starting Duplicacy backup process ==========================
2018-10-16 01:00:21.596 ======
2018-10-16 01:00:21.623 ====== Start time is: 2018-10-16 01:00:21
2018-10-16 01:00:21.633 ====== logFile is: C:\duplicacy repositories\tbp-pc\.duplicacy\tbp-logs\2018-10-16 Tuesday\backup-log 2018-10-16 01-00-21_636752484214107081.log
2018-10-16 01:00:21.642 =================================================================
2018-10-16 01:00:21.961 Zipping (and then deleting) the folder: C:\duplicacy repositories\tbp-pc\.duplicacy\tbp-logs\2018-10-15 Monday to the zipFile: C:\duplicacy repositories\tbp-pc\.duplicacy\tbp-logs\2018-10-15 Monday.zip
2018-10-16 01:00:40.656 ===
2018-10-16 01:00:40.664 === Now executting  .\z.exe  -log  -d   backup -stats -threads 18
2018-10-16 01:00:40.674 ===
2018-10-16 01:00:43.146 INFO STORAGE_SET Storage set to G:/My Drive/backups/duplicacy
2018-10-16 01:00:43.155 DEBUG PASSWORD_ENV_VAR Reading the environment variable DUPLICACY_PASSWORD
2018-10-16 01:00:43.170 DEBUG PASSWORD_KEYCHAIN Reading password from keychain/keyring
2018-10-16 01:00:43.185 TRACE CONFIG_ITERATIONS Using 16384 iterations for key derivation
2018-10-16 01:00:43.204 DEBUG STORAGE_NESTING Chunk read levels: [1], write level: 1
2018-10-16 01:00:43.205 INFO CONFIG_INFO Compression level: 100
2018-10-16 01:00:43.205 INFO CONFIG_INFO Average chunk size: 4194304
2018-10-16 01:00:43.205 INFO CONFIG_INFO Maximum chunk size: 16777216
2018-10-16 01:00:43.205 INFO CONFIG_INFO Minimum chunk size: 1048576
2018-10-16 01:00:43.205 INFO CONFIG_INFO Chunk seed: 047e223d47da826b44a58bf72c9f701ded4e04e3afb10e39c3a5eb30c93b1e07
2018-10-16 01:00:43.205 DEBUG PASSWORD_ENV_VAR Reading the environment variable DUPLICACY_PASSWORD
2018-10-16 01:00:43.219 DEBUG BACKUP_PARAMETERS top: C:\duplicacy repositories\tbp-pc, quick: true, tag:
2018-10-16 01:00:43.219 TRACE SNAPSHOT_DOWNLOAD_LATEST Downloading latest revision for snapshot tbp-pc
2018-10-16 01:00:43.219 TRACE SNAPSHOT_LIST_REVISIONS Listing revisions for snapshot tbp-pc
2018-10-16 01:00:43.288 DEBUG DOWNLOAD_FILE Downloaded file snapshots/tbp-pc/3123
2018-10-16 01:00:43.319 DEBUG CHUNK_DOWNLOAD Chunk e1c2e8de2264eb5807a8ca4e841f7b734c9abd829b1b2e557cad7c2ba8e50f6a has been downloaded
2018-10-16 01:00:43.520 DEBUG DOWNLOAD_FETCH Fetching chunk 427602ae10b276099c86ab60973236250bfc37c960a57822812c996e2a5e794f
[..]
2018-10-16 01:00:45.687 DEBUG DOWNLOAD_FETCH Fetching chunk ffc387899375df5cd538fa89846d3fc6edca9ecb7c6081463cec78cfd0f26084
2018-10-16 01:00:45.692 DEBUG CHUNK_DOWNLOAD Chunk ffc387899375df5cd538fa89846d3fc6edca9ecb7c6081463cec78cfd0f26084 has been downloaded
2018-10-16 01:00:45.760 INFO BACKUP_START Last backup at revision 3123 found
2018-10-16 01:00:45.760 INFO BACKUP_INDEXING Indexing C:\duplicacy repositories\tbp-pc
2018-10-16 01:00:45.763 DEBUG REGEX_STORED Saved compiled regex for pattern "/iTunes/Album Artwork/Cache/", regex=&regexp.Regexp{regexpRO:regexp.regexpRO{expr:"/iTunes/Album Artwork/Cache/", prog:(*syntax.Prog)(0xc044048b10), onepass:(*regexp.onePassProg)(nil), prefix:"/iTunes/Album Artwork/Cache/", prefixBytes:[]uint8{0x2f, 0x69, 0x54, 0x75, 0x6e, 0x65, 0x73, 0x2f, 0x41, 0x6c, 0x62, 0x75, 0x6d, 0x20, 0x41, 0x72, 0x74, 0x77, 0x6f, 0x72, 0x6b, 0x2f, 0x43, 0x61, 0x63, 0x68, 0x65, 0x2f}, prefixComplete:true, prefixRune:47, prefixEnd:0x0, cond:0x0, numSubexp:0, subexpNames:[]string{""}, longest:false}, mu:sync.Mutex{state:0, sema:0x0}, machine:[]*regexp.machine(nil)}
[...]
2018-10-16 01:00:45.764 DEBUG REGEX_STORED Saved compiled regex for pattern ".Library/Saved Application State/", regex=&regexp.Regexp{regexpRO:regexp.regexpRO{expr:".Library/Saved Application State/", prog:(*syntax.Prog)(0xc044048bd0), onepass:(*regexp.onePassProg)(nil), prefix:"", prefixBytes:[]uint8(nil), prefixComplete:false, prefixRune:0, prefixEnd:0x0, cond:0x0, numSubexp:0, subexpNames:[]string{""}, longest:false}, mu:sync.Mutex{state:0, sema:0x0}, machine:[]*regexp.machine(nil)}
2018-10-16 01:00:45.764 DEBUG REGEX_DEBUG There are 155 compiled regular expressions stored
2018-10-16 01:00:45.764 INFO SNAPSHOT_FILTER Loaded 155 include/exclude pattern(s)
2018-10-16 01:00:45.764 TRACE SNAPSHOT_PATTERN Pattern: e:(?i)/\.git/
[...]
2018-10-16 01:00:45.784 TRACE SNAPSHOT_PATTERN Pattern: e:.Library/Saved Application State/
2018-10-16 01:00:45.784 DEBUG LIST_ENTRIES Listing
2018-10-16 01:00:45.785 DEBUG PATTERN_INCLUDE C__Users_link is included
2018-10-16 01:00:45.786 DEBUG PATTERN_INCLUDE C__all_link is included

2018-10-16 01:01:34.034 TRACE PACK_START Packing [bla]
[...]
2018-10-16 01:01:34.097 INFO PACK_END Packed [bla]

2018-10-16 01:01:36.248 INFO BACKUP_END Backup for C:\duplicacy repositories\tbp-pc at revision 3124 completed
2018-10-16 01:01:36.248 INFO BACKUP_STATS Files: 118642 total, 14,914M bytes; 174 new, 79,947K bytes
2018-10-16 01:01:36.249 INFO BACKUP_STATS File chunks: 11043 total, 18,647M bytes; 8 new, 65,611K bytes, 19,240K bytes uploaded
2018-10-16 01:01:36.249 INFO BACKUP_STATS Metadata chunks: 12 total, 52,673K bytes; 6 new, 22,476K bytes, 5,721K bytes uploaded
2018-10-16 01:01:36.249 INFO BACKUP_STATS All chunks: 11055 total, 18,699M bytes; 14 new, 88,088K bytes, 24,961K bytes uploaded
2018-10-16 01:01:36.249 INFO BACKUP_STATS Total running time: 00:00:53
2018-10-16 01:01:36.288 ===

On my setup listing starts @
2018-10-16 01:00:45.784 DEBUG LIST_ENTRIES Listing
and the packing starts (so listing ends) @
2018-10-16 01:01:34.034 TRACE PACK_START Packing [bla]
so that would be about 45 seconds of work.

I see here that for 160GB of many files it takes less that 30% of your time for the listing part. I also don’t remember ever noticing that the backup runs. There’s no slowdown, no high fan, nothing.

And this is no ultra-powerful computer and i’m using the intel cpu fan (so the cooling is decent at best).


I’m curious what’s going on with the listing on your machine, or how could it be so slow and especially resource intensive. Do you have a HUGE repository?


#6

Fair enough, provided watching filesystem is lightweight enough to not not degrade performance of other tasks.

Perhaps we need data from larger dataset about average resource requirements of constantly watching huge number of files vs periodically scanning the same huge number of files. There should be a break-even point at some backup frequency – that could be different depending on CPU, FS cache size, filesystem, OS and storage media. (I suspect that frequency will be very high – as with the sufficiently large filesystem cache all backup runs after the first one shall be effectively never touching disk – as all the metadata will be cached.)

Perhaps something that can be collected by telemetry with user permission?

Same with CrashPlan – I had to kill their stupid filesystem monitor to get some work done.

Never had a problem with CrashPlan wrt real-time monitoring, but could this be an OS thing? I did have issues where CrashPlan’s VSS would often freeze my entire machine while doing a backup. That was a PITA.

That was consistently on number of versions of MacOS for as long as I used CrashPlan ( at least 4-5 years, perhaps more, times flies so fast).

Indeed, this is suspiciously huge amount of time. I forgot if there was a tool similar to spindump on windows (Part of performance analyzer toolkit perhaps) – but it would be interesting to see where is all that time spent.

Also, how many files are there? (my 40 second on first pass and 30 second on subsequent passes are for dataset just short of a million files strong on an APFS volume on a fairly old MacBook Pro 2015 out of which about 200k are picked up:

mymbp:~ me$ find ~ |wc -l
  992558
mymbp:~ me$ duplicacy  list -r 128 -files |wc -l
  207222
mymbp:~ me$ time caffeinate -s duplicacy backup
Running script /Users/me/.duplicacy/scripts/pre-backup
Storage set to sftp://me@redacted//Backups/duplicacy
Last backup at revision 128 found
Indexing /Users/me
Loaded 49 include/exclude pattern(s)
Backup for /Users/me at revision 129 completed

real	0m28.684s
user	0m12.214s
sys     0m9.875s

Also, does it not go faster the second run for you? How much ram is on that machine and how much of it does filesystem cache allowed to occupy? What is the filesystem? What is media?

And the million dollar question – are there any antivirus/antimalware/dropbox/onedrive/google/drive/other filesystem watching apps (ahem-ahem) running? What happens to backup speed if you stop them temporarily?


#7

I also wonder about the dataset.

My backup is about 400k files (150GB, but size doesn’t matter with regards to the scan time) and a typical run requires about 30 seconds on a PC that’s a half decade old.


#8

Slight clarification.

This is a typical backup length of my Windows user profile taken every 3 hours - so including indexing, VSS, packing - not a huge amount of data at all. So in truth, it can take as little as a minute or as much as 5, but 3 mins is more typical for incrementals throughout the day, and depending if the file cache has been cleared by heavy gaming, reboots etc…

The point I was making is that it doesn’t scale well. In fact, my Window user profile probably isn’t the best example, since it sits nicely on a Samsung 970 Evo M.2.

I have a repo on a 7200rpm, same system, which takes around 55 seconds to index, after a fresh reboot. No new data to backup; completes in almost exactly 1 minute - the majority of that time, indexing. Only 212K files, 79K folders. That, imo, is still a bit long and a concern for future data growth / more frequent backups.

If you double the number of files and folders, you can expect the time to index to double as well (though I grant some proper testing is very much warranted :slight_smile: )

Oh defiantely faster but the file cache does tend to get cleared out often when I game or unarchive large files, even with 16GB RAM.

(Duplicacy incidentally is very reasonable with memory usage during backups - very much unlike CrashPlan!)

Only Windows 10’s built-in AV and Syncthing (it does watch folders for changes, but not in the repos that take a long time to backup).

Need to do some further testing…

Might add some debugging to the code to get a clearer idea on what’s exactly happening.


#9

How long does it take if you right click the depository and check the properties in Windows Explorer? How many seconds?


#10

Interesting test, didn’t think of that. 50 seconds without rebooting (I usually leave the desktop PC on 24/7). This was just before a backup run, nearly 3 hours after the last. Ten minutes after the backup, properties on that repo takes around 10 seconds.

So the metadata doesn’t tend to stay for long in the cache, with only ~35% physical usage out of 16GB.


#11

So at least here duplicacy is as fast as it could be (eg as fast as windows normally is). I like that.
For a moment i had the feeling that duplicacy was slower than windows reading the modified date for all the files in that specific folder tree.


#12

I had already asked for such a feature 2 or 3 years ago when duplicacy was introduced, but it didn’t find much support back then.

I’ve restated this request recently, here: Let me limit the folders that are checked during a backup

Now I see I’m not alone with this any more. And I see others also refer to Crashplan, which did a great job at that.

I’m primarily a Mac developer, so I’d be happy to provide help implementing this on macOS. I’ve used FSEvents before, so I know well how it works, and the linked post request a feature in the cmdline tool that would allow us to provide a list of changed folders (or files), which would then mean that the backup tool would ONLY scan those files/folders, instead of the entire backup’s source tree.

With that feature added to the cmdline tool, Linux and Windows developers could, like me, implement their own little background “daemon” that will record file system changes and regularly invoke the cmdline tool, passing the list of changes to it. Once that has been tested by us developers, the feature could then be directly added to the UI tools Acrosync provides, or we’ll keep providing our own separate tools.

Either way, by adding the option to provide a set of folders/files to the backup command, which I believe is easily added, we can take this out of the hands of the duplicacy developer(s), since they’re not eager to work on that currently, it seems :slight_smile:


#13

Understood, completely agree. The intermediate workaround would then be to add the -filters option as a command line parameter, not just the filters file. I think this is a very interesting idea, and it would have other benefits, such as enabling the centralization of configuration parameters.

I’m a Rclone user and it allows these two forms of parameters for filters, by file and by command line, it is very useful / flexible.

I think this modification (passing the filter parameters by command line) would be easier to implement than monitoring in real time.

Again, you’ll find some interesting discussions here:

Some of the above discussions are about using a -filters parameter to point to specific filters files, but I think using -filter directly as a parameter on the command line may be more practical in some use cases


#14

Doesn’t using the filters mean that it’ll affect the stored files that are filtered out this way? That would not be good.

An example:

The repo has two folders, A and B, both with files inside.
I back up both.

Now I detect that only A has changed contents, so I’d modify the filter to only scan A.
When I run the backup now, won’t it believe that B has been removed and will delete them from the store’s last snapshot?


#15

If by “last snapshot” you are referring to the one just created, then yes, you are correct.


#16

I don’t think it’s useful to think of this feature the same as filters - yes, it is a filters of sorts, but in the context of Duplicacy a filter excludes file(s)/folder(s) from snapshots. What we want is to include unchanged files from the previous snapshot and scan changed files only.

My suggestion would be to allow the backup command to specify a [pattern] - exactly like the syntax for the restore command:

USAGE:
   duplicacy restore [command options] [--] [pattern] ...

Furthermore, if a single pattern isn’t enough - for both backup and restore commands (as I suspect it might be useful for GUI 2.0 especially, to be able to restore a selection of things) - perhaps allow multiple patterns and/or a -pattern option to specify this.


#17

I see, you’re saying that we could have the parameters in backup command

func (manager *BackupManager) Backup(top string, quickMode bool, threads int, tag string,
	showStatistics bool, shadowCopy bool, shadowCopyTimeout int, enumOnly bool)

similar to what we have for restore (patterns at the end):

func (manager *BackupManager) Restore(top string, revision int, inPlace bool, quickMode bool, threads int, overwrite bool, deleteMode bool, setOwner bool, showStatistics bool, **patterns []string**)

Looks interesting!

Source:


#18

We have to evaluate if we are not trying to use a backup tool as a synchronization tool.

Time machine (eg) saves file versions.

A backup tool (such as Duplicacy) saves snapshots of file sets.


#19

Isn’t that the same to user? After all, when you browse a Time Machine backup, you browse something tat looks very close to snapshots. And you browse entire folders by date & time. And if you look at the backup directory directly (via a file browser like the Finder), you’ll see them at the root as snapshots, each with timestamp:


#20

Maybe. It depends on the use case.

Consider your own example above:

time folders
time 1 backup folders A and B
time 2 backup folder A

Use case 1: “My file has been corrupted, I’ll restore it from backup”

No problem. Just check the dates / times of the backups and locate the file with list command (or another way) and restore it.

Use case 2: “My HDD died.”

Just locate the last full backup (with all folders and files) and restore it. Wait … which one was the last full backup?

Because:

That’s why I’m saying that it’s different to deal with files and sets of files.

By what you describe (I have little experience with Macs) Time Machine generates a full snapshot even detecting that only some files changed. Duplicacy - to my knowledge - will only do the same thing if it run a full scan of the repository. If we limit the scope using some kind of filter in the repository, it will consider this new “range” as a full version of the backup in this run.