Let me limit the folders that are checked during a backup


#1

Consider a user (like me) is using a huge directory tree as the backup source.
Now, if I let the tool run a backup every hour, this may cause a lot of scanning every time, slowing down my hard disks.

Since I’m using macOS, I could write a tool that watches the source tree thru the “FSEvents” feature. This will tell me which folder contents had changed.

Now, knowing which folders have changed, it would be great if I could pass this information to the tool, so that it will only look into the provided list of folders for changes instead of checking the entire tree.

I’d think the best way would be that I can simply pass the list of dir paths to the tool, past the “backup” command, or by passing a path to a file that contains the list.

Can that be done?


#2

I think the best way would be you pass an “include” parameter related to the desired folders:

Your script would have to update the filters file.

I have some very large folders and scanning them does not take long. My daily backup (multiple repositories) takes 10 ~ 15 min.


#3

“include” makes little sense since the entire repo is already included in the search, so adding a sub folder of said repo would have zero effect. Unless “include” works differently from what its name suggests.

It should rather be called “testOnly” or something like that.

10 Min is way too much time for an hourly backup, as I’m used to do with macOS Time Machine. The check should only take a few seconds if only a few files have changed, and that’s how TM does it, based on using FSEvents as I suggest (I’ll then write the daemon process for that).


#4

It’s a include pattern. As you mentioned:

so your script just have to change the filters file to include something like:

+path/to/changed_folder1/*
+path/to/changed_folder2/*
...
e:.*      ==>> exclude everything else

But I really think this would not be a proper way to use the filters file. I think the best way would be simply let Duplicacy scan the folders.

On the other hand, I understand your concern about running hourly, and I wonder: would this be the case of using a synchronization tool instead of a backup tool?

In my case, for example, I make a daily backup of my folders that are synced to Google Drive. If I need the file version of a few hours ago, I download it from Drive. Since Drive only saves the versions from the last 30 days, if I need something older, I go to Duplicacy backups.

Obviously, I don’t know the details of your use case, but I only see sense in making backups every hour in a workstation if you need to save in the long run all versions / changes made.

EDIT: I just remembered, you’ll find an interesting discussion here:


Watch file system in real-time aka File System Notifications
#5

But I really think this would not be a proper way to use the filters file.

Right, because the filters file is more of a permanent setting, whereas the feature I need is specific to every call to the backup command.

I think the best way would be simply let Duplicacy scan the folders.

We already established that this can be very inefficient.

I am trying to get duplicacy do the same as Time Machine does, i.e. back up changed files every 30 minutes, with a history, but to a remote location instead of to a local disk. Time Machine is a great tool to keep automatic file history on a file system that’s not auto-versioning. I’ve relied on TM for years, and if duplicacy can be a replacement for TM, with the additional option to save to “Cloud” storage, it can be quite successful on the Mac.

Ever since I’ve got to know duplicacy about 2 or 3 years ago, I was looking for something similar with the feature set that TM offers, but nothing comes close (Crashplan did, but it has since stopped its support for non-business users). There’s Arc for macOS, which is rather popular, and is, like duplicacy, able to store in cloud servers, but it also does not have FSEvents support, so it basically is as inefficient as duplicacy, when backing up a folder tree with more than a few 10000 files, but with a better user interface than duplicacy. To beat Arc, duplicacy needs to be more efficient, and adding this feature I look for would make a big difference. Sure, the UI is still not as user friendly (i.e. not Mac-like), but with the web service about to come out, it’ll look great. You could even wrap that into a Mac container app, thereby replacing the current Qt(?) app - I assume that’s what you’re planning anyway.

So, scanning a millon files every 30 minutes is inefficient and unnecessary when I can provide the location of the changes.

In fact, to be more suitable for the use with FSEvents, I’d need two modifiers for the backup command:

  1. I specify a set of dirs that have their immediate contents changed
  2. I specify a set of dirs that have any of their contents changed (-> deep scan)

I’d expect the backup scanner to directly dive into these provided paths, skipping any others. That should be easy to add, as I’m sure there’s already a function that scans a dir (and its subdirs). So, all that needs adding is a parameter to that function that tells it whether to look into deeper dirs or not, and then call that function with the provided paths.

If you don’t want to do that, would I be allowed to modify the source accordingly, and you’ll accept it into the main branch, provided I break nothing? Then I’ll write the FSEvents daemon and we’ll see how it performs.


#6

I am having a look at the tool’s code right now. I’m not fluent in Go, but it’s well readable regardless :slight_smile:

It seems to me that the scanner (CreateSnapshotFromDirectory) always needs to create a complete list of all files of the source, and then this list is compared against the store’s contents. I’d have hoped it would take the store’s contents as input and then deliver only a diff result, which would have made implementing my feature easier.

Now, with the way it works, I can’t just have CreateSnapshotFromDirectory look at a smaller set of dirs, because then the following comparison against the store’s contents would mark all the unscanned files as deleted, I guess.

How do I best approach this? I can think of two ways:

  1. After calling CreateSnapshotFromDirectory(), I’d parse the store’s contents, adding all files from, while excluding the dirs / files I had already scanned.
  2. Or, in the code that compares the store to the snapshot, it would only look at the dirs that were scanned.

I’ll try the second approach. I don’t like either way too much because it leads to two places in the code that have to be kept in sync (the comparison and the snapshot creation), as both have to interpret the list of scanned dirs correctly.


Working on the source code, I found that the last import in the main file refers to gchen’s repo directly - it should instead refer to “duplicacy/src”, or a forked project won’t use the local copy of the code in the src directory.


Huh, I now realize that the snapshot created by CreateSnapshotFromDirectory() is exactly what gets stored as the next snapshot to the storage. So, this has to be complete. Which means that if I’d only scan a few dirs, I’ll have to add the others from the previous snapshot to it, so that the snapshot remains complete.


#7

Had a look at the code too (am a complete novice to Go unfortunately) and came to the same conclusion regarding CreateSnapshotFromDirectory(). At first thought, what we need is a CreateSnapshotFromPreviousSnapshotAndDirectories()… a bit long-winded maybe, but gets the point across. :slight_smile:

The problem I see with CreateSnapshotFromDirectory() though, is that its only job is to iterate through the entire directory structure - not to check what’s changed - another piece of code does that. So simply adding another code path to do a CreateSnapshotFromPreviousSnapshotAndDirectories() instead, doesn’t solve the issue. The entire snapshot - either way - then needs to be scanned for changes and that defeats the point; it’s not efficient.

So your second approach sounds like the best, but if there’s a way to unify the two code paths such that CreateSnapshotFromDirectory() also scans for changes?

Or, completely do away with CreateSnapshotFromDirectory() and approach both scanning methods with the same steps: 1) load snapshot from previous backup (if exists), 2) recursively scan for changes within specified directories (if no directories specified, use the repository root) - comparing for changes along the way. Basically, reversing the steps and making them part of the same process.

Another consideration to make is how fine grained the [pattern(s)] should be…

I believe most File System Notification APIs return changes at the file level. Obviously GUI 2.0 or a third party tool will monitor this, then pass the changes to the CLI. However, is it best to group them up at the folder level, considering Duplicacy’s bundle and chunk-based? Or allow individual files to be picked out?

Also, I would encourage any third-party tools (and hopefully GUI 2.0) which handle File System Notifications to use this Go library for cross-platform support. The main issue with this library, but also any custom implementation, is support for recursively monitoring changes. So that needs to be solved as well, preferably within the library.


#8

Thanks for having a look at the source code and giving it some thought.

Well… macOS can provide a file-level monitoring as well, at the lowest API level, but that can quickly get overwhelming, e.g. if the user starts to copy large amounts of files. That’s why the higher-level FSEvents monitoring tries to be smarter: It usually only tells you about changes to individual directories. And if that becomes too much, then it may choose to notify you only about the highest affected dir level, with the flag that you need to scan all subdirs as well. That makes it possible that the notification remains even somewhat persistent across reboots, where your watcher process may have missed some events - if it didn’t miss too much, it’ll just be given a broader summary if necessary, which will avoids that the user has to rescan everything.

So, it would be nice if the cmdline option for making this work would at least have the options to look into particular dirs only, or optionally into all its subdirs as well, and - for other systems - to even regard only single files and not entire dirs.