Improving the instructions for include/exclude patterns (filters)

I think the instructions for the include/exclude patterns could be improved, but I am myself too unsure of things to make any edits in the wiki. So I’m starting this thread to clarify some things first and once I (or anyone else) has a clear view of things, I/they can edit the wiki.

So, let’s start with the first example under “1. Wildcard matching”:

+foo/bar/*
-*

It is introduced by saying

the following pattern list doesn’t do what is intended, since the foo directory will be excluded so the foo/bar will never be visited

It remains unclear what actually is intended, but it seems that the aim is to include only foo/bar/* (I don’t understand what difference it makes to say foo/bar/* instead of foo/bar/). We are then told that the reason it won’t work (I assume that means: nothing at all will be backed up) is that “the foo directory will be excluded so that foo/bar will never be visited”.

But why is this so? Further up, we are told that

the order of the patterns is significant. If a match with an include pattern is found, the path is said to be included without further comparisons.

If that is the case, then that means, the first thing that duplicacy finds in the filter file is +foo/bar/*, which means, the first thing it will do is to include that path “without further comparisons”. Only then will it find the -* and exclude everything, i.e. everything else. So, it is not clear to me why “the foo directory will be excluded so that foo/bar will never be visited”.

Anyway, if I accept that there is a problem with the first example, I nevertheless don’t understand why the second example is the (best) solution:

+foo/bar/*
+foo/
-*

Looking at that example, it strikes me that the first and last line end with an asterisk but not the second one. My hunch is that this is because we only want to include the foo directory, but not everything in it (though it sounds strange to say you want a directory but not its contents). But if the idea is that foo needs to be included so that duplicacy will foo/bar at all, then why does +foo not come first, i.e. before +foo/bar/*?

Ironically, the third example makes sense to me:

-foo/bar/
+foo/*
-*

Perhaps it’s because the only thing we are told about that example is that it “includes only files under the directory foo/ but not files under the subdirectory foo/bar”, which means there is not so much room for discovering incoherences…

The way I interpret each line is like this:

Line 1 -foo/bar/: we exclude foo/bar/ before including anything else, because whatever comes first tales priority over what comes later. So if foo would first be included first, it would be “included without further comparisons”, which means that we would not be able to exclude any of its subdirectories.

Line 2: +foo/* Now that we have made sure that foo/bar is excluded once and for all, we can include foo with all its contents (except the ones excluded earlier)

Line 3: -* Finally, we exclude “everything”, which means “everything except for what has previously been explicitly included” so that we get a backup of everything in foo/, except foo/bar/.

I’d love to hear where I’m wrong because if I continue building my filter file with my current mindset, I’d probably have to completely rework it once I learn how things really work…

Duplicacy borrows the include/exclude model from rsync. First, the indexing starts at the root of the repository and do a non-recursive listing. Every file or folder is matched against the patterns in order. If an excluded pattern is matched, the file or folder will not be backed up. If it is a folder, Duplicacy will no descend into the excluded folder to list its files/subdirectories. Therefore, it is possible to exclude an entire subdirectory tree at once without wasting time checking ever file/folder under it.

Note that when indexing a subdirectory, what is being matches agains the patterns is the path relative to the repository root.

1 Like

I’m afraid I still don’t follow. Could you give a step by step description of how the matching process works? Like: duplicacy finds the file foo/bar/file.txt, what does it do next to decide whether it should be backed up?

Assuming foo/bar/file.txt is the only file you want to back up, the correct patterns are:

+foo/bar/file.txt
+foo/bar/
+foo/
-*

Duplicacy lists the root of the repository. The foo/ folder matches the third pattern, while all others will match the fourth one and thus get excluded. In the next step Duplicacy lists the foo/ folder and only foo/bar/ will be included. Finally, Duplicacy lists foo/bar/ and finds foo/bar/file.txt which is the only file with a matching include pattern while all others will be excluded by -*. Overall, Duplicacy only needs to list 3 directories to locate the only file to be included without iterating through the entire directory tree.

1 Like

Okay, let me try to make the process even more explicit:

  1. Duplicacy lists the root of the repository.
  2. D takes the first item on the list (let’s say it’s bar/foo) and compares it to the first pattern -> doesn’t match
  3. D compares it to the second pattern -> doesn’t match
  4. D compares it to the third pattern -> doesn’t match
  5. D compares it to the third pattern -> matches!
  6. Since the matching pattern is an exclude pattern, D ignores bar/foo (i.e. does not back it up, and - important in the present context: does not list its contents either).
  7. D proceeds in the same way with every item in the list until it reaches the foo/ directory:
  8. D compares it to the first pattern -> doesn’t match
  9. D compares it to the second pattern -> doesn’t match
  10. D compares it to the third pattern -> bingo!
  11. Since the matching pattern is an “include folder” pattern (and not an include-folder-with-all-its-contents pattern, which would be +foo/*, it lists the contents of the folder and proceeds as above by comparing the each item with each pattern. And so on.

Right?

Bonus question: if the third pattern had been +foo/* D would have acted differently in step 11. It would simply have backed up all the content in foo/, including all subdirectories. Right? That means that there would be no point in having the first two patterns, as they are redundant. Right?

Basically right, but there is a slight error in step 2:

  1. D takes the first item on the list (let’s say it’s bar/foo) and compares it to the first pattern -> doesn’t match

It would be bar instead of bar/foo, since when you list the root of the repository you’ll only see bar.

Bonus question: if the third pattern had been +foo/* D would have acted differently in step 11. It would simply have backed up all the content in foo/, including all subdirectories. Right? That means that there would be no point in having the first two patterns, as they are redundant. Right?

Exactly.

1 Like

Oh yes, “seems”… I struggled here and fell… I decided to first finish the world formula, then I’ll continue to think about this. What is intended and why will foo be excluded.

What I fear is: I will understand the filter system in a year or two. But what if I do not create filters for some weeks… I may have to study the whole thing again :expressionless:

I submitted a few updates to the wiki to better explain the part causing confusion. Please read, if it is still not clear, I can work on it a bit more. I, too, struggled with understanding what was going on until I realized that gchen had implemented it as what CS people call a ‘breadth first search’. Once I got that, it all started making much more sense.

1 Like

I don’t know if this will help clear things up or muddy the waters:

The thing that seems to work for me when thinking about it is to separate the patterns into those that match directories from those that match files. The ones that match directories control the search of the tree. The ones that match files control which of those in the pruned tree are selected. The one weird thing is that a top level symbolic link to a directory can only be matched without a trailing /.

So to get only my .iso files below my Music/SACD ISOs/ directories (where Music is a symbolic directory link to my music files.) I need to do use

+Music/SACD ISOs/*/
+Music/SACD ISOs/
+Music
i:(?i)\.iso$

I think all of those lines are necessary and together sufficient.

I have a folder with subfolders for each month

folder/
├── 2019-07/
│   ├── file1.axx
│   ├── file2.txt
│   ├── file3.txt
├── 2019-08/
│   ├── file4.axx
│   ├── file5.txt
│   ├── file6.txt

and I back up just one file type (axx) in subfolders (there are others that don’t need backup).

This filter works well:

+*/
+*.axx
-*

Maybe the equivalent for your case is something like this?

+Music/SACD ISOs/
+Music/
+*.iso
-*

Hi,
I think since @tedsmith’s Music is a symbolic link (which is a file, not a ‘real’ folder), the filter still needs the link name without trailing \:

+Music

This detail would be nice to add to the Filters How-to!

Here is more discussions:

Besides the symbolic link issue, my tree of files is unbalanced. In some sections there are only a few levels of directories and in others many more. The pattern matching get’s more complicated when you need to deal with that too. It took a lot of experimentation to get a minimal set of patterns to do the job needed, I’d be happy if I could simplify it a little more, but it’s not bad now. My only point was that IMO it’s best to think about the directory selections (ending in /'s) separately from the file selections (which don’t end in a /, but with the caveat that the top level symbolic links to directories can’t have trailing /'s.)
So I need

+Music

for the top level symbolic link and to allow seeing the SACD ISOs directory, and

+Music/SACD ISOs/

to only work on that subtree and to see that subtree at all, and

+Music/SACD ISOs/*/

to get to arbitrary levels below that, and

i:(?i)\.iso$

to get the iso files (which don’t all have the same case of I, S and O’s for other silly reasons)
In this case no trailing -* is needed.

2 Likes

I would say this entire include/exclude mechanism is unintuitive and needs a redesign, not just the documentation.
.gitignore syntax has some good ideas.

Some concepts:

  • * should not include directory separators.
  • ** can be a string with directory separators, e,g. aa/bb/cc.
  • aa/bb/cc/* should imply that we also the parent directories aa/, aa/bb/ and aa/bb/cc/.

For backwards compatibility, there could be a declaration at the top of a file to choose the new filter syntax. Otherwise the old syntax will be used.