Symbolic link repository and filters

I have read a number of topics and posts on this topic (e.g. Move duplicacy folder/Use symlink repository, Filters/Include exclude patterns, Improving the instructions for include/exclude patterns (filters)) and I think I understand, but I want to double check.

I have a directory at /backup that I intend to be my symlink repository. My first attempt at setting this up looked like this (please don’t make fun of my default rpi user haha):

/backup
├── boot
│   ├── cmdline.txt -> /boot/cmdline.txt
│   └── config.txt -> /boot/config.txt
├── etc -> /etc
├── home
│   └── pi -> /home/pi
└── var
    ├── backups
    │   ├── group.bak -> /var/backups/group.bak
    │   ├── gshadow.bak -> /var/backups/gshadow.bak
    │   ├── passwd.bak -> /var/backups/passwd.bak
    │   └── shadow.bak -> /var/backups/shadow.bak
    ├── log -> /var/log
    └── www -> /var/www

There are only a couple of specific files I want out of /boot, so I created a directory at /backup/boot and symlinked those two files in there. All of /etc can come (this is obviously not the greatest plan, but I didn’t feel like sifting through everything while I’m evaluating Duplicacy). There is only one specific user home directory that I want to back up, so once again I created /backup/home and symlinked /home/pi in there. The /var situation is even more complicated, because there’s a ton of stuff in there that I know I don’t care about, but there is also a pretty scattered collection of stuff that I do want. So, I ended up with the symlink structure you see in the tree above.

However, as I’m sure you all well know, this kind of mutli-level symlink structure does not work with Duplicacy. When I run a duplicacy backup job on this repository, the etc files are the only files that get successfully captured in a revision. From the documentation on symbolic links:

I’m curious about that "by default", but for now I’m just taking it as gospel that there is no way to get Duplicacy to follow symlinks that exist at any depth of the directory structure other than the root level of the repository. So, with that in mind, I recreated my /backup directory as follows:

/backup
├── boot -> /boot
├── etc -> /etc
├── home -> /home
└── var -> /var

Obviously Duplicacy will follow these root-level symbolic links. However, now I need to create a complex .duplicacy/filters file in order to curate my repository such that it matches my original plan. I haven’t tested it yet (because, frankly, I’m finding this a little frustrating), but I believe this set of filters should work, maybe?

+boot/cmdline.txt
+boot/config.txt
+boot/
+etc/*
+etc/
+home/pi/*
+home/pi/
+home/
+var/backups/group.bak
+var/backups/gshadow.bak
+var/backups/passwd.bak
+var/backups/shadow.bak
+var/backups/
+var/log/*
+var/log/
+var/www/*
+var/www/
+var/
-*

However! Even if those filters do work, I’m not sure I would be particularly thrilled with this setup. I was very optimistic about the original — but doomed to fail — symlink repository structure, particularly due to the ease with which I could survey exactly which files were going to be backed up (i.e. simply by browsing the directory structure of /backup). Under the new “correct” symlink structure, one must consult the .duplicacy/filters file and reconcile those filter rules against the repository directory structure with meticulous care in order to ascertain the true nature of the files to be backed up. Am I missing something, or is this pretty much the shape of it? Is there some secret method for coercing Duplicacy into following symbolic links that are encountered deeper into the repository than the root? I can’t find any mention in the documentation if so. Or, are filters like this the only option?

Edit: I should add that in all other aspects of my evaluation, Duplicacy is proving to be a fantastic and delightful tool, so I don’t mean to sound negative on it overall. Duplicacy has successfully backed up multiple terabytes of my most sensitive and irreplaceable data (family photos, mostly) to B2, and the data deduplication approach has cut the storage cost to a fraction of what it would be if I simply copied my data up to a B2 bucket via rsync or similar. Very impressive!

Thinking more on this, I wonder if it would be possible to use filters in combination with the first /backup tree I posted? If /backup is the root of the repository, and /backup/home/pi is a symbolic link (that Duplicacy would not follow, since it is one-too-many directories deep in the hierarchy), could a combined filter value of +home/pi/* and +home/pi/ “force” Duplicacy to follow that link? If this functionality does not already exist in Duplicacy, I would request it. It seems like this would be a reasonable, if redundant, compromise between the flexibility offered by arbitrary repository symlink topologies and the straightforward (from the Duplicacy backup system’s point of view) functional explicitness of the filter system.

Having slept on it, I realize that for this type of backup — where the repository is less of a directory, and more of a host-wide (but sparse) snapshot spanning across one or more entire filesystems — it’s probably best to abandon the idea of organizing the repository within a /backup directory using symbolic links at all. Instead, I should leverage the -repository flag during init to point Duplicacy at /, and rely solely on the .duplicacy/filters file to define the map of files for backup.

From the thread Newbie: Recommended usage of –repository and –pref-dir:

I will make another attempt following this model.

I personally don’t like symlinks, I’ve had some problems with them in the past (not related to backups).

Regarding the above solution, I have been using it for over a year without problems. Today I use a slightly better version in which the filter files are pointed within the preferences file

"filters": "D:/...../filter_job1.txt"

that is, I don’t use the default filters file in the .duplicacy folder.

I think this approach is interesting because I can define specific filters for different jobs that use the same repository.

I’m afraid I don’t quite follow what the benefit is here. Could you elaborate?

If you don’t use the "filters": key in the preferences file, :d: will use the default filter file, which is unique per repository.

On the other hand, if you use the "filters": key, you can specify several filters per repository, depending on the jobs / configuration you use.

For example, I have a large repository, in which only a few subfolders are updated daily, so I don’t need to back up the entire repository every day, but only these subfolders (I back up the full repository weekly).

So I set up two ID’s in the preferences file:

    {
        "name": "repo1-full--B2",
        "id": "repo1-full",
        "repository": "D:/repo1",
        "storage": "b2://bucket-repo1",
        "encrypted": true,
        "no_backup": false,
        "no_restore": false,
        "no_save_password": false,
        "nobackup_file": "",
        "keys": null,
        "filters": "D:/filters/filter_repo1_full.txt"
    },
    {
        "name": "repo1-subfolder1--B2",
        "id": "repo1-subfolder1",
        "repository": "D:/repo1",
        "storage": "b2://bucket-repo1",
        "encrypted": true,
        "no_backup": false,
        "no_restore": false,
        "no_save_password": false,
        "nobackup_file": "",
        "keys": null,
        "filters": "D:/filters/filter_subfolder1.txt"
    }

(note the different filters)

I have a script configured to perform daily backup of repo1-subfolder1--B2 and weekly backup of repo1-full--B2.

Note that the storage (bucket) is the same, so take full advantage of deduplication.


Another way to do this task would be to create two different repositories, one for the complete folder and another for the subfolder:

        "name": "repo1-full--B2",
        "id": "repo1-full",
        "repository": "D:/repo1",
		...
		
        "name": "repo1-subfolder1--B2",
        "id": "repo1-subfolder1",
        "repository": "D:/repo1/subfolder1",
		...

As you can see, there are several ways to configure it, but I personally prefer to keep only one repository and work with filters (the first option).