Can Duplicacy backup directories outside of the repository?

JacobHands · 8 March 2017 16:37

I am thinking of purchasing but want to make sure it can do what I need.

For example, some files I need backed up are in %appdata%, some are on other hard drives, etc. Can I use 1 repository to backup everything?

gchen · 8 March 2017 22:17

If you create a symbolic link right under the repository directory, Duplicacy will follow the symbolic link to scan the directory it refers to. For symbolic links residing on a subdirectory of the repository, Duplicacy does not follow them and simply backs up them as symbolic links.

On Windows, to create a symbolic link, open up a DOS prompt window and enter the following commands:

cd \path\to\repository
mklink /d appdata \path\to\appdata

On macOS, enter the following commands in Terminal:

cd /path/to/repository
ln -s /path/to/appdata appdata

Please verify that it works for you before you make the purchase.

Charles · 18 April 2017 22:28

I’m trying to implement this solution, but I am having troubles with folders that aren’t at the root drive. Any that are at the root of the C drive are working fine, but my profile folder for instance doesn’t work the same or doesn’t work at all depending on how I write the command.

mklink /d chads \Users\chads

This creats a symlink with a target to c:\DuplicacyRepo\Users\chads… it creates this simlink in c:\DuplicacyRepo. Sometimes the link works and sometimes it says c:\DuplicacyRepo\Users\chads could not be found (because it doesn’t exist).
This sort of describes the issue. There are a few other posts that describe the issue, but I haven’t seen a solution.
https://www.tutcity.com/windows-7/root-relative-symbolic-links-not-resolving-properly-properties-window.60780.html

mklink /d “c:\DuplicacyRepo\chads” “c:\Users\chads”

This command creates a symlink with a target of c:\Users\chads unlike others where the target is c:\DuplicacyRepo\symname
This may end up working but I’m hesitant when I try to exclude a folder from this symlink only, I get a complaint that the directory to be excluded must be a subdirectory inside the repo. I’m not sure if the program will choke on this link and if not, what should my filter pattern look like since it works differently from the other symlinks.

–
I realize these are Windows issues (I’ve never had issues with symlinks in linux) but I am at a loss on what to do here.

edit I’m going to try it with the second command in place and see what happens. Filters as so: -chads/AppData/

gchen · 19 April 2017 00:27

The target of the symlink must be an absolute path – i.e., “c:\Users\chads”. So your second command should work.

The include/exclude patterns should always contain only paths relative to the repository. And Duplicacy will treat directories pointed to by first-level symlinks as if they were first-level sub-directories. Therefore, “c:\Users\chads\AppData” will become “c:\DuplicacyRepo\chads\AppData”, and “-chads/AppData/” should be the right pattern to exclude it.

Charles · 19 April 2017 03:43

This did work indeed. I was thrown by the fact that the other targets I created were not the absolute path.

For instance, I ran

mklink /d Access \Access

The target according to the properties is c:\Duplicacy\Access. It all worked out in the end.

Scratch that. I was just looking in the restore list, and it only backed up the one with the absolute path. It seems to have skipped the rest. I will recreate the other symlinks and run it again.

I also forgot a few exclusions. On that note, if I wanted to essentially prevent deleting of files <90 days old, should I just “Keep 1 snapshot every [1] days if older than [90] days”?
My reasoning for wanting to do this is that I see no reason to delete anything in <90 days due to Google Cloud Storage Coldline charges for deletion. Deletion in <90 days after upload incurs a $0.014 charge (this charge varies depending on when you delete it). Since it costs $0.005 to store it for 90 days, you are essentially paying for it whether it is stored or not (.005*3=.015+API call to delete).
https://cloud.google.com/storage/pricing

PS
I have to say, I read the entire design documentation as well as some of the other documentation like the guide, etc. It was a very interesting read and your responses to these forum posts are so thoughtful and quick. Everything I have read about the design is captivating and has had me rethinking limitations of backups all day. The fact that every file is its own hash, the way deletions are handled, even the encryption is just so darn clever and simple.

My only concerns are

How slow is it going to be to load a file list when I try this on my home server with ~5 TB of data and 1,858,775 files (not sure if this number is accurate)? Databases I imagine are a lot faster than writing to and parsing a file.
How many api calls is it going to make in order to look for existing files (concern based on both speed and cost). Hopefully this is cached.

I am excited to see how this does. I’m almost tempted to buy a commercial license if this fits my needs just to have a look at how it works.
(sorry, I know much of this is off topic. I can edit this post and remove irrelevant information if you want).

gchen · 19 April 2017 15:28

if I wanted to essentially prevent deleting of files <90 days old, should I just “Keep 1 snapshot every [1] days if older than [90] days”?

The “Keep 1 snapshot every [1] days if older than [90] days” policy doesn’t apply to snapshots less than 90 days old, so it won’t delete any snapshots less than 90 days old.

How slow is it going to be to load a file list when I try this on my home server with ~5 TB of data and 1,858,775 files (not sure if this number is accurate)? Databases I imagine are a lot faster than writing to and parsing a file.

The snapshot files are in json format. I don’t known if loading them is faster or slower than loading from a database, however I’m pretty sure this won’t be the bottleneck. The potential bottleneck is your upload bandwidth and also the actual speed allowed by your cloud storage. In addition, Duplicacy needs to load the entire list of those 1,858,775 files into memory, but my guess is that you should have enough memory to do that.

How many api calls is it going to make in order to look for existing files (concern based on both speed and cost). Hopefully this is cached.

Duplicacy uses the chunks referenced by the last snapshot as the known chunk cache. Any chunk that is in this cache will be skipped. Any other chunk is treated as a new chunk and will require one api call to look up if the chunk is in the remote storage and another api call to actually upload the chunk. So it is mostly two api calls for new chunks, and none for chunks existing in the last snapshot.

Charles · 19 April 2017 19:14

Your answers only make me more excited to test this against my larger dataset.

2 API calls only for newly uploaded files is perfectly fine.

A database would typically be a faster read from the disk without having to keep it all in memory, especially with sorted indexes on the search criteria/column, etc, but considering you are loading the list into memory I’m not sure if it would be (as long as the search algorithm is efficient). One thing I noticed with the restore list is that it takes a long time to load with just 100 GB of files. I am guessing this is due to the parsing of the json file. If there was a database included along side the snapshot files, then the database could be used to load in the snapshot data initially instead, making everything much much faster. This should also make adding multithreading a little easier I think since the database would handle locking. What’s more, you could do searches directly off of the local database without loading everything into memory and the searches would still be fast. In the event that someone wanted to install the software from scratch and download an existing repository, the database could be rebuilt from the last json snapshot stored to the storage location. **Of course, you know what is best for your application. If this idea or some variation of it can be implemented to improve the program, great. If not, that’s fine too. If you like the idea, I wouldn’t mind discussing it further or even collaborating on it some. (I do code database driven applications for a living).

gchen · 20 April 2017 02:34

This is an interesting idea. We plan to make the source of the CLI version publicly available within a month, likely under the Fair Source License, so maybe by that time you can start working on this feature if the license works for you.

Charles · 20 April 2017 14:36

That is an awesome license. I don’t think I’ve seen that before. I can’t believe I haven’t heard of this software before this week. I’ve been looking for a while. There are so many it’s easy to overlook even the good ones.
The upload looks like it is going to be fast and it wasn’t using too much ram when I was able to check on it (<1GB). If the connection issue can be fixed soon, this will definitely be what I go with. Heck, maybe I’ll just switch to backblaze if it takes longer than I’d like. The prices look similar to GCP Coldine anyway. I have a $300 free trial credit with GCP though.

Reinard · 7 June 2017 19:25

Am I right? I can backup a folder from an external drive only by setting a symlink in the repository folder on the internal Hard drive?

gchen · 7 June 2017 20:14

If you set up a repository on the internal hard drive, and you want to back up an external drive as well using the same repository, then you’ll need to create a symlink.

We’re planing a major rewrite of the GUI version to support multiple repositories.

andyru · 19 June 2017 17:19

Is there an ETA on the version to support multiple repositories?

gchen · 19 June 2017 23:57

I’ll start working on this one after finishing the fast resuming feature. Should be ready in a month at most. I’m still contemplating if the GUI version should be rewritten in Go using a Go GUI library such as https://github.com/therecipe/qt, but it is probably a good idea to implement multiple repositories before rewriting the GUI version.