Mount duplicacy snapshot as a virtual filesystem

Droolio · 20 October 2022 20:45

Window Server network share (UNC), and list is pretty quick. It just hangs Windows Explorer while listing the snapshots, and then the console is referring to revision numbers I didn’t even click on.

david.rios.gomes · 20 October 2022 21:05

Thanks for the valuable insights.

An option to specify a single revision makes sense to optimize access on large backups, will add that.

Some issues are due to the way Windows explorer works. If you open a day folder for instance, Windows explorer will try to list the contents of all children, so if you have 12 revisions a day, it will cause the program to initialize all 12 revisions. At 366k files each, that’s 4M+ file descriptions loaded in memory already. It may be useful to add another level to prevent that as an optimization.

Some big memory usage comes from the internal APIs. When I was testing with my modest repository, some memory spikes would occur while accessing internal APIs to download chunks and other things. I’m probably doing some things wrong, as I’m not familiar with the code base and internal APIs.

I’ll try a couple of things and provide another test binary as soon as I can.

david.rios.gomes · 20 October 2022 21:19

Unfortunately I don’t have any real personal backups to test yet, given that I’ve heard about this tool for the first time 5 days ago, while looking for a backup solution for my sister . While evaluating it, I found the clunky restore interface and lack of live mounting deal breakers, but luckily it’s (kind of) open source…

It’s also basically my first ever Go code, so I’m definitely doing some things wrong performance wise

sevimo · 21 October 2022 16:05

OK, so I tried the latest build on Linux, and here are my observations/suggestions, in no particular order:

It mostly worked!
Tested storage was remote (OneDrive) running in parallel with something else, so I did hit some 429 (rate limit) retry-able errors - as expected. As such, while it did take some time to get initial listing, it was about the same time as running list command
Caching directories seems to be working fine, subsequent access was basically instantaneous
File access to individual files in snapshots worked, this is great! However, I suspect there is no support for threaded loads, which is not ideal for remote storages (see below)
I played around with 12TB repository, and as I suspected there are no issues with large storages per-se; large number of revisions and/or huge number of files in a single directory might be a different story (not tested)
I did not notice excessive memory consumption; saw perhaps 1-1.5GB at most, certainly not more than running the same storage/snapshots. Having said that, I did not run extensive tests, just poked around for a bit.
Not sure if mount command is supposed to fork into background and exit (it didn’t) or it was supposed to be running while mount is active. But then I killed it with Ctrl-C it did not unmount, I had to run umount separately

So based on the (very) limited testing, I’d say that core functionality works well. However, there are several things that can be improved, primarily on the customization/parametrization part.

Support multiple snapshots in the mounted tree. Basically, instead of top-level folder being years of a particular snapshot, make the top level out of all the snapshot names in the storage, with each having existing folder structure underneath.
Support multiple storages (-storage flag). I believe right now mount always picks the first storage from the preference file
Support multithreaded chunk download (-threads flag). This is important for remote storages as single threaded downloads could significantly underutilize available bandwidth.
Some optimizations on reducing scope of the mount, this is a nice-to-have. Basically, instead of using all snapshots and all revisions, ability to use only 1 snapshot/all revisions or even 1 snapshot/1 revision. This might be important for large/slow storages and applications that pre-load more than is needed (e.g. Window Explorer)

Again, it worked well for me so far. I hope more people can test it so we can see if there are any edge cases that are not covered.

mirallo.sebastien · 22 October 2022 12:22

Sorry i am not developper: I am on Windows 10; my backupped files are on C:\SourceDuplicacy1
I have installed WinFsp and replaced duplicacy.exe by the version of davidrios
command:

CD /D C:\SourceDuplicacy1
duplicacy mount Z:

This show “mounting travail on Z:” “Found xxx revisions” but Z: don’t appear in explorer

mirallo.sebastien · 22 October 2022 12:23

PS: i make a tutorial French for using duplicacy and send mails in Sauvegarder avec duplicacy et le stockage en ligne Storj – PCsoleil Informatique

towerbr · 22 October 2022 15:10

Congrats @david.rios.gomes , it works great! A long-awaited feature…

yes-it-works

I really liked the implementation of the file system, with the cascading dates and then the revision number, very clever.

I noticed the same behaviors identified by @sevimo (memory, cache, etc).

I did two tests, with two buckets in Wasabi:

first:
1 snapshots and 35 revisions
Total chunk size is 17,947K in 1160 chunks

second:
7 snapshots and 316 revisions
Total chunk size is 174,466M in 26531 chunks

The only error I ran into was this “invalid syntax”, when trying to access any revisions folder (but even showing the error it normally opens the folder):

The service duplicacy has been started.
INFO MOUNTING_FILESYSTEM initRevision 35
INFO SNAPSHOT_VERSION snapshot **************** at revision 35 is encoded in an old version format
INFO MOUNTING_FILESYSTEM initRevision failed: strconv.Atoi: parsing "ini": invalid syntax
INFO MOUNTING_FILESYSTEM initRevision failed: strconv.Atoi: parsing "ini": invalid syntax

Worked fine with Wasabi read-only keys.

And I endorse the same suggestions as @sevimo:

sevimo:

Support multiple snapshots in the mounted tree. Basically, instead of top-level folder being years of a particular snapshot, make the top level out of all the snapshot names in the storage, with each having existing folder structure underneath.

Support multiple storages (-storage flag). I believe right now mount always picks the first storage from the preference file

Support multithreaded chunk download (-threads flag). This is important for remote storages as single threaded downloads could significantly underutilize available bandwidth.

Some optimizations on reducing scope of the mount, this is a nice-to-have. Basically, instead of using all snapshots and all revisions, ability to use only 1 snapshot/all revisions or even 1 snapshot/1 revision. This might be important for large/slow storages and applications that pre-load more than is needed (e.g. Window Explorer)

Plus:

Need an option to inform RSA key;

towerbr · 22 October 2022 15:26

One more test:

5 snapshots and 307 revisions
Total chunk size is 696,630M in 105463 chunks

Worked perfectly…

david.rios.gomes · 22 October 2022 16:11

Uploaded new binaries with suggestions: Release v3.0.1-mount · davidrios/duplicacy · GitHub

Check the help command.

Added a new empty level to each revision dir to prevent Windows explorer from loading revisions too early, should work even with -flat and several revisions.

Empty folders in the backup will error if you try to open them, known issue, I’ll fix it later, but shouldn’t affect anything.

sevimo · 22 October 2022 16:42

OK, this one did not quite work. Even when not using -disk-cache option, it is still looking for sqinn, and segfaults otherwise:

Failed to init: exec: “sqinn”: executable file not found in $PATH
runtime error: invalid memory address or nil pointer dereference
goroutine 1 [running]:
runtime/debug.Stack(0x0, 0x0, 0x0)
/home/david/go/src/runtime/debug/stack.go:24 +0xa5
runtime/debug.PrintStack()
/home/david/go/src/runtime/debug/stack.go:16 +0x25
github.com/gilbertchen/duplicacy/src.CatchLogException()
/home/david/work/duplicacy/src/duplicacy_log.go:233 +0x225
panic(0x1619ae0, 0x1f693f0)
/home/david/go/src/runtime/panic.go:971 +0x4c7
github.com/cvilsmeier/sqinn-go/sqinn.(*Sqinn).Terminate(0x0, 0x0, 0x0)
/home/david/go/pkg/mod/github.com/cvilsmeier/sqinn-go@v1.1.2/sqinn/sqinn.go:694 +0x42
github.com/gilbertchen/duplicacy/src.(*BackupFS).cleanup(0xc0003b0d90)
/home/david/work/duplicacy/src/duplicacy_mount.go:888 +0x36
github.com/gilbertchen/duplicacy/src.MountFileSystem(0x7fff7c3d1869, 0x7, 0xc0003b0cb0, 0xc00057b480)
/home/david/work/duplicacy/src/duplicacy_mount.go:1014 +0x3a7
main.mountBackupFS(0xc0003c25a0)
/home/david/work/duplicacy/duplicacy/duplicacy_main.go:1468 +0x11d0
github.com/gilbertchen/cli.Command.Run(0x17c81aa, 0x5, 0x0, 0x0, 0x0, 0x0, 0x0, 0x17ee7f1, 0x22, 0x0, …)
/home/david/go/pkg/mod/github.com/gilbertchen/cli@v1.2.1-0.20160223210219-1de0a1836ce9/command.go:160 +0xd76
github.com/gilbertchen/cli.(*App).Run(0xc0003c2120, 0xc0000c4000, 0x5, 0x5, 0x0, 0x0)
/home/david/go/pkg/mod/github.com/gilbertchen/cli@v1.2.1-0.20160223210219-1de0a1836ce9/app.go:179 +0x1145
main.main()
/home/david/work/duplicacy/duplicacy/duplicacy_main.go:2335 +0x7da6
Command exited with non-zero status 101

EDIT: Actually, it fails the same way even if I have working sqinn in path. I have tried to place it into PATH, into current repository folder, into the same folder as duplicacy executable - it still can’t find it for some reason.

david.rios.gomes · 22 October 2022 16:57

That’s strange, should work with the binary in the PATH at least. I’ll fix it later today

david.rios.gomes · 23 October 2022 00:19

Ok, new fixed binaries uploaded.

sevimo · 23 October 2022 11:46

OK, so this improved things quite a bit. sqinn was not found because I was running under sudo with different environment, so now it works without sqinn when -disk-cache is not specified, and with sqinn in the PATH when -disk-cache is specified. However, it still segfaults when -disk-cache is specified, but sqinn is not found. It should really need to fail more gracefully, but that’s an easy fix.

By the way, how does -disk-cache works? Where/how/what is cached?

Filtering by revision numbers seems to work. I haven’t done any stress testing on that (i.e. non-existent/malformed revision lists), but it works for core cases. It seems that it doesn’t do extensive reading (if any) of filtered out revisions as it feels faster with single revision specified. Can you confirm what happens with revisions that are not displayed?

-flat is nice, I’d probably use pure numerical revision names (without timestamps in front) for the cases when you’re looking for specific revision number, but there is no end on how these lists can be potentially presented/customized, it is certainly workable as core functionality is there.

-storage seems to work as well, cool!

This leaves inability to (easily) access multiple snapshot names in the same storage and threaded downloads. Multiple downloads probably has a workaround (creating separate repositories that reference different snapshots, haven’t tried it but it should work, though kinda ugly). Single-threaded download doesn’t affect functionality per se, but can make actual restores via mount unbearable slow on some storages.

All in all, great work so far!

david.rios.gomes · 23 October 2022 14:43

Fixed crash when sqinn not on path and -disk-cache used.

-disk-cache creates sqlite databases in the OS temp folder with file information from each revision and reads them from there. chunks are still cached in memory. duplicacy already has an internal chunk cache, but it doesn’t work fast enough for these use cases. the databases are deleted on program end, but they are not when running, so if you open thousands of revisions with millions of files, you could potentially run out of disk space, but I’m not worried about that TBH.

handling of revisions specified is mostly graceful, with useful messages for invalid specifications. it is indeed faster because it only downloads metadata from revisions that will be listed.

I feel like the formatting of -flat is the right one. revision numbers are still appended, so they are visible at a glance, and also revisions are chronological in nature, so it’ll always be shown in the revision number order, provided you sort files by name on your browsing tool. to me it’s also what makes most sense when you’re just mounting without specifying a revision or listing them before, because you’re usually more interested in when the revision was created.

The chunk downloader internal API does have an option to specify how many threads and I increased it to 3, so in theory it’s running 3 in parallel.

I’m planning on adding a mount-storage command for the last remaining use case.

Automatic unmounting on Linux doesn’t work not matter what I do for some reason. Need to investigate that further.

sevimo · 23 October 2022 17:33

I am, somewhat, as I did run into problems like that before. In many cases, would be running on some appliance device, or even server with minimal primary drive; VMs would also usually have minimal disk space allocated. VFS caching can clog /tmp easily, which may crash the whole system. Real-life example was running rclone on a mini server with a couple of GB available on small primary SSD, and rclone VFS caching tried to place 5GB file there which did not end well. There are reasons why rclone allows specification of VFS cache both in terms of location and maximum sizes. I am not familiar with sqinn, but hopefully there is a way to specify either a specific destination and/or maximum capacity for the disk cache.

I am thinking more about script access, where you don’t look at the list manually But as I said, this is a non-issue as script can certainly find the right revision in the list, it will simply take marginally more work on filtering. If anything, I’d prefer to have some special marker for revision list (e.g. -revisions last) that always selects single most recent revision. Again, nice-to-have, can certainly be worked around.

I really think you should just expose -threads argument to mount, the same way it works with other commands. Fixed number of threads is not ideal, as for some storages it will still not be enough, and for others it might be too much and you’ll be bumping into rate limiters. But if the code supports multithreaded downloads already, this should be a fairly trivial thing to implement.

Is there a reason to split it into a separate command? Not that I see a particular problem from a user perspective, but it sounds more like another parameter to mount rather than its own command. But I haven’t looked at implementation, perhaps it makes more sense.

david.rios.gomes · 23 October 2022 18:09

Yeah, -threads is already done, and also rate limiting, those will be in the next release.

Adding more features is not something I’m willing to put in time now. Maybe in the future, or maybe the maintainers can add them to make it feature complete.

A new command is needed because it has some semantic differences on arguments. It uses the same underlying code, it’s just a matter of parameter parsing.

david.rios.gomes · 23 October 2022 21:21

Latest binaries uploaded.

New command: mount-storage, you can now mount a storage directly without being in a repository directory. Root level are folders with the snapshot ids in the storage, followed by one empty level to prevent Windows explorer from loading too early, and then the same structure from regular mount. You may need to specify a repository dir anyway depending on your storage backend, for instance to validate SSH hosts.
New options added
Fix unmount on Linux

Unless there’s a breaking bug, this is probably my last release, as it’s already working good enough for me. The PR remains open, the official maintainers can pickup from there if they want to improve upon the base implementation.

Droolio · 24 October 2022 00:54

Your efforts are appreciated and I hope some form of mount eventually gets into a full release… I’ve yet to test your latest version but I’m hoping the memory use is improved.

However, I would point out that the choice of parametrisation will probably have to change to match the rest of Duplicacy’s CLI interface - i.e. working from the current repository. I agree with @sevimo that a mount-storage is probably unnecessary, when mount -all, and -storage, -id and -r flags already exist for other commands that might operate on the non-default storage, and matching that nomenclature would be preferred. I’m also unsure if a separate disk cache, stored in a completely separate location, is desirable.

(TBH, Duplicacy is a bit lacking in regards to allowing multiple revisions and/or IDs to be referenced in a single command; it’d be nice if -r supported ranges or a list, for example.)

But that’s something for @gchen to ultimately decide…

david.rios.gomes · 24 October 2022 02:11

The parametrization already matches the standard ones. mount follows the list and related, and mount-storage follows the same as info, which is the other command that works directly on storages.

Droolio · 24 October 2022 02:56

That’s not quite true - most operations, like list, operate on just the current storage unless you use -all or -id. There is no list-storage, for example, because it’s handled by just list. That’s the way it should work with mount IMO - mount on just the current storage, with mount -all doing all.

info is different because it was meant to be used by the GUI only.