Query regarding retrieving file info from snapshots

I’ve been putting together a prototype duplicacy fuse file-system (don’t get too excited, it’s mostly unusable) using https://github.com/billziss-gh/cgofuse

It’s to the stage where I can mount a repository and list all snapshot ids on the specified storage and the browse down into each revision and the directory structure underneath.

Opening files is not yet implemented and any operations that make changes are deliberately not implemented so it’s basically read only.

Where I’ve got concerns is how to handle the retrieving file and directory data via the following fuse/file system operations that are needed to satisfy the fuse interface.

Readdir: getting a list of files/dirs in a particular path
Getattr: get the attributes of a particular path (ie stat info)

I’m currently doing this using the following:

manager := duplicacy.CreateBackupManager(snapshotid, ...)
manager.SetupSnapshotCache(...)
snap := manager.SnapshotManager.DownloadSnapshot(snapshotid, revision)
manager.SnapshotManager.DownloadSnapshotContents(snap, nil, true)

At that point I cache the contents of the “snap” slice for later, then loop over the “snap” slice to find the entries for the current path.

This seems incredibly inefficient as I’ve now got info for every file in the snapshot in memory for potentially listing the contents of an empty directory.

Then for a “Getattr” operation, which is to retrieve the file info for a single file/dir it’s a case of potentially traversing the entire “snap” slice to return some basic stat info like file size, modification time, mode etc…

I’ve got a background goroutine that removes/deletes cached data to keep memory use down, but its not unusual for my “duplicacy-fuse” process to eat up almost a GB of RAM by just browsing two or three revisions from one snapshot id on the storage.

Is there a better way to handle this?

Some way to only retrieve info for a particular path in a backup?

Currently the repo I’m working on is set to private as to be honest I’m not sure I want the world to see what is pretty terrible code…but if it helps I can share the URL.

4 Likes

First of all, I’m really excited someone is taking a shot at this since this is a feature that would satisfy some use cases I don’t currently have good solutions for. For example, trying to use some of the files I backup to cloud storage on other machines without having to upload a second copy.

Unless I’m mistaken, Duplicacy maintains an on-disk cache of snapshot chunks (but not file chunks AFAIK). Is your implementation also making use of an on-disk chunk cache or is it only caching them in memory?

Is there a reason you pass in nil instead of a pattern to DownloadSnapshotContents? I haven’t tried digging through the code yet (and I’m also not a golong developer), but I’d expect it to be more memory efficient than not using any filtering on the relevant path or directory for the given operation.

Reference

1 Like

Right, all metadata chunks are stored in the local cache directory. Technically you can parse these chunks to retrieve the directory listing and file attributes on the fly, as these are just a large json file split into chunks. However, the parser is not easy to implement as splitting happens at random places so each chunk is not a valid json object.

I think a simpler method is to use a on-disk database (or key-value storage) to store the directory listing and file attributes for all revisions. For each revision you’ll need to load the full metadata into the database once. Directory listing becomes as database queries so it should be fairly fast. The downside is that you basically store locally two copies of the metadata, one in raw chunks and the other in the database.

Implementing a small db or similar is a great idea to reduce memory usage, thanks for the idea.

I have basic functionality working so I thought I should share the code as is:

https://gitlab.com/andrewheberle/duplicacy-fuse

What works

  • Mounting - possible to specify which storage, snapshot ID and revision to mount
  • Listing snapshot ID’s and revisions on the mounted storage
  • Listing files and directories within a revision with sizes and modification times
  • Opening files from a revision

What Doesn’t Work

  • Mouning SFTP storage that uses key based auth
  • Possibly everything else that doesn’t match my config at home (Windows 10 Professional, Go13, BackBlaze storage)

What Will Never Work

  • Writes or changes to the mounted filesystem

I’d be happy for any improvements, or if any of this code can be used natively in Duplicacy at some time in the future once this improves quite a bit.

5 Likes