About Duplicacy nomenclature

This is an example of nomenclatures used in Duplicacy. This case considers two computers (“John” and “Mary”) backing up to the same storage (a single bucket in B2). Since both computers have “documents” folders, deduplication can save storage space and allow faster backups if the documents used by John and Mary are similar.

This example only covers one configuration use case. Separate storages and different configurations may be used. The purpose is just to clarify how the following terms are used in :d::

  • repository is the location/folder of the source/original files on your own computer;
  • snapshot-id is an ID used to distinguish different repositories connected to the same storage. Each repository must have a unique snapshot ID, and snapshot IDs must be unique across machines. See the init command description for details.
  • storage is where backups will be saved, i.e., the folder or bucket on the remote cloud provider;
  • each time a backup is performed, a revision is generated;
    .
    (click to enlarge the images)






John’s computer preferences file (main lines):

"name": "john-docs--b2",              -->  <storage name>
"id": "john-docs",                    -->  <snapshot id>
"repository": "D:/documents",
"storage": "b2://all-files-bucket",

Mary’s computer preferences file (main lines):

"name": "mary-docs--b2",              -->  <storage name>
"id": "mary-docs",                    -->  <snapshot id>
"repository": "D:/documents",
"storage": "b2://all-files-bucket",
...
"name": "all-dbs--b2",                -->  <storage name>
"id": "all-dbs",                      -->  <snapshot id>
"repository": "D:/databases",
"storage": "b2://all-files-bucket",

Note that "name" (storage name) is not the name of the storage itself, but a reference to be used to execute commands (backup, check, etc.).

And "storage" (storage path) represents the real path where :d: stores all its data.

12 Likes

Thanks a lot for writing this little nomenclature guide, it’s incredibly clear and useful!

I do have a specific question regarding snapshot IDs, though!

According to your nomenclature and to the init command guide:

But in practice, it is possible for a single repository to possess multiple snapshot IDs, right?

While the init command asks for an initial snapshot ID to initialise the very first storage, the add command allows any further storage to be added with a completely different snapshot ID.

One concrete case where this may prove useful is explained in the guide to back up to multiple storages, more specifically when backing up to “Multiple cloud storage without a local storage”:


Overall, my understanding is that:

  1. This rule of unicity (1 unique repository possesses 1 unique snapshot ID) is actually a slight simplification of reality for clarity’s sake?

  2. Repositories are not actually identified by a snapshot ID. Nor are they identified by any ID at all, as a matter of fact.
    A repository is just a specific directory on a given machine. A repository is always connected to one or several storages, which allows data transfer to and from these storages, but that’s it.
    Storages are actually fully unaware of what repositories are, of where they are located. A repository is really just a physical “place” connected to one or more storages.

  3. While repositories are physical, snapshot IDs are more abstract, and linked to how we organise our backups across all our machines. What a snapshot ID uniquely identifies is a “backup chain”.
    Put differently, a snapshot ID identifies a succession of incremental revisions for specific data files that we want to preserve. Taken together, these successive revisions give us an ever-evolving “snapshot” of these same data files, but at different points in time. A snapshot ID is the unique name we give to each chain of revisions to distinguish it from other revision chains.
    For instance, to back up our music, pictures and work documents separately, we would need to use 3 distinct snapshot IDs.

  4. Distinct repositories can upload and download revisions from the same storage under the same snapshot ID. In other words, multiple repositories on different locations can share and contribute to the same chain of backups.
    This can typically be used in multi-machine setups to have several machines synchronise and back up the same data, for instance to have work projects be constantly synchronised between a laptop and a home computer: both of the 2 repositories just need to be connected to the same cloud storage under the same snapshot ID. This basically achieves the same effect as simpler “file synchronisation” programs like rsync, Dropbox, Google Drive and OneDrive (but with the added efficiency of Duplicacy): multiple locations sharing the same incremental backups/revisions.

  5. Conversely, a single repository can totally be connected to multiple storages under multiple snapshot IDs, thereby receiving backups & uploading revisions from completely different backup chains.
    I could have a music repository which backups the entirety of my music on Google Drive under the snapshot ID all_my_music… And at the same time, this same repository could back up specific files of royalty-free music I collect to an SFTP storage under the snapshot ID royalty_free_music. This SFTP server could be accessible in readonly to everyone on the Internet, which would make it a public archive of royalty-free music accessible to everyone through Duplicacy.
    Another concrete case of single repository with multiple snapshot IDs would be the one from the guide I quote above, to back up to multiple cloud storages without a local storage efficiently.

  6. Overall, both points 4) and 5) make it clear that snapshot IDs are wholly distinct from repositories. A snapshot ID abstractly represents a backup chain, but the backups could come from multiple machines, and a single directory could contribute or restore backups from completely different backup chains.
    I assume that these subtle distinctions are the reason why “snapshot IDs” are named as such, and no longer “repository IDs”, as they used to be here and there in a few older forum posts.


Would that be a correct understanding of snapshot IDs as they exist in Duplicacy at the moment (e.g in version 2.2.3, for future readers’ sake)?

Sorry for the long and redundant explanation, at any rate! I’m pretty sure these distinctions are clear to a lot of people here, I was mostly trying to explicit them in writing just to make sure I’m understanding everything myself, ahaha.

This isn’t a suggestion to amend any of the guides either, I can definitely see how they benefit from going straight to the point while leaving these explanations more implicit.
But I admit I was always a little bit puzzled over what snapshot IDs concretely represented before delving into them, so if this summary is correct, I hope it can help anyone else.

1 Like

Just remembering that the above text

Having said this,

Not only possible, but very useful. A real example I use: I have a folder (repository) with this structure:

folder/
├── folder1/
├── folder2/
│   ├── file-a.txt
│   ├── file-b.txt
│   └── file-c.txt
├── folder3/

I need to back up folder2 much more often than the entire folder, so I have two snapshots-ids in the same preferences file:

"name": "all-files--b2",
"id": "all-files",
"repository": "D:/folder",
"storage": "b2://bucket",
...
"name": "folder2--b2",
"id": "folder2",
"repository": "D:/folder/folder2",
"storage": "b2://bucket",

So I can run a script using folder2--b2 much more often than the script that uses all-files--b2.

Note that I could do this with filters too, but the filter file is unique per repository. If I could use specific filters for each type of script execution, the above arrangement would not be necessary.

Yes. Let’s call this a “basic setup”

Yes!

Yes. Putting it more simply: we could say that snapshot-id will be used to identify backups (revisions) of a particular set of files (repository) in a storage.

If what you call a “chain” is the snapshot-id, no, it is not recommended. Even if machines have similar repositories (such as “documents” in the original post), if backups run concurrently you will have problems with revision numbering and chunking. It is recommended that you have a “characterizer” of which machine (or user) that repository refers to (such as john-docs and mary-docs in the example above).

Nope, completely different. A sync would keep all locations with the same set and state of files. A backup stores various states (relative to different times) of the file set.

Yes!


Thanks for your detailed post with these considerations, it is always helpful to enrich the knowledge base.

1 Like

Thanks a lot for taking the time to read and reply, this clears up everything for me!


Wow, I hadn’t actually realised that this was the true purpose of the -repository option for the init and add commands.

I thought it could be used to initialise elsewhere the repository’s “root” (the directory that will contain the .duplicacy subfolder, and therefore, the one from which Duplicacy commands must be run).
In other words, I thought this option could be used to make Duplicacy create a repository elsewhere, while completely ignoring the current working directory.

I now realise that -pref-dir is the option to use to set a different location for the preferences directory… and that even -pref-dir puts a .duplicacy file in the current working directory, which means the directory from which we must run Duplicacy commands is always the current working directory anyway.

This makes much more sense, thanks for clearing that up! I was failing to see how both options differed.
And yeah, I can see how useful multiple snapshot IDs can be to manage specific subfolders with more granularity while backing up tot he same storage.


This clears up everything, thank you!
So after all, distinct repositories do need to have distinct snapshot IDs to prevent trouble down the line.

It doesn’t prevent a single repository from using multiple snapshot IDs, but on other hand, a given snapshot ID must always identify revisions from a specific repository, not from several ones.

Thanks again so much for taking the time to clarify everything!

1 Like

I would say this is a bonus :sunglasses:

and this is the main function :point_up_2:

1 Like

It seems to me that what the CLI refers to as a Snapshot is referred to on the Duplicacy Web Edition pages as a Backup.

Is that the case or have I misunderstood?

I note the storage names for the two computers are:
“john-docs–b2”
“mary-docs–b2”
“all-dbs–b2”

All these “storage names” point to the same bucket: “b2://all-files-bucket”
I assume they are deduplicated.
So… I assume this would work the same if they all used the same storage name. Wouldn’t that make more sense? Why name each “storage name” differently if they use the same bucket?

The nomenclature leads to this kind of confusion, it’s not the storage name::

The nomenclature leads to this kind of confusion, it’s not the storage name::
Ahh… I’m still not sure that answered my questions… I saw what you said earlier… I think it implies some answer… but I was hoping to get some explicit replies? Please?
Would this work the same if they all used the same storage name? (implied “yes”… but help me be sure?)
Why name each “storage name” differently if they use the same bucket? (Curiousity… is there a use for this?)

You can use a single snapshot name for everything in a repository, but then all operations will have to be on everything. Splitting repository into multiple snapshots (which could be overlapping) allows to run different schedules, different pruning policies etc on subsections of your data.

Ah, now that makes some sense to me! Thank you.

Does deduplication work across multiple repositories that have the same storage bucket?

In the terminology of the web edition, it would be multiple backups in the same storage.

The diagram above seems to indicate this, but I couldn’t find a statement in the search that specifically states this.

The same target, yes.