Snapshot file format

A snapshot file is a file that the backup procedure uploads to the file storage after it finishes splitting files into
chunks and uploading all new chunks. It mainly contains metadata for the backup overall, metadata for all the files,
and chunk references for each file. Here is an example snapshot file for a repository containing 3 files (file1, file2,
and dir1/file3):

{
  "id": "host1",
  "revision": 1,
  "tag": "first",
  "start_time": 1455590487,
  "end_time": 1455590487,
  "files": [
    {
      "path": "file1",
      "content": "0:0:2:6108",
      "hash": "a533c0398194f93b90bd945381ea4f2adb0ad50bd99fd3585b9ec809da395b51",
      "size": 151901,
      "time": 1455590487,
      "mode": 420
    },
    {
      "path": "file2",
      "content": "2:6108:3:7586",
      "hash": "f6111c1562fde4df9c0bafe2cf665778c6e25b49bcab5fec63675571293ed644",
      "size": 172071,
      "time": 1455590487,
      "mode": 420
    },
    {
      "path": "dir1/",
      "size": 102,
      "time": 1455590487,
      "mode": 2147484096
    },
    {
      "path": "dir1/file3",
      "content": "3:7586:4:1734",
      "hash": "6bf9150424169006388146908d83d07de413de05d1809884c38011b2a74d9d3f",
      "size": 118457,
      "time": 1455590487,
      "mode": 420
    }
  ],
  "chunks": [
    "9f25db00881a10a8e7bcaa5a12b2659c2358a579118ea45a73c2582681f12919",
    "6e903aace6cd05e26212fcec1939bb951611c4179c926351f3b20365ef2c212f",
    "4b0d017bce5491dbb0558c518734429ec19b8a0d7c616f68ddf1b477916621f7",
    "41841c98800d3b9faa01b1007d1afaf702000da182df89793c327f88a9aba698",
    "7c11ee13ea32e9bb21a694c5418658b39e8894bbfecd9344927020a9e3129718"
  ],
  "lengths": [
    64638,
    81155,
    170593,
    124309,
    1734
  ]
}

When Duplicacy splits a file in chunks using the variable-size chunking algorithm, if the end of a file is reached and yet the boundary marker for terminating a chunk
hasn’t been found, the next file, if there is one, will be read in and the chunking algorithm continues. It is as if all
files were packed into a big tar file which is then split into chunks.

The content field of a file indicates the indexes of starting and ending chunks and the corresponding offsets. For
instance, file1 starts at chunk 0 offset 0 while ends at chunk 2 offset 6108, immediately followed by file2.

The backup procedure can run in one of two modes. In the default quick mode, only modified or new files are scanned. Chunks only
referenced by old files that have been modified are removed from the chunk sequence, and then chunks referenced by new
files are appended. Indices for unchanged files need to be updated too.

In the safe mode (enabled by the -hash option), all files are scanned and the chunk sequence is regenerated.

The length sequence stores the lengths for all chunks, which are needed when calculating some statistics such as the total
length of chunks. For a repository containing a large number of files, the size of the snapshot file can be tremendous.
To make the situation worse, every time a big snapshot file would have been uploaded even if only a few files have been changed since
last backup. To save space, the variable-size chunking algorithm is also applied to the three dynamic fields of a snapshot
file, files, chunks, and lengths.

Chunks produced during this step are deduplicated and uploaded in the same way as regular file chunks. The final snapshot file
contains sequences of chunk hashes and other fixed size fields:

{
  "id": "host1",
  "revision": 1,
  "start_time": 1455590487,
  "tag": "first",
  "end_time": 1455590487,
  "file_sequence": [
    "21e4c69f3832e32349f653f31f13cefc7c52d52f5f3417ae21f2ef5a479c3437",
  ],
  "chunk_sequence": [
    "8a36ffb8f4959394fd39bba4f4a464545ff3dd6eed642ad4ccaa522253f2d5d6"
  ],
  "length_sequence": [
    "fc2758ae60a441c244dae05f035136e6dd33d3f3a0c5eb4b9025a9bed1d0c328"
  ]
}

In the extreme case where the repository has not been modified since last backup, a new backup procedure will not create any new chunks,
as shown by the following output from a real use case:

$ duplicacy backup -stats
Storage set to sftp://gchen@192.168.1.100/Duplicacy
Last backup at revision 260 found
Backup for /Users/gchen/duplicacy at revision 261 completed
Files: 42367 total, 2,204M bytes; 0 new, 0 bytes
File chunks: 447 total, 2,238M bytes; 0 new, 0 bytes, 0 bytes uploaded
Metadata chunks: 6 total, 11,753K bytes; 0 new, 0 bytes, 0 bytes uploaded
All chunks: 453 total, 2,249M bytes; 0 new, 0 bytes, 0 bytes uploaded
Total running time: 00:00:05
1 Like

Thank you very much for this useful information. However, when I open any file from a repository in the snapshots folder corresponding to a given version, I find that the file has a binary format. How can I decrypt the binary into something readable by humans? I am asking this because I keep getting messages as

Chunk 45037963ded2b8840b40aab521258b69b64f4c32dcd893e5bf7be43ba1f41373 can’t be found

when running the prune command with the option ‘exhaustive’. This chunk was actually deleted before and correctly

The chunk 45037963ded2b8840b40aab521258b69b64f4c32dcd893e5bf7be43ba1f41373 has been permanently removed

So I don’t understand why it now complains that the file is no there any longer. I have already cleaned the cache folder, but it did not help.

Without the option exhaustive, the prune command returns no error, but also does nothing.

This is a design decision in prune (with which I personally disagree): duplicacy first deletes chunks and only then snapshots. So, if it is interrupted for whatever reason the snapshot file remains pointing to missing chunks, and the next check will complain about these ghost snapshots.

Another solution that would neither leave trash in the datastore nor ghost snapshots was proposed here: Zero size chunks: how to solve it once and for all? - #2 by saspus

3 Likes

This seems to be no longer (fully) accurate. I documented some of the differences I saw in this other thread.