Snapshot format documentation no longer correct?

dpccy · 6 June 2024 16:38

The snapshot format, as documented in the github wiki, doesn’t seem to match the current behavior. For example, here’s a simple backup I made of a folder with a single file called “test”, with contents “asdf”.

Is there more up-to-date documentation somewhere? Or can someone at least tell me where to start looking in the source code?

My efforts to reverse-engineer it are below:

Firstly, there’s no mention of compression-- but I figured out that LZ4 prefix pretty easily. The json output is then:

{
  "chunks": [
    "5fc6d07e4150163ce5085a61560807f8dbbd3a352323dfcad1756ecf43a9f275"
  ],
  "end_time": 1717355775,
  "file_size": 4,
  "files": [
    "0d55bb119c0420a3382697baf86169cb5aface7e5e8e9d9a296b8f6ab6584929"
  ],
  "id": "test-src",
  "lengths": [
    "bdec3d2844cf0e87ac57b681be6d96e016691b93962fae6ebfbd14ae0a7a92e7"
  ],
  "number_of_files": 1,
  "options": "-hash",
  "revision": 1,
  "start_time": 1717355775,
  "tag": "",
  "version": 1
}

Firstly, a few of the fields aren’t mentioned.

The chunk file appears to be a list of numbers. In this case it’s [4]? Which is the size of the file in bytes, but maybe it’s a coincidence.

The files 0d55... and bdec... don’t exist in the repo.

There are two other files: 877f649... which appears to be the file contents, and 47428... which appears to not be json, but I see the name of the file in there… and a ref to b913... which I don’t see anywhere either.

My .duplicacy/preferences file is:

[
    {
        "name": "default",
        "id": "test-src",
        "repository": "",
        "storage": "REDACTED\\test-src\\..\\dest\\",
        "encrypted": false,
        "no_backup": false,
        "no_restore": false,
        "no_save_password": false,
        "nobackup_file": "",
        "keys": null,
        "filters": "",
        "exclude_by_attribute": false
    }
]

And the repo config is:

{
    "compression-level": 100,
    "average-chunk-size": 4194304,
    "max-chunk-size": 16777216,
    "min-chunk-size": 1048576,
    "fixed-nesting": true,
    "DataShards": 0,
    "ParityShards": 0,
    "chunk-seed": "6475706c6963616379",
    "hash-key": "6475706c6963616379",
    "id-key": "6475706c6963616379",
    "chunk-key": "",
    "file-key": "",
    "rsa-public-key": ""
}

gchen · 11 June 2024 21:14

You can run duplicacy cat -r <revision> to see the actual content of a snapshot. When you run this command, a new field named chunk_sequence will be created, by reading the contents of chunks listed in chunks. The chunks in chunk_sequence store the file content.

Similarly, the length_sequence field will be created from chunks listed in lengths. It is used to track the length of each chunk in chunk_sequence.

The files field works slightly differently but still follows the same principle. First, a file_sequence field is created which contains the hashes of all metadata chunks. These chunks are then downloaded and concatenated together, and decoded into a list of Entries, each of which represents the metadata for a file.

dpccy · 18 June 2024 17:34

Thank you for your response!

For context, I’m building a file sync tool. I’d love if it could integrate with a duplicacy repo.

How does duplicacy derive the file, chunk, and length sequence fields?

For example, you said " a new field named chunk_sequence will be created, by reading the contents of chunks listed in chunks". For my example, the 5f file just contained [4]. How does it turn that into ["e9..."]?