How to identify which chunks represent a particular file

To solve a specific problem, I’m looking for a way to identify which chunks represent a particular file.

I have looked at what is described in this topic about snapshot format:

But when I checked the snapshot files, they don’t have the information for each file, just a summary of the revision:

{"chunks":["7b8d4a30d31191b39c058a8f312422fe8af94b74a03192a1cadac6c4b107047b"],"end_time":1560434471,"file_size":23822761786,"files":["545ad57c1d3dae9df0d7f635b21a99f3023e6645a5e402803cb0c97ed8b9098d","fa65f94cfae38e96728b70ac24424e7994a6d33ae9e789f83c9f7e4825183bf5","a7540a7f52248925450f35cbc4edbfbbf8e921d519d7642b3495ab52f0d14891","99133bc7482f64d55d91e2cd53320618dad1b7a29e17c37db14119357497ffb1","6d52b527bcdeac26ff1946488758d64f7aaa0df1eee22974b0693301d11dfc06","b0d5e4b69e66e26d58fe4229f591f584b9af9207b5f6094ea3d8d23bc9cfd33d","46729f8b77d6c212b82efd7fa28010f25129b87191434ffc74f96a5cc2a2f926","4cd0aca9385ccbdfdc521878882e62f285215feddfd95ec982e439681e16b0c7","b849912c532bde7aa0def5e06f01d6f51aee74b813f2736a4b0ee7f06afc79f5"],"id":"[redacted]","lengths":["ac206108d0118b88877d37fd417246e421a9aa1621ee82fddbfc84058f33a31f"],"number_of_files":28772,"options":"","revision":103,"start_time":1560432276,"tag":""}

Am I looking at the correct file?

Yep but there’s a bit more to it. Further down that page it explains that:

So from what I understand, the files entries in your example is a list of metadata chunks which, once decoded, expands into the more detailed file list as shown further up the howto page. They’re probably just raw text in .json format, but chunked, compressed, and encrypted like all other chunks.

Same with chunks entries, and same with lengths.

Once decoded, each file should have a content field which details the start and ending chunk, plus where in the chunks they start and end.

I agree, but it also has:

:interrobang:

The snapshot file I showed above has 9 files in the “files” section:

"files":
["545ad57c1d3dae9df0d7f635b21a99f3023e6645a5e402803cb0c97ed8b9098d",
"fa65f94cfae38e96728b70ac24424e7994a6d33ae9e789f83c9f7e4825183bf5",
"a7540a7f52248925450f35cbc4edbfbbf8e921d519d7642b3495ab52f0d14891",
"99133bc7482f64d55d91e2cd53320618dad1b7a29e17c37db14119357497ffb1",
"6d52b527bcdeac26ff1946488758d64f7aaa0df1eee22974b0693301d11dfc06",
"b0d5e4b69e66e26d58fe4229f591f584b9af9207b5f6094ea3d8d23bc9cfd33d",
"46729f8b77d6c212b82efd7fa28010f25129b87191434ffc74f96a5cc2a2f926",
"4cd0aca9385ccbdfdc521878882e62f285215feddfd95ec982e439681e16b0c7",
"b849912c532bde7aa0def5e06f01d6f51aee74b813f2736a4b0ee7f06afc79f5"]

But I did not find (for example) the file:

chunks / 54 / 5ad57c1d3dae9df0d7f635b21a99f3023e6645a5e402803cb0c97ed8b9098d

The backup log shows that I have 8 metadata chunks:

File chunks: 18450 total, 20,750M bytes; 1582 new, 1,783M bytes, 1,783M bytes uploaded
Metadata chunks: 8 total, 8,043K bytes; 8 new, 8,043K bytes, 3,114K bytes uploaded
All chunks: 18458 total, 20,758M bytes; 1590 new, 1,791M bytes, 1,786M bytes uploaded

What would these 8 chunks be?

Probably, but the problem is: how to do this?

I just remembered this:

The cat log for the above storage shows:

{
  "chunk_sequence": [
    "025325de81ee0e7cf0e6280daaf1671eabddc78e4bc065419f71493762485d2f"
  ],
  "chunks": [
    "3d2b6cda2c2ae3e8378674cb6e39ff5876e992d64e7c57375ff0ec138c74a779",
    "8ba431ffa12cf15854107aafd2d37a034b0660165f9cce69411cdc0d9e1ea1af",
	
	... [thousands of lines]
	
	"4998219e1b99f28fb0ede308d34a841a66ba46a5b273e27d3c3d59c75b37d1b1",
    "d3ae6601e69593cbe04b18625da27f99dc9cac45daa6c93d305e8fa6c07185bd"
  ],
  "end_time": 1560970712,
  "file_sequence": [
    "2e3bdc4f4a8a28353f2ca7f0f84c683e5bb6a388eeb78009f3b5da5715ae5307",
    "000331566b4b191b16bc3e6b605905d209f5ebc11747a5b56bb15f0febb80a5c",
    "dc9661db52560a92aafe55b26c80359b427254e2463053bc1b343fc9287524c4",
    "dd85e32c64eedb8fb14d2b3b4319ebedfa184669660c502de1f6de1485b989a9",
    "417c041234584e3c1259bc2b24fd768886e0f2367e06abc5dc2702d11b5caabc",
    "aca8ab0c0a96300e40304259ebfbef601a737bf2e77156bfa8d3b4441427d154"
  ],
  "files": [
    {
      "content": "6884:0:6884:188",
      "hash": "d7206f987d976f0fba2ef658c3805ff8ab358dd5b8cbdb511c25797019bca626",
      "mode": 438,
      "path": "[file 1]",
      "size": 188,
      "time": 1540320503
    },
    {
      "content": "6884:188:6884:19750",
      "hash": "7b8b9ef315866fa4d6a7cab137c2f61c1713fbea4a3350de272a28a4751a12b8",
      "mode": 438,
      "path": "[file 2]",
      "size": 19562,
      "time": 1540320498
    },
    {
	...

Okay, that’s a bit more light on the subject.

Now, considering this:

What would be the chunks that are associated with file 2?

      "content": "6884:188:6884:19750",
      "path": "[file 2]",

That means file 2 runs from offset 188 to 19750 in chunk 6884. 6884 is the index of the chunk in the chunks array.

1 Like

This was the part of the puzzle that was missing…

A curious detail is that the file that starts from the “zero position” is the oldest file in the whole folder:

      "content": "0:0:0:17899",
      "path": "[very old file]",
      "time": 1387212469

1387212469 = Mon, 16 Dec 2013 16:47:49

That is just a coincidence. Files are always sorted alphabetically.

Could this be because they are stored as chunk hashes and not chunk ID’s (chunk file names)?

If you think about it, if you copy snapshots between different storages - i.e. copy-compatible, but not bit-identical - those storages will have a different ‘ID Key’. You’ll have the same data but encrypted differently, and the chunk file names will be different. But the data inside the snapshots has to be immutable.

Makes sense :thinking: