When restoring, backups from S3 appear empty

As per this thread:

  • I’ve been testing a workflow of sending off duplicacy backups to S3 Glacier, via lifecycle policy.

What I did is:

  1. copied single project backup from repository to a dedicated folder in the bucket.

  2. Froze everything in the ‘chunks’ folder and left snapshots and config hot.

Expected behaviour:

  • To be able to browse content index of the backup via restore tab, but not be able to recall anything.

What I got:

  • all revisions show up, but they appear empty inside.

Can the good people of this forum explain if it’s possible to achieve what I want without unfreezing whole backup? Or is this a glitch and the backup is lost?

Currently unfrezing the chunks folder to see if it helps…

Snapshots are partially contained in the chunks folder.

To be precise, data file that represents snapshot is quite large so it is chukenized and stored inside chunks folders just like normal files would. The file in the snapshots folder simply contains reference to those chunks.

If you freeze chunks folders then you need to go through 2-step process we discussed in the other thread: first defrost chunks referred to in the snapshot file to unpack the full snapshot and then defrost chunks that full snapshot refers to in order to restore actual files

I tried looking for 8973350d4fd2e01c0b5ec389a4597ca4ac1283b039f3008a6c6ecda826b26830 from:

LZ4 … òm{“chunks”:[“8973350d4fd2e01c0b5ec389a4597ca4ac1283b039f3008a6c6ecda826b26830”],“end_time”:1631791350,“file_size”:75083311361 { ð1c57b7ded54390e852650802db5966c877de3017d8315caae02759b33f2e9a32d{ ñ
id":“BACKUPID”,“lengthe ð1388bdb015fabeb0a5918e1fd01340488ecc51cb967b9cde79d02ba2d9c171f8fe number_of_Ô øs”:197,“options”:"-hash",“revision”:1,“startð 89091,“tag”:”"}

But it does not exist?

8973350d4fd2e01c0b5ec389a4597ca4ac1283b039f3008a6c6ecda826b26830 will be stored as chunks/89/73/350d4fd2e01c0b5ec389a4597ca4ac1283b039f3008a6c6ecda826b26830. The number of sub folders is configurable and defaults depend on the backend. On most backends it’s just one, i.e. chunks/89/73350d4fd2e01c0b5ec389a4597ca4ac1283b039f3008a6c6ecda826b26830

There is no 73350d4fd2e01c0b5ec389a4597ca4ac1283b039f3008a6c6ecda826b26830 in chunks/89

No Sub-folders either.

am I missing something else ?

You just need to click the folder icon next to Revison to expand the listing.

8973350d4fd2e01c0b5ec389a4597ca4ac1283b039f3008a6c6ecda826b26830 is the chunk hash. The name of the chunk file (which is call the chunk id) is derived from the chunk hash.

I understand that much, but when I open the folder it is empty. What am I supposed to do here?
how do I get chunk ID form chunk hash ?

Ah, indeed. This is how to do that

But at this point it’s probably easier to add that funcrionaly to duplicacy instead of reimplementing stuff around it — maybe add a similar to existing cat command that would dump list of snapshot chunks from a snapshot file and submit PR :slight_smile:

You don’t need to worry about the actual chunk IDs unless Duplicacy complains about missing chunks. If you really want to find out them, you can run the CLI with the cat command to dump the snapshot content:

cd ~/.duplicacy-web/repositories/localhost/1
~/.duplicacy-web/bin/duplicacy_osx_x64_2.7.2 cat -storage storage_name -r 1

You can also see the file list in the output. If there are files in the file list and you’re still seeing empty list then it is likely that you didn’t choose the right backup id to list.

This requires data from the chunks folder to be defrosted: the data that cat command dumps is coming from the chunks snapshot file is split into. The goal of this is effort is to find those chunks that make up a snapshot to defrost, in order for cat command to work which will allow to fetch the list of chunks for the files the snapshot refers to.

Is my understanding not correct?

In other words, if the duplciacy datastore is frozen, to restore specific revision of specific files the following needs to be done:

  1. defrost config file
  2. defrost snapshot/revision_id file
  3. parse that file for chunk hashes and using keys from config files find chunk filenames
  4. defrost those chunks
  5. Now cat command will work – dump the snapshot contents
  6. parse it for the chunks needed for a specific list of filenames to restore
  7. defrost those chunks
  8. do an actual restore, which will now also work.
1 Like

What I want is to be able to browse the index of the backup/revision without having to defrost the entire repo first.

That is basically right. One thing to note is that there is one more level of indirection. For instance, to get the file list, you’ll need to first defrost the chunks referenced in the snapshot/revision files as file_sequence. What these chunks contain is not the file list, but the list of chunks that made up the file list, so you’ll need to defrost these chunks too. The same applies to chunk_sequence and length_sequence.

1 Like

I’ve been thinking about this for a bit… what’s the advantage of keeping the metadata chunks amongst the regular data in the chunks/ directory?

There wouldn’t be any chance of deduplication (between meta and data anyway), so if there’s no real advantage, why not have them stored in a separate e.g. index/ directory instead?

That way, this directory could easily be targeted by a seperate policy, for Glacier etc… (In my case, I might duplicate that particular directory across my DrivePool, knowing the extra space isn’t huge and would offer extra protection for really important chunks.)

1 Like

oops. In this case this endeavor quickly becomes pointless – too many steps involving 12 hour wait for this approach to be useful.

I second that, and would take this a step further and ensure that only directly addressable file data is located in chunks (or some other easily identifiable folder) while all metadata – snapshots, file lists, etc; stuff that is small by definition – elsewhere.

Then additional command like restore-prepare might be useful to dump a list of chunk files that would be required to restore specified filtered list of files in the specified revision. That list can be fed into automation to defrost those files as part of restore workload. Or maybe even later get it automatically supported – for endpoints that support thawing.

This seems a fairly easy and safe change but will instantly make duplicacy compatible with archival storage. Apparently there is some demand for that, which is honestly surprising to me.

1 Like

Just realised! AWS S3 has a lifecycle Archiving featture for archiving unused files into glacier. What If I was to set it up for the whole bucket, let’s say “anything not used for 30 days to be sent to glecier”, and then somehow query the index more than once every 30 days, then the index and everythign required for it would stay hot while the unused chunks would automatically become cold. Then, since Im already copying out snapshots to dedicated folders all I would have to do is unfreze the whole folderr if I actually need to recall something.

Woudl that be a viable strategy ?

Not sure this will help anything: the snapshots are not the problem, because these are small files that can be kept in hot storage all the time. It’s the chunk files that you need fetch data from to restore data that are large and can benefit from being in the cold storage. The whole ordeal is to determine which chunks to defrost, and the complication is that the information required to make that determination is contained in other chunks files. There is no easy way to distinguish those metadata chunks from actual chunks without reading their contents, and number of levels of indirections times 12 hours makes this not feasible.

If there was an easy way to tell – “these chunks should not be frozen” without need to look inside, ideally by consolidating them in a separate folder – then only one thawing would be needed and that would be tolerable.

What I meant by snapshots is that I am putting every backup i na separage folder on s3, not into a single repository.