Listing revision contents, need chunk lists

wcdbsteve · 14 February 2022 17:16

So I’ve been experimenting heavily with putting Duplicacy backups on Amazon s3 and transitioning them to s3 Glacier Deep Archive which is extremely affordable.

But I ran into an interesting problem today. I wanted to list the files in a revision of one of my backups from the web gui, but it failed to list them. From looking at the web log, this is because it is trying to download a chunk that is in Deep Archive so I need to restore that file to standard access in S3… which I’m doing…

My fear is that there will be more chunks that need to be restored to see the file list and it takes about 12 hours to restore… so I can’t do it one file at a time.

The main super import question here is… is there a way I can get a list of which chunks contain the file lists for a given revision so I can bulk restore those??

Then I guess the next question would be if its possible to list the chunks needed to restore a particular file, or a whole revision… this was I can script the S3 retrieval process and only restore the chunks I need to avoid much larger costs from amazon… and hopefully only have to do 1 or 2 retrieval requests??

saspus · 14 February 2022 17:53

I think the right approach here would be to keep the snapshot chunks in a separate folder from data chunks.

Then the whole data chunks folder can be moved to deep archive and snapshot chunks folder left in standard storage. This way all information required to list chunks to be thawed (to restore the whole snapshot or just a few files) will reside in standard tier and can be fetched instantly and cheaply.

This sounds like a pretty simple and safe change: when generating chunks save them to appropriate folders and when reconstructing data – look in both folders. Then adding AWS support will add additional step to wait for data availability after reading snapshot chunks and before accessing data chunks.

wcdbsteve · 14 February 2022 18:22

How would I configure that… there doesn’t seem to be any way to separate. All the chunks just go in the /chunks/ folder… which is the only things I move to Deep Archive, everything else stays on Standard S3 including the config file and /snapshots/ directory.

So I’m able to list the storage metrics, see how many revisions/snapshots exist, but when I try to list the contents of one to do a single file restore, the weblog shows that it’s trying to grab a regular chunk just to provide the file list.

Here is a sample of that log:

2022/02/14 11:53:07 192.168.1.22:55336 POST /list_repositories
2022/02/14 11:53:07 Running /root/.duplicacy-web/bin/duplicacy_linux_x64_2.7.2 [-log -d info -repository /cache/localhost/all s3://redacted]
2022/02/14 11:53:07 Set current working directory to /cache/localhost/all
2022/02/14 11:53:07 DEBUG PASSWORD_ENV_VAR Reading the environment variable DUPLICACY_S3_ID
2022/02/14 11:53:07 DEBUG PASSWORD_ENV_VAR Reading the environment variable DUPLICACY_S3_SECRET
2022/02/14 11:53:07 INFO STORAGE_ENCRYPTED The storage is encrypted with a password
2022/02/14 11:53:07 INFO STORAGE_SNAPSHOT ownCloud_Backup_AWS
2022/02/14 11:53:09 redacted:54934 POST /list_revisions
2022/02/14 11:53:09 Running /root/.duplicacy-web/bin/duplicacy_linux_x64_2.7.2 [-log list -storage redacted -id redacted]
2022/02/14 11:53:09 Set current working directory to /cache/localhost/all
2022/02/14 11:53:09 INFO STORAGE_SET Storage set to s3://redacted
2022/02/14 11:53:11 redacted:55336 POST /list_files
2022/02/14 11:53:11 Created listing session mi45cq
2022/02/14 11:53:11 Running /root/.duplicacy-web/bin/duplicacy_linux_x64_2.7.2 [-log list -id redacted -r 4 -storage redacted -files]
2022/02/14 11:53:11 Set current working directory to /cache/localhost/all
2022/02/14 11:53:11 INFO STORAGE_SET Storage set to s3://redacted
2022/02/14 11:53:11 INFO SNAPSHOT_INFO Snapshot redacted revision 4 created at 2022-02-14 04:00
2022/02/14 11:53:11 ERROR DOWNLOAD_CHUNK Failed to download the chunk redacted: InvalidObjectState: The operation is not valid for the object’s storage class
2022/02/14 11:53:11 ERROR DOWNLOAD_CHUNK Failed to download the chunk redacted InvalidObjectState: The operation is not valid for the object’s storage class
2022/02/14 11:53:11 CLI: status code: 403, request id: HZ907RJQAQ3AJZNQ, host id: oRokx/SEEIvq4J2B5tSUX0iKiux9NtU82p/SxTndBfyrx83WgT/u22C6cLLWxhQLquJaBafmJqY=
2022/02/14 11:53:11 Failed to list files for backup redacted revision 4 in the storage redacted: Failed to download the chunk redacted: InvalidObjectState: The operation is not valid for the object’s storage class 2022/02/14 11:53:19 Deleted listing session 7cxck

saspus · 14 February 2022 18:30

Right, today there isn’t; everything is piled up into the same chunks folder indiscriminately. Even using prefixes to differentiate snapshot chunks from data chunks would have gone a long way to faciliate support of archival storage.

Once this is done, supporting the glacier will be fairly straightforward.

Since this is not in the roadmap – you can implement that in your fork (and submit PR!)

wcdbsteve · 14 February 2022 19:53

If I had even the remotest level of coding skills to get something like that done I would jump on it in a heart beat.

What I really need to know is… how does duplicacy know which chunks to pull in order to get the file list. It seems to give up after it fails on the first one. Is there any way I could get the full list of the chunks I want… then I just just manually promote those all at once in S3 and my listings would go much better.

So is there a way to see a list of the chunks it’s going to try to request? Or even in a file somewhere that I could download, decrypt and find them myself?

If I can get that I might be able to script a little python utility that could do this all automatically and it would make Glacier support not only easier, but VERY VERY useable! It could even tag those chunks and set a policy in S3 to not transition them down to Deep Archive again.

So yeah… how do I see a list of chunks containing the file list in that backup revision?

saspus · 14 February 2022 20:14

My bad! For some reason I got an impression that you are looking to implement support for the archival storage in duplicacy

You can search for IEEE pdf here on the forum — there is a pdf of an article describing inner workings.

In a nutshell — snapshot file contains information on how to recreate the files and list of chunks that contain bits and pieces needed for that. But because the snapshots file is huge in itself it also is shredded into chunks that are uploaded to the same pile. The actual snapshot file in the snapshots folder now contains instructions on how to recreate the full snapshot, that contains instructions on how to recreate files.

That’s why if it fails to fetch a chunk that is part of a snapshot — it can’t proceed.

As you can see the problem is that snapshot chunks are mixed in with the rest of chunks, even though they are semantically different and shall be kept separately. This would have also provide for further differentiation between how data chunks and snapshots chunks are handled: for example, it would be easy to duplicate snapshot chunks for redundancy: losing data chunk can affect few files, but losing snapshot chunk arrest entire snapshot (millions of files)

wcdbsteve · 14 February 2022 20:49

Ok… now I think we’re actually getting somewhere great… I’m going to search for that doc ASAP and see what I can learn, but you mentioned something very important.

I do have my snapshot files on standard S3… so I can grab that and just download it right from the S3 console. They key here though is that I do use duplicacy backup encryption, so I’m assuming that snapshot file would be encrypted as well.

I can download that to my local machine… and of course I have my encryption key… could I download that file, decrypt it manually with some command line took, or python using a crypto library and get this list of chunks that way??

I think armed with that list of chunks I should be able to promote all of those back up to standard storage and get the files to list. It’s also something I could write a python utility to do automatically, then just always mark those chunks as except from the lifecycle rule that pushed things down to Deep Archive.

Also, thanks so much for your support here… fantastic! I think we can do this! And I’d LOVE to help contribute to the project more. I’m loving it so far!

saspus · 14 February 2022 21:01

That would be reimplementing (and maintaining!) of half of the duplicacy functionality in python might as well do it right in the duplicacy codebase in Go — where all data is already available and the only change that would be needed is to store snapshot chunks to a different folder. Should be a one-line change

(As a side note, personally, having a long history with C/C++ I find Go much more palatable than python. I have to use Python at work occasionally, and I’m yet to grow to enjoy it. Go however is absolutely awesome. But I don’t get to use it

gchen · 14 February 2022 21:10

You can use the cat command to find chunks needed by a backup:

duplicacy cat -r revision

Specifically, file_sequence contains the list of metadata chunks. If you copy only these chunks you should be able to list files in a backup.

saspus · 14 February 2022 21:20

Does not this require snapshot chunks to be already available? And today there is no way to differentiate snapshot chunk from a data chunk. Or is there?

wcdbsteve · 14 February 2022 21:22

I’m also trying to figure out how to do this inside a docker container… if I run it outside the container it can’t find any of the storage configurations or anything.

I kind of need a full command or better way to get at that data…

wcdbsteve · 14 February 2022 21:24

I think he was referring to just reading the revisions file that is in the /snapshots/ folder under each backup ID… I can’t view mine clear while download because my backups are encrypted. So I’m assuming doing this through the command line util it will decrypt while it grabs it.

wcdbsteve · 15 February 2022 02:24

gchen:

You can use the cat command to find chunks needed by a backup:
duplicacy cat -r revision
Specifically, file_sequence contains the list of metadata chunks. If you copy only these chunks you should be able to list files in a backup.

I’ve been working on this for a while now… I create a directory on another machine, got duplicacy installed, did an init and pointed it to the same storage location, it now comes up in the list of available storage, but keeps telling me there are no revisions even through I can see the backup ID directory under “snapshots” and I see files for each of the 4 revisions… what am I missing?

wcdbsteve · 15 February 2022 02:30

This is all I’m getting now:

steve@MacProDesktop REDACTED % duplicacy cat -id REDACTED -r 4
Failed to download the chunk REDACTED: InvalidObjectState: The operation is not valid for the object’s storage class

I still can’t seem to figure out how to get the exact list of chunks needed to assemble the file list for that revision.

saspus · 15 February 2022 04:53

It seems it is exactly what we discussed, that the chunk that contains portion of a snapshot is not defrosted and hence can’t be downloaded, is it not?

For that command to work full snapshot file must be retrievable. For that all member chunks need to be thawed… it’s a loop.

wcdbsteve · 15 February 2022 05:14

So we’re basically saying there is no way to chase down through a chain of files which files actually need to be thawed?? That can’t be possible… I mean, how does the code even start? It must start with a single file and then build up a list of the other chunks it needs to grab, no??

saspus · 15 February 2022 05:59

No, of course it’s possible; just as far as I understand today there is no external API to get that data. I.e. there is no way to ask duplicacy “dump what chunks are needed to construct this specific snapshot”

wcdbsteve · 15 February 2022 06:01

Not looking for an API right now… just need to know how to get at the data the hard way, get it back to my machine unencrypted so I can build the list of chunks I need to thaw and then see how it works…

From there I can start coding, learning GO, and testing to see how this might really work in the future… but I have to be able to do it manually first.

wcdbsteve · 15 February 2022 14:58

I’m still working on this… I tried to do a backup test just to a local drive with no encryption so I could read the snapshot file… which I got, but there is one more problem.

LZ4 áÚj{“chunks”:[“92c262a75b378fa3f07d310d2a7abcc2b0ad0a658d9409109bd5d544288eb6a5”],“end_time”:1644936577,“file_size”:48900293x11028785bbac2ff6c7e426012d3aa47a2093b8b4a0cf39e276046847c609e5f4dxÒid":“Backup_FBExportsTest”,“lengthk647205a1bddc69036ec4f173f431b853a73a1eb3!ff327c99c9d50303e5eck†number_of_◊¸s”:71,“options”:"-hash",“revision”:1,“start∞6,“tag”:”"}

It looks as if the file is also compressed, seemingly with LZ4 compression… but when I try to decompress it manually it tells me the file can’t be decode? Is there some trick to uncompressing this file? I still can’t get any discernible chunk lists out of it. The only one that it seems to mention at the top is not a chunk in my backup.

saspus · 15 February 2022 15:33

Those are chunk hashes: How to identify which chunks represent a particular file

The chunk file names is derived from a hash: When restoring, backups from S3 appear empty - #8 by saspus