Listing revision contents, need chunk lists

wcdbsteve · 15 February 2022 02:30

This is all I’m getting now:

steve@MacProDesktop REDACTED % duplicacy cat -id REDACTED -r 4
Failed to download the chunk REDACTED: InvalidObjectState: The operation is not valid for the object’s storage class

I still can’t seem to figure out how to get the exact list of chunks needed to assemble the file list for that revision.

saspus · 15 February 2022 04:53

It seems it is exactly what we discussed, that the chunk that contains portion of a snapshot is not defrosted and hence can’t be downloaded, is it not?

For that command to work full snapshot file must be retrievable. For that all member chunks need to be thawed… it’s a loop.

wcdbsteve · 15 February 2022 05:14

So we’re basically saying there is no way to chase down through a chain of files which files actually need to be thawed?? That can’t be possible… I mean, how does the code even start? It must start with a single file and then build up a list of the other chunks it needs to grab, no??

saspus · 15 February 2022 05:59

No, of course it’s possible; just as far as I understand today there is no external API to get that data. I.e. there is no way to ask duplicacy “dump what chunks are needed to construct this specific snapshot”

wcdbsteve · 15 February 2022 06:01

Not looking for an API right now… just need to know how to get at the data the hard way, get it back to my machine unencrypted so I can build the list of chunks I need to thaw and then see how it works…

From there I can start coding, learning GO, and testing to see how this might really work in the future… but I have to be able to do it manually first.

wcdbsteve · 15 February 2022 14:58

I’m still working on this… I tried to do a backup test just to a local drive with no encryption so I could read the snapshot file… which I got, but there is one more problem.

LZ4 áÚj{“chunks”:[“92c262a75b378fa3f07d310d2a7abcc2b0ad0a658d9409109bd5d544288eb6a5”],“end_time”:1644936577,“file_size”:48900293x11028785bbac2ff6c7e426012d3aa47a2093b8b4a0cf39e276046847c609e5f4dxÒid":“Backup_FBExportsTest”,“lengthk647205a1bddc69036ec4f173f431b853a73a1eb3!ff327c99c9d50303e5eck†number_of_◊¸s”:71,“options”:"-hash",“revision”:1,“start∞6,“tag”:”"}

It looks as if the file is also compressed, seemingly with LZ4 compression… but when I try to decompress it manually it tells me the file can’t be decode? Is there some trick to uncompressing this file? I still can’t get any discernible chunk lists out of it. The only one that it seems to mention at the top is not a chunk in my backup.

saspus · 15 February 2022 15:33

Those are chunk hashes: How to identify which chunks represent a particular file

The chunk file names is derived from a hash: When restoring, backups from S3 appear empty - #8 by saspus

wcdbsteve · 15 February 2022 19:28

Ok… so I’m trying to follow, but for some reason no matter what I do, even when mentioning to storage by name, including the backup ID, it still keeps telling me the repository hasn’t been initialized…

~ # /config/bin/duplicacy_linux_x64_2.7.2 cat -storage REDACTED -id REDACTED -r 5
Repository has not been initialized

And this is being done from inside the container. I’ve tried from the hoem directory, from the config directory, even from the source directory for the backups… always the same thing.

What am I missing?

saspus · 15 February 2022 20:32

Look at any backup log – in the first few lines you’ll see a path. cd there and then run command.

wcdbsteve · 15 February 2022 22:43

Still no good… here’s my log:

/logs # cat backup-20220215-040001.log
Running backup command from /cache/localhost/3 to back up /source/REDECTED/data
Options: [-log backup -storage REDECTED -threads 8 -stats]
2022-02-15 04:00:01.197 INFO REPOSITORY_SET Repository set to /source/REDECTED/data
2022-02-15 04:00:01.200 INFO STORAGE_SET Storage set to s3://REDECTED
2022-02-15 04:00:14.872 INFO BACKUP_START Last backup at revision 4 found
2022-02-15 04:00:14.872 INFO BACKUP_INDEXING Indexing /source/REDECTED/data
2022-02-15 04:00:14.872 INFO SNAPSHOT_FILTER Parsing filter file /cache/localhost/3/.duplicacy/filters
2022-02-15 04:00:14.872 INFO SNAPSHOT_FILTER Loaded 2 include/exclude pattern(s)
2022-02-15 04:33:36.121 INFO BACKUP_THREADS Use 8 uploading threads

From there it lists the files being uploaded, then confirms revision 5 has been created.

However, when I try to run the cat command again… still the same thing:

/source/REDECTED/data # /config/bin/duplicacy_linux_x64_2.7.2 cat -storage REDECTED -id REDECTED -r 5
Repository has not been initialized

I even tried it with almost no options as mentioned before… it sill comes up with nothing.

saspus · 15 February 2022 22:58

cd <wherever /cache is mounted at>/localhost/3

Or if running in the container — cd /cache/localhost/3

There must be .duplicacy folder. If it is not there — it’s a wrong location.

wcdbsteve · 16 February 2022 00:08

YES!!! You nailed it! I did it from the cache directory and it dumped nearly 310MB of data on me. It’s actually 7,624,312 lines of data!!! Hahahaha

Now the real question is… what can I search for in this giant file to tell me which chunks I need to thaw to get the file list working for this particular revision?!

And if I haven’t said it already… thank you SO much for the help! You’re killin it!! Expect some donations, project work, anything I can contribute, I’m in!!!

saspus · 16 February 2022 05:23

I think there hides a problem already: to produce that output duplicacy had to construct the full snapshot - that involves downloading snapshot chunks. That can be frozen.

We need another command line command that will dump references for the snapshot file – to know which chunks to thaw to get the snapshot itself, which will then be used to figure out what file chunks are needed.

But then it becomes 2-step process, which is not good in itself.

Droolio · 16 February 2022 12:18

This, ultimately, would be the proper solution.

This would allow even disk pooling software (DrivePool, mergerfs) to duplicate such a meta directory for increased redundancy.

wcdbsteve · 16 February 2022 15:44

I agree, I’d love to see that happen… and help where I can!

wcdbsteve · 16 February 2022 15:49

The thinking in my brain is to follow the process the code takes to get the file listings… I’m just trying to get it to list for me the files that are in each revision. So in the web GUI when you go to restore, select your backup, your revision, etc… it goes to populate the file list, and in doing so it obviously attempts to download chunks. I see the first one it downloaded… so I thawed it, now it gets that one and I see the second one it’s trying to download… so now I’m thawing that.

So in the code, how does it know which chunks it’s going to need to get to give you that file list?

If I can step through this manually, I can develop a logical process of what would need to be done to make that part of the GUI function properly with Glacier Deep Archive… then we can move on from there… I’m even taking some online coding classes to I’ll be able to develop some of these things myself and submit a PR… but I’m a little too junior to understand what the code is doing right now.

gchen · 17 February 2022 00:13

A snapshot/revision file is just a json serialization of this Snapshot struct:

github.com

gilbertchen/duplicacy/blob/68b60499d7dca73dd94fe6f01a4f2e0fd3481282/src/duplicacy_snapshot.go#L21-L48

    
      
          type Snapshot struct {
          	ID            string // the snapshot id; must be different for different repositories
          	Revision      int    // the revision number
          	Options       string // options used to create this snapshot (some not included)
          	Tag           string // user-assigned tag
          	StartTime     int64  // at what time the snapshot was created
          	EndTime       int64  // at what time the snapshot was done
          	FileSize      int64  // total file size
          	NumberOfFiles int64  // number of files
          
          
	// A sequence of chunks whose aggregated content is the json representation of 'Files'.
          	FileSequence []string
          
          
	// A sequence of chunks whose aggregated content is the json representation of 'ChunkHashes'.
          	ChunkSequence []string
          
          
	// A sequence of chunks whose aggregated content is the json representation of 'ChunkLengths'.
          	LengthSequence []string
          
          
	Files []*Entry // list of files and subdirectories

This file has been truncated. show original

This file is compressed if the storage is not encrypted, so you can’t print it in plaintext.

You can’t find the list of metadata chunks directly in this file. Instead, there is one extra level of indirection involving FileSequence, ChunksSequence, and LengthSequence. For example, FileSequence is a list of chunks. To get the list of metadata chunks that make up the file list, you must download all chunks in this FileSequence, and then concatenate these chunks together, then deserialize them from json to a list of Entry structs.

wcdbsteve · 17 February 2022 00:58

Ok… this somewhat makes sense to me… so how do I start? Is being able to cat the revision into that 7 million line file helpful at all? How do I figure out which chunks I need to thaw, download, concatenate, etc…?

I’m really struggling here to get this done on AWS S3 Deep Archive, but deep down I know we can make it work, I just need a little more help. Would showing you parts of the “duplicacy cat” contents help you point me in the right direction… ?

Another question might help maybe… how does the duplicacy soft know which chunks to grab for the metadata is needs to populate the file list? I just need to get that list in advance so I can thaw them before trying to list the files in the revision… the code stops after the first error in downloading… if it would just pop out all the chunks I need and not stop after the first error I’d get the list that way… now I’m trying to do that manually.

The problem is, without that… let’s say it takes 6 chunks to get the backup metadata… I have to run it, get an error, thaw that chunk… run it again, get the next error, thaw the next chunk… etc… with 6 chunks it could end up taking me 3 days just be able to list the files in a backup revision.

wcdbsteve · 11 March 2022 12:55

Ok… any chance you can help me here… all of a sudden now even my new backups are failing and I have no idea where to start…

What file can I start downloading? How can on uncompress it, and how can I concatenate things together to start getting the list…

My trial license expired, I bought one, applied it… and now it won’t restart any of the backups or do checks… it seems to want to start with downloading these chunks… and it’s going to take forever to do it. Just a little more help would be great.

wcdbsteve · 11 March 2022 13:11

Is there any way to identify metadata chunks when they’re uploaded?

I’m really stuck at a standstill here? Is there more logging I can do somehow to see the differences?