Show difference between repository and storage

Christoph · 29 November 2019 13:59

What is the easiest (=least time consuming) way of comparing a list of files in a local repository with a list of files in the backup storage? (I’m on Windows, so I guess I’m asking for a powershell or comand line script.)

In order to avoid an XY problem, here is the more specific challenge I’m facing: A couple of moths ago, I moved a number of directories from one directory tree to another and something seems to have gone wrong in the moving process because I recently noticed that quite a number of files seem to be missing in the new location. I’m not sure what exactly happened (and it is irrelevant here), but it looks like it might have something to do with the path length, i.e. that files whose path exceeded a certain length got lost. But that is just an aside here.

So, before I go ahead and start the missing files, I would like to get a better picture of which files are actually affected, i.e. I basically want a list of files that are available in the backup storage but missing in the local repository. The file dates and times can savely be ignored because none of the files has been modified in the past months.

Note that the local paths are no longer fully identical with the paths on the storage but the differeneces are only at the beginning of the path. Something like this:

Current local paths:

c:\users\myself\project1\abc.txt
c:\users\myself\old\def.txt
c:\users\myself\project2\admin\ghi.txt
c:\users\myself\templates\admin\ghi.txt

Paths in snapshots (corresponding to previous local file locations):

users/myself/Box Sync/projects/project1/abc.txt
users/myself/Box Sync/projects/old/def.txt
users/myself/Box Sync/projects/project2/admin/ghi.txt
users/myself/Box Sync/templates/admin/ghi.txt
users/myself/Box Sync/templates/admin/comm/xyz.txt

(The duplicate file name is intentional.)

So the starting poing would be the latter list of paths (though duplicacy’s list command doesn’t output them as neatly as that when I use duplicacy list -id ALPHA_C -files | select-string -pattern "/Box Sync/") and then check for each path whether that file is currently present locally, somewhere.

So in the above example file lists, the result would be

users/myself/Box Sync/templates/admin/comm/xyz.txt

Which is the file that is present in the storage but not locally.

The solution does not have to be perfect and fully automatic. I can handle a number of edge cases manually. And neither do I expect anyone to write an entire script. I’d be happy to learn about strategies for tackling this more generally.

towerbr · 29 November 2019 22:14

Well … since you said:

I have a proposal, which doesn’t involve scripts, but … Excel is your best friend ( don’t kill me!).

Of course it depends on the number of files and folders, but anyway here it is:

Generate both file lists and paste into Excel:
B2:B5 and B8:B12;
Extract the characters relative to the desired folder and file names, disregarding the initial characters:
C2:C5
formula in C2: =MID(B2;17;1000)
Do something similar in the second list, but replace / with \ and adjust the start parameter if you have different intermediate folders:
C8:C12
formula in C8: =SUBSTITUTE(MID(B8;32;100);"/";"\")
formula in C11: =SUBSTITUTE(MID(B8;23;1000);"/";"\") (notice the “23”)

This adjustment above is because in the first list all files are on the same “level”, but in the second list some files have one more “level” (“projects” in blue in the Excel above).

Then create a formula to look for a particular name in the other list:
D8:D12
formula in D8: =VLOOKUP(C8;$C$2:$C$5;1;0)

(in a little while someone shows up here with a script involving regex, etc. and will humiliate my poor solution , but hey, it works! )

gchen · 1 December 2019 04:04

Isn’t this the exact use case of the diff command?

Droolio · 1 December 2019 14:51

Personally, I’d restore from backup to a temporary directory and use Beyond Compare directly on the two paths, adjusting the base folders as necessary.

Or you could compare the two file lists as .csv files with BC, but that would require a little tinkering with the file list output beforehand. BC can be told to ignore X characters on each line for one side of the compare.

(Incidentally, this doesn’t answer the question, but could the missing files have something to do with them residing in the Box Sync directory? My first thought is that Box(?) behaves similarly to OneDrive on Windows 10 in that it can treat these files as offline and stream them on-demand. Just a thought.)

Christoph · 1 December 2019 20:25

Okay, thanks everyone for your suggestions. I’m happy to see that I didn’t ask a trivial question

Or did I?

That’s what I had hoped but then I read that

and since I’m neither comparing files nor snapshots, I figured diff is not form me. Am I missing something? (I don’t think so but I certainly hope I do. Perhaps I should make a feature request to add snapshot-repository comparison?)

I like Excel (although I’m increasingly running against its various limitations, most recently the impossibility to have a third y-axis in charts) and this may indeed be a viable way forward. I already have the file list from the snapshot, but how do I get such a list from my local directory tree branch?

I remember looking up that piece of software when you mentioned it in some other context but shied away from the price (which, I’m sure, is justified but for someone who will probably use it once a year or so (if I then still remember I have it), it’s still a lot. That said, for the present case, the free trial version would do (and maybe I’ll discover some more use-cases) and the advantage would be that I’m analyzing and fixing the problem at the same time, which is more than I had dared to ask for. I think I’ll give it a try.

In terms of aesthetics, however, I would still be curious to see a scripted solution. If we ignore the windows requirement, does someone have an idea?

No, luckily, Box does not behave like that. I’m not sure what caused the loss, but I’m suspecting that it’s related to the fact that I used the Box integration of Citrix Sharefile which allegedly allows you to migrate your files from Box to Sharefile… I already had a bad feeling when I ussd it because it was not very talkative I said to myself: “Come on, stop distrusting everything that’s trying to make your life easier. Your employer paid a lot of money to both Citrix and Box, so it must be good.” I’m going back to trusting my gutt feeling again…

towerbr · 1 December 2019 21:09

Easy: dir /s /b >list.txt

202020

gchen · 2 December 2019 02:25

The doc isn’t clear, but you can provide only one revision:

duplicacy diff -r n

This will compare the local repository with the specified revision.

Christoph · 2 December 2019 08:05

Aah, that is great. But if I understand correctly, there are two reasons why this won’t work for me:

although the files are still in the same repository, their path has changed. I’m assuming duplicacy won’t recognise the files as moved. But maybe I can tell it to ignore parts of the path and then handle duplicate file names (like ghi.txt in the OP) manually?
I’m not interested in a specific snapshot but the entire storage. Or is it possible to not specify any revision at all, like in the list command?

austin.france · 4 December 2019 10:59

It can’t report those files that were moved, but missing, because it never knew they existed.

You will have to list the snapshot, list the file system, and manually compare the broken tree with the original (or diff after some cleanup) the two directory trees

To diff, you need to build two file lists (sorry, this uses unix/linux commands, but its just for illustration)

cd /root/of/broken/tree
find . -type f -print | sed ‘s/^…//’ > list1 # sed removes ./ prefix on each line

duplicacy list -files -r N | awk ‘/match-root-path-of-original-tree/ { print substr($0,999); }’ >list2

where N is the revision wiht the original files, and 999 is enough to prune the unwanted initial info including root path.

You should end up with two files that look like

subdir/a/b/c
subdir/b/c/d

that can be directly compared using diff or windiff.