File Format/ Accessbility

AgnosticCup · 23 August 2020 02:27

Hi,

I apologize if my question sounds dumb since I’m not quite familiar with this piece of software.

So I’ve tried out Duplicacy and it seems that it creates three files: chunks, snapshots, and config at the destination. I was wondering my backups are accessible through means other than Duplicacy in case this software would not be available in the future?

I’ve tried searching really hard for an answer but to no avail. On Duplicacy demolishes duplicati and rclone for large backups its mentioned

And has another advantage: an open format. If you decide to change your backup provider/storage tomorrow, simply move your existing files. With Crashplan and others you get stuck in a proprietary format.

but it seems that the files are a bunch of unintelligible txt files? Are there any standard it is based on so it can be read through other means, or are there existing alternative software that can read the files?

Thank you so much!

Edit: I would like to also ask if there is a way to verify the backup files that they are the same and usable once the backup is done, since I’ve experienced backup corruption in the past, let’s say its a terrifying experience. It seems in normal file system that could be done through comparing the checksums, yet I’m not really sure how it can be done in this chunk format. Or please tell me if my concern is not a valid one, that the copying process is really secure without errors, or that perhaps if there is a corruption if directory won’t shows correctly in the restore page so that showing should be safe enough?

saspus · 23 August 2020 03:59

Yes. The format is opensource and the Duplicacy itself is opensource. You can compile it yourself from sources if needed. The language it’s written in is Go. It’s opensource language originally developed by Google for their own use and is In fact used extensively across many google services. Google does not go anywhere any time soon. If it were to disappear — the source is hosted on github. You can compile go compiler using previous go compiler and bootstrap the whole process with golang version 1.1 that can be build with C compiler. Does not get any more bare bone than that. In practice that means if you have a system with gnu C compiler (aka any unix/Linux/bsd/macOS/Windows/) you can (likely easily) compile Duplicacy for that system. Even if Acrosync disappears tomorrow your files will still be recoverable for many years and as architectures evolve there would be people who would fork the project and keep it alive for themselves. (Like myself for example. I keep a copy of Duplicacy source code together with my backup encryption keys).

Now about format. Yes, Duplicacy, like many other backup tools, organizes data in an way that facilitates deduplication and enables great performance and features and happens to be opaque. This is not a bad thing. In fact, this already happens on many other levels across your system. What you think of text file with Hello World written in it that you can read anywhere — that thing does not exits. Data on disk is just ones and zeroes. There is disk hardware, firmware and Many other logical abstractions like partitions and filesystems that allow these bits scattered across the magnetic surface to be interpreted into files, and there are text viewers that allow interpret bit streams in those files to show you “hello world” in a chosen font in the form of LEDs lighting up in a certain pattern on another piece of hardware called display (don’t let me get into driver software specifics…). And if it wasn’t a text file? If it was a pdf or docx? It does not make much sense in a text viewer, you need yet another layer of abstraction to interpret PDF or docx to get readable text.

And yet, we don’t worry about the exact same concern here: but how would I read my files if, say, Microsoft disappears tomorrow and stops supporting NTFS. Your drives and OS won’t become pumpkins — the existing filesystem drivers will still work.(And NTFs isn’t even opensource). Or what happens if pdf software disappears?

Duplicacy is just another layer of abstraction above filesystem. Just like filesystem uses blocks on disk to store and retrieve files Duplicacy uses files to store and retrieve your backup history.

So, don’t be concerned about opaque container — your data is already in opaque containers since day one. It’s just one more layer in the stack.

Now about verification.

Duplicacy supports many levels of verification: ranging from simple checks — that all the data Files needed To restore your data is present to actually verifying the content of those chunks and comparing with a precompiled hash.

This brings another topic: the storage you host your backup on must provide data consistency guarantees. In practice it means it has to be bit-rot aware. Yes, Duplicacy can tell if your backup has rotted — but that does not help you. Instead, use redundant storage with Checksumming filesystems that support periodic scrub and bit rot detection and recovery. Your former backups got corrupted likely for exact same reasons. For example, if you keep your datastore on a single HDD it is not about whether the backup will rot but when wil this happen: this year or two months after.

Let me know if I missed some of your questions — this wall I’d text is high enough for one comment

TheBestPessimist · 23 August 2020 06:08

Thanks for the question @AgnosticCup.

Beside @saspus’s thorough answer, I suggest to read Design and Implementation, for a thorough description of 's inner workings.

AgnosticCup · 24 August 2020 02:47

@saspus I appreciate you so much for trying to explain to me thoroughly down to the core, and you havn’t missed my questions. And wow that’s a lot to digest. Short disclaimer as a nonprogramer, what I say might not make full sense . I feel like I’m venturing some deep territory, and its not easy to clear out these things by simply researching online, or maybe I’m not looking at the right places.

The format is opensource and the Duplicacy itself is opensource. You can compile it yourself from sources if needed. keep a copy of Duplicacy source code together with my backup encryption keys).

I understand noww how the widely available C is supposed to build up to this software now with the right tools. GitHub - gilbertchen/duplicacy: A new generation cloud backup tool I’ve taken a look at the github and downloaded the directory, and according to (Installation · gilbertchen/duplicacy Wiki · GitHub) just to be sure he duplicacy_main.go should be the source code file that I should keep?
You’ve also mentioned that you kept a copy of your backup encryption keys, is there some particular way that those keys should be stored?
Can I also assume that the code should compile well even at newer OS without an update since the code should be the same and its the complier’s job to compile the code to fit the OS? Either way thoughas you said this project would probably be forked and updated.

Onto a problem concerning the file format, I suppose you might also have used other backup software and have to transfer the entire subset of backup across time to duplicacy (instead of the backup in a particular time). If I one day need to export everything, is there a way technically to export the entire backup across time span to something compatible with other similar software, as it seems recursively restoring every version then backing it up from scratch is too slow to be feasible.

Indeed, the broken up files are intuitvely worrying, but what you say is about opaque containers being everywhere is so true. I feel like why I don’t think too much about NTFS PDF etc is because it is widely implemented, which of course is a cognitive bias. Or in this case, it seems intuiively that using the finder copying function is a lot safer, though lacking so many features. (or blocks on disks storing data.)

And conversely, duplicacy an independently-developed software I came across when researching due to the inability of native backup software to copy files properly (this software and developers here are amazing don’t get me wrong). Also its not a matter of whether a piece of software executes the way it should, but years of accumulated data so I’m a little over-worried. Maybe I should be more confident after having gone through Design and Implementation, and seeing so many of us here who defintiely know what they are doing have really good experiences with this?

I’ve taken a look at this document Check command details . I’d like to ask if the check task can be queued up to run automatically after each backup task? Is it overkill to run check for all chunks each time the program does the backup?

Besides, I’d like to know if the backed up storage does require any type of periodic maintainence so to speak, or are the files themselves secure

Thanks for that info! Could you provide some directions on how I should execute this (redundant storage -> more backup copies ie more harddrives/cloud, checksumming filesystems support periodic scrub + bit-rot detection recovery?)
Does that mean the filesystem on which I backup data should not be NTFS/HFS+ but instead ZFS(it’s mentioned everywhere)? I hope the file system can perform the scrub, detectio nadn recovery automatically otherwise upkeep might be a problem. Also, it seems there might probably be compatability issues transfering from HFS+ (my mac) to ZFS, or at least a significant speed decrease, is that the case? If so I might have to give up on MacOS just to be more secure with the data.
On the other hand, is there way to perform the necessary security measures you’ve mentioned just using a HFS+/NTFS external harddrive or cloud storage?

Again I know I sound newbie and thank you so much for helping me out on this!

Thanks @TheBestPessimist I’ve been reading the design and implementation and its really interesting.

Droolio · 24 August 2020 23:33

Go is pretty much a cross-platform language and Duplicacy already has the OS-specific parts written. So you could compile it on Mac, make backups on Mac, then compile it on Windows, restore from the same backup storage on Windows.

But really, the main gist is, you can compile it yourself if you wanted/needed to - and the source code details how those chunks are encoded/decoded - but it’s enough to just hang onto the CLI executables in order to restore your data, because they’re standalone.

The well-known CrashPlan backup software/service also has proprietary chunk-based backup storage. Except it’s not open source, and they heavily rely on a database to properly ‘index’ backup data.

When they ended their Home-tier service (including their free PC-to-PC option), bam all that data became inaccessible, because they linked their backup storage to cloudy accounts and basically deactivated everyone’s clients. The problem wasn’t the chunk-based storage; it was the reliance on proprietary data formats, cloud-reliance, and zero source code. Had there been source code, local backups would be restorable.

Duplicacy gives you the source code, doesn’t rely on cloudy accounts (except perhaps the Web Edition - though the backup engine certainly doesn’t), and it’s designed in such a way as to not require complicated database to index stuff.

So long as the integrity of your backup storage is good, and you have the decryption passphrase, you can restore long-term backups without relying on anybody.

You don’t have to use ZFS but, as with any backup solution, you should test your backups. Duplicacy has pretty good integrity checks but storage isn’t infallible. Just don’t rely on a single NTFS/HFS+ drive as your only other copy.

Duplicacy lets you copy backup storage to multiple destinations, so it’s good practice to have multiple copies but also multiple types (i.e. not just Duplicacy). i.e 3-2-1. By copying backups from one Duplicacy storage to another - say from local to cloud - Duplicacy effectively validates the data as it’s being decrypted and re-encrypted onto the destination. Setup email alerts etc., conduct regular test restores, and Duplicacy can be a very robust solution.

If you backup directly to your own cloud, you’ll have less to worry about bit rot (in theory) but you should always have more than one copy.

This comes down to your initial worries about being locked into Duplicacy’s chunk format, and why the 2 in 3-2-1 is really all about having different methods - not just media.

My advice is to employ another backup method in addition to Duplicacy, such as disk image style backups. I use Veeam Agent for Windows, for example. They have a client for Linux, doesn’t look like they have one for Mac at present. However I gather there’s Time Machine and things like Acronis True Image. It wouldn’t be for long-term archival, but it could complement Duplicacy in case of emergency. Remember, Duplicacy is backup software, and backups don’t necessarily make good archives. Consider selectively archiving relevant data using appropriate means. (For instance, I don’t use Duplicacy to keep copies of large media files - they get Rclone-copied to cloud and checksummed along with lots of metadata. Add as many remote copies and you can muster.)