Avoiding new uploads

Given the importance of what is in my Evernote to me, and because I do not trust myself to not ever accidentally delete the entire Evernote database, I run a script every night that syncs a local copy of the Evernote database and then I export this into .enex files. One file for each notebook.

I’d like to add these notebooks to my offsite backup. The problem is that most notebooks never change, but the date stamp of the daily newly created export changes AND inside the file there is one line that gets changed.

% diff Travel\ Archive.enex ~/Travel\ Archive_jan23.enex
3c3
< <en-export export-date="20220123T184655Z" application="Evernote" version="10.10.5">
---
> <en-export export-date="20220122T184657Z" application="Evernote" version="10.10.5">

These .enex files can get quite big. So I’m looking to avoid huge uploads happening every single night.

To better understand Duplicacy’s inner workings, my questions are:

  1. How does Duplicacy decide when a file needs to be updated at the backup destination? Based on last-modified stamp, based on content changes?
  2. And if it decides a file needs updating, will it reupload the entire file or just the lines that have changed?

By last modified time. Unless flag -hash is passed, in which case – by content.

Depends on a file, but the aim is to only upload differences.

From Duplicacy paper accepted by IEEE Transactions on Cloud Computing

The variable-size chunking algorithm, also called Content-Defined Chunking, has become well-known in the industry [3], [4]. The idea is to use a rolling hash, similar to that of Rsync, but only to identify breakpoints whose hash values follow a specific pattern, for example, ending with a given number of zeros [12]. These breakpoints serve as the boundaries between chunks, and the selected pattern controls the expected size of chunks. The main advantage is that a lookup to check duplicates is performed only after a breakpoint has been identified (which indicates the end of a chunk), rather than one per byte as in the case of Rsync.

In some extreme cases, e.g. when backing up virtual machine images – where small changes occur inside a huge file – fixed size chunking approach may work better. This is configurable, but Evernote files are so small that I would not worry about possible deduplication inefficiency whatsoever.

Thank you for your clear reply. Much appreciated.

Some of my notebooks are larger than 1GB and the total is 7.4GB, which is not something I’d like to upload each night if not needed.

When using -hash, Duplicacy figures out or has sensible defaults about chunk sizes that will reduce re-uploads to a few megabytes or so?

That’s still quite small :), but definitely it won’t be uploading that much – chunk size by default is variable and on the order of few Mb. So that would be granularity/overhead.

Let’s not conflate the way to detect change with chunking and deduplication.

-hash flag will result in scan of the file contents to detect change. Without -hash timestamp will be used. It shall be enough. (Using -hash has some other side effects as well that may affect deduplication efficiency)

Rolling hash chunking will take care of minimizing upload to a new data + overhead.

You can (should) test it easily though. Backup to local folder, check size of the datastore, change something in the notebook, backup again, check size of the data store again.

I was rather curious, so I decided to run the trial you suggested.

Letting Duplicacy figure everything out by itself, and simply using the -hash option, the difference in size of backup destination after a second backup, was 270MB. Wondering if this could be reduced any further, I ran the experiment a second time with a fixed 1MB chunk size and this ended up with a 39MB difference.

While not a crazy amount of data, given there are 48 lines that have actually changed this is still a chunky delta. But, perfectly fine for practical use!

Thanks very much for your suggestions!

My Evernote database is also very precious to me, so I run two types of backup:

  1. I back up the database directly with Duplicacy, using fixed-size chunks. It Works very well! It’s fast and efficient.

  2. I export the enex files to a temporary folder and back up these enex files also with Duplicacy, in this case with the default chunks configuration. It also works very well.

Contrary opinion: fiddling with Duplicacy isn’t the right solution to this problem. The files and their timestamps are, in fact, changing and Duplicacy has a duty to reflect those changes.

I looked at the source for evernote-export and found that it adds the differing text. The program is committing the twin sins of not exporting the data in an idempotent way and not setting the timestamp on the file to reflect when the note was last changed. (There’s a good argument that it shouldn’t do the latter because it’s a backup tool rather than a sync tool.)

The path of least resistance would be to simply remove the line from the program. Duplicacy will (again, correctly) note in its metadata that the file’s timestamp has changed but won’t upload new chunks because the content hasn’t.

Wow I’m loving the engagement on this forum! Thanks everyone for the comments!

@mfeit_duplicacy you raise a very interesting point and I can absolutely see where you’re coming from. Just removing the line doesn’t prevent the file from being overwritten with the last-modified stamp changed, I think. And if not using the chunks / hash approach I think that means the whole file gets reuploaded?

Making sure the last-modified stamp is that of the notebooks last modified time, would be the most elegant solution indeed.

That’s correct. If the file was rewritten by a new export, I’d want Duplicacy to reflect that in the backup set so that it’s correct when restored.

The default behavior (detecting by size and timestamp) is fine because even though the timestamp has changed, the contents will still become the same file chunk. Duplicacy will see that it already has that file chunk in the repository and won’t upload a fresh copy. It will upload a new metadata chunk reflecting the new timestamp, but that’s less than a megabyte in aggregate if the backup set includes a few thousand snapshots.

You can see this in action on a small scale using my Duplicacy Sandbox. See my comments, which begin with //:

// get set up
$ make build
mkdir -p storage/default
mkdir -p root

// Create a file and back it up.
$ echo "foo" > ./root/foo

$ make backup
#
# Backing up to default
#

Storage set to /tmp/duplicacy-sandbox/storage/default
Downloading latest revision for snapshot sandbox
Listing revisions for snapshot sandbox
No previous backup found
Indexing /tmp/duplicacy-sandbox/root
Parsing filter file /tmp/duplicacy-sandbox/root/.duplicacy/filters
Loaded 0 include/exclude pattern(s)
Packing foo
Packed foo (4)
// New file chunk
Uploaded chunk 1 size 4, 4B/s 00:00:01 100.0%
Uploaded foo (4)
Listing snapshots/
Listing snapshots/sandbox/
Listing chunks/
Backup for /tmp/duplicacy-sandbox/root at revision 1 completed
Files: 1 total, 4 bytes; 1 new, 4 bytes
// One new file chunk and three new metadata chunks
File chunks: 1 total, 4 bytes; 1 new, 4 bytes, 13 bytes uploaded
Metadata chunks: 3 total, 331 bytes; 3 new, 331 bytes, 359 bytes uploaded

All chunks: 4 total, 335 bytes; 4 new, 335 bytes, 372 bytes uploaded
Total running time: 00:00:01

// Rewrite the file with the same contents, which will have a different timestamp
$ echo "foo" > ./root/foo

$ make backup
#
# Backing up to default
#

Storage set to /tmp/duplicacy-sandbox/storage/default
Downloading latest revision for snapshot sandbox
Listing revisions for snapshot sandbox
Last backup at revision 1 found
Indexing /tmp/duplicacy-sandbox/root
Parsing filter file /tmp/duplicacy-sandbox/root/.duplicacy/filters
Loaded 0 include/exclude pattern(s)
Packing foo
Packed foo (4)
// This chunk is already in the backup set, so no upload:
Skipped chunk 1 size 4, 4B/s 00:00:01 100.0%
// This is a little misleading:
Uploaded foo (4)
Listing snapshots/
Listing snapshots/sandbox/
Listing chunks/
Backup for /tmp/duplicacy-sandbox/root at revision 2 completed
Files: 1 total, 4 bytes; 1 new, 4 bytes
// No new file chunks were uploaded but one new metadata chunk was:
File chunks: 1 total, 4 bytes; 0 new, 0 bytes, 0 bytes uploaded
Metadata chunks: 3 total, 331 bytes; 1 new, 260 bytes, 269 bytes uploaded
All chunks: 4 total, 335 bytes; 1 new, 260 bytes, 269 bytes uploaded
Total running time: 00:00:01
1 Like

Thank you for the added comments.

This morning I decided to do two more experiments with my actual data:

  1. No changes to the export script, running Duplicacy without -hash explicitly specified or 1MB chunk size configured or whatever. My original fear of full files being uploaded was unfounded, and the difference was 270MB. Exactly the same as when -hash is explicitly added.

  2. Edited the export script. I left the line you linked to in there, but removed the actual dynamic element {export_date} so that it writes export-date="" into the file. I was worried removing the line would break file compatibility. A new export of the file is not exactly identical to the previous version, and so backup revision was an excellent 6KB as per your prediction.

I may submit a feature request to the repository, asking for a --no-export-date option to be implemented. Sadly, I do not have the skills to submit such a change myself.

Thanks for the recommendations!!!

Note: The script checks if prior export files are present and creates files with an incremental counter appended to the filename. There does not seem to be a way to disable this. So I’ve had to create a wrapper script that removes the old files before a new export.

I had another look at the documentation this morning and noticed that the process is evernote-backup sync followed by evernote-backup export. The docs imply that sync has a local database that’s updated only when the source material at Evernote changes. Backing up the directory holding that database will result in new or changed chunks being backed up, which makes backing up your exports redundant.

If you want to keep daily exports of your notes around but don’t want them backed up because they change, put them someplace that your Duplicacy filters exempt. I do this at a global level by establishing a standard directory name to be skipped:

# Don't back up anything named 'nobackup' at the top of the tree or anywhere below it
-nobackup/*
-*/nobackup/*

You’ll still have the internal database in your backup set and the worst case is that you have to do an export once after a restore.

Thanks for the suggestion. I get what you are saying, but I prefer keeping the backup in a format that other tools can use, which .enex is. Plenty of tools can read those files and import them into other notes programs. I know I could always create those out of a database, but why not save like that in the first place.

I did. And it works a treat!

1 Like

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.