Race condition between backups and prunes can lead to corrupted snapshot

gchen · 18 August 2024 18:03

Policy 2 in the paper states:

A backup procedure cannot access any fossils. That is, it
must upload a chunk if it cannot find the chunk, even if
a corresponding fossil exists.

So the second backup will still upload a new chunk even if the same chunk has been turned into a fossil by the prune operation.

Moreover, Policy 3 guarantees that fossils collected won’t be deleted before the second backup is finished:

For each backup client, there is a new backup that was
not seen by the fossil collection step.

Of course, this assumes that the second backup is using a different snapshot id so it is counted a unique backup client.

This is not a bug. There is really no need to support multiple simultaneous backups with the same snapshot id. If that happens, then it is a configuration error.

gmaxardis · 19 August 2024 08:06

It is possible to corrupt backup
It is possible to avoid it by fixing code
It is not a misconfiguration - you might run 2+ backups with same ID because of automation tooling/etc

Ergo, this is a bug and it can be fixed

ADMECA · 20 August 2024 06:33

And taking the problem the other way around, isn’t there a way to avoid this kind of misconfiguration ?

gchen · 20 August 2024 14:31

I can’t think of any scenario where you are forced to use the same snapshot ID for multiple backups. Even if you need to run multiple backup clients to back up the same source directory at the same time, you can still assign a unique snapshot ID to each client.

Running multiple backups with the same snapshot ID at once is undefined behavior. Likewise, running multiple prune operations also leads to undefined behavior.

saspus · 20 August 2024 16:18

Yes, there may not be obvious valid scenario to do so, but it may happen by mistake: for example when running backup in cron. Unobvious one is when backup duration exceeds the backup cadence.

Duplicacy shall prevent user mistakes from causing corruption whenever possible, especially if that scenario is just user running plain innocent backup command.

The fix is very simple, and there is no good reason not to implement it.

Droolio · 20 August 2024 22:25

#!/bin/bash

PROC_NAME='DUPLICACY_BACKUP'

# abort, if already running
pgrep -f "$PROC_NAME" && exit 1

# otherwise, do backup
cd /path/repo
/path/duplicacy/duplicacy backup -comment $PROC_NAME

IS it a simple fix, though?

We’ve established - only local file and sftp backends use a .tmp file before uploading. So remains to be seen if this technique can be done with other backends and cloud APIs.

(In fact I’m curious now… does the above test script fail with local backends?)

saspus · 20 August 2024 22:54

Not really, you may have multiple duplicacy processes backing up to diffferenr destinations.

Also you have a race condition there. So it’s not a good solution.

Furthermore, users shall not be required to concoct workarounds to prevent data corruption when using published CLI, of a fully concurrent backup program.

It is, but it it wasn’t — it would not matter. It’s a bug that causes data loss using CLI commands and nothing else. It must be a top priority.

I don’t think this is related to backend implementation details. The problem seems to be that two instances end up backing up the same revision.

Duplicacy shall “claim” the next revision at start (by uploading a marker/placeholder, it can be done atomically. Or use guids as revision identifiers, not numbers. This will eliminate possibility of two backup jobs uploading the same version ID.

The same mechanism is already used elsewhere (fossilization) so it’s not really a bleeding edge tech.

Droolio · 20 August 2024 23:54

This was a simple example, and trivially easy to append a snapshot ID (and other stuff) to $PROC_NAME.

Quite frankly, I consider this the preferred method to avoid two backups running on the same ID.

Let’s say you ‘detect’ another backup, what you gonna do other than to abort the whole backup anyway? It’d be undesirable to allow both backups to complete - even if they save to different revision numbers - so why waste resources while they run 'til the end?

I’d want it to stop at the earliest opportunity and run one instance. -comment let’s you do exactly that.

So you’re talking about a lock file?

Duplicacy randomises .tmp files to allow concurrency - doing anything else forces you to run a second backup, because you can’t be sure any placeholder didn’t get left there by an aborted run.

Then what to do about other backends?

What to do about multiple prune runs?

There could be an argument made about using a local lock file for all such occurrences - say one for each ID, perhaps with a timeout, or even an internal ps check built-in. Fine. But I’m just not convinced changing the behaviour on the backend is necessary or as easily achievable as is claimed, without fucking things up.

saspus · 21 August 2024 01:04

Oops. Indeed. We don’t want that.

The lock-less solution would be to use guids for revision file names, thus letting both backups to proceed.

With guids it will “just work” — both backups will succeed. The “version number” will be meaningless — but nobody cares about it anyway — time of the snapshot is what’s important.

I don’t think revision names are used anywhere outside of revision management, so I don’t think anything will get screwed.

Droolio · 21 August 2024 11:13

You still wouldn’t want two backups running at the same time on the same ID.

GUIDs sound like it’d need some kinda database, which you’d have to read the contents of a file (or all files) for.

Revision numbers are dead simple to understand and behave (correctly so) as a sequence number, which you only have to list the directory contents to know how to do the next backup.

(Duplicacy also allows revision ranges as a flag - although I’d love if it wouldn’t error out when individual revisions in a specified range don’t exist )

Sure you could replace this number with a date in the format YYYYMMDDHHmmss but that doesn’t necessarily solve anything unless you add microseconds as well (ssssss), and even that doesn’t completely eliminate the problem. If I have to manually triage or diagnose issues on a storage, I’d much rather see simple version numbers than long ass timestamps or random IDs!

If it were me, I’d rather not take the risk. This so-called bug is incredibly rare to trigger and avoidable through current means, to warrant tinkering with all the different types of backends at this stage.

IMO this whole issue can be fixed by creating (locally) a classic .duplicacy\cache\<ID>\.lock.

This is FAR more preferable to having a placeholder on a remote backend, or some complicated manipulation with revision naming. A second backup is stopped instantly and the bug is prevented.