Missing chunks (ZFS volume)

duplicacy1 · 8 October 2018 23:45

I have read through this post, along with the full post here: Lots of missing chunks with pCloud. I appear to be having the same issue, but not sure what to do.

I am backing up locally to ZFS volume (so I don’t think it is related to some other posts I saw related to cloud storage providers). I am using duplicacy-util as a wrapper around duplicacy in case that matters.

As part of each backup a prune, and check command is also ran. The check process is returning exit code 100 and thousands of the below lines are written to the log file (obviously with the chunk number differing).

06:02:24 Chunk d4cb825614b7f401fd50a523efc93b4b1fddef7fff0bc595c7342a61025f08e3 referenced by snapshot firewall-local-backup-01 at revision 39 does not exist

This particular backup is revision (snapshot) 39. So it is somewhat strange that something that was just backed is having difficulty finding a chunk.

I did perform a restore of a single (random) file and it worked successfully, but I am not sure if 1 file is a valid test to validate the backup.

Also, the log file contains similar message for revisions starting at 1, skipping to 8, and then similar log files for 8-39. Makes me wonder what was unique for revisions 2-7?

If I try to find one of those chunks on the local disk using the “find” command, no such file exists.

In the prune logs there are lines similar to this:

Deleted fossil 2460eb32016e3c20cda2e72770c16e582319429912de9cb77b8a262dfa6af0d2 (collection 5)

then a bunch of these:

Marked fossil 1b703505f5966b059d7897cc5d63589bd29669f5f9f169754ea314bfdf0bb5ac

and finally this:
Fossil collection 6 saved
Deleted cached snapshot firewall-local-backup-01 at revision 7

Nothing about “exclusive” mode in the prune file.

I am not sure what is causing this, as I have several other backups running that are not doing this. I have tried some of the things listed on the link above that was provided, but I am still getting the massive amount of log about chuks that do not exist.

I do not understand what is happening, or how to fix the error.

Are there any other suggestions on resolving this issue?

Thank You.

towerbr · 9 October 2018 00:16

All topics from the old forum have been migrated to this forum here.

The topic referenced above (the bit.ly link to the old forum) can be found here in the forum:

towerbr · 9 October 2018 00:25

A clarification on the nomenclature used by Duplicacy, which sometimes causes some confusion: snapshot is related to the set of files, and revisions are the various revisions (backups) of a repository.

So, in your case:

snapshot = firewall-local-backup-01
revision = 39

Then it is the 39th backup of the set of files nominated as “firewall-local-backup-01”

Returning to your main question:

It’s really strange. And the backup ended without errors?

gchen · 9 October 2018 00:43

Does the chunk d4cb825614b7f401fd50a523efc93b4b1fddef7fff0bc595c7342a61025f08e3 appear in any of those prune logs?

If you’re running an old version then this bug may cause missing chunks.

duplicacy1 · 10 October 2018 00:08

This does not return any results:
find /opt/duplicacy/prefs/logs/ -type f -print0 | xargs -0 grep d4cb825614b7f401fd50a523efc93b4b1fddef7fff0bc595c7342a61025f08e3

I am running duplicacy version 2.1.1 (e8b892) on CentOS 7.

duplicacy1 · 10 October 2018 00:09

Thank you for the clarification.

duplicacy1 · 10 October 2018 00:11

The backup does not return with errors. It completes successfully. It is the “check” that returns with error code 100.

gchen · 10 October 2018 03:24

Can you make sure that chunk file doesn’t exist in the storage? It should be named cb825614b7f401fd50a523efc93b4b1fddef7fff0bc595c7342a61025f08e3 in the d4 folder.

Does duplicacy-util delete prune log files? If you keep all prune log files and still can’t find this chunk then it means it is not the prune command that deletes this chunk.

duplicacy1 · 10 October 2018 06:54

It looks like all the prune logs are kept. Many of them are zero bytes in length. Since October 4, all prune logs have data in them. I have a prune log for every day starting on August 24.

I have run a find command for a file named cb825614b7f401fd50a523efc93b4b1fddef7fff0bc595c7342a61025f08e3 in every d4 chunk folder that exists and no such file is found.

For the d4 chunk file directory, the one I believe is in question, currently has 1779 files

Everything seemed to be working just fine. Everyday I would get an email stating everything was successful. Then all of the sudden, one day I started getting Failure emails instead.

The backup seems to complete–

05:38:43 Backup for / at revision 40 completed
05:38:43 Files: 883292 total, 1774G bytes; 862 new, 2,537M bytes
05:38:43 Files: 883292 total, 1774G bytes; 862 new, 2,537M bytes
05:38:43 File chunks: 420096 total, 1777G bytes; 263 new, 1,339M bytes, 825,124K bytes uploaded
05:38:43 Metadata chunks: 87 total, 368,474K bytes; 31 new, 156,110K bytes, 47,423K bytes uploaded
05:38:43 All chunks: 420183 total, 1777G bytes; 294 new, 1,491M bytes, 872,548K bytes uploaded
05:38:43 All chunks: 420183 total, 1777G bytes; 294 new, 1,491M bytes, 872,548K bytes uploaded
05:38:43 Total running time: 00:16:41
05:38:43 17 files were not included due to access errors
05:38:43 Duration: 16:41
05:38:43 ######################################################################

Then the prune command appears to be OK–

05:38:43 Keep no snapshots older than 365 days 05:38:43 Keep 1 snapshot every 30 day(s) if older than 180 day(s) 05:38:43 Keep 1 snapshot every 7 day(s) if older than 30 day(s) 05:38:43 Keep 1 snapshot every 1 day(s) if older than 7 day(s) . . . 05:41:26 Fossil collection 7 saved 05:41:26 The snapshot firewall-local-backup-01 at revision 8 has been removed 05:41:29 ######################################################################

Then the check command runs and a whole lot of lines like the ones mentioned earlier appear, with the last lines in the log being –

05:54:40 Some chunks referenced by snapshot firewall-local-backup-01 at revision 40 are missing
05:54:40 Some chunks referenced by some snapshots do not exist in the storage
05:54:40 Error executing command: exit status 100

Outside of changing the password for the email account used for receiving backup status emails, no other changes have been made to the configuration.

There were updates to the ZFS kernel modules and the server has been rebooted since this backup process was put in place. A “scrub” of the ZFS pool was done yesterday, and no errors were found.

I can’t really think what might be the error. I am backing up a couple of windows desktops to this same ZFS filesystem and so far, not seeing similar errors. I am also performing a more limited backup of this same server which is producing the errors to backblaze cloud storage, and again I have not seen this problem with that backup. So far, only this one particular server being backed up to the local ZFS volume is the only location where I have come across this issue.

Not sure if any of this information is helpful, but thought I would try to provide as much information as possible.

Thank You.

gchen · 10 October 2018 16:32

Are you backing up multiple machines to the same storage location? If so, does any other machine run the prune command as well? The prune logs are only kept locally.

Which revision is the earliest one that contains this missing chunk?

duplicacy1 · 16 October 2018 21:41

Yes. As mentioned I have multiple computers backing up to the some storage location. 2 x Windows laptops using SFTP for connection, and 1 x Linux server which is directly attached to the storage. The 2 x Windows desktops are using the exact same process (using duplicacy-util wrapper), which calls the prune process nightly. The Windows backups are completing successfully and are not exiting with error code 100 during the check process.

Today’s duplicacy log looks like this (this is from the prune section)

<SNIP>
05:36:30 Keep no snapshots older than 365 days
05:36:30 Keep 1 snapshot every 30 day(s) if older than 180 day(s)
05:36:30 Keep 1 snapshot every 7 day(s) if older than 30 day(s)
05:36:30 Keep 1 snapshot every 1 day(s) if older than 7 day(s)
05:36:31 Fossil collection 5 found
05:36:31 Fossils from collection 5 is eligible for deletion
05:36:31 Snapshot desktop-d-drive revision 45 was created after collection 5
05:36:33 Snapshot firewall-local-backup-01 revision 47 was created after collection 5
05:36:36 The chunk f018e6af4ee0d271e9dd3bc685a466020caa65fb1a555ff835f33170326f48d7 has been permanently removed
</SNIP>

From the check section:

<SNIP>
05:39:15 Listing all chunks
05:49:43 Chunk adf23527a1f3a17af7267190525616db438d969a72263efb6d91cd0cdd993873 referenced by snapshot firewall-local-backup-01 at revision 1 does not exist
</SNIP>

The first few lines of today’s prune log looks like this:

<SNIP>
 Snapshot desktop-d-drive revision 45 was created after collection 5
  Snapshot firewall-local-backup-01 revision 47 was created after collection 5
  Deleted fossil f018e6af4ee0d271e9dd3bc685a466020caa65fb1a555ff835f33170326f48d7 (collection 5)
</SNIP>

gchen · 17 October 2018 00:30

If all prune logs are kept on all machines, and you still can’t find the missing chunk in any of those prune log files, then it may not be Duplicacy that deleted the chunk. The prune command is the only command that deletes chunks – all other commands can only add new chunks.

However, I don’t know if duplicacy-util would remove prune logs that are too old.

duplicacy1 · 17 October 2018 21:03

Yes. I was able to find the following in the prune logs from one of the other Windows servers.
adf23527a1f3a17af7267190525616db438d969a72263efb6d91cd0cdd993873

This particular backup, where chunk was found in the prune logs, does actually backup to the same file system, whereas the other windows servers are actually backing up to their own file system on the same storage.

gchen · 18 October 2018 01:17

Can you post here the relevant lines in the prune logs as well as what other backups were running at the same time? A good example is Pruning Vertical Backup storage gone awry · Issue #465 · gilbertchen/duplicacy · GitHub. The detailed timing information there allowed me to quickly identify where the bug is.

duplicacy1 · 22 December 2018 23:51

I have dropped the ball on this, but have am now coming back to this. I still have not figured out the problem, but I have another question, they may or may not be related.

I have a Linux server, which hosts the physical disks where backup are being stored. I have a backup job that runs as “root” on that server the stores files here: /zfsstore/backups/mydir/snapshots. Beneath this directory are two sub-directories. One from the backup job that runs as root on this Linux server, and another directory created from a job that runs from a windows server and uses SFTP to perform the backup. These two sub-directories have completed different file permissions as the two jobs run as completely different users. One runs as root on the local server, and one has the file permissions of the SFTP user that is used to make the SFTP connection.

The job that runs as root created a sub-directory which has a mode of 0750 and is owned by root:root, whereas the other sub-directory (the directory created by the SFTP connection) has a mode of 0770 and is owned by myuser:myuser.

I am wondering if these file permissions have something to do with the issue I am having?

I have upgraded duplicacy to the latest version since this original post, and it has not made a difference.