[Cold Storage] Compatibility with OVH Cloud Public Archive

Rastagong · 25 April 2019 11:13

Hello everyone,

Being interested in using OVH’s Public Cloud Archive together with Duplicacy, I did a bit of research on the topic.
Since it is a kind of so-called “cold” storage, it has a few particularities, some of which may not work so well with Duplicacy. I tried to find out to what extent it could still be usable.

Please note that this was originally posted in issue #41 in Duplicacy’s GitHub repository, but @TheBestPessimist invited me to resume the discussion here on the forum instead, to reduce clutter on GitHub. (I edited a few passages from the original for the sake of clarity.)

This is probably for the best since much of this discussion is purely speculative —this is not a feature request per se, nor a tutorial, just a few thoughts on using cold storage together with Duplicacy.

Anyway, here’s what I found out!

Cold Storage in a nutshell

As a reminder and for clarity’s sake, cold storage schemes provide long-term storage of data which have no need to be frequently/rapidly accessed.

While the actual technologies differ, cold storage usually implies using cheaper and more compact technologies (like magnetic tapes), or technologies which do not need to networked and running 100% of the time. This reduces storage costs, at the price of higher costs/latency when you actually need to retrieve data.

In other words, cold storage tends to be cheaper per volume of data than regular storage, provided you don’t need to access your data very often.
Hence its potential use to store backups generated by Duplicacy!

Cold Cloud Storage Providers

Several cloud storage providers have started offering cold storage tiers. These include include Amazon Glacier, OVH Public Cloud Archive, Google Coldline, Microsoft Cool Blob Storage and Online C14.
For almost all of these vendors, cold storage is offered as adjacent to their mainline cloud storage tier (Amazon S3, OVH Object Storage, etc).
Some of these vendors in fact offer the possibility of programming rules so that older data located in “hot” storage containers may be automatically transferred to cold storage containers at some point in their lifetime.

In practice, all of these vendors rely on different technologies, and provide very different interfaces to upload and retrieve data to and from the storage. In addition, their actual pricing schemes tend to differ significantly from each other.
All this to say, it’s probably hard —if not outright impossible— to provide universal interfaces for cold cloud storage providers.

OVH’s Public Cloud Archive (PCA) definitely stands out of the crowd, though.
Like their regular Object Storage service, PCA relies on OpenStack Swift, which is an open protocol and that Duplicacy supports since version 2.1.0 (March 2018). In addition, they provide an SFTP bridge, which is also usable by Duplicacy… but which might not fully work for reasons detailed in the next section.

As such, OVH’s Public Cloud Archive is probably one of the only cold storage solutions that Duplicacy may possibly use out of the box.
I’ll focus on OVH PCA in the remainder of this summary for these reasons!

Key Facts on OVH Public Cloud Archive

Useful links:

Developer Guide
The entire table of contents of OVH’s cloud storage documentation (the PCA section is at the bottom right)
API Particularities of the OpenStack Swift protocol when used with PCA
Tutorial to unfreeze objects from the admin panel

The upload of data is priced, the download of data is priced (at the same rate), the monthly storage of data is also priced (at a very low rate).

The upload of data is straightforward. When using the OpenStack Swift protocol, you simply need to create an object inside a container. When using SFTP, you simply need to upload files.

The tricky part is the retrieval of data. Since this is a cold storage service, data is not meant to be accessed often, and you first have to “unfreeze” any object that you want to download. This can be performed either:

Manually, by clicking on “unfreeze” in the admin panel of the service.
Automatically, by attempting to download the object through an OpenStack Swift request (I’m not sure it works through SFTP, so OpenStack Swift may be mandatory). The server will respond with error code 429, but actually begin the unfreezing process (per the API particularities page).

Once unfreezing has begun, it takes some time for the object to be made available for download. The first time I requested unfreezing on a file, it took 4 hours.
The remaining time until unfreezing is visible in the admin panel. When you request a download (and thus unfreezing) through OpenStack Swift, it’s also provided in a header of the response.

Once unfrozen, an object remains available for download for 24 hours (per the developer guide), before becoming frozen again.

Contrary to other cloud services, OVH PCA does not make you pay anything for unfreezing an object (just for actually downloading, as explained above).

However, per the developer guide, it is “designed for seldom consulted data: the less frequently an archive unsealing operation is requested, the smaller the retrieval latency”.

How well does Duplicacy play out with OVH PCA?

It manages to initialise a PCA storage easily!
For making backups, it works almost perfectly. Since its chunks are immutable, and since it can make quick lookups through the name of a chunk alone, Duplicacy manages to upload backups!
… That is, it manages to perform backups, except for the existence of the config file. Since Duplicacy has a databaseless approach, it downloads the config file from the storage every time, even to perform backups. This is problematic since the config file needs to be unfrozen prior to download.
For any operation which involves as much as reading the contents of a single chunk (typically, to restore data, but also probably for pruning), Duplicacy will be heavily dependent on unfreezing a lot of objects, and probably inadequate.
One aspect I’m not sure and worried about is whether Duplicacy needs to read metadata chunks to perform simple backups? My understanding is that metadata chunks are created only when a snapshot file gets large enough, and that they are needed only during restoration of data. If they’re actually needed to perform further backups, then unfreezing will be a major obstacle for backups too.

In other words, Duplicacy works almost well with PCA for uploading data, the only roadblock is the download of the config file (and potentially metadata chunks, I sure hope I’m not wrong!!).

For any recovering and pruning, Duplicacy will be pretty inefficient, and will need to wait for unfreezing.

So technically, Duplicacy is already usable with OVH PCA, provided we connect the storage through OpenStack Swift. Duplicacy will automatically trigger unfreezing of the files it needs by attempting to download them (it could also be triggered manually), but a lot of waiting time will be involved before actually backing up or restoring.
Coupled with the fact that frequent unfreezing requests may actually lengthen this delay… this irrevocably makes restoring data from OVH Public Cloud Archive very inefficient.

Potential ways to still use OVH PCA with Duplicacy

Using it as a secondary storage only

This inadequacy for data restoration is probably in line with the purpose of cold storage in the first place?
I mean that you should probably not attempt to make cold storage your “main” storage, since data retrieval from it will always be slow and costly.

On the other hand, since Duplicacy supports multiple storages, it could be feasible to back up to and restore from a primary storage, and to simply back up to PCA as a secondary storage without ever restoring from it. This would grant additional long-term preservation of data, without the time cost of restoring from PCA.
The only problem with this approach is again the download of the config file, which requires prior unfreezing prior to any backup.

Using rclone

A potential counter to this config download problem is to manually duplicate the chunks from the primary storage to the PCA storage (provided the PCA storage is made copy-compatible and bit-identical with the primary storage).

In fact, rclone could be used to duplicate these chunks; since it doesn’t need to download a config file, it will be able to upload data to the storage easily.
In addition, since version 1.47.0 (released about 10 days ago), rclone is compatible with OVH PCA: it can now download from PCA too by waiting for the appropriate unfreezing time (see issue #341 of rclone).

A potential workflow with rclone would therefore be:

Use Duplicacy to make backups to your primary storage
Use rclone to duplicate the chunks from the primary storage to the PCA storage

Can other tools deal with OVH PCA?

In their documentation, OVH recommends using PCA together with Duplicity.
Duplicity seemingly generates different chunks for the metadata and the actual data to be backed up. And while it lacks a lot of features of Duplicacy, it can be used in a multibackend configuration, where some files are backed up to a storage, and other files to another storage. This can be used to back up only actual data to the Public Cloud Archive, while metadata (which are needed for many operations) are backed up to the regular OVH Object Storage, from which they can be easily recovered. This basically sidesteps the config download problem of Duplicacy.
Not fully confirmed, but Duplicati 2.0 will be configurable to work with a local database instead of a remote one, thus it should be able to back up to OVH PCA without needing to download anything before. The beta is already available.

Both of these tools have obvious downsides to Duplicacy, but I thought I’d list those which can potentially deal with PCA anyway.

Potential evolutions of Duplicacy to deal with PCA?

Storing the config file in another storage

If Duplicacy could download the config file from a special storage, while uploading backups to another, “real” storage, then it would be able to freely backup to PCA.

To test this, I vaguely thought of creating a fork of Duplicacy myself with a new kind of storage: a metastorage (or parent storage) which could contain several child storages.
When asked to download a chunk, this metastorage would freely choose to which child storage to defer the download call.
In practice, we could use this metastorage to encompass an OVH Object Storage where the config file would be stored, and an OVH Public Cloud Archive where everything else would be stored.

This way, Duplicacy would be able to download the config file all the time from the Object Storage, while actually backing up to PCA. It wouldn’t facilitate restoration from PCA, but hopefully, it should be enough to back up?

[Not Possible] Storing metadata chunks in another storage, like Duplicity

In issue #537 in February, someone asked if Duplicacy could use the Duplicity method of storing metadata chunks and content chunks across different storages, but it was said to be “almost impossible without redesign the chunk structure”.

This makes me hope that metadata chunks are truly not needed to perform backups, otherwise, even creating a metastorage would not be enough.

Conclusion

That’s it!
I hope I did not write any factual errors —please feel free to point them out if that is the case.
In short, I don’t think cold cloud storage will ever be a sound choice as primary storage for Duplicacy. Hopefully, it will eventually be possible to use one as a secondary, write-only storage (and in that case, OVH Public Cloud Archive will most likely be the easiest one to use).

As for myself, I’m still considering whether to just use OVH Object Storage instead, or to incorporate rclone into my backup workflow to manually transfer chunks to PCA from a local storage.
Alternatively, and if it isn’t conceptually unfeasible, I’d like to see if I could write a metastorage to simply store the config file elsewhere.

Rastagong · 25 April 2019 11:32

Ah, one significant addition: I did not clarify what I had actually tried myself.

In particular, I said that save for the config file, Duplicacy could already backup to PCA, but I only tested it by using the copy command, which successfully copied chunks from a primary storage.
In other words, I’m not actually sure that the only roadblock for the backup command to work is the config file, there may be other files to unfreeze.

While doing more reading on the forum, I stumbled on this post by Gilbert Chen from a related discussion on Amazon Glacier (emphasis is mine):

So, to perform a backup, Duplicacy would in fact need to download not only the config file, but also the snapshot file (and hence metadata chunks?), as well as several content chunks. Suffice to say, that’s a lot to unfreeze, and this makes PCA potentially much less usable.

On the other hand, it’s probable that when using the copy command, it is in fact sufficient for Duplicacy to have access to the config file, and to nothing else? At least, it worked for me as long as the config file was unfrozen. This cements my suspicion that cold storage will be practically usable only as a secondary storage.

Sorry for the confusion, I should have been clearer on this point in the original post.

towerbr · 26 April 2019 00:06

My two cents…

From time to time I evaluate cold storages, and I always come to the same conclusion: given the price difference, it’s not worth it.

Hot storage is around $ 0.005 / Gb. The cheapests colds are around $ 0.003 / Gb.

It’s a very small price difference to justify all the headache you will have with restore time, download fees, etc. At most providers, you will have to make the transition to a hot storage and then download.

For the price difference to be worth it, you would have to use hundreds of terabytes.

So, I’m still happy with my B2 …

Rastagong · 26 April 2019 07:33

You’re very probably right, yeah! It certainly seems like it’s going to remain a fairly marginal option.

That being said, I’m used to OVH for hosting services, so I’ll probably roll with their mainline Object Storage, which is actually 0.01€ per GB, while their cold storage is 0.002€ per GB (5 times cheaper).
I’m sure there are cheaper “hot” storage options out there (Backblaze that you mention seems to be the leader in that matter?), but I don’t necessarily think these prices are an industry standard!

At least personally, I’m fine with ultimately paying more to stay with a company I trust and already use extensively (I also enjoy the number of open protocols that OVH provides for data upload and retrieval). I’m sure someone extensively using Amazon’s cloud might have similar reasoning with Glacier and S3.
So Public Cloud Archive was always one alternative OVH formula out of 2 for me, cold storage in itself was not a primary requirement.

But yeah, given the technical constraints, cold storage does look very unwieldy long-term wise while coupled with Duplicacy.
Another potential use for it could be storing occasional full system snapshots without Duplicacy, just for the sake of having different backup formats?

towerbr · 26 April 2019 18:27

I think so, closely followed by Wasabi. Cloud Storage Pricing Comparison: Amazon S3 vs Azure vs B2

Maybe… It would be one more backup location in another format, something like a “4” step in the 3-2-1 backup strategy.

Extra security is never too much, should only consider the cost-benefit and how much extra “administration” this will require.

gchen · 26 April 2019 19:08

I can confirm the config file is the only roadblock for running backups. The metadata chunks aren’t really needed if you use the -hash option, as Duplicacy downloads the last snapshot as an optimization to avoid too many lookups (all chunks referenced by the last snapshot are assumed to exist). If you don’t need this optimization and instead list all existing chunks in the storage before the backup, then you can avoid downloading the last snapshot and any of its metadata chunks.

So maybe a solution is to allow a cached copy of the config file to be placed under the .duplicacy\cache directory. When Duplicacy sees this local copy, it won’t attempt to download it from the storage.

Rastagong · 27 April 2019 08:19

I see, thanks a lot for this clarification!

Then yeah, the option to use a cached copy of the config file would basically be enough to perform “write-only” backups to cold storage.

torokalpar · 7 October 2019 07:41

Hello,

I know it’s been a while since the initial discussion.
I have been experiencing with this myself. My provider of choice is Microsoft Azure,
The cost is €0.00084 per TB per month.
The way I have configured it is, that everything goes to hot storage by default, and I use a
storage life-cycle policy in Azure to move things to archive.
The config always stays hot, and the life cycle policy moves everything in chunks to the archive tire after one day (the policy doesn’t allow to do so any sooner).
There’s free credit on signup so that helps with the initial cost of everything staying in hot storage initially.

This work swell for backups, even without the -hash option suggested here.
Things might have changed since the initial writing but it seems that this option applies to local file detection only.

Prune and Check do not however work.

2019-10-05 00:02:46.415 INFO SNAPSHOT_CHECK Total chunk size is 1599G in 339788 chunks
2019-10-05 00:02:49.975 INFO SNAPSHOT_CHECK All chunks referenced by snapshot 1 at revision 1 exist
2019-10-05 00:02:53.046 INFO SNAPSHOT_CHECK All chunks referenced by snapshot 1 at revision 2 exist
2019-10-05 00:02:55.870 INFO SNAPSHOT_CHECK All chunks referenced by snapshot 1 at revision 3 exist
2019-10-05 00:02:59.000 INFO SNAPSHOT_CHECK All chunks referenced by snapshot 1 at revision 4 exist
2019-10-05 00:03:01.816 INFO SNAPSHOT_CHECK All chunks referenced by snapshot 1 at revision 5 exist
2019-10-05 00:03:04.672 INFO SNAPSHOT_CHECK All chunks referenced by snapshot 1 at revision 6 exist
2019-10-05 00:03:07.506 INFO SNAPSHOT_CHECK All chunks referenced by snapshot 1 at revision 7 exist
2019-10-05 00:03:07.673 ERROR DOWNLOAD_CHUNK Failed to download the chunk d0802d2bd831fc10243ca095d05f7af72858843435a6796f57f2e94967a46c6f: storage: service returned error: StatusCode=409, ErrorCode=BlobArchived, ErrorMessage=This operation is not permitted on an archived blob.
RequestId:9d485689-801e-00d5-6010-7beefb000000

I kept snapshots in hot storage hoping it will be ok ( and the size is not significant either ) but it looks like some chunks still have to be downloaded. The same one for both check and prune. Note that the download is expected to fail as the chunks are all archived.

Droolio · 7 October 2019 13:30

Metadata chunks - those that detail the file list and directory tree structure, chunk sequences and chunk lengths - will also be stored in the /chunks directory. See: Snapshot file format

You’re gonna have a hard time trying to separate those chunks into hot, cool, archive. IMO, cold storage probably isn’t suitable for this type of application. Perhaps the Azure storage code could be modified to put such chunks in a /meta structure instead, I can’t imagine that being easy.