Hello everyone,
Being interested in using OVH’s Public Cloud Archive together with Duplicacy, I did a bit of research on the topic.
Since it is a kind of so-called “cold” storage, it has a few particularities, some of which may not work so well with Duplicacy. I tried to find out to what extent it could still be usable.
Please note that this was originally posted in issue #41 in Duplicacy’s GitHub repository, but @TheBestPessimist invited me to resume the discussion here on the forum instead, to reduce clutter on GitHub. (I edited a few passages from the original for the sake of clarity.)
This is probably for the best since much of this discussion is purely speculative —this is not a feature request per se, nor a tutorial, just a few thoughts on using cold storage together with Duplicacy.
Anyway, here’s what I found out!
Cold Storage in a nutshell
As a reminder and for clarity’s sake, cold storage schemes provide long-term storage of data which have no need to be frequently/rapidly accessed.
While the actual technologies differ, cold storage usually implies using cheaper and more compact technologies (like magnetic tapes), or technologies which do not need to networked and running 100% of the time. This reduces storage costs, at the price of higher costs/latency when you actually need to retrieve data.
In other words, cold storage tends to be cheaper per volume of data than regular storage, provided you don’t need to access your data very often.
Hence its potential use to store backups generated by Duplicacy!
Cold Cloud Storage Providers
Several cloud storage providers have started offering cold storage tiers. These include include Amazon Glacier, OVH Public Cloud Archive, Google Coldline, Microsoft Cool Blob Storage and Online C14.
For almost all of these vendors, cold storage is offered as adjacent to their mainline cloud storage tier (Amazon S3, OVH Object Storage, etc).
Some of these vendors in fact offer the possibility of programming rules so that older data located in “hot” storage containers may be automatically transferred to cold storage containers at some point in their lifetime.
In practice, all of these vendors rely on different technologies, and provide very different interfaces to upload and retrieve data to and from the storage. In addition, their actual pricing schemes tend to differ significantly from each other.
All this to say, it’s probably hard —if not outright impossible— to provide universal interfaces for cold cloud storage providers.
OVH’s Public Cloud Archive (PCA) definitely stands out of the crowd, though.
Like their regular Object Storage service, PCA relies on OpenStack Swift, which is an open protocol and that Duplicacy supports since version 2.1.0 (March 2018). In addition, they provide an SFTP bridge, which is also usable by Duplicacy… but which might not fully work for reasons detailed in the next section.
As such, OVH’s Public Cloud Archive is probably one of the only cold storage solutions that Duplicacy may possibly use out of the box.
I’ll focus on OVH PCA in the remainder of this summary for these reasons!
Key Facts on OVH Public Cloud Archive
Useful links:
- Developer Guide
- The entire table of contents of OVH’s cloud storage documentation (the PCA section is at the bottom right)
- API Particularities of the OpenStack Swift protocol when used with PCA
- Tutorial to unfreeze objects from the admin panel
The upload of data is priced, the download of data is priced (at the same rate), the monthly storage of data is also priced (at a very low rate).
The upload of data is straightforward. When using the OpenStack Swift protocol, you simply need to create an object inside a container. When using SFTP, you simply need to upload files.
The tricky part is the retrieval of data. Since this is a cold storage service, data is not meant to be accessed often, and you first have to “unfreeze” any object that you want to download. This can be performed either:
- Manually, by clicking on “unfreeze” in the admin panel of the service.
- Automatically, by attempting to download the object through an OpenStack Swift request (I’m not sure it works through SFTP, so OpenStack Swift may be mandatory). The server will respond with error code 429, but actually begin the unfreezing process (per the API particularities page).
Once unfreezing has begun, it takes some time for the object to be made available for download. The first time I requested unfreezing on a file, it took 4 hours.
The remaining time until unfreezing is visible in the admin panel. When you request a download (and thus unfreezing) through OpenStack Swift, it’s also provided in a header of the response.
Once unfrozen, an object remains available for download for 24 hours (per the developer guide), before becoming frozen again.
Contrary to other cloud services, OVH PCA does not make you pay anything for unfreezing an object (just for actually downloading, as explained above).
However, per the developer guide, it is “designed for seldom consulted data: the less frequently an archive unsealing operation is requested, the smaller the retrieval latency”.
How well does Duplicacy play out with OVH PCA?
- It manages to initialise a PCA storage easily!
- For making backups, it works almost perfectly. Since its chunks are immutable, and since it can make quick lookups through the name of a chunk alone, Duplicacy manages to upload backups!
- … That is, it manages to perform backups, except for the existence of the
config
file. Since Duplicacy has a databaseless approach, it downloads the config file from the storage every time, even to perform backups. This is problematic since the config file needs to be unfrozen prior to download. - For any operation which involves as much as reading the contents of a single chunk (typically, to restore data, but also probably for pruning), Duplicacy will be heavily dependent on unfreezing a lot of objects, and probably inadequate.
- One aspect I’m not sure and worried about is whether Duplicacy needs to read metadata chunks to perform simple backups? My understanding is that metadata chunks are created only when a snapshot file gets large enough, and that they are needed only during restoration of data. If they’re actually needed to perform further backups, then unfreezing will be a major obstacle for backups too.
In other words, Duplicacy works almost well with PCA for uploading data, the only roadblock is the download of the config file (and potentially metadata chunks, I sure hope I’m not wrong!!).
For any recovering and pruning, Duplicacy will be pretty inefficient, and will need to wait for unfreezing.
So technically, Duplicacy is already usable with OVH PCA, provided we connect the storage through OpenStack Swift. Duplicacy will automatically trigger unfreezing of the files it needs by attempting to download them (it could also be triggered manually), but a lot of waiting time will be involved before actually backing up or restoring.
Coupled with the fact that frequent unfreezing requests may actually lengthen this delay… this irrevocably makes restoring data from OVH Public Cloud Archive very inefficient.
Potential ways to still use OVH PCA with Duplicacy
Using it as a secondary storage only
This inadequacy for data restoration is probably in line with the purpose of cold storage in the first place?
I mean that you should probably not attempt to make cold storage your “main” storage, since data retrieval from it will always be slow and costly.
On the other hand, since Duplicacy supports multiple storages, it could be feasible to back up to and restore from a primary storage, and to simply back up to PCA as a secondary storage without ever restoring from it. This would grant additional long-term preservation of data, without the time cost of restoring from PCA.
The only problem with this approach is again the download of the config file, which requires prior unfreezing prior to any backup.
Using rclone
A potential counter to this config download problem is to manually duplicate the chunks from the primary storage to the PCA storage (provided the PCA storage is made copy-compatible and bit-identical with the primary storage).
In fact, rclone could be used to duplicate these chunks; since it doesn’t need to download a config file, it will be able to upload data to the storage easily.
In addition, since version 1.47.0 (released about 10 days ago), rclone is compatible with OVH PCA: it can now download from PCA too by waiting for the appropriate unfreezing time (see issue #341 of rclone).
A potential workflow with rclone would therefore be:
- Use Duplicacy to make backups to your primary storage
- Use rclone to duplicate the chunks from the primary storage to the PCA storage
Can other tools deal with OVH PCA?
- In their documentation, OVH recommends using PCA together with Duplicity.
Duplicity seemingly generates different chunks for the metadata and the actual data to be backed up. And while it lacks a lot of features of Duplicacy, it can be used in a multibackend configuration, where some files are backed up to a storage, and other files to another storage. This can be used to back up only actual data to the Public Cloud Archive, while metadata (which are needed for many operations) are backed up to the regular OVH Object Storage, from which they can be easily recovered. This basically sidesteps the config download problem of Duplicacy. - Not fully confirmed, but Duplicati 2.0 will be configurable to work with a local database instead of a remote one, thus it should be able to back up to OVH PCA without needing to download anything before. The beta is already available.
Both of these tools have obvious downsides to Duplicacy, but I thought I’d list those which can potentially deal with PCA anyway.
Potential evolutions of Duplicacy to deal with PCA?
Storing the config file in another storage
If Duplicacy could download the config file from a special storage, while uploading backups to another, “real” storage, then it would be able to freely backup to PCA.
To test this, I vaguely thought of creating a fork of Duplicacy myself with a new kind of storage: a metastorage (or parent storage) which could contain several child storages.
When asked to download a chunk, this metastorage would freely choose to which child storage to defer the download call.
In practice, we could use this metastorage to encompass an OVH Object Storage where the config file would be stored, and an OVH Public Cloud Archive where everything else would be stored.
This way, Duplicacy would be able to download the config file all the time from the Object Storage, while actually backing up to PCA. It wouldn’t facilitate restoration from PCA, but hopefully, it should be enough to back up?
[Not Possible] Storing metadata chunks in another storage, like Duplicity
In issue #537 in February, someone asked if Duplicacy could use the Duplicity method of storing metadata chunks and content chunks across different storages, but it was said to be “almost impossible without redesign the chunk structure”.
This makes me hope that metadata chunks are truly not needed to perform backups, otherwise, even creating a metastorage would not be enough.
Conclusion
That’s it!
I hope I did not write any factual errors —please feel free to point them out if that is the case.
In short, I don’t think cold cloud storage will ever be a sound choice as primary storage for Duplicacy. Hopefully, it will eventually be possible to use one as a secondary, write-only storage (and in that case, OVH Public Cloud Archive will most likely be the easiest one to use).
As for myself, I’m still considering whether to just use OVH Object Storage instead, or to incorporate rclone into my backup workflow to manually transfer chunks to PCA from a local storage.
Alternatively, and if it isn’t conceptually unfeasible, I’d like to see if I could write a metastorage to simply store the config file elsewhere.