(Newbie) Backup entire machine?

Hi guys,
I’m currently searching for a cloud backup solution for my personal server.
It’s a debian machine that runs a wordpress (with database), my gitlab and a few other services.
I’ve always pushed aside creating regular scheduled backups, but want to tackle it now, since loosing all of it would be really frustrating.
I’m a software-dev, but not into OPs-stuff so I wanted something simple and creating backups directly into my dropbox seems to be a good way to finally use all that dropbox-storage that I pay for.

Now to my question:
In the Quick Start tutorial it is mentioned to “choose a directory to backup”, but I was thinking about backing up my entire filesystem (so “/”). Is this bad practice? Or is this a completely valid thing to do?
My alternative would be to create separate repositories for all stuff I want to backup, correct?
If I had to choose I would probably take /var, /opt, /home, since it should contain all relevant data like docker-images, my mysql and wordpress-files.
So in that case I would go into each folder and create a separate repository for each one. Is this the way to go? Or do I have to go even smaller (e.g. one repository for mysql, one for docker, etc).

Having “/” as a backup would have the benefit that I don’t need to think about all of that stuff. I don’t have a lot of data lying around (around 40gb on “/”) but I’m not sure if Duplicacy is meant to be used for “filesystem dumps”

Thanks a lot :slight_smile:

If you want to lump files from multiple locations into the same backup-ID (sometimes called snapshot-ID, particularly in the command-line reference), you would indeed use the filesystem root “/” as your backup directory. However the next step would then to use include-/exclude-filters to ensure that you only backup the directories that actually contain your data, because (presumably) the world doesn’t need another backup of debian. :wink:

If you use the Web-UI of Duplicacy, please be aware that the UI to maintain include-/exclude-filters is … aehm… “suboptional”. However since Duplicacy allows the use of include files to define your filters, you’d just use that UI to specify an include file and maintain all your actual filters in that file.

Now sometimes it makes sense to logically separate certain files in your backup and the best way to do this is by backup-ID (aka snapshot-ID), so perhaps one backup-ID for your web-server, one for your personal documents and one for your media files. This is just an idea, how you would split this would be up to you. You would need a separate backup job for each backup-ID, but you could do things like backup each ID at different intervals or keep a different schedule of versions. And of course you can - and will likely need to - have different filter files for each backup ID, since a filter file in most cases only makes sense for a certain backup root.

You could then backup all backup IDs into the same Duplicacy storage or - if you prefer - into separate Duplicacy storages. The advantage of using the same storage is that you might get some deduplication between backup IDs (though that’s probably unlikely between very different types of data), the (minor IMHO) disadvantage is that you would need Duplicacy tools (“copy” and “prune” commands) to separate the data out again. But these tools are there and work well, so no big deal IMHO.

Please not that I didn’t use the term “repository”, because it seems ambiguous to me. It could mean the “Duplicacy backup storage” (essentially the physical location) or the backup-ID (the logical grouping).

You’re right it isn’t meant for “filesystem dumps”. But I hope that the use of include-/exclude-filters can help you avoid doing that.

2 Likes

I also prefer a separate configuration of repositories and storages.

In addition to the reasons mentioned above by @tangofan , I can add:

  • With separate configurations, you can choose between different storage configurations. VMs, DBs and Veracrypt volumes (just to name a few examples) seem to work best with fixed size chunks (setting min, max and avg size to all same value).

  • You can choose different backup frequencies for each repository. In my case, my work files are backed up once a day, but my music once a week.

  • You will be able to - more easily - configure different prune policies for backups.

So it is better to make separate configurations than to backup / and use filters.

This is a controversial point, but I personally don’t think Dropbox (or Google drive, etc.) is a good choice for storing backups. They were not made for this, but for sync/access files across multiple devices.

Look for storage such as Amazon S3, Backblaze B2, Google Cloud Storage and others.

1 Like

This is what I’m currently doing. Does anyone know how to make the backups incremental? So far it’s backing up the entire system every time instead of files that have had changes made to them (since the last backup), so it’s quickly eating up space. I’m aware of the pruning option, but that doesn’t shrink the upload time.

Is there a particular shortcoming to using services like Dropbox or Google Drive, etc for storing backups?

Yes. As an example, directory enumeration with multiple files inside (think chunks subfolder) maybe be/is throttled, or otherwise rate limited, either on purpose to discourage bulk storage or simply by not being optimized for such usecase. Dropbox is the worst in this regard.

Even without the neee to get into technical details you can just go by a general principle that it is always best when your goals and incentives are aligned with that of your service provider.

Services like Dropbox, OneDrive already get your money regardless whether you stress their servers or never use them. So their incentive is to slow you down to conserve resources. Or if that happens to occur with no malice — they don’t have any incentive to fix that. Hey, OneDrive client does sync your files eventually, so what’s the problem?

On the contrary, services like amazon s3 or Backblaze B2, where you pay for what you use have incentives to make your usecase as fast as humanly possible. The faster is the upload — the sooner you start paying for storage. The faster download — the more you pay for egress. The quicker api works the more you pay for calls. See? When you say — “hey, AWS, list operation is somewhat slow today” all they hear is “hey guys, you are preventing me from hauling you carts with cash as fast as I’d like to”. You bet they fix it instantly, likely even before you notice the slowdown in the first place…

My radical personal opinion is that Duplicacy should remove support for those file storage services including webdav (oh, don’t get me started) but it will never happen because it’s a great marketing story — it indeed supports so many endpoints and on small datasets they even work fine. Google even works on relatively large ones. So far.

4 Likes

Backup is always incremental and differential; only changed file blocks are picked up. It’s eating up space maybe because you have selected a lot of transient and temp data? You can exclude and include stuff using the filters file.

1 Like

That’s good to know. It could be an issue with macOS. I’ll create another thread.

That’s not an issue with macOS… you have selected to backup data with a lot of turnover.

1 Like

Thanks for the detailed post. It’s very helpful and your explanations make sense.

I admit I was surprised to see webdav included in the supported storage backends. It seems like it can barely do what it was designed to do under normal circumstances. In my case, I’m tied to Google Drive for the time being so I’m glad it’s not being removed. Fortunately it does seem to work, so far. I’ll keep my eye on things.

What is the approximate size of a small dataset and a relatively large one, per your comment about Google?

Also, thank you for the page you wrote about unattended setup for the Mac. It was quite helpful. I added a comment there about a few typos. Hope it’s useful.

Thanks for the detailed explanation and posts! I decided to select a few folders that I need to backup instead of backing up the entire system for all the mentioned reasons.

I got two more questions, maybe you could give me an idea here as well:

  1. Backing up MySQL:
    Can I just backup the whole “/var/lib/mysql”-folder? Or should I write a script that dumps the entire DB to a file and then create a repository for this one file?
  2. Backing up Gitlab:
    Same question here:
    Gitlab can create backups on-demand and puts it in the “/var/opt/gitlab/backups” folder. Mine currently contains multiple .tar-files (one for each backup). Is the “intended” way to move the most recent backup in a folder and rename it to e.g. “gitlab_backup.tar” and then create a repository in there so that the newest one always gets backed up? Or should I just backup the entire “/var/opt/gitlab/backups”-folder? Seems a bit overkill to me, since it contains numerous backup-files.

EDIT:
Is it okay to run duplicacy as sudo?

Hiya, i’m not an expert, but i’ve read somewhere it’s not a good idea to backup a running database.

I believe you’d have to run it with sudo/as root to be able to backup ‘/’

DB:
Dump the db with a pre script.
Do not compress it.

Gitlab:

See if you can backup or dump to an ext directory without compression.
Otherwise, stop the service and backup.


If you compress duplicacy’s compression and dedup won’t work.

2 Likes