Adding support for pruning empty revisions

It would be great if there would be a way to prune empty (constant) revisions. I have several snapshots that look like this:

          snap | rev |                          | files |    bytes | chunks |    bytes |  uniq |    bytes |   new |    bytes |
 NAS1-Software |   1 | @ 2022-08-10 13:04 -hash | 21255 | 226,943M |  43962 | 211,002M |     0 |        0 | 43962 | 211,002M |
 NAS1-Software |   2 | @ 2022-08-12 01:03       | 21255 | 226,943M |  43962 | 211,002M |     0 |        0 |     0 |        0 |
 NAS1-Software |   3 | @ 2022-08-12 13:14       | 21255 | 226,943M |  43962 | 211,002M |     0 |        0 |     0 |        0 |
 NAS1-Software |   4 | @ 2022-08-14 12:27       | 21255 | 226,943M |  43962 | 211,002M |     0 |        0 |     0 |        0 |
 NAS1-Software |   5 | @ 2022-08-14 22:54       | 21255 | 226,943M |  43962 | 211,002M |     0 |        0 |     0 |        0 |
 NAS1-Software |   6 | @ 2022-08-16 04:36       | 21255 | 226,943M |  43962 | 211,002M |     0 |        0 |     0 |        0 |
 NAS1-Software |   7 | @ 2022-08-16 12:23       | 21255 | 226,943M |  43962 | 211,002M |     0 |        0 |     0 |        0 |
 NAS1-Software |   8 | @ 2022-08-18 23:22       | 21255 | 226,850M |  43940 | 210,900M |     0 |        0 |    22 |  85,432K |
 NAS1-Software |   9 | @ 2022-08-19 13:43       | 21255 | 226,850M |  43940 | 210,900M |     0 |        0 |     0 |        0 |
 NAS1-Software |  10 | @ 2022-08-19 13:57       | 21255 | 226,850M |  43940 | 210,900M |     0 |        0 |     0 |        0 |
 NAS1-Software |  11 | @ 2022-08-19 14:25       | 21255 | 226,850M |  43940 | 210,900M |     0 |        0 |     0 |        0 |
 NAS1-Software |  12 | @ 2022-08-19 14:34       | 21255 | 226,850M |  43940 | 210,900M |     0 |        0 |     0 |        0 |
 NAS1-Software |  13 | @ 2022-08-20 01:08       | 21255 | 226,850M |  43940 | 210,900M |     0 |        0 |     0 |        0 |
 NAS1-Software |  14 | @ 2022-08-20 17:47       | 21255 | 226,850M |  43940 | 210,900M |     0 |        0 |     0 |        0 |
 NAS1-Software |  15 | @ 2022-08-21 02:51       | 21255 | 226,850M |  43940 | 210,900M |     0 |        0 |     0 |        0 |
 NAS1-Software |  16 | @ 2022-08-22 21:18       | 21255 | 226,850M |  43940 | 210,900M |     0 |        0 |     0 |        0 |
 NAS1-Software |  17 | @ 2022-08-25 21:41       | 21255 | 226,850M |  43940 | 210,900M |     0 |        0 |     0 |        0 |
 NAS1-Software |  18 | @ 2022-08-27 17:55       | 21255 | 226,850M |  43940 | 210,900M |     0 |        0 |     0 |        0 |
 NAS1-Software |  19 | @ 2022-08-28 06:10       | 21255 | 226,850M |  43940 | 210,900M |     0 |        0 |     0 |        0 |
 NAS1-Software |  20 | @ 2022-08-29 12:35       | 21255 | 226,850M |  43940 | 210,900M |     0 |        0 |     0 |        0 |
 NAS1-Software |  21 | @ 2022-08-31 00:42       | 21255 | 226,850M |  43940 | 210,900M |     0 |        0 |     0 |        0 |
 NAS1-Software |  22 | @ 2022-09-01 20:55       | 21255 | 226,850M |  43940 | 210,900M |     0 |        0 |     0 |        0 |
 NAS1-Software |  23 | @ 2022-09-04 20:20       | 21255 | 226,850M |  43940 | 210,900M |     0 |        0 |     0 |        0 |
 NAS1-Software |  24 | @ 2022-09-05 20:52       | 21255 | 226,850M |  43940 | 210,900M |     0 |        0 |     0 |        0 |
 NAS1-Software |  25 | @ 2022-09-07 06:10       | 21255 | 226,850M |  43940 | 210,900M |     0 |        0 |     0 |        0 |
 NAS1-Software |  26 | @ 2022-09-07 07:11       | 21255 | 226,850M |  43940 | 210,900M |     0 |        0 |     0 |        0 |
 NAS1-Software |  27 | @ 2022-09-08 05:35       | 21255 | 226,850M |  43940 | 210,900M |     0 |        0 |     0 |        0 |
 NAS1-Software |  28 | @ 2022-09-08 06:35       | 21255 | 226,850M |  43940 | 210,900M |     0 |        0 |     0 |        0 |
 NAS1-Software |  29 | @ 2022-09-09 04:39       | 21255 | 226,850M |  43940 | 210,900M |     0 |        0 |     0 |        0 |
 NAS1-Software |  30 | @ 2022-09-10 00:37       | 21255 | 226,850M |  43940 | 210,900M |     0 |        0 |     0 |        0 |
 NAS1-Software |  31 | @ 2022-09-10 20:58       | 21255 | 226,850M |  43940 | 210,900M |     0 |        0 |     0 |        0 |
 NAS1-Software |  32 | @ 2022-09-11 17:28       | 21255 | 226,850M |  43940 | 210,900M |     0 |        0 |     0 |        0 |
 NAS1-Software |  33 | @ 2022-09-12 04:11       | 21255 | 226,850M |  43940 | 210,900M |     0 |        0 |     0 |        0 |
 NAS1-Software |  34 | @ 2022-09-12 06:00       | 21255 | 226,850M |  43940 | 210,900M |     0 |        0 |     0 |        0 |
 NAS1-Software |  35 | @ 2022-09-12 15:22       | 21255 | 226,850M |  43940 | 210,900M |     0 |        0 |     0 |        0 |
 NAS1-Software |  36 | @ 2022-09-14 00:15       | 21255 | 226,850M |  43940 | 210,900M |     0 |        0 |     0 |        0 |
 NAS1-Software |  37 | @ 2022-09-14 12:12       | 21255 | 226,850M |  43940 | 210,900M |     0 |        0 |     0 |        0 |
 NAS1-Software |  38 | @ 2022-09-15 21:12       | 21257 | 226,931M |  43963 | 210,975M |     0 |        0 |    27 |  80,889K |
 NAS1-Software |  39 | @ 2022-09-17 01:08       | 21257 | 226,931M |  43963 | 210,975M |     0 |        0 |     0 |        0 |
 NAS1-Software |  40 | @ 2022-09-17 10:42       | 21257 | 226,931M |  43963 | 210,975M |     0 |        0 |     0 |        0 |

i.e. the snapshot is mostly constant, except once in a while there are some changes. If I am to restore something from such snapshot, I am offered a list of 40 revisions, though in reality there are only 3 (#1, #8 and #38), but I don’t know that during restore. So if I am looking for something, instead of checking just 3 revisions at most, I may poke a lot more (as during restore duplicate revisions are not obvious).

I assume I can manually delete duplicate revisions from /snapshots, or even attach some scripting to check log processing, but I think it would be better if these revisions could be dropped on a flag. Possibly even as part of check rather than prune, as check should have all the information needed to identify duplicate revisions, and no actual chunks are supposed to be pruned during this operation.

Thoughts?

2 Likes

I think it should be a restore time option in the UI:

[ ] collapse equivalent snapshots 

Or

[ ] only show unique snapshots 

Or something along those lines.

This will preserve the original backup cadence and keep avoid prune schedules straightforward with predictable counts and yet provide the additional feature of seeing only different snapshots.

1 Like

Well, restore doesn’t know which snapshots are equivalent (at least right now).

Backup cadence is unaffected by this, regular prune schedules should still work the same with the added benefit of not keeping duplicates instead of changed revisions. Also removing duplicate revisions have more value the more often backups are taken (e.g. hourly). Once there are thousands of revisions, with only a few that are actually changed, it would be very inefficient to figure out duplicates every time it is needed (not to mention redundant work for all check/prune/etc operations).

1 Like

Backup should be immutable and future revisions should not affect past.

So there are three options:

  • keep all, including equivalence revisions
  • if new revision is the same as previous one — don’t save it. This is a problem: I just made a backup, where is it?!
  • if new revision is same as previous one — delete old and keep the new (I.e. update timestamp). This is also a problem — I had backup done at 03.03.2021 3PM and now it’s gone!!!

I don’t think saving few bytes of space deleting snapshot files is helping anything or keeping them presents a problem that needs solving. If few users want to see unique snapshots only — the software can provide that view. But it shall not mess with the specified cadence. If I specified backup every hour - there better be 24 backups by tomorrow.

Thus there is no value (let alone “more value”) in auto- deleting snapshots. Each, albeit equivalent, snapshot carries information: backup was done, and this snapshot is all you need to restore data to the state at that point in time. Not go lookup previous one or next one. This information is independent of any other backups. There is no need to link snapshots together, implicitly or explicitly.

In other words, it’s Ok to present filtered view but it’s not ok to auto-collapse data.

Even as an option it is highly counterproductive, and such this option should not be provided: it will only increase support volume and increase complexity. Both are bad things.

2 Likes

You seem to be vehemently opposing to users having choices, I never even said it should be a default or automatic setting.

By that logic, prune shouldn’t be supported either (which you seems to advocate as well elsewhere). Prune throws away actual non-trivial data! That’s on top of information that backup was done. Yes, it’s OK to throw information away, users can make that choice.

This is not about saving space.The fact is, with thousands or tens of thousands of snapshots :d: slows down significantly, we’ve seen examples of that before. With mostly static snapshots, there is an easy way to deal with that by only throwing out minimal redundant information, while retaining non-trivial data and performance.

And just to preempt an argument that you shouldn’t be running tens of thousands of snapshots - not everybody is using the tool the same way you do.

2 Likes

Yes, I am. I strongly believe that every single option, flag, checkbox, and otherwise choice that the user is forced to make is the one the developer did not, could not, was unable to, or just decided not to make. It’s a copout, it is shifting the responsibility to the user. You know what - I, the problem area expert, don’t know what to do here – lets add this options, users will decide, they’ll definitely make the right choice, and it’s the best use of their miserable time.

I’m strongly agains this. If there is a choice to be made – the vendor of the solution is in the best position to make these informed choices. Not the user.

They don’t need to be involved in these trivial choices. Just keep all the information. Or prune on some predictable fixed (and very simple) schedule, until performance issues are fixed.

In the world of backup – the paragon of usability is Time Machine on macOS. I’m yet to see anything that even comes close. There are no configuration options. It “just works”.

Bingo number one!

Bingo number two!

So, the software sucks at handling its own data efficiently, and instead of having it fixed, we are now justifying deleting data and forcing user to make choices they never wanted to know about in the first place, for no benefit in return, just to workaround software bugs!

Or not touch the datastore, keep it immutable, and present the information in robust in palatable way, fixing the performance issues, by something other than throwing away the data.

This nonsense of “loading revisions” for 4 hours when attempting to restore should have never made it past QA. The list of revisions is few kb log, it shall not take 4 hours to transfer that data. Oh, it traverses the whole snapshot structure to collect it. Well, that is dumb. It should have maintained cached representation to be ready to show it to the user instantly.

And then, since we both agree that it’s not about saving space, there is no need to prune. And once you don’t prune – you get a lot of freebies in the form of immutability, reliability, lack of user-driven decision making, and won’t need to fix that prune bug that is being reported periodically for the past two years. That last one is a pure bonus.

On the contrary, 10s of thousands of snapshots should not be presenting a performance issue. This needs to be fixed. I should be able to go back to October 15, 1999 and restore the file I edited between 2 and 3PM.

The only reason we are forced to prune is because software sucks at handling this modest amount of revisions. I refuse to have to discard data because of software deficiency.

Asking user to throw away data so that the software can run fast is so obnoxious and backwards that I’m not even sure now how did we ended up here.

Ah, yes, the Apple approach - the users cannot be trusted with making choices, so big brother will make them for them, welcome to the walled garden with guardrails. No thanks. I mean, some people like for someone else to decide for them, but others much rather make their own choices according to their circumstances.

And thinking that a single vendor can make the best choices for all users out there who use their product all day every day, really? Whether it is power tools or software tools, most vendors do not even use their own products on a day-to-day basis, much less have view on all the use cases that exist in the user world.

Not all users need to or want to be “protected” from making choices. But to each his own.

1 Like

You missed the point here. I’ll try to convey my thoughts one last time in a blunter way.

These “choices” we are talking about here are those that are forced on the user to workaround issues and deficiencies with the software.

Users don’t need to be burdened with them. Vendor needs to make those choices and own up to them.

But you seem to think that it’s OK to release crap and have users concoct workarounds and make “choices” to make a workable solution.

It’s not a freedom of choice. It’s an unpaid forced labor under the guise of “choice”. Not to mention that this maybe saved time to one developer but wastes time on every single of their hundreds or thousands user.

Yes, really. Microsoft word does not ask you menial questions about where to store and how to manage its data. Time Machine picked a fixed prune schedule that works for everyone and most people don’t even know it’s there and have no idea how it works in general. Nor should they. Adobe Premiere does not ask you how to slice the streams and what texture and compute workgroup size is best on your GPU. Somehow they found time to figure this out without having the user becoming an expert on GPU compute.

But no, this backup tool will have you learn its internals, layout, chunks, prune, etc. Just backup my damn data and don’t ask questions. Can duplicacy do this? Today — apparently not. Few thousands revisions is too much for some reason, in 2022.

The sooner you and others stop making excuses and actually start thinking about user experience the sooner the world will become a better place. I understand, many are traumatized by years of using horribly designed software on the OS that encourages that (yes, windows is a paragon of how not to design interfaces) and think that it is normal, but unless we make conscious effort to throw away that legacy and start making right decisions — nothing will change.

Since you brought up apple — yes, they are one of the very few that understand user experience design, and there is a lot to be learned from that.

And nothing here has anything to with the walled garden you mentioned. It’s more like when I buy a car - I’ll make choices where to go but don’t force me into freedom of choice for picking engine timings or temperature curves.

And definitely don’t ask me to delete my data to make poorly designed application (that only exists to protect my data) not choke. I noticed you did not argue against this, glad we are on the same page.

1 Like

Since you brought up the closest example to :d:, so, should :d: now go with a fixed prune schedule that will work for “everyone”? You know, to improve user experience. Let’s also remove ability to customize chunk size, clearly this is not something that users need. Too many backends, users will get confused, why don’t we go with a single “sensible option”. Etc.

Yes, it is a freedom of choice. The difference between our viewpoints is that I am fine with the tool supporting different ways of doing things, including yours, e.g. minimum customization required for base case scenarios. All what is needed is sensible defaults, users don’t need to choose anything different if they don’t want to. But you seem to be dead against tool supporting other people’s way of doing things, which is ability to customize settings when and if desired.

Speaking of cars, some people really want to drive full auto family sedans, but some people with offroad-capable or high-performance cars want to be able to select gears on their own if needed (or even ECU profiles indeed). You don’t even need to give up automatic gear selection for when you don’t need fine tuning. Not all cars need to be the same, not all backup solutions need to be Time Machine.

So the bottom line is, we want very different backup tools.

1 Like

Yes. The default should be sensible, and it shall be enabled by default. Right now it’s off, which leads to performance issues and forum posts to figure out what that sensible prune schedule should be. Am I saying that the fix here is different default for a checkbox? Yes, I do. Until the performance is fixed so prune is no longer needed.

No, I’m not saying that the user should not be able to do this. I’m saying that the the user should not have to. In this case, the app shall pick correct chunking for remote and file type. Yes, different chunking for different files. Encounter vhd? Do fixed chunking. See .mkv? Do massive 256MB chunks. See .mkv but backend is Storj? do 64MB chunks. (not strictly correct, but as an example). Then the user will not have a reason to mess with chunking at all (which is now impossible to do on per-file granularity) nor need to learn to do it in the first place: because developer spent time teaching the tool to do the reasonably well.

Very true. Pick the best one for me: analyze my data, ask questions, and recommend a remote. Why not? Do I need to learn about 700 different cloud storages and pricing? I have better things to do with my life, honestly.

Yes, this is necessary, but not sufficient. see below

No, there is no “other peoples” way of doing things. There is correct way and incorrect way, and that varies depending on specific circumstances. Or, put it differently, there is optimal and suboptimal way of doing things in each specific circumstance.

All the people who use duplicacy want is to have backup done. This is a crucial point. They don’t want specific chunk size, or prune schedule, or whatever else duplicacy internals they need to learn to tweak just to make the tool usable. They can live happily not knowing that.

The reason you seem to be so adamant in wanting to be able to micromanage duplicacy internals is because you feel this is the only way forward: you have to be able to fix the tool deficiency or inability to handle your usecase. But your usecase is not unique. And your requirements are not unreasonable.

I’d argue there is more than one person with the requirements similar to yours. And the outcome you want is also likely a very reasonable and logical for your circumstances. (Nobody wants horrible deduplication ratio, long upload time, huge chunking overhead, and long restore. Most people want the opposite).

So why does not duplciacy detect this usecase and does the right thing out of the box on its own (like picking chunking or compression parameters based on file types and remote capabilities, as alluded to above)? Why do you to have to need to change things yourself to suit your case?

See my point?

I anticipated this response, and that’s fine, however youe minivan should not need ECU tweaked to survive the camping trip. Today this is what we have to do with duplciacy to make it work. And your feature request is akin – let’s remove the rear seats to make it lighter so it does not overheat and thus we don’t have to fix insufficient throughput of radiator fan. We likely won’t need rear seats anyway anytime soon. Let’s finish with analogies, I think the point is clear without.

I think we want the same thing, but you are not seeing forest behind the trees; lets just step back a few miles and look at the big picture:

I want tool that works out of the box correctly, and does not need tweaking.

You want a tool that you can customize the shit out of, because you gave up waiting for the developer to do the right thing or even come up with sensible defaults; years of using crappy products wore you down and taught to set expectations very low; you no longer expect any, including this software to handle your usecase properly, and so instead of wanting quality product – you now gave up and now want access to internals so you can put work into it yourself to try and improve the quality somewhat or at least steamroll it into doing better in your specific scenario. To do the job you didn’t even expect the developer to do anymore. And everyone like you ends up doing the same thing, wasting hundreds of man hours redoing the same tweaks and reading through the same forum posts. This is infuriating.

I want the app to behave correctly for you out of the box, and while you may still be able to tweak it further or even get the source and recompile to squeeze last 2% of performance – you would not have an incentive to do so because it works great enough as is. That’s the goal. And it’s totally possible. But for that to happen people need to stop tolerating … eh… incomplete software.

I don’t think I have much to add here. Anyone reading this can pick which philosophy they subscribe to.