Information porn: chow dedup rate splittet in "within storage / within snapshot"

lopiuh 13 May 2024 09:36 #1

If you backup “a thing” and you do not know if it is a dupe (old file backup) it would be nice to have the information, how many files (not chunks) of “a thing” are skipped because of redundancy within the snapshot (“a thing”) and how many files of “a thing” (not chunks) have been already in the backup set (storage).

But this analysis could also be done by the user or post process by analyzing the hash of the file list, which could be retrieved from duplicacy. keep it simple, I think it should not be in duplicacy itself.

Have I said that duplicacy is a genius piece of software?

Are there any projects of helper scripts / software for duplicacy? I plan to start some development for duplicate presentation of your backups in order to be able to find and ccleanup dupes in your hot storage. And a feature which alarms you when “important” files vanishes in your hot storage by analyzing contents of old and current backups.

Yours lopiuh

ninjanner 19 May 2024 13:52 #2

Your question about understanding deduplication rates and identifying redundant files within and across snapshots is very insightful. Duplicacy’s deduplication is indeed one of its powerful features, but finer granularity in reporting deduplication statistics, especially at the file level, is not currently a built-in feature. However, there are ways you can achieve this with some additional tools and scripting.

Deduplication Statistics

To obtain detailed deduplication information, you can use Duplicacy’s logging and scripting capabilities to analyze the backup process. While Duplicacy does not natively provide split deduplication statistics between “within snapshot” and “within storage,” you can parse the logs to extract this information.

Helper Scripts and Projects

There are a few community projects and scripts that can help extend Duplicacy’s functionality:

Duplicacy-Utils: A set of utilities and scripts developed by the Duplicacy community to help with various tasks, including monitoring and reporting. You can find more information and download these tools from the Duplicacy-Utils GitHub repository.

Custom Scripts: Writing custom scripts to parse Duplicacy logs can help you achieve the specific insights you’re looking for. For example, you can use Python or another scripting language to process the logs and generate reports on file-level deduplication.

Post-Processing with File Hashes: As you suggested, another approach is to perform post-processing by analyzing file hashes. By comparing hashes of files in the current backup against previous backups, you can determine redundancy and generate detailed reports.

Developing New Tools

Your idea of developing a tool to identify and present duplicates and monitor the disappearance of important files is excellent. Here are some suggestions for implementing this:

File Hash

Comparison: Implement a system that calculates and stores file hashes during each backup. You can then compare these hashes across backups to identify duplicates and files that have disappeared.

Alert System:

Create an alert system that triggers notifications when important files are no longer found in the current backup. This could be implemented using a combination of hash comparison and file monitoring.

Visualization:

Develop a user-friendly interface to visualize duplicates and changes in the backup sets. This can help users quickly identify and address redundancy and potential data loss.

What I have discovered so far from testing other projects.

Duplicacy is a powerful backup tool, and while it does not natively provide all the detailed deduplication statistics you are looking for, with some scripting and additional tools, you can achieve these insights. Your initiative to develop further tools and features will undoubtedly be valuable to the community.

If you decide to develop these tools, consider sharing them on the Duplicacy forums or GitHub to benefit other users who might have similar needs.

Best of luck with your development, and feel free to share any progress or ask for further assistance!

You might be wondering , do I work here ??

NO ! But I did stay at a Holiday inn Express last night.