Your question about understanding deduplication rates and identifying redundant files within and across snapshots is very insightful. Duplicacy’s deduplication is indeed one of its powerful features, but finer granularity in reporting deduplication statistics, especially at the file level, is not currently a built-in feature. However, there are ways you can achieve this with some additional tools and scripting.
Deduplication Statistics
To obtain detailed deduplication information, you can use Duplicacy’s logging and scripting capabilities to analyze the backup process. While Duplicacy does not natively provide split deduplication statistics between “within snapshot” and “within storage,” you can parse the logs to extract this information.
Helper Scripts and Projects
There are a few community projects and scripts that can help extend Duplicacy’s functionality:
Duplicacy-Utils: A set of utilities and scripts developed by the Duplicacy community to help with various tasks, including monitoring and reporting. You can find more information and download these tools from the Duplicacy-Utils GitHub repository.
Custom Scripts: Writing custom scripts to parse Duplicacy logs can help you achieve the specific insights you’re looking for. For example, you can use Python or another scripting language to process the logs and generate reports on file-level deduplication.
Post-Processing with File Hashes: As you suggested, another approach is to perform post-processing by analyzing file hashes. By comparing hashes of files in the current backup against previous backups, you can determine redundancy and generate detailed reports.
Developing New Tools
Your idea of developing a tool to identify and present duplicates and monitor the disappearance of important files is excellent. Here are some suggestions for implementing this:
File Hash
Comparison: Implement a system that calculates and stores file hashes during each backup. You can then compare these hashes across backups to identify duplicates and files that have disappeared.
Alert System:
Create an alert system that triggers notifications when important files are no longer found in the current backup. This could be implemented using a combination of hash comparison and file monitoring.
Visualization:
Develop a user-friendly interface to visualize duplicates and changes in the backup sets. This can help users quickly identify and address redundancy and potential data loss.
What I have discovered so far from testing other projects.
Duplicacy is a powerful backup tool, and while it does not natively provide all the detailed deduplication statistics you are looking for, with some scripting and additional tools, you can achieve these insights. Your initiative to develop further tools and features will undoubtedly be valuable to the community.
If you decide to develop these tools, consider sharing them on the Duplicacy forums or GitHub to benefit other users who might have similar needs.
Best of luck with your development, and feel free to share any progress or ask for further assistance!
You might be wondering , do I work here ??
NO ! But I did stay at a Holiday inn Express last night.