What do you do when you have too many backups? That’s a problem I’ve been working on for the past two years and I thought it might be valuable to put some of this in an article for others who face similar problems.
First of all, how many backups is too many? For me it was when I started stacking backup hard drives like cord wood. Since I started making 300mb film scans in the late 1990s, I’ve had a series of larger drives holding my files. When I upgraded to a larger hard drive, I always kept the old drive “just in case”. This has led to a lot of drives. Just this week I found four drives I didn’t even remember I had, all from the 2007-2009 time frame. Last week I pulled files from some truly ancient drives that I hadn’t spun up in about 15 years.
Why did I keep all these drives? I wanted to preserve the ability to go back in time in case I found a file had been corrupted at some point. The only way to fix a corrupted file is to go back in time before it got corrupted and pull that file. That is the purpose these drives served.
Having ten or fifteen ancient drives taking up space consolidated on one large drive seemed like a better option, and would help me reduce some of my ever growing clutter.
Adding to my problem of excess backups is how Carbon Copy Cloner (my backup software of choice), backs up files. CCC works in a very simplistic way that is very safe, but in doing so, it is easy to make duplicate files when you move folders around. So on top of my duplicate “archive” drives, I had several terabytes of CCC files to deal with.
My goal for this project was three fold. First, I wanted to consolidate all my archives on one drive. Second, I wanted to deduplicate (dedupe) those files to remove unnecessary duplicates so that it would take up the least space possible. Third, I needed to do all this in a way that ensured that I precisely copied every file, bit for bit, and that any duplicates were truly duplicates with no differences at the bit level.
The tool that has made this possible is IntegrityChecker from Lloyd Chambers at https://diglloydtools.com and diglloyd.com.
IntegrityChecker does a number of very interesting things. Foremost, it will compute a cryptographic hash for every file on a drive. This hash serves as a checksum that can show if a file has been altered in even the slightest way, down to the bit level. This is very useful when copying files to another drive to ensure they copied exactly. It also lets me compare them to the hash at a later date to detect corruption. It does some other cool things too as I’ll explain in a moment.
So my consolidation process looked like this.
- Use Carbon Copy Cloner to copy from my old drive to a folder on a new drive.
2. Use IntegrityChecker to compute hashes for both copies
3. Use the “compare” function of IntegrityChecker to compare the copy to the original.
This process let me make a copy of the old drives with absolute assurance that I had copied every file correctly. In over 20TB of files copied for this project, I only found one file that did not copy correctly for whatever reason. Not bad for pulling data off vintage hard drives.
Goal two was to dedupe the drive where I had consolidated all my archives and backups. IntegrityChecker helped with this too. IC can use the hashes it creates to look for duplicates. If a pair of hashes matches, you can be sure with a extremely high level of confidence that these two files are exactly the same. This is a much better way to identify duplicates than other methods that rely on file size, name, and date, because those values will not detect bit level differences from file corruption. IC can, so if IC says two files are duplicates, they really are.
IntegrityChecker lets you deal with dupes in two ways. First, you can use a unique feature of drives formatted with APFS on Macs to create a clone. When a close is made, the two files will be reduced to one at the disk level but you will still see two files in the finder. If you open one of these files and modify it, it will become a separate copy again. Cloning files allows you do reclaim disk space from duplicates without messing up your directory structure. This is very safe, but would not help me with some of my other goals as you will see.
I decided to go a more aggressive route. I wanted to remove every duplicate file, so I used the “- – emit rm” command to create a list of duplicate files with the command line code to erase them. This would remove them from the hard drive permanently, leaving only one copy.
As part of this process, I realized I could delete any of the consolidated files that were part of my current, up to date, working drive and backups. After all, I didn’t need copies of files in my master working archive, so why not get read of those too?
To do that, I made a copy of the files from my current “master” drive (the drive where I access my photos when I’m working on them) and copy them to the drive I was using for consolidation. I put this them in a folder labeled “a” and put the old backup copies into a folder named “z” because I learned that IntegrityChecker will use the top most directory to decide which duplicate to keep. By doing this, I could make IntegrityChecker delete the old files that matched my current files. And at the end of the process, I could delete folder “a”, and be left with only the files that did not exist on my current master drive.
This project let me distill terabytes of files down to about 300GB, which is a very manageable size to keep and maintain. I consider it a success to be able to get a dozen or so hard drives out of my life and my space for the effort while ensuring that I have an absolutely exact copy of every one of my files.
This process has worked for me but be forewarned. IntegrityChecker is very powerful, and it is very easy to delete files you don’t intend to. You need to take the time to learn how it works and understand its behavior. I did a lot of testing to practice and understand it, and I am careful to think through the plan every time I use it, in addition to working when I have a clear mind (always a good idea when doing big things with your data!)
If you have the same problems I do, I hope this gives you some ideas for how to solve it. Courteous questions always welcome.