Too Many Backups! Here is how I consolidated my old archives.

What do you do when you have too many backups? That’s a problem I’ve been working on for the past two years and I thought it might be valuable to put some of this in an article for others who face similar problems. 

First of all, how many backups is too many? For me it was when I started stacking backup hard drives like cord wood. Since I started making 300mb film scans in the late 1990s, I’ve had a series of larger drives holding my files. When I upgraded to a larger hard drive, I always kept the old drive “just in case”. This has led to a lot of drives. Just this week I found four drives I didn’t even remember I had, all from the 2007-2009 time frame. Last week I pulled files from some truly ancient drives that I hadn’t spun up in about 15 years. 

Why did I keep all these drives? I wanted to preserve the ability to go back in time in case I found a file had been corrupted at some point. The only way to fix a corrupted file is to go back in time before it got corrupted and pull that file. That is the purpose these drives served. 

Having ten or fifteen ancient drives taking up space consolidated on one large drive seemed like a better option, and would help me reduce some of my ever growing clutter. 

Adding to my problem of excess backups is how Carbon Copy Cloner (my backup software of choice), backs up files. CCC works in a very simplistic way that is very safe, but in doing so, it is easy to make duplicate files when you move folders around. So on top of my duplicate “archive” drives, I had several terabytes of CCC files to deal with. 

My goal for this project was three fold. First, I wanted to consolidate all my archives on one drive. Second, I wanted to deduplicate (dedupe) those files to remove unnecessary duplicates so that it would take up the least space possible. Third, I needed to do all this in a way that ensured that I precisely copied every file, bit for bit, and that any duplicates were truly duplicates with no differences at the bit level. 

The tool that has made this possible is IntegrityChecker from Lloyd Chambers at https://diglloydtools.com and diglloyd.com. 

IntegrityChecker does a number of very interesting things. Foremost, it will compute a cryptographic hash for every file on a drive. This hash serves as a checksum that can show if a file has been altered in even the slightest way, down to the bit level. This is very useful when copying files to another drive to ensure they copied exactly. It also lets me compare them to the hash at a later date to detect corruption. It does some other cool things too as I’ll explain in a moment. 

Consolidation

So my consolidation process looked like this. 

  1. Use Carbon Copy Cloner to copy from my old drive to a folder on a new drive. 

2. Use IntegrityChecker to compute hashes for both copies

3. Use the “compare” function of IntegrityChecker to compare the copy to the original. 

This process let me make a copy of the old drives with absolute assurance that I had copied every file correctly.  In over 20TB of files copied for this project, I only found one file that did not copy correctly for whatever reason. Not bad for pulling data off vintage hard drives. 

Deduplication

Goal two was to dedupe the drive where I had consolidated all my archives and backups. IntegrityChecker helped with this too. IC can use the hashes it creates to look for duplicates. If a pair of hashes matches, you can be sure with a extremely high level of confidence that these two files are exactly the same. This is a much better way to identify duplicates than other methods that rely on file size, name, and date, because those values will not detect bit level differences from file corruption. IC can, so if IC says two files are duplicates, they really are. 

IntegrityChecker lets you deal with dupes in two ways. First, you can use a unique feature of drives formatted with APFS on Macs to create a clone. When a close is made, the two files will be reduced to one at the disk level but you will still see two files in the finder. If you open one of these files and modify it, it will become a separate copy again. Cloning files allows you do reclaim disk space from duplicates without messing up your directory structure. This is very safe, but would not help me with some of my other goals as you will see. 

I decided to go a more aggressive route. I wanted to remove every duplicate file, so I used the  “- – emit rm” command to create a list of duplicate files with the command line code  to erase them. This would remove them from the hard drive permanently, leaving only one copy.

Distillation

As part of this process, I realized I could delete any of the consolidated files that were part of my current, up to date, working drive and backups. After all, I didn’t need copies of files in my master working archive, so why not get read of those too?

To do that, I made a copy of the files from my current “master” drive (the drive where I access my photos when I’m working on them) and copy them to the drive I was using for consolidation. I put this them in a folder labeled “a” and put the old backup copies into a folder named “z” because I learned that IntegrityChecker will use the top most directory to decide which duplicate to keep. By doing this, I could make IntegrityChecker delete the old files that matched my current files. And at the end of the process, I could delete folder “a”, and be left with only the files that did not exist on my current master drive. 

This project let me distill terabytes of files down to about 300GB, which is a very manageable size to keep and maintain.  I consider it a success to be able to get a dozen or so hard drives out of my life and my space for the effort while ensuring that I have an absolutely exact copy of every one of my files. 

This process has worked for me but be forewarned. IntegrityChecker is very powerful, and it is very easy to delete files you don’t intend to. You need to take the time to learn how it works and understand its behavior. I did a lot of testing to practice and understand it, and I am careful to think through the plan every time I use it, in addition to working when I have a clear mind (always a good idea when doing big things with your data!) 

If you have the same problems I do, I hope this gives you some ideas for how to solve it. Courteous questions always welcome. 

My hard drives didn’t let me down!

This is a follow up to my April project to check the integrity of my backups as I was moving my files to a larger hard drive. 

My objective was to make sure that every single file (about one million) copied exactly to the new drive, and that there were no errors that would prevent me from accessing my data. 

To do this I used a software app from Lloyd Chambers called Integrity Checker which is the most efficient tool I’ve found for this unique job. It’s a command line tool that uses the Mac terminal. That in itself was a learning experience as previously I’ve been very afraid of how bad the wrong command in terminal could muck things up. 

Thanks to Integrity Checker, I was able to confirm that my two main backup copies are exact duplicates of the “master” hard drive. That’s a very good thing because it means I really do have a useable backup when my main drive fails. (All drives fail, it’s just a matter of when.)

My secondary objective was to verify some bare drives I was using in the past for backups. I had stopped using them because the were throwing errors in CarbonCopyCloner. I suspected that these errors were due to the drive dock I was using them in, but had no way to be sure, so I didn’t trust them. They got shoved into a drawer and were just sitting there as “worst case” backups as a hail Mary play in case I needed it if things every got really ugly.

To try and bring these orphaned drives back into my active backup,  I put them into a known good drive enclosure. Then, using Integrity Checker, I was able verify that every file on them matches my “master” and that the drives are trustworthy. That gives me confidence to use them again for backing up new data, and lets them be useful as part of my backup strategy.

The one thing that has surprised me as I completed this project is that everything actually worked. Terabytes upon terabytes of data and multiple copies of a million files that were hashed and read multiple times, and it all worked. Even digital photos from the mid 1990s were still there and readable. I think I found a dozen files that threw an error, but they were all readable so the error was insignificant and they were mostly XMP files. That has made me much more trusting of the process used to backup my data. A sigh of relief, but I’ll still remain vigilant. 

Another surprise was how many files I had duplicated on the drives. For a myriad of reasons, I had multiple folders with the same files that built up over the last twenty-ish years of managing my archive. One terabyte of duplicates to be precise.  It would be a nightmare to reconcile all those files manually, but Integrity Checker came to the rescue again. One of it’s functions allows you to identify duplicate files…that’s how I discovered the 1 TB of duplicates in the first place. 

But just as valuable was Integrity Checker’s ability to “clone” the duplicates and regain that wasted space if you are using a APFS formatted drive. 

APFS is a format for storage drives used with a Mac. It’s designed for solid state drives, not spinning disks. It will work with a spinning hard drive, but it can cause a slowdown in transfer speed. That’s something I could tolerate for backups if it let me get back a terabyte of space, so one by one I converted my backups to APFS, re-verified that all the files would read back correctly, then used Integrity Checker to “de-dupe” the drives and reclaim that 1 TB of space back.

The unexpected benefit of this de-duping is that I now have a whole new set of tricks up my sleeve to manage my storage more efficiently.

The end result is that I now know that every copy of my data is good, and I know how to check it as I go forward to ensure it stays good. This gives me more  confidence that my files will be there when I need them, which was the whole point of this adventure…and something I wish I had done a lot sooner. 

My next adventure is to take one of my offsite backups into the cloud using a Synology DiskStation and Backblaze cloud…more on that in a future post. 

Until then, keep backing up those bits!

Hard Drive Costs Late January 2020

Current hard drive costs at a glance with links to purchase from Amazon. I recommend Seagate hard drives because they continue to test as some of the longest lasting drives at backblaze.com.

Highlights for January include a minor price increase on 6Tb and 10TB external drives, as well as slight changes to internal drives as noted. The days of storage prices dropping quickly seem to be over as drive capacities become so large. Also of note is that 2Tb external drives are now all “portable” meaning they are 2.5″ laptop drives that are bus powered. For my main storage I prefer to have external 3.5″ drives that are plugged in to an external power source, so that means buying a 4TB drive or larger.

10TB external drives are still a big savings over 10TB Internal drives. Also, on a cost per TB basis, 10TB drives are getting close enough to the sweet spot of pricing to make them attractive if you need that kind of storage. But I generally don’t recommend buying more than a year’s capacity at a time to protect from price changes. Also remember that a properly backed up “storage set” requires three drives, so buying more than you reasonably need (over provisioning) can suck up a lot of money.

Sometimes external drives are less expensive than internal drives. Advanced users may want to explore “shucking” external drives to save money as the external drives are often, but not always, SATA drives that can be used as an internal drive.

EXTERNAL

2TB $59.99 ($30 per TB) 2.5″ USB powered portable drive
4TB $89.99 ($22.50 per TB)
6TB $109.99 ($18.33 per TB) +$10Change
8TB $139.99 ($17.50 per TB)
10TB $199.99 ($20 per TB)+$20 Change

INTERNAL

2TB $49.99 ($25 per TB)
4TB $79.99 ($19.99 per TB)-$10 Change
6TB $131.99 ($22 per TB)
8TB $149.99 ($18.75 per TB)
10TB $252.98 ($25.29 per TB)+$12 Change
12TB $327 ($27.25 per TB)+$15 Change
14TB $439.99 ($31.40 per TB)
16TB $484.99 ($30.31 per TB)+$6 Change

I’m an Amazon affiliate so I receive a small commission from each sale.

A Cheaper Storage Upgrade

Seagate 2TB External Drive

If you are sick of my articles on Drobo/NAS/DAS/RAID storage solutions because they are just overkill for your needs, you are in luck. I’m laid up with the flu, which is a perfect time to dump out some different storage solutions because it doesn’t require the same part of my brain the creative photography content does. 

Talking with a friend yesterday about some upgrades for his mac that was running slow and we got around to his current storage shortage.  (Yes, I have a lot of photographer friends, a side effect of this incurable disease I have called photography 😉

After helping him spend about $300 on a RAM and SSD boot drive upgrade for his 2015 iMac, the budget was tight for storage. He wanted to set up a new Storage Set that would be dedicated to RAW files, and include his existing archive of 700GBs of existing RAWs. (See my Freemium Backup and Storage Plan article for an explanation on what a Storage Set is. )  

He settled on buying three 2TB external drives for a total cost of about $179. One would be the master, and two would be exact clones using CarbonCopyCloner. This would let him transfer his existing 700GB of RAWs to the new storage set, and leave maybe a years worth of space for new RAWs from his 45mp camera. The $179 price is an easy bill to afford, and way less than film and processing used to cost, so even if it ends up being a little undersized, it gets him through till his high season for photo sales. 

Putting all your RAW files on a separate drive is a great way to segment your data. Since these files will never be modified directly, the backup needs are greatly minimized for that master volume. Your modified RAWs can live on a volume set aside for more active files in the case of Photoshop, or in your catalog for DAM (digital asset management) programs like Lightroom. 

So why not a RAID in this case? While RAID is a very nice to have, it’s not always a need as long as you are very diligent in doing regular backups. This solution works in keeping the data safe and accessible for very little money. 

My storage articles over the last few weeks weren’t meant to say you need RAID, but rather to explore what they do and how to manage them based on my experiences managing a lot of spinning disks in Mirrored RAIDs and Synology NAS systems. I used to be able to heat my office in with three Mac servers and forty odd hard drives West Coast Imaging required, so to say I’m very close to this subject is an understatement…lol. 

Sometimes inexpensive solutions are the best solutions, and as I shared with my friend, there are always more things to spend money on in photography. Saving money for him means more days on the road having more adventures and making more photographs. So “just enough” is always the right size. Owing spinning disks is not our goal in life. 

Drobo volume/partition size recommendations

A drobo is like a regular external hard drive in that it can be split up into smaller volumes or partitions. Just because you have 20TB of storage doesn’t mean you actually want a 20TB volume.

I recommend your volume size be based on the size external drive you are using for backup, so that each volume can be easily backed up to a single external drive.

For example, if you are using 8TB drives for backup, you might want to make 6TB volumes on your drobo. You want the backup to be bigger than the “master” volume on the drobo because backup software can allow you to keep snapshots of old data that allows you to go back in time if you accidentally delete , erase, or overwrite a file, or if it becomes corrupted. The more extra storage on your backup drive, the further “back in time” you can go (no DeLorean or flux capacitor needed.)

This means you’ll likely have multiple volumes/partitions on your drobo, and you should consider how you use them.

I recommend separating “hot” or frequently changed data from “cold” or seldom accessed data. Hot data is things you are regularly using, like your lightroom catalog, your latest photo shoots, etc. Cold might be where you sort older shoots that you are not accessing, as well as files that don’t need frequent modification. For example “2019 Raw Captures” might get it’s own volume.

Really cold data could be moved to an external drive to free up space on the drobo.

There are lots of ways to play this, but I’ll leave those details up to you.

How long do hard drives last?

Hard drives don’t last forever. Eventually the precision parts that let them rotate at thousands of RPM per minute with read heads that float just microns off the disk surface wear out and fail. SSD drives fail too, just in different ways. Modern technology has made them so reliable that we can be lulled into thinking they won’t give us problems, but that is a false security.

The truth is, an HHD or SSD drive can fail at any time. The best data we have, from online backup provider BackBlaze, proves it. Having a new drive is no insurance. It can fail just as easily at 100 hours as it can at 20,000 hours.

Here’s the part where I remind you that luck is not a strategy to keep your data safe, multiple backup copies is, before I return to the main subject.

So really you shouldn’t ever feel safe about a drive, and should have a disaster plan in place to fix things when, not if, they fail. Because given enough time, it is a when.

Drives record how long they have been powered up in internal logging called SMART data. Some drive cases allow us to read this data and see how many hours drives have been turned on. And while drives can die at any time, they become more likely to die as they log more hours.

Knowing a drives age in hours can be useful. I feel pretty comfortable using a HDD for 20,000-25,000 thousand hours based on my experience maintaining the servers for my printing companies, and my use of hundreds of drives.

25,000 hours/24=1041 days or 2.85 years of continuous use.

But chances are you only need access to your data for a few hours each day, so you can make your dives last a lot longer by powering them down or having them spin down when not in use. Some drive cases like drobo and synology let you specify when drives should “sleep.” Others need to the USB or power to be unplugged or switched off.

At 8 hours a day, 5 days a week, 52 weeks a year, you’ll log about 2,000 hours a year on your drives, and it will take you over ten years to get to 25,000 hours, by which time we should have some Star Trek like crystal storage technology, or at the least, much much cheaper cost per TB, and you’ll have probably replaced your current drives with much larger version for convenience. Less hours spinning means less chance they will fail.

But assuming you are really using your drives a lot, once drives approach 30,000 hours, I like to rotate them out of my master level storage and downgrade them to backup, where they will see fewer hours of use per year. While I’ve had drives work for over 50,000 hours, I wouldn’t want a drive that old to be used for anything but backup for a host of reasons.

All this applies to HDD drives, traditional hard drives with spinning disks. SSD, or flash drives have their own issues, not the least of which is loosing data if not powered on regularly. As of yet, there are few ways to safely put your data on a shelf for a long time and just forget about it. Your data needs to be kept alive on fresh drives and properly backed up, and only you can do that.

So how long do hard drives last? Until they don’t! Which will be the worst possible time they could fail based on Murphy’s law.

Hard Drive Costs November 2019

Current hard drive costs at a glance with links to purchase from Amazon. I recommend Seagate hard drives because they continue to test as some of the longest lasting drives.

Highlight for November is that 10TB external drives are a big savings over 10TB Internal drives. Also, on a cost per TB basis, 10TB drives are getting close enough to the sweet spot of pricing to make them attractive. But I generally don’t recommend buying more than a year’s capacity at a time, because 10TB drives could be $100 by next November, and will erase any “savings” from buying more than you need now. Also remember that a properly backed up “storage set” requires three drives, so buying more than you reasonably need (over provisioning) can suck up a lot of money.

Sometimes external drives are less expensive than internal drives. Advanced users may want to explore “shucking” external drives to save money as the external drives are often, but not always, SATA drives that can be used as an internal drive.

EXTERNAL

2TB $59.99 ($30 per TB)
4TB $89.99 ($22.50 per TB)
6TB $99.99 ($16.60 per TB)
8TB $139.99 ($17.50 per TB)
10TB $179.99 ($18 per TB)

INTERNAL

2TB $49.99 ($25 per TB)
4TB $89.99 ($22.50 per TB)
6TB $131.99 ($22 per TB)
8TB $149.99 ($18.75 per TB)
10TB $249.99 ($25 per TB)
12TB $312.99 ($26 per TB)
14TB $439.99 ($31.40 per TB)
16TB $476.99 ($29.80 per TB)

I’m an Amazon affiliate so I receive a small commission from each sale.

Long Term Photo Storage on Glass

Digital photograph has a big problem with long lasting storage. Hard drives and SSDs fail and degrade with time, with a 3-5 year service life in most cases before the drive fails or the data degrades, a fact I think most photographers are oblivious to, because for the most part, digital photography works well…until it doesn’t.

If you think you take storage seriously, you might want to compare your efforts to those of Warner Bros. They’ve partnered with Microsoft on Project Silica to create a truly archival form of digital storage. Engadget just did a great writeup on it:

https://www.engadget.com/2019/11/04/microsoft-archived-superman-project-silica/

Project Silicon encodes digital data onto stable quartz glass that is resistant to many forms of damage and degradation. The article provides a fascinating look into the challenges with preserving both film and digital data. I hope we still photographers can someday reap the rewards of this technology.

Interestingly glass has a long history in photography as it provides a dimensionally stable base for film emulsion. Ansel Adam’s famous Monolith, the Face of Half Dome was made on a glass plate. Glass plates were widely used by astronomers for most of the 20th century because they allowed precise measurements of star positions. There is a certain poetic beauty that photography is about to come full circle when we again store our photos on glass.