So You Decided to Build a NAS - Storage Spaces War Stories

A quick article to spook the future-to-be NAS owners and maintainers. I’m writing this mostly so that I can use link to this instead of explaining it again.

You’ve probably experimented a little bit with Intel RST on Windows, or bought some off-the-shelf solutions, like ASUS MyCloud or Synology multi-disk units. You know you can run a mirror or stripe with your two HDDs, so you’ve ordered more. You want to take your data-hoarding game to a whole new level. Instead of having two 4TB drives in mirror, you can order one extra drive, you stick them up with parity and you will gain additional 4TB of usable space while retaining protection! Brilliant! Regardless of your IT background, NAS is available and accessible to anyone these days. You don’t need sysadmin career to run it. But you need to know at least a little bit to run it successfully. That is, without data loss. Remember this:

There are two types of data. Data that has been backed up and data that has yet to be lost.
RAID is not a backup.
If you depend on a backup to get your data, it is no longer a backup.

So now you have your 3 disks on the table, you’re building your new home NAS from consumer grade hardware and you’re using gaming motherboard you had laying around from your deprecated build from few years ago. There’s nothing wrong with this setup, it will work. Eventually, you’re going to set up your volume through Intel RST, or Windows Storage Spaces, or Windows RAID through Disk Management console, or DrivePool, or TrueNAS, or maybe through some other means. Whichever suits you best. After you do that and move your data onto your new NAS, you’ll be left with some other storage mediums that you have used to store the data while the NAS was being built. But hey, I have my new NAS, I have parity, I don’t need this additional copy of my data. In fact, I can probably use that media and add it to my pool of disks and expand my little NAS operation here!

The inevitable doom

Eventually, that NAS is going to fail. With parity protection against one disk failure, two disks are going to fail. Or ransomware will steal the data. Or the PSU will crap out during power outage and take some disks with it. Or a Windows update will make changes during reboot that invalidates your array entirely. Or maybe Windows will glitch out and forget about it’s bitlocker volume and eventually tell you that your data is lost. Did you ever stop to think about everything that has to go right in order for you to gain access to your data? Some day, you or your users are not going to use the NAS in the way you have intended. They are not going to upload their 4.3 MiB worth of PDF, or download 12 GiBs worth of 4K movie. No, instead, your friend is going to bring over a 14TB external drive to obtain a copy of your entire library of movies, so they can watch it at home. You’ll plug it in, look up your movie folder, press Ctrl+A, Ctrl+C and Ctrl+V. And you’ll leave the poor thing to it’s fate for several hours. Did you think about cooling strategy of your disks, when you were building your NAS? If disks overheat, their failure rate goes up rapidly. Let’s imagine a more realistic scenario – eventually, your NAS is going to experience some issue and you’ll be left wondering what is the safest approach to recover, without having lost your data. And as you’re sitting there, contemplating the options, you begin to regret not having any complete backup of your NAS. If you did have complete backup, you could manage the issue with a lot more confidence, knowing that if anything you do results in unrecoverable data loss, you have the backup to go back to. This is what’s going to be going through your head when it happens. Depending on value of the data to you or your users, you’ll be reevaluating if having the backup was or wasn’t worth it. And as you’re scrolling through yet another forum post about data recovery at 3.15AM, you’ll finally arrive at the opinion that it would’ve been worth it to spend a little extra cash on having that backup.

I am finally getting to the core point of this article. Running a disk array is additional layer of issues that can cause a failure. If you store all your data on independent disks, you can take each of those and use them in different computers as-is. If one of them fails, you have lost some of your data, but not all of it. Other disks still work. If you have, however, migrated all data to a single disk array, you’ve introduced a whole lot of additional responsibilities and limitations to yourself:

Disks are no longer independent. You need all of them or almost all of them in order to access any of your data.
You can’t migrate independent disks to different machines in order to gain access to the data on the disk. You can only migrate entire array and only to systems that understand how to read it.
Each disk has platters with data. Above that is operating system’s driver to communicate with the disk. Above that is implementation that operates the volume(s) on the drive. Above that is a partition manager that declares how data is actually organized. All this has to work together flawlessly in order for your data to be read and written. If any of those layers fail, or experience a bug, your data is at risk. Now you came along and introduced another layer, a pool layer to the topology. Also, your data is now stored in pieces across several devices. Another change that can screw everything up if not managed properly.

Just because you’ve managed to discover the wonders of RAID-5 and parity, it doesn’t mean you’re protected against everything. In fact, the decision to manage a disk array comes with a lot of responsibility and work in order to be executed successfully. Even if you set everything up properly and the NAS works, did you prepare a step-by-step plan on what to do if a drive fails? Do you even know how to replace a failed drive in your specific setup? It isn’t as easy as simply unplugging the drive and replacing it with a new one. Let’s say one of your drive does fail, and so you go into your Storage Spaces management console and select “Remove” option. After all, you want to properly remove the disk before you replace it with a new one. Bam. Error. Your little 3-disk parity array cannot remove a disk, because it can’t run a parity array with just 2 units! What now? If you’re running a 4-disk parity array, you might not be better off – if your little operation is filled up with data and sits at 80% of capacity being utilized, how are you going to store all that data on 3 units? You can’t, Windows will tell you the drive cannot be removed. It simply needs 4 disks in order to store your data, and it’s parity. What now? If you have tried this when you’ve set the Storage Spaces volume up, it probably worked, because you didn’t have any data on it. Now you do and the same approach doesn’t work anymore. You get nervous. You dive into the NAS drive bay to pull the broken drive out. That can’t fail, no error message can stop you from doing that. It’s already broken, the array cannot get more broken if I remove that failed drive. You open the chassis and find out you actually don’t know which drive it is that had failed. They all look the same, they are all one next to each other and even though you can read failed drive’s serial number from the screen somewhere, due to the way the drives are stacked in the case, you can’t read their stickers that contain the serial number. Now you don’t know which one to pull and you’ll begin to feel desperate. It’s long after midnight at this point, you’ve been at this for 10 hours straight and your entire digital wealth is still at risk. You power down the NAS, you unplug and unmount every disk until you can find the one which has a sticker with serial number matching the number you got from the screen. You plug everything back, maybe even replace the failed disk, you boot up and you begin planning on how to properly introduce the new disk to the pool and retire the now-missing broken disk from the pool. Your train of thought is interrupted suddenly, because now your Storage Space or whatever other tool you’ve managed your array with is telling you that your array is no longer degraded. It’s now unhealthy, failed, broken. Lost. During your hasty unplugging and plugging of drives, you’ve managed to leave one of the drives unplugged from SATA power. Or the connector wasn’t seated properly. Or the SATA cable just decided to fail. It doesn’t matter now, your entire array is lost, and the data can no longer be recovered. You’ve lost 2 units in an array protected against 1 unit failing. My point?

Redundancy is not a backup

When I was originally building my home NAS, I thought it’s pretty well built. It had 4 disks in RAID-5, good performance and it had all of my data on it. Movies, projects, music, documents, everything. All was well, until it wasn’t. If at any point I was in doubt of my array’s health, I was worried sick. If anything goes wrong, I lose everything. Redundancy against 1 disk failure gives you just that, nothing more. It doesn’t protect against the machine going up in flames from faulty PSU. It doesn’t protect against motherboard dying, failing the array if you’ve used RAID through BIOS setting of the board. It doesn’t protect against you accidentally deleting the data. It gives you nothing else, but a way out of one specific scenario, which is that a single disk has failed and the failure was successfully recognized, the disk was degraded and you have been notified.

If your data is not worth anything, then it’s not worth building an array around it that you need to manage, maintain and worry about. Simply store the data on independent disks and swallow the annoyance of having to have the data in two places. If your data is worth something, back it up. The backup has to be completely independent of the NAS computer. There are industry-standard methods to do this. They are called 3-2-1 and you should follow them, but the main message I want to drive home is that you need to backup your NAS if the data it stores has any value at all. If you’re stuck at recovery screen at 3AM in the morning, you want to have the option to say: “Fuck this, I’ll do this in the morning.” With existing backup, you will get a good night’s sleep and you will see things clearer the next day. If you don’t have the backup, you won’t even sleep well and the moment when you inevitably fuck up your array permanently will just be delayed to a later date.

The backup

It’s pricey. You’ve thrown together a functional NAS PC from parts you had laying around and you bought a used disk on ebay, so now you have a working setup with 3 disks in parity configuration for all of 40 bucks that you spend on the disk. Some schmuck on the internet is now telling you that you need to build another PC, with all the parts, and additional disk(s), just to run the backup. Your original budget of 40 bucks is now dwarfed by hundreds of bucks required to build the backup unit. This is outrageous! Unnecessary even! Yeah, you do you – until you find yourself tearing your hair out at 3AM. Then you’ll think the couple hundred bucks would have been worth it to avoid this mess.

I was running an external HDD for backups as long as I could. It’s cheap, independent, it checks all the boxes. Once a week I would plug it in and leave it to it’s duty. As my data array grew, I would eventually have to make cuts in what was being backed up. Movies and TV library would not make the cut and would not be backed up. After all, I can always download those again. Later on, even VM folder would have to be excluded due to it’s growing size. And when the drive failed later on, I would come to find out that my “backup” would not contain even half of the data being recovered. Eventually I will have spent enough hours curating my movie library that would evaluate to the price of additional HDDs, if converted to hourly wage. And so I eventually did the inevitable. I bought secondary NAS and I’m running two, completely independent units, just for the sake of backing the first one to the second one. Did anything happen to the main NAS since I started doing that? Oh yes. SATA cables failed, drives failed, Windows corrupted itself, USB port on the case has died, I accidentally deleted data that I needed later, VeraCrypt containers got corrupted from being incorrectly unmounted during power outage… A lot of stuff happened and I didn’t have to worry once – everything is backed up and worst case scenario is the backup will be 5 days old. I don’t think I ever actually needed the backups on the secondary NAS, but I remember every single time I was fixing whatever was broken on the primary NAS with smile on my face and a cup of coffee in my hand. I went to bed at reasonable time and I was never worried about losing dozens of terabytes of data I spent my entire lifetime generating, managing and curating.

Invest in a solid backup solution. Do include a cost of additional backup solution in your budget for your first NAS. Right of the start, back up everything. Do not underestimated this, you will inevitably need that backup and you will thank me for the advice.

The inevitable doom

Redundancy is not a backup

The backup

Leave a Reply Cancel reply