I wouldn’t dare to touch production NAS server without it’s data being backed up. This gave me a headache few years back when I built the original NAS. All was great, until I found out I have no place to back the data onto, if I need to do maintenance on the machine. It eventually all came crashing down in a data disaster of 2020. This is often overlooked by people asking me for advice when building their first home NAS – one day that thing will crap out and you need to have backups. How do you backup a medium that holds more data than any consumer available hard drive offered today?
Back then, I realized the best way to protect my data moving forward was to stick to 3-2-1 backup scheme. Given my NAS held about 14TB of data back then, I did not have an option to run out and buy large-capacity external HDD to back all this onto.
Alternatively, I could’ve bought some consumer-grade solution, such as Synology or ASUS MyCloud with multiple disks. I opted not to do that. I used to own a D-Link DNS-320 back in the day with two drives in it’s bays set as mirrors. This thing worked well for all of about 4 months, when I was slowly moving some data on it that held more and more importance and it was gaining my trust. After about 4 months, it crapped out and kept crapping out for the rest of it’s lifespan. It would keep reporting issues with mirror setting, it would always keep rebuilding to the point I had to shut it down in the evenings because it would wake me up with it’s poor, tiny, cheap, bearing-less fan. Even though the disks had clean S.M.A.R.T. tables, D-Link’s NAS would never be satisfied and would continuously rebuild it’s little array. D-Link’s support would not accept it for warranty claim, nor would the seller. I tried replacing the disks, upgrading from 2x500G to 2x1000G drives, but it didn’t change anything. Trying to figure out what was wrong was impossible, consumer-grade NAS units use closed-source nix distros as their operating system and don’t allow users to fiddle with it. I never lost any data I put on that NAS, but I didn’t trust it and it didn’t provide me with any value since the mirror was always being rebuilt and thus didn’t provide any actual resiliency. Given this experience, I didn’t want to throw large amounts of money on buying another one of those few years later. Fool me once…
The last option I could think of was to build another, secondary NAS. For practical reasons, I chose same case model, so I have two identical boxes neatly stacked in the corner of my living room and they also serve as talking topics for many of my guests. People do not typically stack ~ 130TB of hard drives in two machines in their living rooms, but when they find out what they do and why is there more than one, it gives them something to think about. Anyway, here comes our strategy for extended NAS maintenance without data loss, or even risk of data loss.
In my last article, I talked about how I upgraded my secondary NAS to TrueNAS, different redundancy scheme, and much more. I was allowed to do this and do complete reinstall and reconfiguration, because this NAS is only ever used as secondary backup warehouse. The data it holds is also present elsewhere and so deleting all of it’s contents and doing complete rebuild doesn’t put me in any risk. That’s all been done now and this NAS is now a little happy camper. Did I start it up and started moving data on it right away? No.
Never hurry
TrueNAS is a tool I’ve never used and what’s more, it’s open-source and free. That rarely ever translates to “plug and play.” With open-source, free software, it’s always “some assembly required,” but I have to admit, I did not encounter any bugs in the time I’ve operated it. All I have encountered was missing knowledge on my part. Before I decided to trust this thing with my data, I drove it around for a while. I played with TrueNAS for about a week, trying to run it’s Jails, it’s VMs, it’s ACLs. I went as far as using the broken AKASA slim SATA cables to test out it’s ability to sense intermittent issues with disks. I tried pulling 3 disks from raidz2 pool (two disk resiliency) to see what broken pool looks like. I tried to import unknown TrueNAS pool back into it, and much more. I don’t want to experience an issue down the road and start learning about how to resolve it only when it happens. That’s too late and I would be much more prone to accidentally cause catastrophic data loss by doing something dumb. TrueNAS actually handles these situations quite well, it provides email notifications (something Storage Spaces do not offer many years since release), it periodically runs S.M.A.R.T. tests and ZFS has many features to prevent data loss itself. I was very surprised with how effortless this week was. TrueNAS will not hold my hand when managing degraded pool, but TrueNAS community will. Their forum is surprisingly responsive and competent. I was able to understand any feature and how it works in during one evening’s study, so I was able to figure out how I want to run my file system’s ACL, my sharing ACL, pools, vdevs, jails, plugins, it was all a breeze. I eventually decided the box was fit for service, passed the trial period and I deployed it back into it’s living room corner. My main NAS was able to successfully run it’s first regular backup onto it and I was able to move to my next step.
Primary NAS maintenance
Now that secondary NAS is back in service after a week of downtime, I can start thinking about touching my primary NAS and planning some maintenance. The most important thing it has it’s all of my data, so that needs to be backed up first in it’s entirety. Secondary NAS will, temporarily, not only hold redundant backup copies, but will actually hold all of my data. That’s no joke and that is also why I was running TrueNAS for a week, trying to punish it every which way, so that I trust it’s ability to not crap out in the short period I will depend on it with all the data I have.
What kind of maintenance are we talking here? Will I also be moving to TrueNAS with my main NAS? Maybe in the future, but not today. Changing the platform of main user-facing server is not something I have confidence to execute. Primary NAS is running Windows 10 Pro and although this thing is painfully unfriendly to server use, it works. Purpose of the maintenance would be as follows:
- Scale down the HDD pool. After two years I found out the machine will be completely sufficient with six 10TB drives instead of nine. That will improve airflow and temperatures, as well as remove the need for AliExpress Marvell SATA controller to provide additional SATA connections.
- Upgrade the GPU. GTX960 4G was a reliable workhorse, but can’t cope with the load my current user base is putting on it. People using Plex want to watch 4K streams on their TV or 1080p streams on their tablets and this poor old gaming card was simply not up for the task anymore. I would’ve upgraded much sooner, but current GPU climate doesn’t allow for any reasonable pricing options. I ultimately gave up and bought RTX2060 6G rev. 2 for about $500 in my country. It only works out to that price, because I can claim the VAT back and used customer points with the retailer.
- Upgrade from SATA SSD OS drive to NVMe M.2 SSD.
- Clean up all the dust, check that all SATA cables have been replaced from the AKASA slim cables and make sure all the fans are in working order.
Should everything work out as planned, the Storage Spaces pool will not have to be rebuilt and there will be no data loss on primary NAS, that would require me to move the data back from secondary NAS’ backups. Since we’re keeping Windows as OS, I can just remove 3 HDDs from the pool and have Storage Spaces to take care of it. During HDD removal, Storage Spaces will simply move it’s data onto remaining drives and I will never even lose redundancy during the procedure. Neat!
The procedure then is as follows:
- Copy all of the data from primary NAS to secondary NAS in order to have effortless access to backed up data at all times. If primary NAS burns up in fire, I will have not lost any data. This step doesn’t require downtime and primary NAS is fully accessible by users.
- Start removing desired HDDs from Storage Spaces pool. This will require 3-step approach, because we’re removing 3 disks. It will take some time, but also doesn’t require downtime. Primary NAS will be accessible to all users at this point.
- Turn the primary NAS offline and perform the maintenances and upgrades. This does constitute downtime. I expect to communicate this to my users at least a week in advance and perform the maintenance and upgrade during low-usage hours.
- After the maintenance is done, I will power the box back on and make sure everything works. The downtime schedule will have at least 1 hour of time reserved for this step alone in case I encounter any issues.
Since I was able to extensively punish and get familiar with TrueNAS in the week before, I am confident in it’s ability to handle the load. During data copy, the speeds are sustainably holding at ~ 125MB/s, which is the practical limit 1Gbps network connection. I did not have to go through command-line martyrium with TrueNAS that I did have to go through with Storage Spaces. Once the whole upgrade is finished, I expect I’ll write another post about it.