This is a bit of a long one, so I’ll put this first.
Opinion #1: OmniOS and Napp-IT are nice, but there’s a good chance it’s not worth the headache.
There is absolutely little to zero support, and even with some Enterprise hardware that is potentially only 4 years old now, the driver support isn’t there.
Opinion #2: Don't buy 10TB Seagate IronWolf drives. They will probably not work in most ZFS setups, unless you are running ZFS on Linux. I’m pretty sure both the regular and Pro versions have broken firmware when used in ZFS pools.
FreeNAS and OmniOS are both affected with no real resolution in sight, but at least so far, Linux seems stable. And Seagate support isn’t much help.
After a few weeks of testing, a few months ago I migrated all my ZFS pools over to OmniOS CE using Napp-IT for a GUI.
As a long-time ZFS user, I’ve grown to loathe FreeNAS, and the big ZFS GUI for Linux is in OMV, and that’s not great.
For the most part, I’ve been reasonably satisfied with the OmniOS setup. My biggest complaint (prior to my problems) is that Solaris is just so different from Linux/FreeBSD. There’s just not a lot of documentation, so any minor issues take a lot of trudging to figure out.
I really do love a few things about it. How NFS and SMB are configured are glorious compared to something like FreeNAS. And Comstar is an incredible piece of software for managing iSCSI.
A major annoyance is that it will sometimes fail to complete booting, necessitating a reboot. Apparently this is an ancient Solaris bug that’s still hanging out in illumos.
So a simple reboot for a software update often results in needing to reboot two or three times. And this happened both in VM installs that I’ve tried and on multiple different hardware platforms.
Other than that, all was fine until about a month ago when I added four 10TB Seagate IronWolf drives as two additional mirrored vdevs. Specifically model number ST10000VN0004, but some searching tells me the 10TB IronWolf Pro is probably impacted too.
Four brand new drives, and within a few days they ALL started to throw timeouts, which eventually resulted in them getting offlined and kicked out of the pool until a reboot. I came dangerously close to losing the whole pool more than once.
Fortunately, the last OmniOS CE update introduced a new killer feature, vdev removal. Using this feature, I was able to remove the two new vdevs, get my pool back to stability. Or so I thought.
The problem is that now my pool was anything but stable. Random checksum errors, transport errors on the drives, basically everything scary that you don’t want to see.
- Three different HBAs, one brand new
- Different cables, SFF8088.
- Different SA120s and enclosures with breakout cables
- This is an array of 10TB Seagate Enterprise drives, and a mix of 6/8TB IronWolf drives, and every one of them pass SMART. The errors aren’t even tied to specific drives, just on the pool in general.
So I have two issues. The unstable original pool and the broken new 10TB Seagates.
In trying to troubleshoot the Seagates, I hooked them up to different SAS expanders, HBAs, even directly to the SATA ports on the motherboard. The dropouts continued under every ZFS implementation except Linux. And since I’m not using them in a “supported” NAS, there’s not any help from Seagate Support.
Note that I’ve been using almost exclusively IronWolf drives for years, 6/8TB Pro/non-Pro, and I’ll say they’ve been some of the most reliable drives I’ve ever used. So it’s disappointing that Seagate support won’t do anything with the 10TB drives when it’s clearly a firmware problem. I’m running almost half a petabyte of Seagate storage, and these drives are the ONLY ones that have issues.
There’s a few posts about these 10TB drives on the FreeNAS forums and on Reddit. The fix currently is to buy Western Digital drives (ewww), or run ZoL.
Anyway, after almost a month of messing around with it (and wasting a pile of cash on different HBAs, expanders and cables, to try and troubleshoot both problems), I’ve basically come to the conclusion that the vdev device removal is buggy and triggered all my stability issues.
So I did what any sane person would do.
I grabbed a spare R330 SFF from production , loaded it up with a bunch of SSDs for VM storage, a brand new Intel x540 10Gb card, and a brand new 9207-8e. All with the intention of migrating my data over to a whole new pool, before giving the loaner hardware back.
One word. FAIL. With OmniOS, and the H330 that is in the R330, any attempt to migrate VMs off the server results in the “mr_sas” driver causing a kernel panic. This a year old bug that’s been reported with no fix in sight.
So now I had a terabyte of VMs stuck on a datastore and I couldn’t get them off. The second a migration in vCenter was attempted, the box kernel-panicked.
After three days of working on it, I finally figured out I could copy them manually to a single-spinner 2TB datastore, over a 1Gbps link, to prevent the load (or lack of) from triggering the kernel panic. Almost a literal week later...
So here I am. I have:
- a previously rock-solid pool of drives exhibiting failure symptoms on OmniOS CE
- Four brand new 10TB drives connected to the loaner R330 that have been running a test pool on Ubuntu. I’ve been subjecting this pool to load for the last 5 days with precisely zero dropouts.
- Four gallon freezer bags full of SAS/SATA cards/expanders/adapters/parts/cables purchased in the last month.
That leads me to the simple conclusion that it’s time to go back to ZFS on Linux. At this point I’m stuck with the new 10TB drives, and I need them to be stable.
This morning, I started the migration back. Unfortunately the OmniOS CE ZFS version has a ton of feature flags that haven't/possibly won't make it back to Linux and FreeNAS.
Because of vdev removal on the source pool, I’m able to copy a bunch of data, remove a mirror on the original pool, add the mirror to the new pool on Linux.
This is almost a 300TB pool, so unfortunately this is the only reliable way to migrate the pool without investing in a ton more drives. The downside is that it's going to greatly unbalance the disks in the pool. The early ones are going to have most of the data on them, which will kill some performance.
Since I’m running the new server on bare metal Ubuntu, I’m probably going to end up writing at least some sort of simple reporting GUI for the server.