Menu
blog.kroy.io
blog.kroy.io
i-am-a-monkey

The ZFS walk-of-shame with Seagate and OmniOS CE.

Posted on July 7, 2018August 29, 2019 by Kroy

This is a bit of a long one, so I’ll put this first.

The TL;DR

Opinion #1: OmniOS and Napp-IT are nice, but there’s a good chance it’s not worth the headache.

There is absolutely little to zero support, and even with some Enterprise hardware that is potentially only 4 years old now, the driver support isn’t there.

Opinion #2: Don’t buy 10TB Seagate IronWolf drives. They will probably not work in most ZFS setups, unless you are running ZFS on Linux. I’m pretty sure both the regular and Pro versions have broken firmware when used in ZFS pools.

FreeNAS and OmniOS are both affected with no real resolution in sight, but at least so far, Linux seems stable. And Seagate support isn’t much help.


After a few weeks of testing, a few months ago I migrated all my ZFS pools over to OmniOS CE using Napp-IT for a GUI.

As a long-time ZFS user, I’ve grown to loathe FreeNAS, and the big ZFS GUI for Linux is in OMV, and that’s not great.

For the most part, I’ve been reasonably satisfied with the OmniOS setup. My biggest complaint (prior to my problems) is that Solaris is just so different from Linux/FreeBSD. There’s just not a lot of documentation, so any minor issues take a lot of trudging to figure out.

I really do love a few things about it. How NFS and SMB are configured are glorious compared to something like FreeNAS. And Comstar is an incredible piece of software for managing iSCSI.

A major annoyance is that it will sometimes fail to complete booting, necessitating a reboot. Apparently this is an ancient Solaris bug that’s still hanging out in illumos.


So a simple reboot for a software update often results in needing to reboot two or three times. And this happened both in VM installs that I’ve tried and on multiple different hardware platforms.

Other than that, all was fine until about a month ago when I added four 10TB Seagate IronWolf drives as two additional mirrored vdevs. Specifically model number ST10000VN0004, but some searching tells me the 10TB IronWolf Pro is probably impacted too.

Four brand new drives, and within a few days they ALL started to throw timeouts, which eventually resulted in them getting offlined and kicked out of the pool until a reboot. I came dangerously close to losing the whole pool more than once.

Fortunately, the last OmniOS CE update introduced a new killer feature, vdev removal. Using this feature, I was able to remove the two new vdevs, get my pool back to stability. Or so I thought.

The problem is that now my pool was anything but stable. Random checksum errors, transport errors on the drives, basically everything scary that you don’t want to see.

I tried:

  • Three different HBAs, one brand new
  • Different cables, SFF8088.
  • Different SA120s and enclosures with breakout cables
  • This is an array of 10TB Seagate Enterprise drives, and a mix of 6/8TB IronWolf drives, and every one of them pass SMART. The errors aren’t even tied to specific drives, just on the pool in general.

So I have two issues. The unstable original pool and the broken new 10TB Seagates.

In trying to troubleshoot the Seagates, I hooked them up to different SAS expanders, HBAs, even directly to the SATA ports on the motherboard. The dropouts continued under every ZFS implementation except Linux. And since I’m not using them in a “supported” NAS, there’s not any help from Seagate Support.


Note that I’ve been using almost exclusively IronWolf drives for years, 6/8TB Pro/non-Pro, and I’ll say they’ve been some of the most reliable drives I’ve ever used. So it’s disappointing that Seagate support won’t do anything with the 10TB drives when it’s clearly a firmware problem. I’m running almost half a petabyte of Seagate storage, and these drives are the ONLY ones that have issues.

There’s a few posts about these 10TB drives on the FreeNAS forums and on Reddit. The fix currently is to buy Western Digital drives (ewww), or run ZoL.

Anyway, after almost a month of messing around with it (and wasting a pile of cash on different HBAs, expanders and cables, to try and troubleshoot both problems), I’ve basically come to the conclusion that the vdev device removal is buggy and triggered all my stability issues.


So I did what any sane person would do.

I grabbed a spare R330 SFF from production , loaded it up with a bunch of SSDs for VM storage, a brand new Intel x540 10Gb card, and a brand new 9207-8e. All with the intention of migrating my data over to a whole new pool, before giving the loaner hardware back.

One word. FAIL. With OmniOS, and the H330 that is in the R330, any attempt to migrate VMs off the server results in the “mr_sas” driver causing a kernel panic. This a year old bug that’s been reported with no fix in sight.

So now I had a terabyte of VMs stuck on a datastore and I couldn’t get them off. The second a migration in vCenter was attempted, the box kernel-panicked.

After three days of working on it, I finally figured out I could copy them manually to a single-spinner 2TB datastore, over a 1Gbps link, to prevent the load (or lack of) from triggering the kernel panic. Almost a literal week later…


So here I am. I have:

  • a previously rock-solid pool of drives exhibiting failure symptoms on OmniOS CE
  • Four brand new 10TB drives connected to the loaner R330 that have been running a test pool on Ubuntu. I’ve been subjecting this pool to load for the last 5 days with precisely zero dropouts.
  • Four gallon freezer bags full of SAS/SATA cards/expanders/adapters/parts/cables purchased in the last month.

That leads me to the simple conclusion that it’s time to go back to ZFS on Linux. At this point I’m stuck with the new 10TB drives, and I need them to be stable.


This morning, I started the migration back. Unfortunately the OmniOS CE ZFS version has a ton of feature flags that haven’t/possibly won’t make it back to Linux and FreeNAS.

Because of vdev removal on the source pool, I’m able to copy a bunch of data, remove a mirror on the original pool, add the mirror to the new pool on Linux.

This is almost a 300TB pool, so unfortunately this is the only reliable way to migrate the pool without investing in a ton more drives. The downside is that it’s going to greatly unbalance the disks in the pool. The early ones are going to have most of the data on them, which will kill some performance.

Since I’m running the new server on bare metal Ubuntu, I’m probably going to end up writing at least some sort of simple reporting GUI for the server.

HomeLab ZFS

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • The WYSE 5070, a perfect little VyOS device
  • Battle of the Bare Metal Routers
  • Migrating an EFI Linux Install to a new Server
  • The Great Rack Migration – Dell R420
  • The Great Rack Migration – X10SRH-CLNF

Archives

  • December 2019
  • November 2019
  • October 2019
  • September 2019
  • August 2019
  • January 2019
  • November 2018
  • July 2018
  • May 2018
  • December 2017
  • August 2017
  • July 2017
  • June 2017
  • May 2017
  • April 2017
  • March 2017
  • February 2017
  • January 2017

Categories

  • ceph
  • Debian
  • Docker
  • ESXi
  • freenas
  • Hardware
  • Hardware Hacks
  • HomeLab
  • mediaserver
  • mikrotik
  • Networking
  • nucnucnuc
  • pfsense
  • proxmox
  • Proxmox
  • QNAP
  • Rack Migration
  • RouterOS
  • Routing
  • RTLSDR
  • Storage
  • SysAdmin
  • Virtualization
  • VyOS
  • ZFS

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org
©2019 blog.kroy.io | Powered by WordPress & Superb Themes

Privacy Policy - Terms and Conditions