26 Nov 2022 -
Last update 26 Nov 2022
TL;DR: SMR disks are evil in any ZFS ZPool or below copy on write filesystems such as btrfs. Never ever use them there. Replace them as soon as possible when you have some in your pools, do not wait for failure or some later moment. They’re ok in other scenarios like ufs, ext4, NTFS, fat, etc. partitions and applications that write on really rare occasions but not for ZFS, btrfs and other modern filesystems or in RAID settings.
So what is SMR anyways? It’s a way - or pattern - that magnetic hard disks are recording their data with. Currently one sees two categories of hard disks that write data in different ways on their magnetic disks:
Why do manufacturers use SMR instead of CMR when it’s so bad? As one can imagine denser tracks mean on one hand less head movement in many cases but most important you can fit more data on smaller disks - sometimes you can even reduce the number of rotating storage disks by one and thus reduce costs which leads to huge profit gains when one looks at the volume of disks manufactured. Thus you can sell a disk with same capacity for a lower price or have a larger profit. Because of this many manufacturers didn’t even tell customers which disks used SMR in many cases - this had been better the last few months since manufacturers started to release lists of SMR disks realizing they cause failure in important settings which didn’t make users really happy (it didn’t even help that they saw large performance the first time period after they bought the disk when after that even major data loss happened and performance crippled - surprisingly).
Basically the long latency whenever the disk has to rewrite data from the cache into the SMR region in case it causes many rewrites and thus latency’s up to the tens of seconds range. When using SMR disks in the beginning everything seems to work fine and with usual workloads and with sequential access (or workloads that mainly cause reads) everything appears perfect. Also when only writing in bursts and then pausing for a longer duration everything seems fine since all data is written to the CMR cache and then transferred in the idle time into the shingle zones. When running with a modern copy on write filesystem also everything looks great - until some disks in the pool fail. Then a resilver process will start. Resilvering a modern file system causes many different random writes while replaying the journals and rewriting nearly all sections on the disk. This rapidly fills up the write cache in the CMR region of the disk - and then triggers rewrites that take many seconds and if you’re unlucky up to tens of seconds. In case the disk gets too slow the RAID logic of hardware raid controllers and also pool management of ZFS (other volume managers of course too) mark the disk as dead - since this is one of the behaviors a failing disk shows. Thus they eject your disk that you want to resilver on as dead and as long as there is enough redundancy they do not cause any error. The problem now arises when there are multiple SMR disks in the system that are written too as usual during a resilver (and by the fact that resilvering will take really long or will never finish with periodic faults). Then the next disk might fail during the continuation of the process and so on. This will lead in a unavailable pool that will fall into this condition reboot after reboot over and over. And if you’ve bad luck and have some failing sectors or a faulty disk this might even lead to total data loss even with modern filesystems who are usually surprisingly robust - there are some tricks one can try though.
And here’s the catch: You won’t notice this problem until you experience major disk problems - not even the bandwidth reduction that you will first notice when the disks get filled up nearly complete for the first time since the disk has to rewrite more neighboring sectors and has less likelihood to stop on an empty sector while performing writes into the SMR region. And when they are there a rebooting machine may hang many hours during mounting of the filesystem while replacing the most necessary parts of the intention log - and even weeks to months during resilvering or scrubbing the whole pool. It may take even days mounting the pools again (the longest I’ve ever seen has been 9 days to reboot on a production machine with a just 12 TByte pool just for mounting, the resilver dropped to less than 4 KByte/sec after about 20% and thus would have taken roughly 82 years to finish due to the speed decrease on the SMR drive). In case it’s urgent you then just have to buy CMR disks as replacement, just mirror the old ones sector wise and throw out the SMR disks all at once. This works somehow well and comparably fast (a few days for ordering and cloning the disks) as long as there are no defective sectors on the disks since one can clone them in huge batches which exploits fast sequential reads and writes on disks. When there are defective sectors it gets a little bit more tricky - then one has to copy them sector by sector which is painfully slow (lets say around 3-4 days on a 4 TByte disk as of today - and one has to do this with every SMR Disk in ones pool). Or you wait until the machine starts up again and replace them the old fashioned way during runtime which basically works as long as the SMR disks are not written onto - so in case any defect happens one cannot do this in a reasonable way. All in all - as soon as problems start you will have days to weeks of downtime of those machines which is something that you want to avoid in the first place - there is a reason to use RAID like solutions anyways (and that’s not to replace backups but to keep the systems available and up even in case of hardware failure)
So short story even shorter: If you have SMR disks in your pools replace them. If you’re building pools don’t use SMR disks. Check if your disks are SMR beforehand or now when you don’t know. And if a manufacturer does not explicitly state which recording technique a disk uses (note that this might even differ for different storage sizes of the same family of disks of the same manufacturer) assume it’s SMR and don’t buy them.
So is there something one can do during recovery (resilvering) when one is unable to get rid of the SMR disks or while doing the replacement? Indeed - one can increase the disk timeout threshold at which the operating system decides a disk is dead in case one uses a JBOD and not a hardware RAID controller.
For FreeBSD for example one would set:
kern.cam.ada.default_timeout for ATA direct access devices
kern.cam.da.default_timeout for SCSI direct access devices
Usually those are set to values like 30 seconds for
ada and 60 seconds for
da - way to slow
for a full SMR disk. It’s a good idea to increase the timeout values during recoveries to values as
high as 5 minutes (300 seconds) to prevent the disks to be ejected. Then one can run a long scrub,
resilver and replace procedure. This will take time (depending on the disk up to the range
of 80 to 100 years - unfortunately not exaggerating)- but it’s still one of the fastest ways to recover
from SMR disks over just a few weeks up to a few months - but no other device is allowed to fail during
this time and performance will cripple. One can usually only take this route as long as no device has
shown the performance degrading effects to replace the SMR disks in a useful way. But as soon as resilvering
There is only one really good way to recover when the problem already has shown - buy a CMR disk of equal or larger size (sector wise) for each SMR disk in your storage pool. Take your systems offline and clone them disk wise one by one. This is pretty simple:
dd if=/dev/adaXX of=/dev/adaYY bs=1G for each and every disk in succession.
For a typical 3 TB disk this will take 7 to 12 hours per disk - if you have multiple disks and are able
to attach them to the machine launch multiple instances of
dd - you are most likely limited
by the disks and not the controllers during replacement.
Personally I’d say: Never. Avoid them.
But basically whenever not many writes are performed in close succession, latency is not a problem and the disk is not used inside a disk array. Thus when one archives really slowly produced data on some disks, runs only a single disk in ones desktop workstation or notebook where one just writes low amounts or data, etc. and does not run a robust copy on write filesystem or wants to send in larger bursts of data (like performing a full disk backup, etc.) SMR disks are a perfectly valid choice. So for many home and some unnecessary office use this won’t be a problem. But for server applications, reliable workstations, storage applications or as a backup target they’re usually a really bad idea.
This article is tagged:
Dipl.-Ing. Thomas Spielauer, Wien (firstname.lastname@example.org)
This webpage is also available via TOR at http://rh6v563nt2dnxd5h2vhhqkudmyvjaevgiv77c62xflas52d5omtkxuid.onion/