Why you should not use SMR disks for ZFS
26 Nov 2022 -
Last update 26 Nov 2022
12 mins
TL;DR: SMR disks are evil in any ZFS ZPool or below copy on write filesystems such as btrfs.
Never ever use them there. Replace them as soon as possible when you have some in your pools, do not wait
for failure or some later moment. Theyâre ok in other scenarios like ufs, ext4, NTFS, fat, etc. partitions
and applications that write on really rare occasions but not for ZFS, btrfs and other modern filesystems
or in RAID settings.
What is shingled magnetic recording?
So what is SMR anyways? Itâs a way - or pattern - that magnetic hard disks are recording their data with.
Currently one sees two categories of hard disks that write data in different ways on their magnetic disks:
- Conventional Magnetic Recording (CMR) is the traditional recording method - usually as of today this
means that those disks use perpendicular magnetic recording (PMR) which magnetizes the disks in horizontal
direction and perpendicular to the disk surface (in contrast to the older longitudinal magnetic recording (LMR) method
that has been used for older disks). Those disks write data directly to those sectors of the disks that
are requested by the operating system (except for bad sector remapping that usually happens transparently to
the host system by the disk controller as well as offsets added to prevent access to locked sections on the
disk that might be used by firmware) without having to rewrite other sections of the disk. The write head
always only overwrites a single track on the disk and never destroys or overwrites other data.
- Shingled Magnetic Recording (SMR) on the other hand writes data on overlapping neighboring tracks. This works
since the write heads are larger (or at least can be imagined to be) than read heads - the tracks are placed
more dense than with CMR so that the write head writes onto more than one track. The read head is still small
enough to read the tracks independently. In case the write head destroyed a neighboring track the disk will also
have to re-write the neighboring track. So a write might cause a cascading effect of rewriting the neighboring
sectors as well as their neighbors until it reaches an unused area or the edge of the shingle zone where overwriting
is not necessary. While the disk re-writes it basically hangs, the writes take longer to finish. In addition the
disks of course need to know which sectors they have to rewrite so like SSDs they require periodic TRIM commands
sent by the operating system (i.e. SMR disks work really bad with legacy operating systems not supporting TRIM).
The write processes gets less performing the more disk sectors are used (or have ever been in use in case no
TRIM command is sent by the operating system).
Instead of a few milliseconds writes may take up to a few seconds (which will be the problem in the end) - on some
occasions even tens of seconds.
To hide this rewriting that might take way more time than just writing a single track or sector those disks usually
employ an outer region that uses CMR as a cache. Usually those regions are at least 20 GBytes in size - data
is first written into the cache and when either the hosts requests it, the disk cache is full or the disk has
some spare time on some of the disk managed SMR disks the disk re-writes the data into the SMR region.
Until the disk cache is full the write performance of an SMR disk appears to be comparable with a
CMR disk - so one doesnât notice anything. In case the cache is full the disk rewrites all data from
the cache into the SMR region usually - which can take tens of seconds while the disk hangs before
finishing the next write which will cause the bandwidth to cripple into the kilobytes per second range
and potentially even lead to the disks faulting.
Why do manufacturers use SMR instead of CMR when itâs so bad? As one can imagine denser tracks mean on one
hand less head movement in many cases but most important you can fit more data on smaller disks - sometimes
you can even reduce the number of rotating storage disks by one and thus reduce costs which leads to
huge profit gains when one looks at the volume of disks manufactured. Thus you can sell a disk with same
capacity for a lower price or have a larger profit. Because of this many manufacturers didnât even tell
customers which disks used SMR in many cases - this had been better the last few months since manufacturers
started to release lists of SMR disks realizing they cause failure in important settings which didnât make
users really happy (it didnât even help that they saw large performance the first time period after
they bought the disk when after that even major data loss happened and performance crippled - surprisingly).
Whatâs the problem?
Basically the long latency whenever the disk has to rewrite data from the cache into the SMR region in case
it causes many rewrites and thus latencyâs up to the tens of seconds range. When using SMR disks in the beginning
everything seems to work fine and with usual workloads and with sequential access (or workloads that mainly cause
reads) everything appears perfect. Also when only writing in bursts and then pausing for a longer duration
everything seems fine since all data is written to the CMR cache and then transferred in the idle time into the
shingle zones. When running with a modern copy on write filesystem also everything looks great - until some disks
in the pool fail. Then a resilver process will start. Resilvering a modern file system causes many different
random writes while replaying the journals and rewriting nearly all sections on the disk. This rapidly fills
up the write cache in the CMR region of the disk - and then triggers rewrites that take many seconds and
if youâre unlucky up to tens of seconds. In case the disk gets too slow the RAID logic of hardware raid
controllers and also pool management of ZFS (other volume managers of course too) mark the disk as dead - since
this is one of the behaviors a failing disk shows. Thus they eject your disk that you want to resilver on as
dead and as long as there is enough redundancy they do not cause any error. The problem now arises when there
are multiple SMR disks in the system that are written too as usual during a resilver (and by the fact that
resilvering will take really long or will never finish with periodic faults). Then the next disk might fail
during the continuation of the process and so on. This will lead in a unavailable pool that will fall into
this condition reboot after reboot over and over. And if youâve bad luck and have some failing sectors or
a faulty disk this might even lead to total data loss even with modern filesystems who are usually
surprisingly robust - there are some tricks one can try though.
And hereâs the catch: You wonât notice this problem until you experience major disk problems - not even the bandwidth
reduction that you will first notice when the disks get filled up nearly complete for the first time since
the disk has to rewrite more neighboring sectors and has less likelihood to stop on an empty sector while performing
writes into the SMR region. And when they are there a rebooting machine may hang many hours during mounting of
the filesystem while replacing the most necessary parts of the intention log - and even weeks to months during
resilvering or scrubbing the whole pool. It may take even days mounting the pools again (the longest Iâve ever
seen has been 9 days to reboot on a production machine with a just 12 TByte pool just for mounting, the resilver
dropped to less than 4 KByte/sec after about 20% and thus would have taken roughly 82 years to finish due to the
speed decrease on the SMR drive). In case itâs urgent you then just have to buy CMR disks as replacement, just
mirror the old ones sector wise and throw out the SMR disks all at once. This works somehow well and comparably
fast (a few days for ordering and cloning the disks) as long as there are no defective sectors on the disks since
one can clone them in huge batches which exploits fast sequential reads and writes on disks. When there are defective
sectors it gets a little bit more tricky - then one has to copy them sector by sector which is painfully
slow (lets say around 3-4 days on a 4 TByte disk as of today - and one has to do this with every SMR Disk
in ones pool). Or you wait until the machine starts up again and replace them the old fashioned way during
runtime which basically works as long as the SMR disks are not written onto - so in case any defect happens
one cannot do this in a reasonable way. All in all - as soon as problems start you will have days to weeks
of downtime of those machines which is something that you want to avoid in the first place - there is a
reason to use RAID like solutions anyways (and thatâs not to replace backups but to keep the systems
available and up even in case of hardware failure)
So short story even shorter: If you have SMR disks in your pools replace them. If youâre building pools donât use
SMR disks. Check if your disks are SMR beforehand or now when you donât know. And if a manufacturer does not
explicitly state which recording technique a disk uses (note that this might even differ for different storage
sizes of the same family of disks of the same manufacturer) assume itâs SMR and donât buy them.
Some tips during recovery
So is there something one can do during recovery (resilvering) when one is unable to get rid of
the SMR disks or while doing the replacement? Indeed - one can increase the disk timeout threshold at which
the operating system decides a disk is dead in case one uses a JBOD and not a hardware RAID controller.
For FreeBSD for example one would set:
kern.cam.ada.default_timeout
for ATA direct access devices
kern.cam.da.default_timeout
for SCSI direct access devices
Usually those are set to values like 30 seconds for ada
and 60 seconds for da
- way to slow
for a full SMR disk. Itâs a good idea to increase the timeout values during recoveries to values as
high as 5 minutes (300 seconds) to prevent the disks to be ejected. Then one can run a long scrub,
resilver and replace procedure. This will take time (depending on the disk up to the range
of 80 to 100 years - unfortunately not exaggerating)- but itâs still one of the fastest ways to recover
from SMR disks over just a few weeks up to a few months - but no other device is allowed to fail during
this time and performance will cripple. One can usually only take this route as long as no device has
shown the performance degrading effects to replace the SMR disks in a useful way. But as soon as resilvering
The best way to recover?
There is only one really good way to recover when the problem already has shown - buy a CMR disk of
equal or larger size (sector wise) for each SMR disk in your storage pool. Take your systems
offline and clone them disk wise one by one. This is pretty simple:
- Boot the system without importing the pool from any other medium
- Figure out the serial numbers or all disks and their matching device names
- Figure out the device names of all replacement disks
- Use a command like
dd if=/dev/adaXX of=/dev/adaYY bs=1G
for each and every disk in succession.
For a typical 3 TB disk this will take 7 to 12 hours per disk - if you have multiple disks and are able
to attach them to the machine launch multiple instances of dd
- you are most likely limited
by the disks and not the controllers during replacement.
- Remove the SMR disks and re-import your pools again. Everything should work smoothly from now on
- Get rid of the SMR disks (shredder, sell on eBay for the next person to make the same mistake
or use for slow archiving without a copy on write filesystem and without RAID, etc.)
When are SMR disks perfectly ok?
Personally Iâd say: Never. Avoid them.
But basically whenever not many writes are performed in close succession, latency is not a problem
and the disk is not used inside a disk array. Thus when one archives really slowly produced data on
some disks, runs only a single disk in ones desktop workstation or notebook where one just writes
low amounts or data, etc. and does not run a robust copy on write filesystem or wants to
send in larger bursts of data (like performing a full disk backup, etc.) SMR disks are a perfectly
valid choice. So for many home and some unnecessary office use this wonât be a problem. But for
server applications, reliable workstations, storage applications or as a backup target theyâre
usually a really bad idea.
This article is tagged: