08 Nov 2024 - tsp
Last update 12 Nov 2024
5 mins
Since I prefer using FreeBSD as my main operating system
on servers but also on desktop systems I’ve been running it on many different hardware
configurations and usually everything works out very smooth and without any problems (especially
with way less problems than with other operating systems). It’s robust, easy to configure,
consistent and efficient. Lately there had been a few systems newly built out of low cost components
that started to randomly freeze though. It appeared that some applications started to stop working
while one still could use the system and at some point it froze completely. The freezing seemed
to be related to usage of browsers (like Firefox or Chromium) on the first glance since it
never happened when they did not run but was reproducible reliably during browser usage even
on systems with much memory - though the freezing frequency seemed to correlated to the
available RAM. Since I also use FreeBSD at work and hanging on my workstation there
got a little bit too annoying I decided to investigate this further. After rebooting the
log did not show anything though - so I decided to log into a ssh session from another machine
since the GUI usually totally froze and just display the contents of /var/log/messages
in real-time
all the time using tail -f
:
sudo tail -f /var/log/messages
This also works when data is not flushed onto the disk but only kept in RAM so one will see kernel messages as long as the filesystem driver itself works. This then finally showed consistently over all freezings timeouts on my very cheap SSDs:
kernel: (ada0:ahcich1:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
kernel: (ada0:ahcich1:0:0:0): CAM status: ATA Status Error
kernel: (ada0:ahcich1:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 04 (ABRT )
kernel: (ada0:ahcich1:0:0:0): RES: 51 04 00 00 00 40 00 00 00 00 00
kernel: (ada0:ahcich1:0:0:0): Retrying command, 0 more tries remain
kernel: (ada0:ahcich1:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
kernel: (ada0:ahcich1:0:0:0): CAM status: ATA Status Error
kernel: (ada0:ahcich1:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 04 (ABRT )
kernel: (ada0:ahcich1:0:0:0): RES: 51 04 00 00 00 40 00 00 00 00 00
kernel: (ada0:ahcich1:0:0:0): Error 5, Retries exhausted
This seemed to indicate problems with the SSD though inspection of the SMART parameters yields no further insight. Also the behavior was consistent over a small set of machines. I tried to increase the timeout just in case the SSDs where incredible slow like one knows from SMR harddisks - that you should not use for ZFS anyways. To mitigate this problem I increased the default timeout for the device first to 60 seconds and in a last try even to 300 seconds.
kern.cam.ada.default_timeout=60
Unfortunately this did not resolve the problem - it just took longer till the system totally froze.
In the end it turned out to be a problem with flushing the caches buffers in conjunction to usage of native command queuing. On many devices like cheap SSDs and SMR (Shingled Magnetic Recording) harddisks NCQ (native command queuing) could simply lead to multiple flushing commands to be enqueued that then exceed the timeout - this would be solvable by just increasing the timeout even further. Unfortunately my devices still caused the same trouble. This may be caused for example by firmware bugs. This could be solved on my particular system and my particular SSDs by disabling NCQ - of course with all consequences like reduced throughput:
camcontrol negotiate ada0 -T disable
This disabled tagged queuing (native command queuing) on the disk and immediately solved the freezing issues.
rc.init
script to disable NCQSince this change does not persist I decided to write a small hacked rc.init
script in /usr/local/etc/rc.d/disablencq
:
#!/bin/sh
# PROVIDE: disablencq
# REQUIRE: NETWORKING SERVERS
# Disable tagged queuing on listed disks
#
# disablencq_enable="YES"
# Execute this script
# disablencq_disks="ada0 ada1 ada2"
# List the disks to disable NCQ on
. /etc/rc.subr
name="disablencq"
rcvar=disablencq_enable
desc="Disable NCQ on some disks"
start_cmd="disablencq_start"
disablencq_start()
{
for dsk in ${disablencq_disks}; do
camcontrol negotiate ${dsk} -T disable
done
}
load_rc_config $name
: ${disablencq_enable:="NO"}
: ${disablencq_disks:="ada0"}
run_rc_command "$1"
Now I was able to configure the problematic SSDs in /etc/rc.conf
:
disablencq_enable="YES"
disablencq_disks="ada0 ada1"
This of course is a hack - but it seems to circumvent problems with the native firmware of some SSDs. Usually it’s a good idea to stay away from such SSDs the same way as one should stay away from SMR harddisks though.
This article is tagged:
Dipl.-Ing. Thomas Spielauer, Wien (webcomplains389t48957@tspi.at)
This webpage is also available via TOR at http://rh6v563nt2dnxd5h2vhhqkudmyvjaevgiv77c62xflas52d5omtkxuid.onion/