11 Jul 2020 - tsp
Last update 26 Jul 2020
2 mins
Since I use RabbitMQ on my smaller deployments as message broker for more than ten years I’m usually not experiencing any problems using the message broker itself. RabbitMQ is a solid implementation that offers AMQP and MQTT connectivity, an easy to handle management interface, allows easy clustering - and is just highly relieable.
Until recently I’ve never experienced any problems while using RabbitMQ on
multiple deployments as backbone to microservice architectures and IoT deployments.
But one day after an unexpected restart of one of the servers RabbitMQ seemed to
start - but clients did not connect. The status page told that vhost / is down
which sounded rather strange - and a quick web search did not offer any solution
either.
A quick dive into the logfiles (which are not really easy to read though since they simply dump Erlang objects) brought up the following:
CRASH REPORT Process <0.428.0> with 0 neighbours crashed with reason: no match of right hand value {error,{not_a_dets_file,"/var/db/rabbitmq/mnesia/rabbit@store01/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/recovery.dets"}} in rabbit_recovery_terms:open_table/1 line 197
and the following
[warning] <0.318.0> Unable to initialize vhost data store for vhost '/'. The vhost will be stopped for this node. Reason: {shutdown,{failed_to_start_child,rabbit_vhost_process,{badmatch,{error,},[{rabbit_recovery_terms,open_table,1,[{file,"src/rabbit_recovery_terms.erl"},{line,197}]},{rabbit_recovery_terms,init,1,[{file,"src/rabbit_recovery_terms.erl"},{line,177}]},{gen_server,init_it,2,[{file,"gen_server.erl"},{line,374}]},{gen_server,init_it,6,[{file,"gen_server.erl"},{line,342}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]},{child,undefined,rabbit_recovery_terms,{rabbit_recovery_terms,start_link,[<<"/">>]},transient,30000,worker,[rabbit_recovery_terms]}}}}}}
As one can see it’s Mnesia complaining that the recovery.dets
file is
in fact corrupt. This means that the crash did - despite having all write caches
disabled as usual for a database and MQ machine - experienced a corrupt data file.
For this machine the solution was quite simple since none of the messages carried
any really important content - this has just been data collection for statistics
anyways. So I simply stopped the message broker, deleted the
whole /var/db/rabbitmq/mnesia/rabbit@store01/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/recovery.dets
database file after backing up everything and restarted the broker again. And
as expected everything started as expected. Unfortunately all queued and not
delivered messages vanished which of course is a huge problem since the whole
point of using a message broker that offers relieable delivery is to not loose
any messages after commiting the transaction.
This of course leads to the question if Mnesia is - in it’s current state - really a durable database system as it should be or if this has been just a problem because of the usage of consumer common-off-the-shelf harddisks inside the machine I’ve worked on which is a principle of mine when building systems (also relieable ones). If there’d be problems with the current Mnesia implementation under some circumstances that would have a huge impact on other Mnesia based applications besides RabbitMQ - for example ejabberd which’s powering a huge amount of Jabber/XMPP servers out there and many telecommunication software out there in the telecommunication market.
In case I get some time for further investigation of the damaged database files I’ll update this blog post later on.
This article is tagged:
Dipl.-Ing. Thomas Spielauer, Wien (webcomplains389t48957@tspi.at)
This webpage is also available via TOR at http://rh6v563nt2dnxd5h2vhhqkudmyvjaevgiv77c62xflas52d5omtkxuid.onion/