Repairing corrupted RabbitMQ instance (VHost / experienced error, not a dets file)

11 Jul 2020 - tsp
Last update 26 Jul 2020
Reading time 2 mins

Since I use RabbitMQ on my smaller deployments as message broker for more than ten years I’m usually not experiencing any problems using the message broker itself. RabbitMQ is a solid implementation that offers AMQP and MQTT connectivity, an easy to handle management interface, allows easy clustering - and is just highly relieable.

Until recently I’ve never experienced any problems while using RabbitMQ on multiple deployments as backbone to microservice architectures and IoT deployments. But one day after an unexpected restart of one of the servers RabbitMQ seemed to start - but clients did not connect. The status page told that vhost / is down which sounded rather strange - and a quick web search did not offer any solution either.

A quick dive into the logfiles (which are not really easy to read though since they simply dump Erlang objects) brought up the following:

CRASH REPORT Process <0.428.0> with 0 neighbours crashed with reason: no match of right hand value {error,{not_a_dets_file,"/var/db/rabbitmq/mnesia/rabbit@store01/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/recovery.dets"}} in rabbit_recovery_terms:open_table/1 line 197

and the following

[warning] <0.318.0> Unable to initialize vhost data store for vhost '/'. The vhost will be stopped for this node.  Reason: {shutdown,{failed_to_start_child,rabbit_vhost_process,{badmatch,{error,},[{rabbit_recovery_terms,open_table,1,[{file,"src/rabbit_recovery_terms.erl"},{line,197}]},{rabbit_recovery_terms,init,1,[{file,"src/rabbit_recovery_terms.erl"},{line,177}]},{gen_server,init_it,2,[{file,"gen_server.erl"},{line,374}]},{gen_server,init_it,6,[{file,"gen_server.erl"},{line,342}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]},{child,undefined,rabbit_recovery_terms,{rabbit_recovery_terms,start_link,[<<"/">>]},transient,30000,worker,[rabbit_recovery_terms]}}}}}}

As one can see it’s Mnesia complaining that the recovery.dets file is in fact corrupt. This means that the crash did - despite having all write caches disabled as usual for a database and MQ machine - experienced a corrupt data file.

For this machine the solution was quite simple since none of the messages carried any really important content - this has just been data collection for statistics anyways. So I simply stopped the message broker, deleted the whole /var/db/rabbitmq/mnesia/rabbit@store01/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/recovery.dets database file after backing up everything and restarted the broker again. And as expected everything started as expected. Unfortunately all queued and not delivered messages vanished which of course is a huge problem since the whole point of using a message broker that offers relieable delivery is to not loose any messages after commiting the transaction.

This of course leads to the question if Mnesia is - in it’s current state - really a durable database system as it should be or if this has been just a problem because of the usage of consumer common-off-the-shelf harddisks inside the machine I’ve worked on which is a principle of mine when building systems (also relieable ones). If there’d be problems with the current Mnesia implementation under some circumstances that would have a huge impact on other Mnesia based applications besides RabbitMQ - for example ejabberd which’s powering a huge amount of Jabber/XMPP servers out there and many telecommunication software out there in the telecommunication market.

In case I get some time for further investigation of the damaged database files I’ll update this blog post later on.