Simple DHCP failover with ISC-DHCP

02 Oct 2022 - tsp
Last update 02 Oct 2022
Reading time 6 mins

So we all have seen this. You have different automation components in your network - running different systems for data processing, embedded devices based on ESP8266 and ESP32 in your wireless network, other MCUs hooked up to Ethernet controllers such as the ENC28J60 and embedded computers such as RaspberryPi’s that control different machines and experimental components. Or you’re just running a bunch of machines and get mad when you cannot communicate in between them when your DHCP servers fail. When using IPv6 the standard way to configure your components is SLAAC to automatically configure IP address prefixes (one can entirely rely on link local addresses in automation networks though which makes using IPv6 for such stuff even more appealing) and sometimes using DHCP6 to configure DNS servers and other parameters where needed. For legacy IP (IPv4) one usually uses DHCP to not get insane (doing static allocation in such a network is in my opinion a total no-go). The standard solution to run a DHCP server is usually the ISC DHCP implementation that’s also available on FreeBSD via the isc-dhcp44-server port or package. Configuring this is out of scope for this blog post though even though it’s rather simple. A short introduction can be found in my blog post about running an LTE/UMTS gateway on FreeBSD.

So everything runs smooth but one day your DHCP server just silently dies. The clients keep their IP addresses till the lease expires and then loose their interface configuration and vanish from the network. You loose all your automation capabilities and have a hard time recovering under some circumstances. To prevent this from happening it’s a good idea to run DHCP servers in a redundant fashion. This is pretty simple with ISC’s DHCP implementation. It implements a custom failover protocol (that also runs over a separate port that one can configure freely).

In the following example configuration I’m assuming there is one primary server that should hand out DHCP leases as usual - and a secondary failover server that should take over when the primary one fails. The servers communicate with each other using a failover protocol and thus keep track over all leases. One can configure the ports oneself - I’m assuming here that:

The primary listens on port 519
The secondary listens on port 520

This of course has to be accounted for in the firewall configuration.

The first step configuring failover is simply copying over the existing configuration from the primary server to the secondary - keep in mind that you have to change both files from now on in case you make changes (including static leases, etc.). Before launching the secondary one has to configure the failover though. There is a bunch of configuration options that one has to decide for that change behavior of the failover:

Configured on both sides of the failover:
- The max-response-delay is the number of seconds after which a peer determines that it’s communication session is dead and it has to re-establish the protocol connection again. Usually 60 seconds is a sane value.
- max-unacked-updates determines how many update messages the given server can transmit to the other side (peer or primary) in parallel without waiting for confirmation. After reaching this number of updates in flight remaining updates have to be queued. Usually set to something around 10
  - load balance max seconds configures after which period a server should consider it’s peer to be dead. Usually this is set pretty low (around 5 seconds) so failover happens nearly transparent to all clients. Setting this too high leads to outages of DHCP services so clients have to retransmit their DISCOVER messages more often. Configuring it too low leads to excessive network traffic or might lead to spurious outage detections on networks that have some larger loads from time to time and start dropping packets.
Configured only on the primary (and distributed to the peer by the primary)
- The maximum client lead time (mclt) defines the maximum time a server can extend a lease for a clients binding beyond the time known by the other servers. This parameter essentially defines how fast recovery from a failed primary happens - and is not easy to calculate even though there is an sample in RFC8156, section 4.4.1. The value to choose is determined by the number of new clients expected, the default lifetime of a lease. In case one has not much clue it’s a good idea to set this to a default value around 3600 seconds (i.e. 1 hour) which should be a sane value on most small scale or home networks. Basically one can assume this is the maximum lease time that can be handed out by the failover peer while the primary has failed.
- split configures load balancing between the failover peer and the primary. This allows one to operate both servers in operational state. Setting to 256 disables load balancing so only the primary handles DHCP requests, 128 would configure a 50% load balancing so both peer and primary would handle 50% of the requests and 0 would configure the peer to handle all requests as long as it didn’t fail.

Primary configuration

So now one can configure the primary peer:

failover peer "failover-example" {
        primary;
        address 192.0.2.1;
        port 519;
        peer address 192.0.2.2;
        peer port 520;
        max-response-delay 60;
        max-unacked-updates 10;
        mclt 3600;
        split 256;
        load balance max seconds 3;
}

To tell the dhcpd service for which subnets to use the given failover one has to add the failover peer configuration to the subnet pool:

subnet 192.0.2.0 netmask 255.255.255.0 {
        # range 192.0.2.100 192.0.2.200;
        option routers 192.0.2.5;
        option subnet-mask 255.255.255.00;
        option ntp-servers 192.0.2.6;
        option time-offset 2;
        option broadcast-address 192.0.2.255;

        pool {
                failover peer "failover-example";
                range 192.0.2.100 192.0.2.200;
        }
}

Peer configuration

On the other hand one also has to configure the failover peer:

failover peer "failover-example" {
        secondary;
        address 192.0.2.2;
        port 520;
        peer address 192.0.2.1;
        peer port 519;
        max-response-delay 60;
        max-unacked-updates 10;
        load balance max seconds 3;
}

Of course the pool declaration is also required on the failover peer side:

subnet 192.0.2.0 netmask 255.255.255.0 {
        # range 192.0.2.100 192.0.2.200;
        option routers 192.0.2.5;
        option subnet-mask 255.255.255.00;
        option ntp-servers 192.0.2.6;
        option time-offset 2;
        option broadcast-address 192.0.2.255;

        pool {
                failover peer "failover-example";
                range 192.0.2.100 192.0.2.200;
        }
}

Restarting master and slave

After applying the configuration one has to enable both services in rc.conf and restart or reload the service on the primary and the failover peer’s side. This is everything that’s required to get some redundancy on your DHCP configuration.