02 Oct 2022 - tsp
Last update 02 Oct 2022
6 mins
So we all have seen this. You have different automation components in your
network - running different systems for data processing, embedded devices based
on ESP8266 and ESP32 in your wireless network, other MCUs hooked up to Ethernet
controllers such as the ENC28J60 and embedded computers such as RaspberryPi’s
that control different machines and experimental components. Or you’re just running
a bunch of machines and get mad when you cannot communicate in between them when
your DHCP servers fail. When using IPv6 the standard way to configure your components
is SLAAC to automatically configure IP address prefixes (one can entirely rely
on link local addresses in automation networks though which makes using IPv6 for
such stuff even more appealing) and sometimes using DHCP6 to configure DNS servers
and other parameters where needed. For legacy IP (IPv4) one usually uses DHCP to
not get insane (doing static allocation in such a network is in my opinion a
total no-go). The standard solution to run a DHCP server is usually the ISC DHCP
implementation that’s also available on FreeBSD via
the isc-dhcp44-server
port or package. Configuring this is out of scope
for this blog post though even though it’s rather simple. A short introduction
can be found in my blog post about running an LTE/UMTS gateway on FreeBSD.
So everything runs smooth but one day your DHCP server just silently dies. The clients keep their IP addresses till the lease expires and then loose their interface configuration and vanish from the network. You loose all your automation capabilities and have a hard time recovering under some circumstances. To prevent this from happening it’s a good idea to run DHCP servers in a redundant fashion. This is pretty simple with ISC’s DHCP implementation. It implements a custom failover protocol (that also runs over a separate port that one can configure freely).
In the following example configuration I’m assuming there is one primary server that should hand out DHCP leases as usual - and a secondary failover server that should take over when the primary one fails. The servers communicate with each other using a failover protocol and thus keep track over all leases. One can configure the ports oneself - I’m assuming here that:
519
520
This of course has to be accounted for in the firewall configuration.
The first step configuring failover is simply copying over the existing configuration from the primary server to the secondary - keep in mind that you have to change both files from now on in case you make changes (including static leases, etc.). Before launching the secondary one has to configure the failover though. There is a bunch of configuration options that one has to decide for that change behavior of the failover:
max-response-delay
is the number of seconds after which a peer
determines that it’s communication session is dead and it has to re-establish
the protocol connection again. Usually 60 seconds is a sane value.max-unacked-updates
determines how many update messages the given server
can transmit to the other side (peer or primary) in parallel without waiting
for confirmation. After reaching this number of updates in flight remaining
updates have to be queued. Usually set to something around 10
load balance max seconds
configures after which period a server should
consider it’s peer to be dead. Usually this is set pretty low (around 5 seconds)
so failover happens nearly transparent to all clients. Setting this too high
leads to outages of DHCP services so clients have to retransmit their DISCOVER
messages more often. Configuring it too low leads to excessive network traffic
or might lead to spurious outage detections on networks that have some
larger loads from time to time and start dropping packets.mclt
) defines the maximum time a server
can extend a lease for a clients binding beyond the time known by the
other servers. This parameter essentially defines how fast recovery from
a failed primary happens - and is not easy to calculate even though
there is an sample in RFC8156, section 4.4.1.
The value to choose is determined by the number of new clients expected,
the default lifetime of a lease. In case one has not much clue it’s a good
idea to set this to a default value around 3600 seconds (i.e. 1 hour) which
should be a sane value on most small scale or home networks. Basically one
can assume this is the maximum lease time that can be handed out by the
failover peer while the primary has failed.split
configures load balancing between the failover peer and the
primary. This allows one to operate both servers in operational state.
Setting to 256
disables load balancing so only the primary handles
DHCP requests, 128
would configure a 50% load balancing so both
peer and primary would handle 50% of the requests and 0
would configure
the peer to handle all requests as long as it didn’t fail.So now one can configure the primary peer:
failover peer "failover-example" {
primary;
address 192.0.2.1;
port 519;
peer address 192.0.2.2;
peer port 520;
max-response-delay 60;
max-unacked-updates 10;
mclt 3600;
split 256;
load balance max seconds 3;
}
To tell the dhcpd
service for which subnets to use the given failover
one has to add the failover peer
configuration to the subnet
pool
:
subnet 192.0.2.0 netmask 255.255.255.0 {
# range 192.0.2.100 192.0.2.200;
option routers 192.0.2.5;
option subnet-mask 255.255.255.00;
option ntp-servers 192.0.2.6;
option time-offset 2;
option broadcast-address 192.0.2.255;
pool {
failover peer "failover-example";
range 192.0.2.100 192.0.2.200;
}
}
On the other hand one also has to configure the failover peer:
failover peer "failover-example" {
secondary;
address 192.0.2.2;
port 520;
peer address 192.0.2.1;
peer port 519;
max-response-delay 60;
max-unacked-updates 10;
load balance max seconds 3;
}
Of course the pool
declaration is also required on the failover peer
side:
subnet 192.0.2.0 netmask 255.255.255.0 {
# range 192.0.2.100 192.0.2.200;
option routers 192.0.2.5;
option subnet-mask 255.255.255.00;
option ntp-servers 192.0.2.6;
option time-offset 2;
option broadcast-address 192.0.2.255;
pool {
failover peer "failover-example";
range 192.0.2.100 192.0.2.200;
}
}
After applying the configuration one has to enable both services in rc.conf
and restart or reload the service on the primary and the failover peer’s side.
This is everything that’s required to get some redundancy on your DHCP configuration.
This article is tagged:
Dipl.-Ing. Thomas Spielauer, Wien (webcomplains389t48957@tspi.at)
This webpage is also available via TOR at http://rh6v563nt2dnxd5h2vhhqkudmyvjaevgiv77c62xflas52d5omtkxuid.onion/