Please note: Pastry is not my design or invention, itās well documented in a
bunch of papers. There is a quick overview
available by the authors
of pastry too.
What is a DHT?
A distributed hash table (DHT) is a distributed database that provides key-value
store like semantic. Basically itās a distributed system that allows one to store
values with a given key and retrieve them again using the key. This is called a
hash table since the key is usually calculated as a hash value of the stored objects.
A hash function is a one way function that should yield a evenly distributed key
value that is guaranteed to change whenever an object changes. Itās also often
desirable to be collision free - but of course thatās not possible in any case
if the keyspace is smaller than the value space; hash tables are usually capable
of dealing with such collisions.
The distributed behavior is achieved by assigning nodes IDs from the same keyspace
and assigning objects to the numerically nearest node(s). Redundancy can be achieved
by using the $n$ closest objects.
Since node IDs are assigned randomly (either by using random number generators,
being derived via a hash function from the address of the node or being the hash
of a credential like a public key) theyāre statistically evenly distributed in
keyspace if everything is done right.
Since they operate on some really basic principles DHTs scale well, are self organizing,
are robust even in case nodes fail, are fault tolerant and are distributing load evenly
on the system. Pastry also offers reduced load on the network by employing a proximity
metric while building some of itās tables. Also pastry is a little bit different than
most other DHTs - it includes an overlay routing network that allows one to route messages
to nodes that one doesnāt have information about. Routing tables are usually updated
whenever a node fails or joins the network.
One can build a huge number of services on top of a DHT:
- Distributed object storage (see for example Ceph - thatās not
built using pastry though)
- Filesystems (see for example again Ceph, PAST)
- Publish subscribe systems (for example SCRIBE)
- Distributed caches (see SQUIRREL)
- Content distribution systems (see SplitStream)
- Generic messaging (see POST)
- Peer to peer resource sharing systems (see Scrivener)
There is a nearly unlimited amount of applications - depending on the environment they
run in one has to take some security considerations though.
Network parameterization
First of all pastry uses 128 bit Node IDs (i.e. 16 bytes). There is a number of
tunables though:
- The number of bits per digit or subnet inside the keyspace. This is termed $b$
and is usually set to $4$ so each digit in hexadecimal representation would be equal
to one routing step.
- The number of nodes is termed $N$.
- The size of the neighborhood set $\mid N \mid$ is usually set to $2*2^b$
- The size of the leaf set $\mid L \mid$ is usually set to $2^b$
First lets think about the information required to contact other nodes:
- Itās IP address(es). Depending on the way the system works this might be
a single IPv4 or IPv6 address or a collection of potential addresses used
during an interactive connectivity establishment process. Usually containing
a single IPv4 and a single IPv6 address is sufficient, sometimes a hostname
might also be needed (for example when supporting TOR hidden services which
requires connections to be made using SOCKS4a and hostnames)
- Itās node ID - usually a 128 bit number.
- Depending on signatures: A public signature key (for example a DSS key)
- Depending on encryption: A public key (for example for RSA or ECC) or a
shared secret if all nodes are trusted.
Additionally the node locally might add:
- Proximity information
- Timestamp of last contact
- Error counters
- A network socket handle for an open connection to the given node (depending
on the implementation).
Node state
Neighborhood set ($N$)
The neighborhood set is not directly used during message routing but is usually
used for locality considerations. It contains the $\mid N \mid$ nearest nodes
according to the network metric - one could use hop count, latency or other cost
based information. Note that itās not specified by the original paper which metric
one should use and the reference implementation allowed external tools like traceroute
and ping to be used to determine the proximity value. Usually the application keeps
track of the nodes by simply measuring proximity to known nodes and updates this
set accordingly - itās advised to store them in a ordered fashion for faster access.
The size is usually set to $2*2^b$, i.e. for the often used $b=4$ this would
lead to 32 entries (in this case linear scan might also be suitable during access).
During join of the network a node initializes itās neighborhood set to the
neighborhood set of itās initially contacted seed node.
Routing table ($R$)
The routing table is the main fabric used during message routing .Basically it
contains a row for every digit (i.e. group of $b$ bits in the node ID space).
For example for 128 bit node IDs and a value of $b=4$ it would contain 32 rows.
In the paper describing pastry this number of rows is written as $log_{2^b} N$.
Each row contains $2^b-1$ possible entries - one might implement it due to implementation
constraints with $2^b$ entries. Each entry contains the (topologically) closest
known node with the corresponding digit that shares the first row-index digits
with the current node.
For a node ID of $0123$ the table would start with
none |
1xxx |
2xxx |
3xxx |
4xxx |
5xxx |
6xxx |
7xxx |
8xxx |
9xxx |
00xx |
none |
02xx |
03xx |
04xx |
05xx |
06xx |
07xx |
08xx |
09xx |
010x |
011x |
none |
013x |
014x |
015x |
016x |
017x |
018x |
019x |
0120 |
0121 |
0122 |
none |
0124 |
0125 |
0126 |
0127 |
0128 |
0129 |
Note that each entry might be empty, occupied by the own node or contain an
arbitrary entry that satisfies the given constraint. On average only $log_(2*b) N$
entries are populated though.
The typical network using $b=4$ and 128 bit node IDs would require 32 rows with
15 entries each (or 16 if not optimized to exclude the own node in each row), thus
480 entries which is totally acceptable even for applications on platforms like
the ESP8266/ESP32 - depending on the stored information.
Leaf set ($L$)
The leaf set works somewhat like the neighborhood set and contains the $\frac{L}{2}$
closest nodes that are larger and $\frac{L}{2}$ closest nodes that are smaller than
the current node ID - but this time not in proximity space but in key space. Nodes
should be stored in order to allow fast binary search. The size of the leaf set
is also configurable and is usually set to $2^b$ entries (for a typical $b=4$ i.e.
to $16$ entries, which would be small enough even for unordered access).
Message routing
Note that message routing requires a route to be already joined and join requires
message routing to be working on other nodes so this is somewhat an chicken-egg-problem.
This routing will be used in a similar way during key-value storage to determine which
node is responsible for a given key. First Iāll describe how the message routing works.
Basically the node received a message from any other node that has a destination node
ID (and usually should also carry the source nodes information).
The node first checks if the destination key is contained inside the range of itās
leaf set. It simply checks if the destination key $D$ is larger or equal to the
smallest and smaller or equal to the largest known node ID in the leaf set. If this
is the case the message is forwarded to the node with the minimum distance to the
destination key.
If the leaf set could not be used it tries to use the routing table. This is done
by first determining the length of the prefix that the current node and the destination
node are sharing ($l$). Then it tries to look up the routing table in the $l$ row
at the position corresponding to the next digit. This can be accomplished by simple
binary and if $2^b$ entries are stored per row instead of $2^b-1$. If a node is
found in the routing table the message gets forwarded to the given node.
In case this also didnāt work out which is said to be a rare case in the original
paper the message is forwarded to any known node (from routing table, leaf set or
neighborhood set) that:
- shares a longer or equally long prefix
- has the minimum known distance to the destination node ID
If no such node exists the message has arrived at itās destination node. Note that
during join this is used to determine one of the immediate neighbors of a node. This
routing procedure is guaranteed to converge except when $\frac{L}{2}$ of the nearest
nodes have failed simultaneously.
Node join procedure
In case a new node wants to join the DHT ring it first generates - if this has not
happened before - itās own credentials (signature and encryption keys) as well
as itās random node ID. The random node ID is sometimes calculated as the SHA-1 hash
of itās IP address or from the hash of itās public key. After that it requires at
least one seed node that has to be supplied by some external mechanism - like being
supplied from a central authority which would introduce a single point of failure,
using IP multicast which only works in local networks, being supplied manually, etc.
After the seed node has been located it gets contacted by the new joining node. A join
message is transmitted to the seed node with itās destination set to the new node ID.
The message is routed according to the standard routing protocol - all nodes that
are encountered during forwarding do transmit all of their tables (neighborhood set,
routing table, leaf set) to the newly joining node.
The neighborhood set it initially copied from the seed node since itās assumed to
be in proximity to the newly joining node. From this point on the neighborhood set
is tracked as it will be done during the whole lifetime of the node - i.e. in case
a node is discovered that is closer by proximity metric than the currently known
ones itās included into the neighborhood set and possibly replaces the most distant
one thatās currently included in the set.
The last node that the join message arrives on is used as the seed for leaf set
information. The leaf set of this node is copied onto the newly joining nodes leaf
set - and from this moment on itās tracked again to contain the closest $L$ nodes
ever seen by this node, replacing more distant nodes if required.
The most problematic table to construct is the routing table. First of all all
fields corresponding to the current node (i.e. sharing a prefix of $row * b$ bits
with the current node ID) are occupied automatically by the node itself (see example
above). The first row can be directly copied from any node that doesnāt share a
common prefix with the current node at all and are independent of the current node
ID. The other entries have to share a prefix with the current node and can initially
be filled by information received from the nodes that the join message has been
routed through. Note that the length of the shared prefix gets longer and longer
for every node encountered on the join messages path.
Note that this is only the initial build of the routing table. The new node also may
request additional information (i.e. state tables) from other nodes encountered during
this process. It asks all of the encountered nodes for their state tables and updates
itās own state in case it locates closer nodes. The neighborhood set is pretty useful
for this process since it keeps tables of nodes in vicinity consistent.
After itās own tables have been initialized the node transmits a joined message to
all nodes applicable to inform them of their presence so they include the new node
in their routing tables.
Keeping track of the tables during runtime
Leaf set and neighborhood set
These two sets are the most easy ones to track during runtime. For the leaf set $L$
one can simply track the distance of all active (non currently joining) nodes encountered
during any communication process and determine if a more close node is encountered
which would lead to the currently known nodes to be replaced by a more local one.
The neighborhood set $N$ which is built according to a proximity metric can be tracked
by evaluating the proximity of all encountered nodes on a periodic basis and doing the
same thing - just storing nodes that are in proximity of the current node. Itās a good
idea to keep track of the worst proximity value for some fast discarding.
Routing table
The routing table might get updated periodically with newly arriving nodes, nodes that
fail can be removed and nodes usually do exchange updates routing information with
their neighbors.
When any node provides state information to another node it attaches a timestamp value.
The receiving node then updates itās own routing table according to the state Information
and eventually notifies the original node - for example - of itās arrival. The arriving
node attaches the original timestamp so the receiving node can check if itās own routing
information has changed since then - and might signal an restart of routing table update
to the now joined node.
Node leave procedure
Failure and departure is handled the same way. In case a node is discovered to be dead
since there is no keep-alive exchanged or a request times out - or in case the node
signals the end of itās connections in connection oriented setups - the node gets
removed from the network.
To replace a node inside the leaf set a node contacts the most distant existing
entry in the leaf set and asks for itās leafset to update itās own information about
neighboring nodes.
To repair a failed node entry in the routing table a node contacts another node
from the same row (i.e. the same shared prefix length) and asks for itās entry for the
failed prefix and value. In the event that none of the nodes from the same row has a
satisfying live node the net gets casted in a wider fashion - i.e. the node contacts
nodes sharing even a longer prefix from rows below. This is likely to find an entry
in case it exists.
Network resilience
One of the questions that arises when designing DHTs is the question of resilience
against malicious behavior of nodes. Pastry tries to counter malicious nodes by introducing
some randomness. Each routing step requires to select a node that lies closer to the
target space - but the choice of nodes that fulfill this property can be made arbitrarily
with some randomness. Itās entirely possible to collect possible routing hops first
and then select one of them. In case of a failed query the query might have to be
repeated a number of times but this process of random selection of sub optimal
steps allows one to circumvent failed or malicious nodes. Note that of course the probability
distribution of the random number generator used during this process should be biased
towards the optimal routing decision anyways.
Network partitions
Underlying network failure (i.e. Internet network failure) may lead to DHT network
partitioning. Pastry usually is capable of coping with situations in which a node
is only reachable by a subset of nodes as long as the underlying network is not
fully partitioned - but in case a network segment gets separated from the Internet (or
the local network in which the DHT is used) the routing tables might be updated
in a way that isolated islands of the DHT are formed. The DHT itself is not capable
of recovering from this situation.
To circumvent such situations one has to circumvent the overlay routing network for
node localization. In local networks one might use periodic multicast queries for neighboring
nodes that might be introduced into the leaf set so islands will get reintegrated
periodically - the same information might be used to update routing tables each step.
One might also use some other kind of centralized coordinator to join separated islands
on a random periodic basis.
This article is tagged: Programming, Tutorial