What is this blog entry about?
As everyone who has worked with servlet containers like Apache Tomcat
knows these containers are capable of deploying new servlets by simply copying
a new version of the web application archive into a folder. The container then
terminated the current running version of the servlet, redeploys the application
from the newly copied archive and instantiates the new servlet. This allows
for easy upgrading of web applications with some minimal amount of downtime. The
upgrade process is also pretty simply because it just requires access for
an file transfer method like rsync
or scp
as long as one doesnāt
want to upgrade the servlet container itself (note that of today the servlet
container is often deployed using a mechanism like docker together with the
web application - the approach described in this blog post isnāt suited for
this approach).
On the other hand the approach taken by most servlet containers still
is somewhat problematic since only a single version of the servlet can be
deployed at the same time. This means that first all requests to the old version
have to be completed, the old version has to be shutdown and the new version
has to be deployed and will be reachable for new connections after the deployment
succeeded. This leads to a (hopefully short) downtime during which connections
are either dropped or delayed. Also most servlet containers return errors while
the servlet has been stopped and the new version hasnāt been deployed.
To solve that problem other systems like the legendary Erlang
support keeping two versions of modules loaded and active at the same time.
Control flow stays inside the current version of the module and might jump
into the new version whenever the programmer decides to do so (normally this is
done during some tail recursion calls) or after all old lightweight threads
that used the old version terminated. This is a feature that allows for example
runtime upgrading of telecommunication routers without interrupting any
network connections. There have also been experiments of using Erlang
for robotic control systems - for example there has been a demonstration of
replacing the control algorithms of a quadcopter in flight. New calls to
module functions from the outside are also directed to the new version of
the module; calls from the inside either to the version they originate from
or to the new one.
In Erlang there exists a limit of a maximum of two module versions - if one
tries to load a third one the VM simply kills the application. The feature
is heavily supported by the beam
virtual machine and the Erlang language
itself. The fact that itās a (non pure) functional language is of course
also helpful since global state is minimized with this programming style - and
itās heavily encouraged to handle for example different network connections
using different lightweight threads.
The Ansatz described in this blog post tries to provide the foundation for
a similar behavior for applications coded in ANSI C. Note that in this case
the application modules have to support runtime upgrading in an expressive
way.
Basics: Loading modules
First one has to know how one can load modules into the current process.
The basic idea is to use dynamic link libraries (DLL) or shared objects (SO)
depending on the operating system. On all major operating systems they can
be opened (dlopen
on POSIX systems, LoadLibrary
on Windows). After
a module has been opened one can query a pointer (function or data) to
symbols exported from these modules - relative to the module handles
returned by the previous functions. This is normally done using dlsym
and dlfunc
on POSIX or GetProcAddress
on Windows. After an
application is finished using the DLL/SO it gets closed by dlclose
on POSIX and FreeLibrary
on Windows.
There is one drawback in case one simply wants to do file alteration monitoring
on a module directory and simply opening changed DLLs/SOs in the new version.
During the first deployment this would work - and one is even capable
of unlinking the DLLs/SOs so their inodes
get released after the
modules get closed using dlcose
or FreeLibrary
. Unfortunately
a copy of a new version on top of the existing one replaces the file and
doesnāt create a new inode so the code of the module gets replaced (especially
if itās only mmapāed) so the old version gets overwritten and applications might
crash - or the write access is simply denied.
To solve that problem a rather simple approach can be used: Whenever the
file alteration monitor or a periodic scanner detect a new module version
this module will first be copied to a temporary location using an unique filename
with suitable permissions. Then the module gets opened using dlopen
or LoadLibrary
. In case signature verification is required it should
be done on the new temporary copy of the file thatās inaccessible by any
entity except the application itself. This solves an often encountered bug
that allows injecting a correctly signed binary - then after the loader
calculated the hash of the module an external application overwrites the
plugin. The signature check is done against the original correctly
signed binary - the loader then opens some injected code.
Then the file gets immediately unlink
-ed or deleted using DeleteFile
so there is no chance of files staying inside the cache without being needed
any more or being overwritten. After that access to the symbols is done as
usual. After the library has been closed the inode
is immediately deleted.
So the basic flow is:
- Detect module modification
- Copy the module to a new random temporary
location thatās only accessible by the application
- Perform signature checking on the temporary copy
- Open the library using
dlopen
or LoadLibrary
- Unlink or delete the temporary file (
unlink
, DeleteFile
)
- Query required symbols during the lifetime of the module (
dlsym
and GetProcAddress
)
- In case all operations have been performed with the old version it
simply gets closed using
dlclose
or FreeLibrary
Basic switching
The most simple method of upgrading for short lived network services
like webservers or similar systems is really simple - just keep a
reference for the newest version of the loaded library inside
the core application and deliver each and every new connection
to the newest module. Old modules still handle old connections. There just
has to be synchronization when accessing shared data stores or global
state. In case all modules are reference counted theyāll be closed
and released automatically after the old connections have been dropped.
Pro:
- Really easy and simple implementation.
- Jumping around arbitrary revisions might be possible.
- Downgrading is possible in most of the time.
Cons:
- Multiple versions might have to exist at the same time and perform
consistent access to storage backends.
Runtime upgrading
This is a more advanced idea. It works by using a event callback based
approach. As an module gets loaded for the first time it registers event
callback handlers inside the main container or some event handling
framework. For example it registers and function to be called in case
new incoming connections have been accepted - or it registers an
callback that will be called whenever data is received from a client.
The new module will now register filtering event callbacks at the
same points that the old module has registered itās own. During
the registration step the new module will simply call the old registered
functions again. This puts the new module transparently in place. Then
the new module will start to transfer state via an module implementation
specific method into itās own instance and applies the messages passing
through the filter to the internal state. This allows runtime state transfer
from the old module into the new module - and itās easiest when using
a pattern like event sourcing - and for example caching incoming messages
during partial state transfers. As soon as the module is capable of
taking over connections or processes from the old module the filter functions
donāt call back into the old filter any more. This allows to transfer running
connections into a new version.
Pros:
- Runtime upgrading also for long lasting jobs and connections.
- Simpler to access backends in a consistent way.
Cons:
- Multi stage process.
- Really complex.
- Not generalizeable but has to be implemented for each and every module.
Details: Directory change notification
So after one knows how to load modules one has to know how directory change
notifications work. This is an operating system dependent part - there currently
is no portable way of performing such detection.
Note that there is another caveat - file system change notifications are not
reliable on many systems (like Windows for example) and do not work on all
types of filesystems like for example network filesystems (NFS). To circumvent
this situation one should only use change notifications as an immediate indication
of change and then perform a scan either based on well known metadata of files
like last changed time, creation time and/or file size - my implementation
uses all of them and detects a change whenever any of the attributes changed.
Depending on the OS also attributes like owner and group are used.
Since itās possible on most systems that event notifications are missed all
implementations that Iāve written also run a periodic scan over the specific
watched directories and check modification of attributes independent of
any notifications. This of course induces some overhead - especially in case
directories get large - but itās inevitable. One should only use this kind
of watching for rather small directories not containing tens of thousands or
even millions of files - one might use hashed directory storage and large
timeouts for that.
FreeBSD (kqueue)
On FreeBSD the most efficient way to monitor directory notification is simply
opening the directory handle using open
int hDirectory;
hDirectory = open(lpDirectory, O_RDONLY|O_SHLOCK|O_DIRECTORY|O_CLOEXEC);
if(hDirectory < 0) {
// Error handling
}
In this case the flags:
- Open the directory in read only mode (
O_RDONLY
)
- Request a shared lock (
O_SHLOCK
) so deletion is not possible while the
directory is open
- Require the resource to be a directory
O_DIRECTORY
- And request that the handle will be closed on exec (
O_CLOEXEC
) so it
wonāt be inherited.
Then one can simply subscribe to the EVFILT_VNODE
watching filter.
This filter triggers on different supported conditions on all supported filesystems:
NOTE_ATTRIB
is triggered in case the attributes of the file descriptor
such as owner, group, permissions or size have changed.
NOTE_CLOSE
triggers when the file descriptor has been closed and the
descriptor had been opened with read only permissions.
NOTE_CLOSE_WRITE
is the same as NOTE_CLOSE
but for a file descriptor
that had write permissions.
NOTE_DELETE
signals that an unlink
call has been executed
NOTE_EXTEND
reports for a directory that an entry was added or removed
as result of a rename operation.
NOTE_LINK
notifies about a changed link count - for example when
an subdirectory has been created inside a directory.
NOTE_OPEN
notifies about an open
against the referenced node.
NOTE_READ
is executed whenever a read against the node has been triggered.
NOTE_RENAME
signals the object has been renamed.
NOTE_REVOKE
reports that the reference has been revoked due to revoke
in case of unmount or killing processes.
NOTE_WRITE
signals that an object has been written to.
Note that immediately after enabling the filter a directory scan operation should
be started. This should also happen after each re-arming of the notification.
This is required to not miss any modifications but might trigger scanning
twice - which is most of the time acceptable.
Note: Keep in mind that this only watches for directory modifications - the
filter EVFILT_VNODE
is not triggering in case a member simply get written to
but still gets triggered if it gets replaced by another file atomically.
Windows (IO completion ports and ReadDirectoryChangesW)
Windows works - as usual - a little bit different. There are multiple ways
to subscribe to directory change notifications. The most flexible and powerful
one is to use ReadDirectoryChangesW in conjunction with the excellent IO completion
ports (IOCP). IO completions ports are the method to go to perform asynchronous
overlapped operations on Windows. One assigns a file handle to an IO completion
port, executes an overlapped I/O operation (i.e. and operation with all required
buffers already attached so data can be read or written directly by the specific
driver) and gets an notification enqueued in an scaling task queue. The task
queue itself can be used by an arbitrary number of threads but is capable of
controlling the concurrency limit - i.e. it can control how many threads are
used to process events in parallel.
The flow to use IOCP is somewhat different from kqueue
on FreeBSD:
First one has to open the directory. This is done using CreateFile
as
usual - one should at least specify the GENERIC_READ
access permissions
as well as OPEN_EXISTING
to prevent creating a new file. The
flags FILE_FLAG_OVERLAPPED
and FILE_FLAG_BACKUP_SEMANTICS
have to
be specified. Overlapped I/O operation is required to be used with IOCP,
the backup semantics are required to use ReadDirectoryChangesW
.
After that the directory handle gets assigned to the IOCP thatās going
to be used for directory watching using CreateIoCompletionPort
as usual.
Since I normally use a single set of threads for all directory watching operations
I designate a single IO completion port to directory watching - all watching
threads are attached to the same IOCP. Usually Iām also using just one
watching thread since change notifications from directories are normally not
the highest priority in the applications Iām developing.
Then one has to start a read operation using the ReadDirectoryChangesW
operation. This operation already requires a target buffer to write into which
has to be pre-allocated. This is usually done on a per directory basis and
stored together with the directory handle.
One can specify which type of events one wants to receive:
FILE_NOTIFY_CHANGE_FILE_NAME
watches rename, creation or deletion of
files.
FILE_NOTIFY_CHANGE_DIR_NAME
is caused in case child directories are
modified.
FILE_NOTIFY_CHANGE_ATTRIBUTES
is raised on any attribute change in the
directory or itās subdirectories.
- On resize
FILE_NOTIFY_CHANGE_SIZE
is triggered. This notification might
be heavily delayed due to caching.
FILE_NOTIFY_CHANGE_LAST_WRITE
signals the modification of the last
written time.
FILE_NOTIFY_CHANGE_LAST_ACCESS
is raised on modification of last access
time. Note that this last accessed time is not written on all filesystems.
FILE_NOTIFY_CHANGE_CREATION
signals that the creation time has been
changed on any of the files.
FILE_NOTIFY_CHANGE_SECURITY
signals that security attributes like the
DACLs have changed.
After starting the routine also a scan should be triggered immediately to
be able to not miss any modification. The same should be done on every
re-arming of the function. Always first enqueue the ReadDirectoryChangesW
operation and then enqueue the scan operation - this might trigger two
consecutive scans but doesnāt miss any change events.
Detecting change
Now that one gets change notifications one could be tempted to fully trust
the notifications received - this would be an major mistake especially
on Windows since there are many conditions under which one might miss change
notifications as well as the missing support of file alteration monitoring
on some filesystems like NFS on most major operating systems. Because of this
change notifications should only be seen as a hint that something (highly likely)
has happened - but not be relied upon.
To detect changes one might then walk the watched directory or the whole
directory hierarchy and keep a record of all known files. As usual this
approach is not suited for every application - in case one has thousands
or millions of files periodic scanning would not be a good idea (for example
when monitoring image or media galleries) - but in case of runtime loadable
modules this is totally feasible. If itās not because there are too many
modules one should think about a different approach of injecting new components.
What can be used to detect change?
- The time the last has been modified (this is an easy to get property). One
should note that this time doesnāt really have to be updated by every
filesystem but in my opinion itās a sane assumption to require filesystems
that require immediate action on module updates to write this property.
- Owner and group (on Unix) or ACL (Windows) properties. These should be
watched too to prevent some missing of modules that have been copied
into the directory with invalid permissions and then have been modified
later on.
- File size. This can detect an ongoing copy process but not modifications inside
the file that do not increase or decrease file size.
- Use a file hash. This is the most safe approach - but also the most expensive
one since one would have to recalculate the hash on each and every iteration.
Calculating the hash will of course be required on any signature verification
after modification has been detected.
In my current implementations I use a tuple of last modified time, access permissions
and file size. The hash and signature are used by me only after the file has
been copied to a different location and itās integrity should be verified.
This information can be gathered using:
stat
on Unixoid operating systems like FreeBSD and Linux
GetFileAttributes
, GetFileSize
and GetSecurityInfo
together with LookupAccountSid
on Windows.
Note: To be continued
This article is tagged: