09 Jul 2020 - tsp
Last update 14 Aug 2020
16 mins
As everyone who has worked with servlet containers like Apache Tomcat
knows these containers are capable of deploying new servlets by simply copying
a new version of the web application archive into a folder. The container then
terminated the current running version of the servlet, redeploys the application
from the newly copied archive and instantiates the new servlet. This allows
for easy upgrading of web applications with some minimal amount of downtime. The
upgrade process is also pretty simply because it just requires access for
an file transfer method like rsync
or scp
as long as one doesn’t
want to upgrade the servlet container itself (note that of today the servlet
container is often deployed using a mechanism like docker together with the
web application - the approach described in this blog post isn’t suited for
this approach).
On the other hand the approach taken by most servlet containers still is somewhat problematic since only a single version of the servlet can be deployed at the same time. This means that first all requests to the old version have to be completed, the old version has to be shutdown and the new version has to be deployed and will be reachable for new connections after the deployment succeeded. This leads to a (hopefully short) downtime during which connections are either dropped or delayed. Also most servlet containers return errors while the servlet has been stopped and the new version hasn’t been deployed.
To solve that problem other systems like the legendary Erlang support keeping two versions of modules loaded and active at the same time. Control flow stays inside the current version of the module and might jump into the new version whenever the programmer decides to do so (normally this is done during some tail recursion calls) or after all old lightweight threads that used the old version terminated. This is a feature that allows for example runtime upgrading of telecommunication routers without interrupting any network connections. There have also been experiments of using Erlang for robotic control systems - for example there has been a demonstration of replacing the control algorithms of a quadcopter in flight. New calls to module functions from the outside are also directed to the new version of the module; calls from the inside either to the version they originate from or to the new one.
In Erlang there exists a limit of a maximum of two module versions - if one
tries to load a third one the VM simply kills the application. The feature
is heavily supported by the beam
virtual machine and the Erlang language
itself. The fact that it’s a (non pure) functional language is of course
also helpful since global state is minimized with this programming style - and
it’s heavily encouraged to handle for example different network connections
using different lightweight threads.
The Ansatz described in this blog post tries to provide the foundation for a similar behavior for applications coded in ANSI C. Note that in this case the application modules have to support runtime upgrading in an expressive way.
First one has to know how one can load modules into the current process.
The basic idea is to use dynamic link libraries (DLL) or shared objects (SO)
depending on the operating system. On all major operating systems they can
be opened (dlopen
on POSIX systems, LoadLibrary
on Windows). After
a module has been opened one can query a pointer (function or data) to
symbols exported from these modules - relative to the module handles
returned by the previous functions. This is normally done using dlsym
and dlfunc
on POSIX or GetProcAddress
on Windows. After an
application is finished using the DLL/SO it gets closed by dlclose
on POSIX and FreeLibrary
on Windows.
There is one drawback in case one simply wants to do file alteration monitoring
on a module directory and simply opening changed DLLs/SOs in the new version.
During the first deployment this would work - and one is even capable
of unlinking the DLLs/SOs so their inodes
get released after the
modules get closed using dlcose
or FreeLibrary
. Unfortunately
a copy of a new version on top of the existing one replaces the file and
doesn’t create a new inode so the code of the module gets replaced (especially
if it’s only mmap’ed) so the old version gets overwritten and applications might
crash - or the write access is simply denied.
To solve that problem a rather simple approach can be used: Whenever the
file alteration monitor or a periodic scanner detect a new module version
this module will first be copied to a temporary location using an unique filename
with suitable permissions. Then the module gets opened using dlopen
or LoadLibrary
. In case signature verification is required it should
be done on the new temporary copy of the file that’s inaccessible by any
entity except the application itself. This solves an often encountered bug
that allows injecting a correctly signed binary - then after the loader
calculated the hash of the module an external application overwrites the
plugin. The signature check is done against the original correctly
signed binary - the loader then opens some injected code.
Then the file gets immediately unlink
-ed or deleted using DeleteFile
so there is no chance of files staying inside the cache without being needed
any more or being overwritten. After that access to the symbols is done as
usual. After the library has been closed the inode
is immediately deleted.
So the basic flow is:
dlopen
or LoadLibrary
unlink
, DeleteFile
)dlsym
and GetProcAddress
)dlclose
or FreeLibrary
The most simple method of upgrading for short lived network services like webservers or similar systems is really simple - just keep a reference for the newest version of the loaded library inside the core application and deliver each and every new connection to the newest module. Old modules still handle old connections. There just has to be synchronization when accessing shared data stores or global state. In case all modules are reference counted they’ll be closed and released automatically after the old connections have been dropped.
Pro:
Cons:
This is a more advanced idea. It works by using a event callback based approach. As an module gets loaded for the first time it registers event callback handlers inside the main container or some event handling framework. For example it registers and function to be called in case new incoming connections have been accepted - or it registers an callback that will be called whenever data is received from a client.
The new module will now register filtering event callbacks at the same points that the old module has registered it’s own. During the registration step the new module will simply call the old registered functions again. This puts the new module transparently in place. Then the new module will start to transfer state via an module implementation specific method into it’s own instance and applies the messages passing through the filter to the internal state. This allows runtime state transfer from the old module into the new module - and it’s easiest when using a pattern like event sourcing - and for example caching incoming messages during partial state transfers. As soon as the module is capable of taking over connections or processes from the old module the filter functions don’t call back into the old filter any more. This allows to transfer running connections into a new version.
Pros:
Cons:
So after one knows how to load modules one has to know how directory change notifications work. This is an operating system dependent part - there currently is no portable way of performing such detection.
Note that there is another caveat - file system change notifications are not reliable on many systems (like Windows for example) and do not work on all types of filesystems like for example network filesystems (NFS). To circumvent this situation one should only use change notifications as an immediate indication of change and then perform a scan either based on well known metadata of files like last changed time, creation time and/or file size - my implementation uses all of them and detects a change whenever any of the attributes changed. Depending on the OS also attributes like owner and group are used.
Since it’s possible on most systems that event notifications are missed all implementations that I’ve written also run a periodic scan over the specific watched directories and check modification of attributes independent of any notifications. This of course induces some overhead - especially in case directories get large - but it’s inevitable. One should only use this kind of watching for rather small directories not containing tens of thousands or even millions of files - one might use hashed directory storage and large timeouts for that.
On FreeBSD the most efficient way to monitor directory notification is simply
opening the directory handle using open
int hDirectory;
hDirectory = open(lpDirectory, O_RDONLY|O_SHLOCK|O_DIRECTORY|O_CLOEXEC);
if(hDirectory < 0) {
// Error handling
}
In this case the flags:
O_RDONLY
)O_SHLOCK
) so deletion is not possible while the
directory is openO_DIRECTORY
O_CLOEXEC
) so it
won’t be inherited.Then one can simply subscribe to the EVFILT_VNODE
watching filter.
This filter triggers on different supported conditions on all supported filesystems:
NOTE_ATTRIB
is triggered in case the attributes of the file descriptor
such as owner, group, permissions or size have changed.NOTE_CLOSE
triggers when the file descriptor has been closed and the
descriptor had been opened with read only permissions.NOTE_CLOSE_WRITE
is the same as NOTE_CLOSE
but for a file descriptor
that had write permissions.NOTE_DELETE
signals that an unlink
call has been executedNOTE_EXTEND
reports for a directory that an entry was added or removed
as result of a rename operation.NOTE_LINK
notifies about a changed link count - for example when
an subdirectory has been created inside a directory.NOTE_OPEN
notifies about an open
against the referenced node.NOTE_READ
is executed whenever a read against the node has been triggered.NOTE_RENAME
signals the object has been renamed.NOTE_REVOKE
reports that the reference has been revoked due to revoke
in case of unmount or killing processes.NOTE_WRITE
signals that an object has been written to.Note that immediately after enabling the filter a directory scan operation should be started. This should also happen after each re-arming of the notification. This is required to not miss any modifications but might trigger scanning twice - which is most of the time acceptable.
Note: Keep in mind that this only watches for directory modifications - the
filter EVFILT_VNODE
is not triggering in case a member simply get written to
but still gets triggered if it gets replaced by another file atomically.
Windows works - as usual - a little bit different. There are multiple ways to subscribe to directory change notifications. The most flexible and powerful one is to use ReadDirectoryChangesW in conjunction with the excellent IO completion ports (IOCP). IO completions ports are the method to go to perform asynchronous overlapped operations on Windows. One assigns a file handle to an IO completion port, executes an overlapped I/O operation (i.e. and operation with all required buffers already attached so data can be read or written directly by the specific driver) and gets an notification enqueued in an scaling task queue. The task queue itself can be used by an arbitrary number of threads but is capable of controlling the concurrency limit - i.e. it can control how many threads are used to process events in parallel.
The flow to use IOCP is somewhat different from kqueue
on FreeBSD:
First one has to open the directory. This is done using CreateFile
as
usual - one should at least specify the GENERIC_READ
access permissions
as well as OPEN_EXISTING
to prevent creating a new file. The
flags FILE_FLAG_OVERLAPPED
and FILE_FLAG_BACKUP_SEMANTICS
have to
be specified. Overlapped I/O operation is required to be used with IOCP,
the backup semantics are required to use ReadDirectoryChangesW
.
After that the directory handle gets assigned to the IOCP that’s going
to be used for directory watching using CreateIoCompletionPort
as usual.
Since I normally use a single set of threads for all directory watching operations
I designate a single IO completion port to directory watching - all watching
threads are attached to the same IOCP. Usually I’m also using just one
watching thread since change notifications from directories are normally not
the highest priority in the applications I’m developing.
Then one has to start a read operation using the ReadDirectoryChangesW
operation. This operation already requires a target buffer to write into which
has to be pre-allocated. This is usually done on a per directory basis and
stored together with the directory handle.
One can specify which type of events one wants to receive:
FILE_NOTIFY_CHANGE_FILE_NAME
watches rename, creation or deletion of
files.FILE_NOTIFY_CHANGE_DIR_NAME
is caused in case child directories are
modified.FILE_NOTIFY_CHANGE_ATTRIBUTES
is raised on any attribute change in the
directory or it’s subdirectories.FILE_NOTIFY_CHANGE_SIZE
is triggered. This notification might
be heavily delayed due to caching.FILE_NOTIFY_CHANGE_LAST_WRITE
signals the modification of the last
written time.FILE_NOTIFY_CHANGE_LAST_ACCESS
is raised on modification of last access
time. Note that this last accessed time is not written on all filesystems.FILE_NOTIFY_CHANGE_CREATION
signals that the creation time has been
changed on any of the files.FILE_NOTIFY_CHANGE_SECURITY
signals that security attributes like the
DACLs have changed.After starting the routine also a scan should be triggered immediately to
be able to not miss any modification. The same should be done on every
re-arming of the function. Always first enqueue the ReadDirectoryChangesW
operation and then enqueue the scan operation - this might trigger two
consecutive scans but doesn’t miss any change events.
Now that one gets change notifications one could be tempted to fully trust the notifications received - this would be an major mistake especially on Windows since there are many conditions under which one might miss change notifications as well as the missing support of file alteration monitoring on some filesystems like NFS on most major operating systems. Because of this change notifications should only be seen as a hint that something (highly likely) has happened - but not be relied upon.
To detect changes one might then walk the watched directory or the whole directory hierarchy and keep a record of all known files. As usual this approach is not suited for every application - in case one has thousands or millions of files periodic scanning would not be a good idea (for example when monitoring image or media galleries) - but in case of runtime loadable modules this is totally feasible. If it’s not because there are too many modules one should think about a different approach of injecting new components.
What can be used to detect change?
In my current implementations I use a tuple of last modified time, access permissions and file size. The hash and signature are used by me only after the file has been copied to a different location and it’s integrity should be verified.
This information can be gathered using:
stat
on Unixoid operating systems like FreeBSD and LinuxGetFileAttributes
, GetFileSize
and GetSecurityInfo
together with LookupAccountSid
on Windows.Note: To be continued
This article is tagged:
Dipl.-Ing. Thomas Spielauer, Wien (webcomplains389t48957@tspi.at)
This webpage is also available via TOR at http://rh6v563nt2dnxd5h2vhhqkudmyvjaevgiv77c62xflas52d5omtkxuid.onion/