Runtime swapable and upgrade-able modules in ANSI C

09 Jul 2020 - tsp
Last update 14 Aug 2020
Reading time 16 mins

What is this blog entry about?

As everyone who has worked with servlet containers like Apache Tomcat knows these containers are capable of deploying new servlets by simply copying a new version of the web application archive into a folder. The container then terminated the current running version of the servlet, redeploys the application from the newly copied archive and instantiates the new servlet. This allows for easy upgrading of web applications with some minimal amount of downtime. The upgrade process is also pretty simply because it just requires access for an file transfer method like rsync or scp as long as one doesn’t want to upgrade the servlet container itself (note that of today the servlet container is often deployed using a mechanism like docker together with the web application - the approach described in this blog post isn’t suited for this approach).

On the other hand the approach taken by most servlet containers still is somewhat problematic since only a single version of the servlet can be deployed at the same time. This means that first all requests to the old version have to be completed, the old version has to be shutdown and the new version has to be deployed and will be reachable for new connections after the deployment succeeded. This leads to a (hopefully short) downtime during which connections are either dropped or delayed. Also most servlet containers return errors while the servlet has been stopped and the new version hasn’t been deployed.

To solve that problem other systems like the legendary Erlang support keeping two versions of modules loaded and active at the same time. Control flow stays inside the current version of the module and might jump into the new version whenever the programmer decides to do so (normally this is done during some tail recursion calls) or after all old lightweight threads that used the old version terminated. This is a feature that allows for example runtime upgrading of telecommunication routers without interrupting any network connections. There have also been experiments of using Erlang for robotic control systems - for example there has been a demonstration of replacing the control algorithms of a quadcopter in flight. New calls to module functions from the outside are also directed to the new version of the module; calls from the inside either to the version they originate from or to the new one.

In Erlang there exists a limit of a maximum of two module versions - if one tries to load a third one the VM simply kills the application. The feature is heavily supported by the beam virtual machine and the Erlang language itself. The fact that it’s a (non pure) functional language is of course also helpful since global state is minimized with this programming style - and it’s heavily encouraged to handle for example different network connections using different lightweight threads.

The Ansatz described in this blog post tries to provide the foundation for a similar behavior for applications coded in ANSI C. Note that in this case the application modules have to support runtime upgrading in an expressive way.

Basics: Loading modules

First one has to know how one can load modules into the current process. The basic idea is to use dynamic link libraries (DLL) or shared objects (SO) depending on the operating system. On all major operating systems they can be opened (dlopen on POSIX systems, LoadLibrary on Windows). After a module has been opened one can query a pointer (function or data) to symbols exported from these modules - relative to the module handles returned by the previous functions. This is normally done using dlsym and dlfunc on POSIX or GetProcAddress on Windows. After an application is finished using the DLL/SO it gets closed by dlclose on POSIX and FreeLibrary on Windows.

There is one drawback in case one simply wants to do file alteration monitoring on a module directory and simply opening changed DLLs/SOs in the new version. During the first deployment this would work - and one is even capable of unlinking the DLLs/SOs so their inodes get released after the modules get closed using dlcose or FreeLibrary. Unfortunately a copy of a new version on top of the existing one replaces the file and doesn’t create a new inode so the code of the module gets replaced (especially if it’s only mmap’ed) so the old version gets overwritten and applications might crash - or the write access is simply denied.

To solve that problem a rather simple approach can be used: Whenever the file alteration monitor or a periodic scanner detect a new module version this module will first be copied to a temporary location using an unique filename with suitable permissions. Then the module gets opened using dlopen or LoadLibrary. In case signature verification is required it should be done on the new temporary copy of the file that’s inaccessible by any entity except the application itself. This solves an often encountered bug that allows injecting a correctly signed binary - then after the loader calculated the hash of the module an external application overwrites the plugin. The signature check is done against the original correctly signed binary - the loader then opens some injected code.

Then the file gets immediately unlink-ed or deleted using DeleteFile so there is no chance of files staying inside the cache without being needed any more or being overwritten. After that access to the symbols is done as usual. After the library has been closed the inode is immediately deleted.

So the basic flow is:

Detect module modification
Copy the module to a new random temporary location that’s only accessible by the application
Perform signature checking on the temporary copy
Open the library using dlopen or LoadLibrary
Unlink or delete the temporary file (unlink, DeleteFile)
Query required symbols during the lifetime of the module (dlsym and GetProcAddress)
In case all operations have been performed with the old version it simply gets closed using dlclose or FreeLibrary

Basic switching

The most simple method of upgrading for short lived network services like webservers or similar systems is really simple - just keep a reference for the newest version of the loaded library inside the core application and deliver each and every new connection to the newest module. Old modules still handle old connections. There just has to be synchronization when accessing shared data stores or global state. In case all modules are reference counted they’ll be closed and released automatically after the old connections have been dropped.

Pro:

Really easy and simple implementation.
Jumping around arbitrary revisions might be possible.
Downgrading is possible in most of the time.

Cons:

Multiple versions might have to exist at the same time and perform consistent access to storage backends.

Runtime upgrading

This is a more advanced idea. It works by using a event callback based approach. As an module gets loaded for the first time it registers event callback handlers inside the main container or some event handling framework. For example it registers and function to be called in case new incoming connections have been accepted - or it registers an callback that will be called whenever data is received from a client.

The new module will now register filtering event callbacks at the same points that the old module has registered it’s own. During the registration step the new module will simply call the old registered functions again. This puts the new module transparently in place. Then the new module will start to transfer state via an module implementation specific method into it’s own instance and applies the messages passing through the filter to the internal state. This allows runtime state transfer from the old module into the new module - and it’s easiest when using a pattern like event sourcing - and for example caching incoming messages during partial state transfers. As soon as the module is capable of taking over connections or processes from the old module the filter functions don’t call back into the old filter any more. This allows to transfer running connections into a new version.

Pros:

Runtime upgrading also for long lasting jobs and connections.
Simpler to access backends in a consistent way.

Cons:

Multi stage process.
Really complex.
Not generalizeable but has to be implemented for each and every module.

Details: Directory change notification

So after one knows how to load modules one has to know how directory change notifications work. This is an operating system dependent part - there currently is no portable way of performing such detection.

Note that there is another caveat - file system change notifications are not reliable on many systems (like Windows for example) and do not work on all types of filesystems like for example network filesystems (NFS). To circumvent this situation one should only use change notifications as an immediate indication of change and then perform a scan either based on well known metadata of files like last changed time, creation time and/or file size - my implementation uses all of them and detects a change whenever any of the attributes changed. Depending on the OS also attributes like owner and group are used.

Since it’s possible on most systems that event notifications are missed all implementations that I’ve written also run a periodic scan over the specific watched directories and check modification of attributes independent of any notifications. This of course induces some overhead - especially in case directories get large - but it’s inevitable. One should only use this kind of watching for rather small directories not containing tens of thousands or even millions of files - one might use hashed directory storage and large timeouts for that.

FreeBSD (kqueue)

On FreeBSD the most efficient way to monitor directory notification is simply opening the directory handle using open

    int hDirectory;

    hDirectory = open(lpDirectory, O_RDONLY|O_SHLOCK|O_DIRECTORY|O_CLOEXEC);
    if(hDirectory < 0) {
        // Error handling
    }

In this case the flags:

Open the directory in read only mode (O_RDONLY)
Request a shared lock (O_SHLOCK) so deletion is not possible while the directory is open
Require the resource to be a directory O_DIRECTORY
And request that the handle will be closed on exec (O_CLOEXEC) so it won’t be inherited.

Then one can simply subscribe to the EVFILT_VNODE watching filter. This filter triggers on different supported conditions on all supported filesystems:

NOTE_ATTRIB is triggered in case the attributes of the file descriptor such as owner, group, permissions or size have changed.
NOTE_CLOSE triggers when the file descriptor has been closed and the descriptor had been opened with read only permissions.
NOTE_CLOSE_WRITE is the same as NOTE_CLOSE but for a file descriptor that had write permissions.
NOTE_DELETE signals that an unlink call has been executed
NOTE_EXTEND reports for a directory that an entry was added or removed as result of a rename operation.
NOTE_LINK notifies about a changed link count - for example when an subdirectory has been created inside a directory.
NOTE_OPEN notifies about an open against the referenced node.
NOTE_READ is executed whenever a read against the node has been triggered.
NOTE_RENAME signals the object has been renamed.
NOTE_REVOKE reports that the reference has been revoked due to revoke in case of unmount or killing processes.
NOTE_WRITE signals that an object has been written to.

Note that immediately after enabling the filter a directory scan operation should be started. This should also happen after each re-arming of the notification. This is required to not miss any modifications but might trigger scanning twice - which is most of the time acceptable.

Note: Keep in mind that this only watches for directory modifications - the filter EVFILT_VNODE is not triggering in case a member simply get written to but still gets triggered if it gets replaced by another file atomically.

Windows (IO completion ports and ReadDirectoryChangesW)

Windows works - as usual - a little bit different. There are multiple ways to subscribe to directory change notifications. The most flexible and powerful one is to use ReadDirectoryChangesW in conjunction with the excellent IO completion ports (IOCP). IO completions ports are the method to go to perform asynchronous overlapped operations on Windows. One assigns a file handle to an IO completion port, executes an overlapped I/O operation (i.e. and operation with all required buffers already attached so data can be read or written directly by the specific driver) and gets an notification enqueued in an scaling task queue. The task queue itself can be used by an arbitrary number of threads but is capable of controlling the concurrency limit - i.e. it can control how many threads are used to process events in parallel.

The flow to use IOCP is somewhat different from kqueue on FreeBSD:

First one has to open the directory. This is done using CreateFile as usual - one should at least specify the GENERIC_READ access permissions as well as OPEN_EXISTING to prevent creating a new file. The flags FILE_FLAG_OVERLAPPED and FILE_FLAG_BACKUP_SEMANTICS have to be specified. Overlapped I/O operation is required to be used with IOCP, the backup semantics are required to use ReadDirectoryChangesW.

After that the directory handle gets assigned to the IOCP that’s going to be used for directory watching using CreateIoCompletionPort as usual. Since I normally use a single set of threads for all directory watching operations I designate a single IO completion port to directory watching - all watching threads are attached to the same IOCP. Usually I’m also using just one watching thread since change notifications from directories are normally not the highest priority in the applications I’m developing.

Then one has to start a read operation using the ReadDirectoryChangesW operation. This operation already requires a target buffer to write into which has to be pre-allocated. This is usually done on a per directory basis and stored together with the directory handle.

One can specify which type of events one wants to receive:

FILE_NOTIFY_CHANGE_FILE_NAME watches rename, creation or deletion of files.
FILE_NOTIFY_CHANGE_DIR_NAME is caused in case child directories are modified.
FILE_NOTIFY_CHANGE_ATTRIBUTES is raised on any attribute change in the directory or it’s subdirectories.
On resize FILE_NOTIFY_CHANGE_SIZE is triggered. This notification might be heavily delayed due to caching.
FILE_NOTIFY_CHANGE_LAST_WRITE signals the modification of the last written time.
FILE_NOTIFY_CHANGE_LAST_ACCESS is raised on modification of last access time. Note that this last accessed time is not written on all filesystems.
FILE_NOTIFY_CHANGE_CREATION signals that the creation time has been changed on any of the files.
FILE_NOTIFY_CHANGE_SECURITY signals that security attributes like the DACLs have changed.

After starting the routine also a scan should be triggered immediately to be able to not miss any modification. The same should be done on every re-arming of the function. Always first enqueue the ReadDirectoryChangesW operation and then enqueue the scan operation - this might trigger two consecutive scans but doesn’t miss any change events.

Detecting change

Now that one gets change notifications one could be tempted to fully trust the notifications received - this would be an major mistake especially on Windows since there are many conditions under which one might miss change notifications as well as the missing support of file alteration monitoring on some filesystems like NFS on most major operating systems. Because of this change notifications should only be seen as a hint that something (highly likely) has happened - but not be relied upon.

To detect changes one might then walk the watched directory or the whole directory hierarchy and keep a record of all known files. As usual this approach is not suited for every application - in case one has thousands or millions of files periodic scanning would not be a good idea (for example when monitoring image or media galleries) - but in case of runtime loadable modules this is totally feasible. If it’s not because there are too many modules one should think about a different approach of injecting new components.

What can be used to detect change?

The time the last has been modified (this is an easy to get property). One should note that this time doesn’t really have to be updated by every filesystem but in my opinion it’s a sane assumption to require filesystems that require immediate action on module updates to write this property.
Owner and group (on Unix) or ACL (Windows) properties. These should be watched too to prevent some missing of modules that have been copied into the directory with invalid permissions and then have been modified later on.
File size. This can detect an ongoing copy process but not modifications inside the file that do not increase or decrease file size.
Use a file hash. This is the most safe approach - but also the most expensive one since one would have to recalculate the hash on each and every iteration. Calculating the hash will of course be required on any signature verification after modification has been detected.

In my current implementations I use a tuple of last modified time, access permissions and file size. The hash and signature are used by me only after the file has been copied to a different location and it’s integrity should be verified.

This information can be gathered using:

stat on Unixoid operating systems like FreeBSD and Linux
GetFileAttributes, GetFileSize and GetSecurityInfo together with LookupAccountSid on Windows.

Note: To be continued