24 Aug 2020 - tsp
Last update 24 Aug 2020
18 mins
This article tries to provide an overview of some of the most common generic data serialization techniques. Data serialization is an often uncared for rather important aspect of application development that happens at the border of applications during import and export of data. As everyone who has learned programming in native languages the classic way nearly everyone starts with writing binary data structures directly to disk - which brings many problems with this approach; for example binary data structure layout is compiler specific in many cases (because of this one also cannot simply map a structure to a device dependent memory location) as well as architecture specific (and integer on one platform might have a different encoding than on another platform). Languages such as C/C++ do provide their own abstract machine model, they do not guarantee a given access method to memory or hardware. In case one writes binary data structures directly onto the disk for example one would not be able to read the same data on a different architecture - even when using exactly the same source.
This is where serialization methods become relevant. They do specify an exact encoding on byte and bit level and may even specify their own description languages. Many of them do provide own tooling to automatically generate serialization and deserialization methods to be used for different programming languages so one’s capable of accessing data from different applications developed on a variety of architectures in different programming languages.
There are mainly two basic approaches to serialization formats:
Both of these approaches do have a right of existence. For example text based formats are often human readable and can be passed over nearly all carrier techniques used by humans such as E-Mail, chat systems, etc. Since they do contain all information in human readable form it’s easy during debugging to inspect data read and written by applications and it’s also possible to write such serialized formats by hand. On the other hand they have more relaxed syntax rules (for example usually they do tolerate flexible white space) which makes them even harder to parse on the first hand. Parsing text based formats is usually a huge burden and it’s one of the major parts of applications that are really prone to errors and security relevant bugs - the more complexity one gets inside the parsers the more likely there is an security bug. Writing an parser for serialization formats such as XML is even that cumbersome that a single developer might require months to develop a correct solution.
It’s a good idea that - if one really needs to develop one own’s implementation - one uses solutions such as static analysis and proof assistants (for example Frama-C / ACSL when using ANSI C, JML for Java, etc.) to verify validity of ones parsers which is of course not easy since usually they do have a problem of being specified in a non terminating way because they’re usually specified of being capable of processing arbitrary lengths of data.
Note that this blog post only looks at serialization formats that are at least somewhat generic and not used for a single specific purpose.
Name | Text/Binary | Complexity | Tools required | Own implementation? | Libraries | Spec |
---|---|---|---|---|---|---|
ASCII text files | Text | Lowest | No | Simple (Seconds) | No | No |
Rolling your own binary format | Text | Lowest | No | Simple (Minutes) | No | No |
CSV/TSV text files | Text | Low | No | Simple (Hours) | Yes | RFC4180 |
JSON | Text | Low/Medium | No | Simple (Hours to days) | Yes | RFC7159 |
XML | Text | High | No | Time consuming (months, years) | Yes | W3C recommendation |
ASN.1 | Binary | High | Yes | Time consuming (months, years) | Most commercial | ITU-T X.690 |
XDR | Binary | Medium | Yes | Doable (hours, days) | Not many | RFC4506 |
Protocol buffers | Binary | Medium | Yes | Doable (days, weeks) | ||
SQLite | Binary | Simple | No | No (Library) | Single | No |
HDF | Binary | Medium | No | Doable (days, weeks) | Yes | HDF working-group |
CERN ROOT | Binary | Medium | No | Doable | Yes | CERN |
DICOM | Binary | High | No | No (really time consuming) | Yes | NMEA |
This might be one of the approaches that’s used during the early stages of learning programming and it’s also suited if one just wants to read or write a sequence of independent numbers or strings. In case each line does have the same meaning and the same datatype - and contains only a single value - this might be a good and justifiable approach. Just be sure to specify the meaning of the numbers as well as the units used. In case on uses floating point for example one has to specify the range, the precision and the interpretation of that characters.
ASCII text files are of course the way to go when encoding long living archivable text or text that will be used during data analysis.
It’s of course possible to roll one’s own binary serialization format specification. One has to take care of stuff like text encoding (Unicode is not that simple, ASCII rather simple), byte order of integers, interpretation of floating point numbers, etc.
It’s a solution in case of rather simple data objects that won’t change over time. Usually it’s done for data structures that should not be passed between different applications but should be stored persistently for one own’s application.
This is another really simply method of serializing tables. Each line in a CSV or TSV file - and note that there is already a formal specification for CSV that one should follow to allow portable access - contains a single row of data containing the same number of attributes or columns. Each column is separated by either a a comma or in similar formats by a semicolon or tab value. When following the specification commas are always used and one has to implement a parser that’s capable of processing quotes strings to allow for multi line data and data that contains commas itself. It’s a rather simple parser to write as a finite state machine but even in this case it’s not possible to parse CSV using regular expressions due to the quoting. One really has to write a parser for that or use an library.
CSV files might also have a heading line attached. This line is allowed to contain other datatypes - usually descriptive strings - then all other lines and allows one to specify the meaning of each column. The number of columns of course has to match the remaining data lines. It’s implementation specific how one determines if a heading line should be present or not (i.e. for example one has to ask the user, there is no reliable way of detecting a heading line). And there is of course no comment type inside CSV files.
A last pitfall when using CSV: It’s even specified that they use line endings
using CR, LF
sequence so one has to be careful when transferring such files
between different systems as transfer tools sometimes tend to do conversion between
Unix and Windows line endings - a CSV file is only valid using the latter one.
JSON is the JavaScript object notation. It’s called that way since the specification for the JSON serialization format has been extracted from the JavaScript language specification. This has been done since it’s a rather simple basic format that’s also easily parseable (note: Not by regular expressions - again). It allows for easy encapsulation of associative arrays and ordered lists - and supports some basic data types such as strings, numbers, boolean and null. It’s that simple that there exist a full railroad diagram for it’s grammar at a webpage dedicated to JSON.
Contrary to the previous formats JSON allows to specify objects containing a arbitrary and different number of elements. It’s easily human readable and follows a strict specification. On the other hand it’s schema less which means there is no automatic way of verifying the validity of a data file before processing. Each application has to decide by itself if a file is valid or not.
Implementing a JSON parser is a little bit more work - one might require 2 - 5 days at maximum when one designs one from ground up. But there exist many implementations for a variety of programming languages that allow easy deserialization and serialization of JSON objects.
XML is one of the most complex text based serialization and data description languages currently available. It’s really hard to parse - don’t even think about using regular expressions or rolling your own parser in a few days. One usually is required to use some kind of EBNF compiler to translate all rules into a working parser and even then there is much work to do to support various features of XML such as namespaces, Schema validation, etc. It’s easy to read by humans and allows one to fully specify the structure as tags and associated attributes as well as their datatypes, required and optional objects and their order. One usually writes an XML schema definition (XSD) to describe the data format. This yields a strict definition of what’s allows and what’s required against which the data file can be verified.
Parsing XML is hard. Really. On the other hand there is a huge number of libraries that mainly fall into one of three categories:
One should carefully think about using XML due to it’s large requirements during parsing but it’s rather convenient when one wants to manually inspect data files and possible modify them while still requiring a strict format specification.
Abstract syntax notation 1 is one of the oldest serialization families used on the Internet. It’s still used for some major applications such as SSL certificates and telecommunication systems. It’s usually used in combination with external tools and compilers for a huge variety of programming languages. It consists of two major parts:
basic encoding rules (BER)
, canonical
encoding rules (CER)
or the distinguished encoding rules (DER)
. DER is
one of the formats that provides a unique and implementation independent mapping
of objects to binary formats and thus is often used for cryptographic applications.
The other formats (BER and CER) are somewhat implementation specific in that way
that there is a number of ways one can serialize given messages but all implementations
are capable of reading all of the possible formats anyways. The last currently
used encoding rules are XML encoding rules (XER)
that allow mapping of ASN.1
structures onto XML. Of course there is some loss of information in this case.
On mobile networks packet encoding rules (PER)
are often used that
ditch the extendability feature of ASN.1 and provide a really compact representation
of messages.Since ASN.1 usually requires the usage of an external compiler as well as runtime library to parse and serialize messages inside the application it’s one of the heavy-weight serialization methods and not often seen in hobby projects - but it’s often used in major data processing applications from security to medicine and network equipment.
Doing one’s own implementation of the full ASN.1 specification is a major multi month to year task. There are readily available implementations but most of them are commercially sold.
The external data representation is an less known serialization technique that’s still in major use for stuff like the network file system (NFS) and other RPC based applications. It’s specified in RFC1832 and consists of a simple description language as well as binary encoding of types. It’s easy to implement in comparison to ASN.1 and XML but still requires preprocessing of the provided description languages to auto generate some code to serialize and deserialize data structures.
The description language consists of 13 EBNF rules so it’s rather simple to implement but a little bit more complex to parse than JSON - and the description language is only used for the code generator that assembles a series of serialization and deserialization function calls in auto-generated code. Usually these tools also emit a language specific data wrapping object (struct, class) so a XDR compiler specific to the programming language used is required.
Protocol buffers have been Google’s answer to encodings such as ASN.1. They decided that one really needs a serialization format and description language for communication inside distributed systems - but one does not want the complexity of ASN.1 or the huge parsing requirements of XML. They consist again of a description language as well as encoding rules. The description language is rather complex and is used to describe the encoded data objects. Then a protocol buffer compiler is used again to auto generate code that serializes and deserializes objects into a rather simple binary format. Currently libraries exist for a variety of programming languages and they’re easy to use.
Protocol buffers are used for data exchange in many different modern systems and are sometimes even used as data exchange format on a file basis. For example the [OpenStreetmap] supplies a dump of it’s whole database as XML (1270 GByte) and as a protocol buffer (52.2 GByte) file. As one can see the impact on required storage size when using protocol buffers instead of XML is really huge but of course one looses the ability to manually edit the data files.
The complexity of protocol buffers usually prohibits to write one’s own implementation for a weekend project but it’s possible to do so for a multi-year project anyways. Existing implementations are robust and available openly.
Although not being an exact serialization format or binary format specification this should be mentioned when talking about exchange formats and structured storage. SQLite uses an internally consistent binary storage format - there is no formal specification available though. It allows full SQL based access to structured storage and is fully embeddable inside applications. It’s most interesting when one has to store relational information and wants to perform database like queries. It’s for example used in nearly all Android applications, is supported inside browsers as part of WebSQL and used by many applications as their local storage format. Though being not a formally specified data exchange or serialization format it’s convenient to use - one can even download SQLite as a single ANSI C source file that’s embeddable into ones own application.
SQLite is available at it’s own webpage under a public domain license.
HDF is an openly available hierarchical data storage file format. It’s often seen in science communities and basically supports hierarchical data structures as well as data sets. Data is accessed in a similar way to POSIX filesystems using the provided libraries. If supports indices over data implemented using binary trees and thus allows fast data access even for large data sets. Note there are two different HDF versions that are not compatible (HDF4, HDF5). A specification is available.
HDF is a write-only file format so it’s not possible to delete content after it’s written to a file in case one does not regenerate the whole data file.
Since I mentioned HDF I also had to mention CERNs ROOT file format. This is used
by the ROOT data analysis framework often used by high energy physicists. It’s also
a write only structured file format that allows indexed access to huge amounts of
data. It allows one to embed various object based data types and define own custom
types that will be stored in the tree like data structure. The on disk data structure
is rather simple but usually not used - the ROOT frameworks provides libraries that
allow direct serialization of data structures into a ROOT tree. There are viewers
available that allow one to explore the data and display metadata as well as data
tables without using an application - all data is also available using C scripting
from the cint
tool.
As already has been written this file format is mostly found in the high energy physics areas. Most of the libraries that use ROOT files are built on heavy usage of modern C++ features - as is the whole ROOT data analysis toolkit. It takes a while to get used to the conventions used but then these tools are rather simple.
The file format specification is rather hard to read since it focuses on the C++ implementation and one has to understand the internal storage format before being able to understand the specification itself.
Digital Imaging and Communications in Medicine is a set of specifications that aims at standardizing image, data and communication exchange for medical applications. It’s for example heavily used with X-ray, tomography and PET scanning equipment. It’s the basis for digital image archives in medical areas such as the picture archiving and communication system (PACS). DICOM files are usually built around a hierarchical tree and provide the ability to store different subtypes of data - for example images are stored in a subset of TIFF or JPEG. There exists a schema language, the information object definition, that defines the structure and contained attributes used inside a DICOM file. Despite it’s hierarchical index and support to store images DICOM mainly is a store for objects containing arbitrary attributes.
The DICOM standard is highly complex (splitted over more than 20 parts) and specifies data storage as well as network based interchange of information, capability querying by different nodes, etc.
Note that DICOM is highly complex and often refers to external specifications. It’s really hard to implement but if one wants to interface modern medical equipment it’s unavoidable.
This article is tagged:
Dipl.-Ing. Thomas Spielauer, Wien (webcomplains389t48957@tspi.at)
This webpage is also available via TOR at http://rh6v563nt2dnxd5h2vhhqkudmyvjaevgiv77c62xflas52d5omtkxuid.onion/