This article tries to provide an overview of some of the most common generic data
serialization techniques. Data serialization is an often uncared for rather
important aspect of application development that happens at the border of
applications during import and export of data. As everyone who has learned
programming in native languages the classic way nearly everyone starts with
writing binary data structures directly to disk - which brings many problems
with this approach; for example binary data structure layout is compiler specific
in many cases (because of this one also cannot simply map a structure to
a device dependent memory location) as well as architecture specific (and integer
on one platform might have a different encoding than on another platform).
Languages such as C/C++ do provide their own abstract machine model, they do
not guarantee a given access method to memory or hardware. In case one writes
binary data structures directly onto the disk for example one would not be able
to read the same data on a different architecture - even when using exactly the
same source.
This is where serialization methods become relevant. They do specify an exact
encoding on byte and bit level and may even specify their own description languages.
Many of them do provide own tooling to automatically generate serialization and
deserialization methods to be used for different programming languages so oneās
capable of accessing data from different applications developed on a variety of
architectures in different programming languages.
There are mainly two basic approaches to serialization formats:
- Binary formats that to specify binary encodings for all basic datatypes and
usually allow combination of these basic types into structured types
- Text based formats that assume that basic ASCII or Unicode text is passed
between applications and that add some metadata structure thatās human readable
Both of these approaches do have a right of existence. For example text based
formats are often human readable and can be passed over nearly all carrier
techniques used by humans such as E-Mail, chat systems, etc. Since they do contain
all information in human readable form itās easy during debugging to inspect
data read and written by applications and itās also possible to write such
serialized formats by hand. On the other hand they have more relaxed syntax
rules (for example usually they do tolerate flexible white space) which makes
them even harder to parse on the first hand. Parsing text based formats is usually
a huge burden and itās one of the major parts of applications that are really
prone to errors and security relevant bugs - the more complexity one gets inside
the parsers the more likely there is an security bug. Writing an parser for
serialization formats such as XML is even that cumbersome that a single developer
might require months to develop a correct solution.
Itās a good idea that - if one really needs to develop one ownās implementation - one
uses solutions such as static analysis and proof assistants (for example Frama-C / ACSL
when using ANSI C, JML for Java, etc.) to verify validity of ones parsers which is of
course not easy since usually they do have a problem of being specified in a non
terminating way because theyāre usually specified of being capable of processing
arbitrary lengths of data.
Comparison table
Note that this blog post only looks at serialization formats that are at least
somewhat generic and not used for a single specific purpose.
Name |
Text/Binary |
Complexity |
Tools required |
Own implementation? |
Libraries |
Spec |
ASCII text files |
Text |
Lowest |
No |
Simple (Seconds) |
No |
No |
Rolling your own binary format |
Text |
Lowest |
No |
Simple (Minutes) |
No |
No |
CSV/TSV text files |
Text |
Low |
No |
Simple (Hours) |
Yes |
RFC4180 |
JSON |
Text |
Low/Medium |
No |
Simple (Hours to days) |
Yes |
RFC7159 |
XML |
Text |
High |
No |
Time consuming (months, years) |
Yes |
W3C recommendation |
ASN.1 |
Binary |
High |
Yes |
Time consuming (months, years) |
Most commercial |
ITU-T X.690 |
XDR |
Binary |
Medium |
Yes |
Doable (hours, days) |
Not many |
RFC4506 |
Protocol buffers |
Binary |
Medium |
Yes |
Doable (days, weeks) |
Google |
Google |
SQLite |
Binary |
Simple |
No |
No (Library) |
Single |
No |
HDF |
Binary |
Medium |
No |
Doable (days, weeks) |
Yes |
HDF working-group |
CERN ROOT |
Binary |
Medium |
No |
Doable |
Yes |
CERN |
DICOM |
Binary |
High |
No |
No (really time consuming) |
Yes |
NMEA |
Using simple ASCII text files
- No specification
- Single data values
- Text based
- Schema less
This might be one of the approaches thatās used during the early stages of
learning programming and itās also suited if one just wants to read or write
a sequence of independent numbers or strings. In case each line does have the
same meaning and the same datatype - and contains only a single value - this
might be a good and justifiable approach. Just be sure to specify the meaning
of the numbers as well as the units used. In case on uses floating point for
example one has to specify the range, the precision and the interpretation of
that characters.
ASCII text files are of course the way to go when encoding long living archivable
text or text that will be used during data analysis.
Itās of course possible to roll oneās own binary serialization format specification.
One has to take care of stuff like text encoding (Unicode is not that simple, ASCII
rather simple), byte order of integers, interpretation of floating point numbers, etc.
Itās a solution in case of rather simple data objects that wonāt change over time.
Usually itās done for data structures that should not be passed between different
applications but should be stored persistently for one ownās application.
Comma or tab separated values (CSV, TSV)
- Specification: RFC4180
- Used for tabular data
- Text based
- Schema less
This is another really simply method of serializing tables. Each line in a CSV or
TSV file - and note that there is already a formal specification
for CSV that one should follow to allow portable access - contains a single row
of data containing the same number of attributes or columns. Each column is separated
by either a a comma or in similar formats by a semicolon or tab value. When following
the specification commas are always used and one has to implement a parser thatās
capable of processing quotes strings to allow for multi line data and data that
contains commas itself. Itās a rather simple parser to write as a finite state
machine but even in this case itās not possible to parse CSV using regular expressions
due to the quoting. One really has to write a parser for that or use an library.
CSV files might also have a heading line attached. This line is allowed to contain
other datatypes - usually descriptive strings - then all other lines and allows
one to specify the meaning of each column. The number of columns of course has to
match the remaining data lines. Itās implementation specific how one determines if
a heading line should be present or not (i.e. for example one has to ask the user,
there is no reliable way of detecting a heading line). And there is of course no
comment type inside CSV files.
A last pitfall when using CSV: Itās even specified that they use line endings
using CR, LF
sequence so one has to be careful when transferring such files
between different systems as transfer tools sometimes tend to do conversion between
Unix and Windows line endings - a CSV file is only valid using the latter one.
JSON
- Specification: RFC7159
- Text based
- Schema less
- Supports associative arrays, ordered lists, strings, numbers, booleans and null
JSON is the JavaScript object notation. Itās called that way since the specification
for the JSON serialization format has been extracted from the JavaScript language
specification. This has been done since itās a rather simple basic format
thatās also easily parseable (note: Not by regular expressions - again). It allows
for easy encapsulation of associative arrays and ordered lists - and supports
some basic data types such as strings, numbers, boolean and null. Itās that simple
that there exist a full railroad diagram for itās grammar at a webpage dedicated to
JSON.
Contrary to the previous formats JSON allows to specify objects containing a arbitrary
and different number of elements. Itās easily human readable and follows a strict
specification. On the other hand itās schema less which means there is no automatic
way of verifying the validity of a data file before processing. Each application has
to decide by itself if a file is valid or not.
Implementing a JSON parser is a little bit more work - one might require 2 - 5 days
at maximum when one designs one from ground up. But there exist many implementations
for a variety of programming languages that allow easy deserialization and serialization
of JSON objects.
XML
XML is one of the most complex text based serialization and data description languages
currently available. Itās really hard to parse - donāt even think about using
regular expressions or rolling your own parser in a few days. One usually is required
to use some kind of EBNF compiler to translate all rules into a working parser
and even then there is much work to do to support various features of XML such as
namespaces, Schema validation, etc. Itās easy to read by humans and allows one to
fully specify the structure as tags and associated attributes as well as their
datatypes, required and optional objects and their order. One usually writes an
XML schema definition (XSD) to describe the data format. This yields a strict
definition of whatās allows and whatās required against which the data file can
be verified.
Parsing XML is hard. Really. On the other hand there is a huge number of libraries
that mainly fall into one of three categories:
- Event based parsers that parse an XML file and raise an even for each discovered
start- and end tag as well as for every attribute and text node. This allows
one to add state machine based processing logic behind this type of parser.
- Document object model (DOM) based parsers. These read the whole XML file into
memory and represent it as a tree of tags with attributes attached to them
as properties. This is the approach usually used in browsers to process XHTML
and XML based formats.
- Entity mapping. In this case on specifies which attributes of objects from an
object oriented programming language one wants to map into which tags and attributes
and the runtime or precompiler does itās thing serializing and deserializing
using either statically generated code or using features such as reflection
or RTTI. This is of course language dependent but allows one to read XML files
that are specified with an XSD in a DOM like fashion in an easy way.
One should carefully think about using XML due to itās large requirements during
parsing but itās rather convenient when one wants to manually inspect data files
and possible modify them while still requiring a strict format specification.
ASN.1
- Specification: ITU-T X.690
- Binary (BER, CER, DER, PER) or Text (XER) encoding
- Schema based
- Heavy weight
Abstract syntax notation 1 is one of the oldest serialization families used on the
Internet. Itās still used for some major applications such as SSL certificates
and telecommunication systems. Itās usually used in combination with external
tools and compilers for a huge variety of programming languages. It consists of
two major parts:
- A specification language that describes the format and datatypes of objects
- Binary encodings such as the
basic encoding rules (BER)
, canonical
encoding rules (CER)
or the distinguished encoding rules (DER)
. DER is
one of the formats that provides a unique and implementation independent mapping
of objects to binary formats and thus is often used for cryptographic applications.
The other formats (BER and CER) are somewhat implementation specific in that way
that there is a number of ways one can serialize given messages but all implementations
are capable of reading all of the possible formats anyways. The last currently
used encoding rules are XML encoding rules (XER)
that allow mapping of ASN.1
structures onto XML. Of course there is some loss of information in this case.
On mobile networks packet encoding rules (PER)
are often used that
ditch the extendability feature of ASN.1 and provide a really compact representation
of messages.
Since ASN.1 usually requires the usage of an external compiler as well as runtime
library to parse and serialize messages inside the application itās one of the
heavy-weight serialization methods and not often seen in hobby projects - but itās
often used in major data processing applications from security to medicine and
network equipment.
Doing oneās own implementation of the full ASN.1 specification is a major multi
month to year task. There are readily available implementations but most of them
are commercially sold.
XDR
- Specification: RFC1832
- Binary encoding
- Schema based
The external data representation is an less known serialization technique thatās still
in major use for stuff like the network file system (NFS) and other RPC based
applications. Itās specified in RFC1832 and
consists of a simple description language as well as binary encoding of types. Itās
easy to implement in comparison to ASN.1 and XML but still requires preprocessing
of the provided description languages to auto generate some code to serialize
and deserialize data structures.
The description language consists of 13 EBNF rules so itās rather simple to implement
but a little bit more complex to parse than JSON - and the description language is
only used for the code generator that assembles a series of serialization and
deserialization function calls in auto-generated code. Usually these tools also
emit a language specific data wrapping object (struct, class) so a XDR compiler
specific to the programming language used is required.
Protocol buffers
Protocol buffers have been Googleās answer to encodings such as ASN.1. They
decided that one really needs a serialization format and description language
for communication inside distributed systems - but one does not want the complexity
of ASN.1 or the huge parsing requirements of XML. They consist again of a description
language as well as encoding rules. The description language is rather complex and
is used to describe the encoded data objects. Then a protocol buffer compiler is
used again to auto generate code that serializes and deserializes objects into a
rather simple binary format. Currently libraries exist for a variety of programming
languages and theyāre easy to use.
Protocol buffers are used for data exchange in many different modern systems and
are sometimes even used as data exchange format on a file basis. For example
the [OpenStreetmap] supplies a dump of itās whole database as XML (1270 GByte)
and as a protocol buffer (52.2 GByte) file. As one can see the impact on required
storage size when using protocol buffers instead of XML is really huge but of course
one looses the ability to manually edit the data files.
The complexity of protocol buffers usually prohibits to write oneās own implementation
for a weekend project but itās possible to do so for a multi-year project anyways.
Existing implementations are robust and available openly.
SQLite
- Structured storage
- Required library
- Database access
Although not being an exact serialization format or binary format specification this
should be mentioned when talking about exchange formats and structured storage. SQLite
uses an internally consistent binary storage format - there is no formal specification
available though. It allows full SQL based access to structured storage and is fully
embeddable inside applications. Itās most interesting when one has to store relational
information and wants to perform database like queries. Itās for example used in nearly
all Android applications, is supported inside browsers as part of WebSQL
and used by many applications as their local storage format. Though being not a formally
specified data exchange or serialization format itās convenient to use - one can even
download SQLite as a single ANSI C source file thatās embeddable into ones own
application.
SQLite is available at itās own webpage under
a public domain license.
HDF
- Structured storage
- Required library
- Schema based
- Binary
HDF is an openly available hierarchical data storage file format. Itās often seen in
science communities and basically supports hierarchical data structures as well
as data sets. Data is accessed in a similar way to POSIX filesystems using the
provided libraries. If supports indices over data implemented using binary trees
and thus allows fast data access even for large data sets. Note there are two
different HDF versions that are not compatible (HDF4, HDF5). A specification
is available.
HDF is a write-only file format so itās not possible to delete content after itās
written to a file in case one does not regenerate the whole data file.
ROOT
- Structured storage
- Required library
- Binary
Since I mentioned HDF I also had to mention CERNs ROOT file format. This is used
by the ROOT data analysis framework often used by high energy physicists. Itās also
a write only structured file format that allows indexed access to huge amounts of
data. It allows one to embed various object based data types and define own custom
types that will be stored in the tree like data structure. The on disk data structure
is rather simple but usually not used - the ROOT frameworks provides libraries that
allow direct serialization of data structures into a ROOT tree. There are viewers
available that allow one to explore the data and display metadata as well as data
tables without using an application - all data is also available using C scripting
from the cint
tool.
As already has been written this file format is mostly found in the high energy
physics areas. Most of the libraries that use ROOT files are built on heavy usage
of modern C++ features - as is the whole ROOT data analysis toolkit. It takes
a while to get used to the conventions used but then these tools are rather simple.
The file format specification
is rather hard to read since it focuses on the C++ implementation and one has to
understand the internal storage format before being able to understand the specification
itself.
Digital Imaging and Communications in Medicine (DICOM)
Digital Imaging and Communications in Medicine is a set of specifications that aims
at standardizing image, data and communication exchange for medical applications. Itās
for example heavily used with X-ray, tomography and PET scanning equipment. Itās the
basis for digital image archives in medical areas such as the picture archiving
and communication system (PACS). DICOM files are usually built around a hierarchical
tree and provide the ability to store different subtypes of data - for example images
are stored in a subset of TIFF or JPEG. There exists a schema language, the information
object definition, that defines the structure and contained attributes used inside
a DICOM file. Despite itās hierarchical index and support to store images DICOM mainly
is a store for objects containing arbitrary attributes.
The DICOM standard is highly complex (splitted over more than 20 parts) and specifies
data storage as well as network based interchange of information, capability querying
by different nodes, etc.
Note that DICOM is highly complex and often refers to external specifications. Itās
really hard to implement but if one wants to interface modern medical equipment itās
unavoidable.
This article is tagged: