The API used on most Unixoid operating systems (i.e. Linux, FreeBSD, etc.) is
Video 4 Linux. It basically consists of a specification for device naming (i.e.
the /dev/videoN devices) as well as:
Capability querying
Defined data formats
Audio and Video I/O operations
These are realized using the standard Unix read / write and ioctl
APIs as usual. V4L does not only support webcams but also tuners, video capture,
satellite receivers, etc. - this page only focuses on cameras though most of the
operations being the same for other video capture devices.
For webcams there are three different methods that can be used to read or stream
frames from the camera:
A simple interface based around the read syscall indicated by the
capability flag V4L2_CAP_READWRITE. Using this API no metadata is
passed besides image information (i.e. no framecounters, timestamps, etc.) which
would be required when sychronizing with other frames or detecting frame dropping.
This is the most simple I/O method.
Mapping stream buffers via shared memory regions using mmap. This mode is
supported whenever the V4L2_CAP_STREAMING flag is set and the mmap
mode is supported by VIDIOC_REQBUFS. This has been one of the most
efficient streaming modes and is usually widely supported. The application can
provide multiple buffers to allow seamless streaming.
A way for kernel mode drivers to write directly into usermode memory using
a usermode memory pointer. This mode is only supported if V4L2_CAP_STREAMING
is set and the mode of usermode pointers is supported by VIDIOC_REQBUFS.
The main difference to mmap is that the application allocated the buffers
itself and thus can for example be easily shared with different processes
or swapped out - the application just passes a pointer to the driver, the
driver then locks the buffer if required and reads data into the applications
memory space. Metadata is passed in an extra structure.
Up to my knowledge USB webcams currently only support the mmap mode for
USB webcams so this is what this blog post will look into first. Note the v4l2
specification does not specify any mandatory interface so for a truly portable
application it would be a good idea to support both streaming methods as well
as a method based on read/write.
Header files used
All Video4Linux2 methods and data types are defined in a single header file
thatās usually contained in linux/videodev2.h
Getting the frames
Opening the device
The first thing is obviously opening the device file. The naming is specified by
the Video4Linux specification but itās a good idea to allow overriding by the
user anyways - as one usually has to support systems including multiple capture
devices this is not a huge problem anyways.
The devices are usually named:
/dev/video0 to /dev/video63 for video capture devices. There might
also be a /dev/video device for the default capture device though this
doesnāt always exist.
For video capture from DVB and analog tuner cards there might be /dev/bttv0
as well as /dev/vbi0 to /dev/vbi31
Radio receivers use /dev/radio0 up to /dev/radio63 and the optional
default device /dev/radio
Teletext decoders use /dev/vtx0 up to /dev/vtx31 and the optional
default device /dev/vtx
Before one opens the device itās a good idea to check if the file exists and
is really a device file:
enum cameraError deviceOpen(
int* lpDeviceOut,
char* deviceName
) {
struct stat st;
int hHandle;
if(lpDeviceOut == NULL) { return cameraE_InvalidParam; }
(*lpDeviceOut) = -1;
if(deviceName == NULL) { return cameraE_InvalidParam; }
/* Check if the device exists */
if (stat(deviceName, &st) == -1) {
return cameraE_UnknownDevice;
}
/* Check if it's a device file */
if (!S_ISCHR (st.st_mode)) {
return cameraE_UnknownDevice;
}
hHandle = open(deviceName, O_RDWR | O_NONBLOCK, 0);
if(hHandle < 0) {
switch(errno) {
case EACCES: return cameraE_PermissionDenied;
case EPERM: return cameraE_PermissionDenied;
default: return cameraE_Failed;
}
}
(*lpDeviceOut) = hHandle;
return cameraE_Ok;
}
Since we opened the device using open we have to close the device in the
end using close:
The next step is to query capabilities of the opened device. This is first done
via the VIDIOC_QUERYCAPioctl. This call fills a struct v4l2_capability
structure. This structure contains:
Human readable strings:
16 characters of driver information (driver)
32 characters of card information (card)
32 characters of bus information (bus_info)
A 32 bit version field (version)
A 32 bit capability bitmask (capabilities)
A 32 bit device capability bitmask (device_caps)
Some reserved bytes (12)
The most important field is the capabilities field. This can be used together
with some interesting flags:
V4L2_CAP_VIDEO_CAPTURE identifies a cpature device - which is what oneās looking
for when looking for an webcam.
Flags indicating the I/O interfaces supported:
V4L2_CAP_READWRITE is set if read and write syscalls are supported to
read and write data
V4L2_CAP_ASYNCIO signals support for asynchronous I/O mechanisms. Since
this is usually not supported by V4L2 this is not of any interest usually.
V4L2_CAP_STREAMING is required to support streaming input and
output which includes userspace buffer pointers and memory mapping.
V4L2_CAP_VIDEO_OUTPUT and V4L2_CAP_VIDEO_OVERLAY would identify
video output and overlay devices, V4L2_CAP_VBI_CAPTURE and V4L2_CAP_VBI_OUTPUT
raw VBI devices. The same category are V4L2_CAP_SLICED_VBI_CAPTURE
and V4L2_CAP_SLICED_VBI_OUTPUT
V4L2_CAP_RDS_CAPTURE devices allow one to capture RDS packets, V4L2_CAP_RDS_OUTPUT
is an RDS encoder
V4L2_CAP_VIDEO_OUTPUT_OVERLAY signals that the device supports video
output overlay
V4L2_CAP_HW_FREQ_SEEK supports hardware frequency seeking
V4L2_CAP_VIDEO_CAPTURE_MPLANE and V4L2_CAP_VIDEO_OUTPUT_MPLANE signal
input and output support for multiplanar formats.
V4L2_CAP_VIDEO_M2M_MPLANE indicates multi planar format support on
memory to memory devices.
V4L2_CAP_VIDEO_M2M identifies a memory to memory device.
V4L2_CAP_TUNER for tuner support, V4L2_CAP_AUDIO for audio
as well as V4L2_CAP_RADIO for radio and V4L2_CAP_MODULATOR for
modulator support.
The first thing to check for when capturing from a webcam or video camera is,
that the device really supports V4L2_CAP_VIDEO_CAPTURE and either
the V4L2_CAP_READWRITE mode for single frame capture or V4L2_CAP_STREAMING
for mmap or userptr mode.
Since the ioctl calls can be interrupted which is indicated by an EINTR
error code libraries usually supply an xioctl method that retries the ioctl
until it either succeeds or fails:
static int xioctl(int fh, int request, void *arg) {
int r;
do {
r = ioctl(fh, request, arg);
} while ((r == -1) && (errno == EINTR));
return r;
}
To fetch the capability flags one simply uses this xioctl method and
checks for the required flags:
The next step is to query cropping capabilities and pixel aspects. This is done using
the VIDIOC_CROPCAP call. This call requires a pointer to a to be filled struct v4l2_cropcap
thatās initialized to the requested stream type. Since the task of this blog post
is to describe video capture the buffer type will be V4L2_BUF_TYPE_VIDEO_CAPTURE.
Now one can simply call the driver:
struct v4l2_cropcap cropcap;
memset(&cropcap, 0, sizeof(cropcap));
cropcap.type = V4L2_BUF_TYPE_VIDEO_CAPTURE;
if(xioctl(hHandle, VIDIOC_CROPCAP, &cropcap) == -1) {
return cameraE_Failed; /* failed to fetch crop capabilities */
/*
Note that some applications simply ignore this error
and simply don't set any cropping rectangle later on
since there are drivers that don't support cropping.
*/
}
The v4l2_cropcap structure contains three interesting members:
bounds is an struct v4l2_rect that specifies the boundary of the
window in which cropping is possible - this is the maximum possible window size.
defrect is the default cropping rectangle that whould cover the whole
image. For an pixel aspect ratio of 1:1 this would be for example 640 Ć 480 for NTSC
images.
The last interesting value is the pixelaspect which is an struct v4l2_fract.
This specifies the aspect ratio (y/x) when no scaling is applied. This is the ratio
required to get square pixels.
Each rect contains left, top, width and height
Initializing device
Setting cropping region
After querying one can initialize cropping - for example to the default cropping
rectangle that should usually cover the whole image. This is done using
the VIDIOC_S_CROP call supplying an struct v4l2_crop. Usually this
should not be required but since there are drivers that do not initialize using
the default cropping rectangle itās a good idea anyways. The structure basically
only contains a cropping rectangle c.
struct v4l2_crop crop;
/*
Note that this should only be done if VIDIOC_CROPCAP was successful
*/
crop.type = V4L2_BUF_TYPE_VIDEO_CAPTURE;
crop.c = cropcap.defrect;
if(xioctl(hHandle, VIDIOC_S_CROP, &crop) == -1) {
/* Failed. Maybe only not supported (EINVAL) */
}
Format negotiation
To be able to negotiate a format one should usually query the formats supported
by each device to locate one supported by the application. The code sample
accompanying this blog post does not perform this negotiation but simply
assumes an webcam to support the YUYV color model and at least 640x480
resolution to make the code easier to read. But Iāll cover the format negotiation
here - itās rather simple.
The first thing one has to know is that there are two major basic representations
for colors used:
A single value per primitive color (Red-green-blue or RGB models)
Luma and Chroma based models that are using relative luminance (Y) and
chrominance channels (usually for red and blue called Cr and Cb). In general
one might associate these two chrominance channels with different wavelength
out of which the generic names U and V emerged - in most use cases UV equals CrCb
but technically that would not be required.
The main advantage of luma and chroma based models is that one immediately has
an grayscale image available when just looking at the luma channel. This is also
how this encoding schemes emerges historically - YUV models have just added two
subcarrier encoded chroma channels to transmit color information in addition
to backwards compatible grayscale images for TV usage.
RGB models on the other side are usually easier to use on modern input and output
devices.
All color models basically support the same information but dependent on their
encoding support different resolution and scales. Nearly all models allow one
to add an optional alpha channel that covers transparency. Since weāre interested
in video capture alpha channels usually donāt play a role.
The most major difference for all color models is the way they encode the data.
Again there are two major encoding methods:
Planar at which one has a separate buffer for each channel
Interleaved at which all information is encoded per pixel (or pixel group). For
RGB888 for example there are 3 bytes per pixel that encode the red, green
and blue channel followed by the next 3 bytes for the next pixel and so on.
Depending on the chosen format the information for each channel may be of the
same amount or there may be different amount of information for each pixel. For
the mostly used YUYV format (thatās also selected by the example and is
often calles YUV422) there are for each two pixels two luminance informations
but only one U and one V coordinate for both. The idea is that the human eye
is more sensitive to luminance changes than chroma changes so one has to encode
way less chromatic information. These four values then occupy - for YUV422 - three
bytes in a specific pattern that has to be decoded.
There is a huge number of supported formats - the usualy way to handle this inside
media processing libraries is to decide on one or two internally supported formats
and decode as well as re-encode on the application boundaries. For example I
personally usually decide to support:
RGB888 with 3 interleaved bytes per pixel encoding R, G, B as an 8 bit value
YUV888 encoding an luma and two chroma channels per pixel in an interleaved
way.
A grayscale Y only format. This is particularly interesting in case one
wants to do CV. Itās of course possible to access an YUV image with a stride
of 3 but having a more compact representations might be interesting many
times.
For more specialized algorithms I personally also use:
RGB with double precision values. This is also encoded interleaved and I usually
use it when doing HDR reconstruction or calculations. Since file formats and
output devices usually do not support such numeric ranges one has to tone-map
in the end again
Grayscale with double precision values. Again this is used for some specialized
applications - like for example integral images of luminance plots (which
are especially interesting for classifier cascades built on top of wavelets)
To determine which format an capture devices supports one can use the VIDIOC_ENUM_FMT
function call. This is built around the struct v4l2_fmtdesc structure:
The basic idea is that the application just fills the index and type
fields, calls the VIDIOC_ENUM_FMT function and the driver fills the fields
with available information. To query information about our capture device
one will iterate the index value from 0 and count upwards till the
driver fails with an error code of EINVAL. The type has to be set
to V4L2_BUF_TYPE_VIDEO_CAPTURE:
for(int idx = 0;; idx = idx + 1)
struct v4l2_fmtdesc fmt;
fmt.type = V4L2_BUF_TYPE_VIDEO_CAPTURE;
fmt.index = idx;
if(xioctl(hHandle, VIDIOC_ENUM_FMT, &fmt) < 0) {
/* Failed, usually one should check the error code ... */
break;
}
/* We got some format information. For demo purposes just display it */
printf("Detected format %08x (is compressed: %s): %s\n", fmt.pixelformat, ((fmt.flags & V4L2_FMT_FLAG_COMPRESSED) != 0) ? "yes" : "no", fmt.description);
}
Setting the format
The next step is setting the desired format. There are three calls involved with
setting, trying or getting the format:
VIDIOC_G_FMT queries the current format
VIDIOC_S_FMT sets the format (might change the width an height though)
VIDIOC_TRY_FMT passes a format to the driver like S_FMT but does not
change driver state. It fails if the format is not supported and might change width/height
as S_FMT. Note that drivers are not required to implement this call so it
might also fail every time.
Setting the format requires usually negotiation of the format but most webcams
support YUYV color space and interlaced pixel layout. This can be set
in a struct v4l2_format:
struct v4l2_format fmt;
unsigned int width, height;
/*
Select 640 x 480 resolution (you should use dimensions
as previously set while setting cropping parameters),
YUYV color format and interlaced order
*/
fmt.type = V4L2_BUF_TYPE_VIDEO_CAPTURE;
fmt.fmt.pix.width = 640;
fmt.fmt.pix.height = 480;
fmt.fmt.pix.pixelformat = V4L2_PIX_FMT_YUYV;
fmt.fmt.pix.field = V4L2_FIELD_INTERLACED;
if(xioctl(hHandle, VIDIOC_S_FMT, &fmt) == -1) {
/* Failed to set format ... */
}
/* Now one should query the real size ... */
width = fmt.fmt.pix.width;
height = fmt.fmt.pix.height;
In some code like v4l2grab there is some additional handling of buggy
drivers. Since webcams are usually cheap products and there are some buggy
drivers so on Linux they check if the fmt.fmt.pix.bytesperline is at least
two times the fmt.fmt.pix.width and that fmt.fmt.pix.sizeimage
is at least 2 * fmt.fmt.pix.width * fmt.fmt.pix.height.
Capturing and capturing frames
Streaming I/O using mmap
The interface supported for most webcams is streaming I/O using memory mapped
buffers. This has been the most efficient streaming method for a long time - allowing
an application to virtually map device memory areas (for example memory contained
on an PCI capture card) directly into application memory. Later on a second method
using userptr has been added that allows one also to exploit DMA transfer
into real main memory when using devices supporting busmastering. For cheap USB
webcams this usually doesnāt make a difference though and userptr streaming I/O
mode is usually not supported by most hardware anyways.
Note that there is no way for a driver to indicate which type of streaming methods
they support except for one to request allocation of buffers.
The basic idea is:
The application requests a number of buffers to be allocated inside the
drivers address space. Buffers for use with the mmap method have
to be allocated using the V4L2_MEMORY_MMAP memory type using
the VIDIOC_REQBUFS ioctl. Note that though the buffer descriptors
seem to contain real memory offsets these are just some kind of magic cookie
that is used by the driver to recognize the allocated buffers (for example
these might be real adresses or just )
After having successfully allocated a given number of buffers they can be
mapped by virtual memory mapping into the applications address space.
There can be either a single buffer per frame in planar mode or multiple
buffers per frame in multi planar mode.
Buffers are allocated in dequeued state, i.e. the device wonāt write
data into the given buffers. To allow writing into a buffer they have to
be enqueued using VIVIOC_QBUF. Whenever a buffer has been written
successfully it has to be dequeued using VIDIOC_DQBUF.
The buffer state (mapped, enqueued, full, empty) can be queried using VIDIOC_QUERYBUF
The ioctl can be executed synchronously which is the default behavior or asynchronous
and then being used like a network socket using select, poll or kqueue
event notification frameworks to determine readiness of new frames.
Streaming can be started and stopped using VIDIOC_STREAMON and VIDIOC_STREAMOFF
There is a common structure used by the queue and dequeue operations thatās
called struct v4l2_buffer. This structure contains:
An index. This is a linear index into a sequence of allocated buffers - used only
with memory mapped buffers.
The type which identifies either input (V4L2_BUF_TYPE_VIDEO_CAPTURE) or
output (V4L2_BUF_TYPE_VIDEO_OUTPUT) buffers.
The size in bytes (length). The size of the allocated buffer has to
be able to contain a full frame of the requested data. After dequeueing
a capture buffer the driver also has set bytesused which might be equal or
smaller than length. For output buffers the bytesused is set by
the application to indicate real used data size
field
timestamp might be set to indicate when the buffer had been captured. For
output the timestamp can specify at which point in time the buffer should
be transmitted by the output device.
timecode is another method to determine the position inside the data
stream.
sequence allows tracking of lost frames. Itās a monotonically increasing
sequence number.
memory indicates the type of the buffer (memory mapped or userptr)
userptr or offset contained in the same union provide a way
for the driver to identify either the offset inside the applications user mode
memory range or provides a cookie to pass to mmap.
input would allow switching between multiple supported data sources on
the same device.
flags can be a combination of:
V4L2_BUF_FLAG_MAPPED indicates that a buffer is mapped into the application
address space.
V4L2_BUF_FLAG_QUEUED indicates a buffer is currently enqueued for the
device driver to be used. The application should not modify the buffer. The
buffer is said to be in the driver incoming queue.
V4L2_BUF_FLAG_DONE indicates that a buffer is already processed by
the driver and is waiting to be dequeued by the application.
V4L2_BUF_FLAG_KEYFRAME signals that a buffer contains a keyframe - which is
interesting when resynchronizing within compressed streams.
V4L2_BUF_FLAG_TIMECODE is set whenever the timecode field is valid.
V4L2_BUF_FLAG_INPUT is only set it the input field is valid.
As shown in the outline above the first step is to request buffers from the
device driver. One can request multiple buffers - the driver itself determines
the lower (!) and upper bound onto the number of buffers that have to be requested.
Itās a good idea to support a variable number in case the driver requests on to use
more or less buffers.
To request buffers one can use the VIDIOC_REQBUFS ioctl that resembles
the function call int (*vidioc_reqbufs) (struct file *file, void *private_data, struct v4l2_requestbuffers *req);
The struct v4l2_requestbuffers structure contains:
The number of requested buffers count. This is an input and output field
that might be increased or decreased arbitrarily by the driver. Note that setting
count to 0 has the special meaning of releasing all buffers.
The type (V4L2_BUF_TYPE_VIDEO_CAPTURE or V4L2_BUF_TYPE_VIDEO_OUTPUT)
of the buffer
An memory specifier. This identifies if the memory are is mapped
into userspace. In this case the V4L2_MEMORY_MMAP constant is used.
If one would use userptr like DMA transfers one would set the
constant to V4L2_MEMORY_USERPTR.
If the driver does not support mmap (or if it has been requested the userptr
mode) it will return EINVAL. This is the only way to determine the supported
streaming data transfer mode.
After the buffers have been requested they have to be mapped into memory. To do
so one has to VIDIOC_QUERYBUF each buffer to determine the parameters that
will be passed to mmap in the same way as mapping from a memory mapped file.
On entry into QUERYBUF one just has to pass type and index.
struct imageBuffer* lpBuffers;
{
lpBuffers = calloc(bufferCount, sizeof(struct imageBuffer));
if(lpBuffers == NULL) {
printf("%s:%u Out of memory\n", __FILE__, __LINE__);
deviceClose(hHandle);
return 2;
}
int iBuf;
for(iBuf = 0; iBuf < bufferCount; iBuf = iBuf + 1) {
struct v4l2_buffer vBuffer;
memset(&vBuffer, 0, sizeof(struct v4l2_buffer));
/*
Query a buffer identifying magic cookie from the driver
*/
vBuffer.type = V4L2_BUF_TYPE_VIDEO_CAPTURE;
vBuffer.memory = V4L2_MEMORY_MMAP;
vBuffer.index = iBuf;
if(xioctl(hHandle, VIDIOC_QUERYBUF, &vBuffer) == -1) {
printf("%s:%u Failed to query buffer %d\n", __FILE__, __LINE__, iBuf);
deviceClose(hHandle);
return 2;
}
/*
Use the mmap syscall to map the drivers buffer into our
address space at an arbitrary location.
*/
lpBuffers[iBuf].lpBase = mmap(NULL, vBuffer.length, PROT_READ|PROT_WRITE, MAP_SHARED, hHandle, vBuffer.m.offset);
lpBuffers[iBuf].sLen = vBuffer.length;
if(lpBuffers[iBuf].lpBase == MAP_FAILED) {
printf("%s:%u Failed to map buffer %d\n", __FILE__, __LINE__, iBuf);
deviceClose(hHandle);
return 2;
}
}
}
Then one has to enqueue all buffers that one wants to provide to the driver (typically
all of them before starting the processing loop) by using the VIDIOC_QBUF
function. One just has to supply type and index when using memory
mapped buffers.
Whenever the device is ready the processing loop will use VIDIOC_DQBUF to
pop the oldest filled buffer from the output queue. This is a blocking call - that
can also be realized using standard select, epoll or kqueue
asynchronous processing functions in case O_NONBLOCK had been set during
the open. Usually one wants to re-enqueue the buffer after having
finished processing or having copied the data for further processing.
The last two important functions start and stop the stream processing. These
are VIDIOC_STREAMON and VIDIOC_STREAMOFF. Of course one should
start streaming before running the event processing loop.
The usage of the read/write interface will be added in near future (hopefully). Note
that itās usually not supported by webcams on FreeBSD anyways.
Writing frames into a JPEG file using libjpeg
The process of writing an raw image into a JPEG file has been discussed
in a previous blog post. The major remaining
task is to convert the read image into the format accepted by libjpeg. In
my application I had to convert the YUV422 format into RGB888. In YUV422 there
are always two luminance values as well as a single set of chroma values per
sample - two pixels share the chroma values but have different luminance values.
Simple sample (FreeBSD, streaming mmap)
External references
One real great resource that Iāve found during writing this article
has been the Video4Linux2 API introduction on
LWN.