This article is work in progress
Introduction
Neuronal networks are often called the solution to every hard to tackle problem
today. Theyāre a powerful method to perform data analysis and data exploration,
but they should not be over-estimated in their importance. They mostly compete
with other algorithms like support vector machines (which are most of the
time more performant than neuronal networks) and systems like bag of words
classifiers. They also compete with basic Bayesian predictors, etc.
The basic idea of a neuronal network is to provide a general data structure
that can modify itās own mappings from inputs to outputs, some kind of
feedback and push as much training data inside the neuronal network as possible
to calculate weights. One tries to find a (complex but mathematical) model
function that identifies some kind of pattern on the input data and
does a prediction (either binary or continuous) on output data.
Note that neuronal networks provide a tradeoff between development effort,
accuracy, required training data and comutational / memory requirements.
Supervised learning approach
In supervised learning the neuronal network is first presented with a training
dataset. This training dataset normally consists of some sample data (it doesnāt
matter if these are images, text, datasamples from some data aquisition system,
etc.) as well as a set of output values (labels, weights, etc.). The data is
presented to the neuronal networks inputs and the output is compared to the
expected one. Then one calculates the squared error to determine deviation from
the expected output and uses the backpropagation algorithm that simply calculates
the influence of each weight of each neuron onto the output (in terms of a
gradient) to calculate the correction applied to the weights - which is modified
using a so called learning rate.
At the end the network is normally tested against a independent test dataset
to verify the training worked.
Note that neuronal networks are similar to statistical analysis in this
application but provide the advantage that one can use a generic model
composed of multiple interconnected functions to describe the problem. The
more complex the model gets the more input data is required (and no - this
does of course not scale linear).
Unsupervised learning approach
This is an approach often used in clustering or segmentation. The training also
works by presenting the network a huge bunch of training data - but this time
the data is not pre-labeled but the network is used to determine some kind
of segmentation of the input data. The normal approach for unsupervised learning
is to use either principal component analysis (PCA) or cluster analysis instead
of neuronal networks. In the world of neuronal networks one might see
self-organizing maps or usage of the adaptive resonance theory.
Normally the outcome of unsupervised learning is not known in advance and
inspection of the generated neuronal network is required. In practice most of
the time PCA or cluster analysis are more performant and more accurate than
neuronal networks in this field.
Reinforcement learning
Another method of learning is called reinforcement learning.
For reinforcement learning one doesnāt provide known input and output pairs
to train the network but (sporadically) provides feedback via a feedback
function - i.e. tells the network if itās performance is good or bad. These
informations then are used to recalculate weights. The initial network might
do random action and check the effect judged by an external feedback function.
One might even introduce randomness into the network to provide additional
learning disturbance to leave local minima. Often a simulation is used to
provide feedback on initial networks - i.e. the network provides some
output, the simulation checks if the outcome is positive or negative (not in
mathematical terms but for example if it leads to catastrophic failure in
control systems, achieves higher scores in games, etc.) and provides positive
or negative feedback in this case.
This method is heavily inspired by the way the human brain learns.
Basics
A neuron
The basic building block of neuronal networks is a neuron. So whatās a
neuron? Itās a mathematical object composed of:
- An weight $w_{ij}$ that maps the weight of the input of the previous layer
or input data $i$ into the neuron $j$. This will end up inside a mapping
matrix.
- A transmission function $z = \Sigma(x_1, \ldots, x_i; w_0j, \ldots, w_ij)$
thatās normally really just the sum of all weighted input values. When
built as sum this is normally called the activation potential.
- An activation function $o_j = \sigma(z)$. This is the main function used
to decide how to react to a given input and itās the main challenge of using
backpropagation as well as performance tuning for neuronal networks.
- Sometimes a threashold value $\theta_j$ is added to the input as a
constant value. This is called bias.
As one can see the input to a single neuron with index $j$ is calculated as
$ z_j = \Sigma_{i = 1}^{I} w_{ij} * x_i $
Then the activation function is applied to calculate the output:
$ o_j = \sigma(z_j + \theta_j) $
Network evaluation
Evaluation of the network response for a given input is normally done on
graphics processing units (GPU) or tensor processing units (TPU) these days
but of course itās entirely possible to be done on CPU based systems. There
are essentially only two basic operations requires:
- Matrix multiplication
- Applying a function to each matrix element
For some layer types on might also be required to sum vectors - or is capable
of reducing storage by applying the same matrix to a larger number of
overlapping or non overlapping zones (for example for convolutional layers).
To evaluate a dense layer one requires:
- The vector $\vec{x}$ contains the input data with elements $x_i$. This
vector is assumed to be a column vector with $I$ elements.
- A weight matrix $w$ that contains the weights $w_{ij}$. This matrix
is assumed to contain all weights for the first neuron in itās first
row, all weigts for the second neuron in itās second, etc. This encodes
the weights that map all input elements into the corresponding activation
potentials $z_j = w_{ij} * x_i + \theta_j$.
- This leads to an intermediate value that should be stored when one wantās
to do training but one doesnāt have to store when doing evaluation, the
vector $\vec{z}$ of activation potentials $z_j$.
- An output vector $\vec{o}$ that contains the result of network evaluation,
i.e. the application of the activation function to each and every element
of the activation vector $\vec{o} = \sigma(\vec{z})$
- One can further reduce the complexity with some additional computational
and storage overhead by including the bias $\theta_j$ into the weight matrix.
This could be realized by adding an additional $1$ element at the end of
each input vector and adding a further column to the matrix containing the
bias values for each neuron.
$\vec{z} = w * \vec{x}$
[
\left(\begin{matrix}z_{1} \\ z_{2} \\ \ldots \\ z_J \\ 1\end{matrix}\right) = \left(\begin{matrix}w_{11} & w_{21} & \ldots & w_{I1} & \theta_1 \\ w_{12} & w_{22} & \ldots & w_{I2} & \theta_2 \\ \ldots & \ldots & \ldots & \ldots & \ldots \\ w_{1J} & w_{2J} & \ldots & w_{IJ} & \theta_J \\ 0 & 0 & \ldots & 0 & 1 \end{matrix}\right) * \left(\begin{matrix}x_{1} \\ x_{2} \\ \ldots \\ x_I \\ 1\end{matrix}\right)
]
[
\left(\begin{matrix}o_{1} \\ o_{2} \\ \ldots \\ o_{J} \\ 1 \end{matrix}\right) = \left(\begin{matrix}\sigma(z_{1}) \\ \sigma(o_{2}) \\ \ldots \\ \sigma(o_{J}) \\ 1 \end{matrix}\right)
]
Training methods
Backpropagation (Error minimization via gradient descent)
One of the most prominent methods to determine weights for the Neurons is
the backpropagation algorithm. This algorithm is used during supervised
learning. In this case on has a huge number of input $\vec{x}$ and expected
target $\vec{t}$ pairs. One can then calculate an squared error value
$E = \sum_{m=0}^{M} (\vec{t_m} - \vec{o_m})^2$
For the following calculations we assume a neuronal network consisting of
input data and two dense neuronal layers. The output data is directly
taken from the output of the second layer.
Note that the following indices are used:
- $m$ is the index into the training set. If you train your network
with 1000 different samples $m$ runs from $0$ to $999$. The total
number of training samples is termed $M$
- $i$ is an index into the input data for the layer. If an layer
has $500$ different possible inputs $i$ runs from $0$ to $499$.
The total number of input values is called $I$. For the first
layer this is the index into the input vector.
- $j$ is the index into the output values for the input layer. For the
last (output) layer this is the result data. The total number
of output values of a given layer is called $J$.
- $j$ is also the index into the input values for the second layer as
both layers are chained at each other.
- $k$ is the index into neurons at the second layer as well as into the
output layer. The number of output variables is called $K$
In this case $t_{k,m}$ and $o_{k,m}$ compose all input values and expected output
values for the $m$-th training sample. The square function guarantees that
every error contributes independent of the direction (positive or negative).
An error of $0$ would mean that the network perfectly reproduces all presented
samples (which normally is an indication for overfitting). This error will
also be calculated during validation.
The forward equations for the input layer in this network can be expressed as:
[
z_j = \theta_j + \sum_{i=0}^{N-1} x_i * w_{ij}
]
[
y_j = \sigma(z_j)
]
[
y_j = \sigma(\theta_j + \sum_{i=0}^{N-1} x_i * w_{ij})
]
The second layer is built on top of the results of the first layer:
[
a_k = \psi_k + \sum_{j=0}^{J-1} y_j * w_{jk}
]
[
o_k = \sigma(a_k)
]
[
o_k = \sigma(\psi_k + \sum_{j=0}^{J-1} y_j * w_{jk})
]
[
o_k = \sigma(\psi_k + \sum_{j=0}^{J-1} \sigma(\theta_j + \sum_{i=0}^{N-1} x_i * w_{ij}) * w_{jk})
]
The idea of backpropagation is now to minimize the error function. First one
has to determine the direction and slope (i.e. the gradient) of the error. One
simply computes the derivative of the error $E$ in relation to the weights $w_{ij}$.
First weāll only look at the output layer where one can directly specify the
error deviation:
[
E = \frac{1}{2} \sum_{m} \sum_{k=0}^{K} \left(o_{k;m} - t_{k;m}\right)^2
]
[
E = \frac{1}{2} \sum_{m} \sum_{k=0}^{K} \left(\sigma(\psi_k + \sum_{j=0}^{J-1} y_j * w_{jk}) - t_{k;m}\right)^2
]
To make oneās live easier weāll introduce the bias as explained above inside the
weight matrix. This eliminates $\psi$ and $\theta$ from the equations:
[
E = \frac{1}{2} \sum_{m} \sum_{k=0}^{K} \left(\sigma(\sum_{j=0}^{J-1} y_j * w_{jk}) - t_{k;m}\right)^2
]
Now the derivatives can easily be calculated:
[
\frac{\partial E}{\partial w_{jk}} = \sum_{m} \sum_{k=0}^{K} \left(\sigma(a_k) - t_{k;m}\right) * \frac{\partial \sigma}{\partial a_k}(a_k) * \frac{\partial a_k}{\partial w_{jk}}
]
[
\frac{\partial E}{\partial w_{jk}} = \sum_{m} \sum_{k=0}^{K} \left(\sigma(a_k) - t_{k;m}\right) * \frac{\partial \sigma}{\partial a_k}(a_k) * y_j
]
Using this derivative for a single neuron $E_k$ one can simply determine:
[
\frac{\partial E_k}{\partial w_{jk}} = \sum_{m} \left(\sigma(a_k) - t_{k;m}\right) * \frac{\partial \sigma}{\partial a_k}(a_k) * y_j
]
For an intermediate node the approach is similar:
[
\frac{\partial E}{\partial w_{ij}} = \sum_{k} \left( \sigma(a_k) - t_k \right) * \frac{\partial \sigma}{\partial z}(a_k) * \left(\sum_j \frac{\partial \sigma}{\partial z}(z_j) w_{jk} * x_i \right)
]
[
\frac{\partial E_j}{\partial w_{ij}} = \sum_{k} \left(\sigma(a_k) - t_{k;m}\right) * \frac{\partial \sigma}{\partial a_k}(a_k) * \frac{\partial \sigma}{\partial z}(z_j) * w_{jk} * x_i
]
As one can see the structure of both formulas is quite similar:
[
\frac{\partial E_k}{\partial w_{jk}} = \underbrace{\left( \sigma(a_k) - t_k \right) * \frac{\partial \sigma}{\partial z}(a_k)}_{\delta_k} * y_j
]
[
\frac{\partial E_j}{\partial w_{ik}} = \underbrace{\left( \sum_{k} \delta_k * w_{jk} \right) * \frac{\partial \sigma}{\partial z}(z_j)}_{\delta_j} * x_i
]
From this expressions one can see why the algorithm is called backpropagation.
It simply transfers the error from the output layer toward the input of the
output layer, then uses the weights of all nodes that are attached to the next
node upstream to transfer the error upwards and uses the same approach again.
This is repeated for every layer inside the network.
With these derivates one can calculate the adjustment of all weights in all
layers:
[
\Delta w_{jk} = - \eta * \frac{\partial E_k}{\partial w_{jk}} = -\eta * \delta_k * y_j
]
[
\Delta w_{ij} = - \eta * \frac{\partial E_j}{\partial w_{ij}} = -\eta * \delta_j * x_i
]
Here the learning rate $\eta$ was used to specify the speed with which the
algorithm should step into the direction of the gradient. A large learning rate
leads to a high convergence speed but the algorithm may also miss local minima
in itās current vicinity. A low learning rate would require a huge number of
iterations but might converge more reliable. Note that one also has to take
care of numerical instabilities (multiplying large numbers - for example from
exploiding gradients of some activation functions - with small numbers is
especially a huge problem with floating point arithmetic) - most of a time
a hybrid approach is most useful where one also reduces the learning rate
during a number of epochs.
Online method
The online backpropagation method does this calculation for each and
every training sample specified. This might be problematic if there are
training samples that would lead to large steps into a given direction
wheras the majority would not - i.e. an outlier. The online method
is rather simple to implement and straightforward. One presents the
intput, propagates it to the tensors till one has calculated a predicted
output, calculates the error and all weight differences for a given learning
rate and applies the weight modifications. Then the next sample is
used and so on.
Batch method
The batch method is used to reduce the effect of outliers. One calculates
the weight differences for a whole bunch of training samples (this step is
called an epoch). Then one calculates the average error for all nodes
and applies that average to the backpropagation algorithm instead of
the error of one sample. In this case the effect of outliers is reduced.
One might also eliminate outliers using different methods but one has to
remember that removing outliers is just fighting symptoms of bad measurement,
way to small sample size or a wrong assumption, a too high error rate for labeling
the input data set, etc.
Of course the batch method can also be used as a simple parallelization method.
Each node uses the same network weights and calculates itās own averaged gradients.
Then the average over all gradients of a single batch is calculated at
a central node, the weights are adjusted and broadcasted to all nodes and then
the process continues. Note that this may leak information about the data
that the nodes are processing.
There is a number of research papers being published about federated learning
(like for example Towards Federated Learning at Scale: System Design)
and also ideas on how to preserve privacy while aggregating information
about gradients (see for example Practical Secure Aggregation for Privacy-Preserving Machine Learning)
Some Neuron types
The following is just a collection of some basic most commonly used neuron
types:
Step function

This is one of the most simple functions available and used in many introduction
classes to neuronal networks:
$\sigma(z) = 1 \forall z \geq 0$
$\sigma(z) = 0 \forall z < 0$
The main problem of this function is that itās a non analytical function. It
simply splits the input-space by a hyperplane and is used in Perceptrons
or at the output stage of binary classifiers - itās derivative would
be the Dirac delta function $\delta(z)$ which takes the value $\infty$
at $z=0$ and zero elsewhere. There is no useable gradient for the backpropagation
algorithm available when using this function.
Sigmoid

The sigmoid function solves the problem of the diverging gradient by having
a finite slope. Itās way more computationally intensive but itās one of
the most easiest functions to deal with during backpropagation.
$\sigma(z) = \frac{1}{1 + e^{-z}}$
Derivative of the Sigmoid

The derivative is given by
$\frac{\partial \sigma(z)}{\partial z} = \frac{e^{-z}}{(1 + e^{-z})^2}$
Hyperbolic tangent

$\sigma(z) = \frac{e^{2z}-1}{e^{2z}+1}$
Derivative of the tanh

The derivative is given by
$\frac{\partial \sigma(z)}{\partial z} = \frac{4*e^{2z}}{(e^{2z}+1)^2}$
Rectifier (ReLU)

The rectifier is a partially defined function. Itās linear for positive
input values but zero for negative ones - it combines the simple evaluation
of the function as well as the existence of the derivative. Note that this
function is not continuous any more but normally this doesnāt pose any
problem. Also the derivative is really easy to calculate because of two
cosntant sloped areas:
$\sigma(z) = z \forall z \geq 0$
$\sigma(z) = 0 \forall z < 0$
An additional weight meight be used to specify the slope:
$\sigma_w(z) = z * \theta_j \forall z \geq 0$
$\sigma_w(z) = 0 \forall z < 0$
Derivative of ReLU

The derivative of the rectifier is rather easy to calculate:
$\frac{\partial \sigma(z)}{\partial z} = 1 \forall z \geq 0$
$\frac{\partial \sigma(z)}{\partial z} = 0 \forall z < 0$
The same with weights:
$\frac{\partial \sigma_w(z)}{\partial z} = \theta_j \forall z \geq 0$
$\frac{\partial \sigma_w(z)}{\partial z} = 0 \forall z < 0$
This easy definition makes the summation during backpropagation really simple
which gives a huge speedup when exploited in code. Unfortunately the non
continuous derivative makes a recursive usage of this node type challenging.
Softmax
The softmax function is a weighted version of the exponential function.
$\sigma_j(Z) = \frac{e^{z_j}}{\sum_{p=0}^{J} e^{z_p}}$
The normalization used here is simply the sum of all exponentials $e^{z_j}$
inside the softmax layer without normalization. The softmax function is mostly
used for categorical distributions - the results of a specifically constructed
neuronal network can be interpreted as the probabilities of an object
belonging to a given class. The same function is also often used in
statistics (Bayesian methods, multi-class regression, etc.)
Linear function
One might also be tempted to use a simple linear activation function - and might
of course do so for some specific cases:
$\sigma(z) = a * z + b$
This might be done with some freely chooseable constants $a$ and $b$. In any
case this will not be a good choice for a real neuronal network since the
linear combination of all these linear functions will collaps into a
multidimensional linear function (i.e. into a single layer) at the end. One
cannot build a plain simple linear function based neuronal networks consisting
of multiple layers.
Different common network topologies
There is a really huge number of possible network topologies. In the following
section some of the most common network types are shortly mentioned.
Feed forward networks (FFN)
This is one of the most basic network structures. The input data is presented
to the first layer in the network as input, the output of the first layer
as input to the second, etc. The last layer presents the output. Normally
all layers are fully connected (i.e. each neuron on the first layer is
connected to each neuron on the second, each neuron on the second layer is
connected to each neuron on the third, etc.). There are no cyclical connection
paths. Such a network is directly representable by a chain of tensor operations.
Note that there is no natural way to describe memory like in a recurring
neuronal network - but recurring neuronal networks can be unrolled into
a more complex feed forward network when cycles have a finite depth.
Recurring neuronal networks (RNN)
RNNs use cycles from layers at lower stages back to upper layers
or from outputs of neurons back to their inputs. In that way they
contain internal state and have to model a propagation delay for these
internal loops. Normally the most basic RNNs are unrolled into feed forward
networks for easier computation. They are capable of processing variable length
inputs instead of only constant length inputs like the feed forward
network (in this case input is shifted through the input areas instead
of loaded at a fixed position)
Long short term memory (LSTM)
To solve problems of exploding or vanishing gradients some networks
introduce long short term memory (LSTM) that is capable of remembering
events a longer time ago without having to expand the network to thousands
of layers and without propagating an infinitely growing error gradient.
Recursive neuronal networks
In recursive neuronal networks multiple layers are interconnected by the
same weight matrix for every layer (i.e. every layer shares the same
weight matrix). They are often used for natural language processing (word
embeddings, etc.). Training is only really possible when using some
function like logistic or hyperbolic tangent with non vanishing and continuous
gradient.
Convolutional neuronal network (CNN)
A convolutional neuronal network includes layers that calculate the convolution
of the input data into a hidden layer. Normally this means that a sliding window
of fixed size is mapped into the next layer and summation is not done
over all outputs of the nodes above. Convolutional neuronal networks are
most often used to identify features in image classification networks to detect
spatial features.
Some layer types
Networks in many machine learning frameworks are built using different
types of layers. Some of them are listed below.
Fully connected layer
This is one of the most classic layers. All outputs from a given layer are
connected to each input of the layer below using a weight matrix as
described above.
Convolutional layer, Locally connected layer
A convolutional layer is convoluting the input data (for example one might
imagine a fixed size sliding window to generate all the input data for a following
layer at the next stage). This means that not all neurons are fully interconnected.
One might imagine the first 3x3 window of data to be mapped to the first neuron
on the next layer, the same window slided by one element to the second neuron
and so forth.
A convolutional layer can be either implemented by modifying the sum or
forcing some weights to be zero.
The difference between a convolutional layer and a locally connected layer
is the sharing of weights. In a convolutional layer each zone that gets
convoluted shares the same weights with every other zone. In a locally
connected layer connections are locally like in the convolutional layer
but each connection in each zone shares itās own weights.
Pooling layer, Merging layer
A pooling layer is normally used below a convolutional layer to discard
unnecessary information. It selects information from the zones of convolutional
layers - for example using a function like the maximum or minimum inside
the zone of a given window. This helps to speed up computation and to reduce
overfitting.
A pooling layer has to be performed using a tensor operation that might operate
on multiple elements of the given tensor.
More generally a merging layer can merge values of either the whole
upper level network or only given zones of the input network. It might
merge using a set of different functions (addition, subtraction, multiplication,
averaging, maximum, minimum, concatenation, dot products, etc.).
Flattening layer
A flattening layer just re-arranges the result of a layer above (mostly
used after a pooling layer) to return a single feature matrix instead of
the multi dimensional result of a pooling layer.
Noise layers
Noise layers allow the introduction of random white noise into the model.
Normally they use gaussian noise model centered around zero. This introduction
of noise allows one to leave local minima and prevent overfitting.
Normalization layer
A normalization layer is just a layer inside the processing chain, not really
a layer that gets evaluated every time the input is changed like the layers
above. Itās used to re-normalize the outputs for every batch sequence in the
sense that one tried to get the average zero with standard deviation 1.
Lambda layer
Again this is not a traditional layer as the layers above but is present
in some machine learning frameworks. It allows the user applciation to call
one function that will be evaluated for every element inside the input
tensor and yield the value in the output tensor (i.e. the function is applied
to every element in the input data).
This article is tagged: Programming, Tutorial, Machine learning, Statistics, Data Mining