Why does MNIST only have 2 channels

Entry into convolutional neural networks with Keras

# AI workshop

What's this about

In my last blog, Getting Started with Neural Networks with Keras, I described quite comprehensively how to build a “workbench” for working with simple neural networks. As an example I used the MNIST database for handwritten digits.
The accuracy of the prediction was impressive at around 96.5%, but completely useless for reality. The prediction accuracy of humans for this data set is about 99.7%. So there is still a lot of room for improvement.
In this blog I want to try to get an accuracy of 99.1% with very simple means. With a little luck you can do this with one CNN Network architecture.

What are CNN?

Convolutions are one of two common architectures that make a network deep (The other architecture is that RNN, Recurrent Nuron Networks, which I will write about in another blog). The depth denotes the number of layers in a network. In doing so, however, several layers are combined into functional areas. Here the special functional area is the folding of pictures.
The very simple, fully connected neural networks are a very powerful tool, but they are sensitive to hypothesis spaces that are too large (small reminder: a data set is called sample, which are called the attributes of a sample Features ==> The number of features determines the number of dimensions in the NN ==> This is known as Hypothesis space).
Colloquially, one can say that too many descriptive attributes confuse the net. By the way, we humans function in a very similar way. Try to derive the emotion from the following sentence: “Oh, I don't know either. It's raining, the sky is gray and today I somehow got up with my left foot. ”It works, but the following two signs show the same emotion and can be grasped many times faster and more unambiguously 🙁.
This is exactly what the convolution layers do before the fully connected layers. They recognize and emphasize unique patterns in a large number of features.
Let's take the picture of a cat as an illustration. Convolution layers go from the very fine structures, i.e. a small line, a point or a color, to increasingly larger patterns, such as a cat's ear, a cat's nose, and cat's eyes. The fully connected layers then no longer see a lot of pixels, but are told: "There is a cat's head in the picture". This reduces the number of features from which the Fully Connected Layers part of the network has to derive its answer, which makes its work easier.
There is a really fantastic YouTube video that clearly leads into the math behind CNN. Must see. Does it make of many? a ! in the head. A friendly introduction to Convolutional Neural Networks and Image Recognition

From getting data to feature engineering

In my last blog, I described in great detail how to "clamp the samples into your workbench" for training and testing. So I can skip that here and just list the code blocks:

Boiler plate

Using TensorFlow backend.

Get data

Feature engineering

There is still a little something different here than last time.
I convert the 3D image tensor into a 4D tensor, because Keras needs this as input for his convolutional layers:
* 1st dimension: The number of the sample
* 2nd dimension: the picture line
* 3rd dimension: The picture column
* 4th dimension: The color channels (here we only have one, since these are grayscale images. Otherwise there are three for red, green and blue)
I divide all gray values ​​by their maximum value (255) in order to trim them into the value range between zero and one. This range of values ​​is easier to digest for the activation functions.


Configure the model

I have opted for an extremely lean configuration of my network. On Kaggle, for example, you can find much more powerful, but also larger configurations.
However, I benefit from the reduced complexity later when I want to highlight a few interesting details of the network. In addition, the network has all the important components that you need for your experiments.

The Convolutions

I use Keras to get a fresh model. Then I add the first convolutional layer. Die und die says that I want to train 32 different filters of size 3 to 3 pixels. In action, a 3 x 3 large window then moves over the image and applies all 32 filters per section. Each of these filters focuses on a specific small detail such as a horizontal black line. If the filter finds this line, it reports the result. If he does not find it, the filter remains dark. The attribute padding indicates whether the margins should be taken into account or not. 'Same' means that they should be observed. As an activation function, I have the relu used. That is a little more efficient than that sigmoid and scaled linearly over the value range, what sigmoid yes exactly does not. You should try different functions when tuning.

Of pooling

Then comes the first pooling layer, which reduces the result of the filter by a factor of 2 in each direction. With the traditional CNN one uses pooling layers, but one goes more and more towards normalizing the result instead of concentrating it. This often gives better predictions, but is a lot slower when training. You can exchange them and compare the results with each other.

One pair is not enough

I stack a total of four convolution layers on top of each other and reduce the number of filters at the end of 32 on 16. That is rather unusual. Actually, you inflate the number of filters backwards. For example, you start with 32 filters and then expand to 64. But I'm doing exactly the opposite for two reasons. On the one hand, clear characteristics for ten different digits should be released afterwards in the filters. A room of 16 three by three large matrices should be more than sufficient for this. On the other hand, I would like to visualize afterwards how the Convolutions encode the individual digits. This can be seen more easily with a clear amount.

At the border there is straight bending

At the transition from CNN to the Fully Connected Nuron Network, the layer bends the output 3D tensor from the layer of the Convolutions into a 1D tensor. 16 filters x 3 by 3 pixels makes a vector of length 144.


I have in the model in a few places Dropout layer added. This is a very simple means of doing overfitting to avoid. The 0.4 means that 40% of the results per iteration are selected at random and discarded. Since the same samples are used for training over and over again per epoch, this avoids the decision (aka weighting) of the network being tied to a few prominent features. Neural networks behave in a similar way to us humans. You always choose the easiest way. The dropout layers are used to adjust the easy paths and force the network to look for minima in other dimensions as well.


The first Dense layer I added the parameter. Regularizers approach the problem Overfitting from the other side. They prevent the network from overinterpreting individual weights. Sounds that are too loud are therefore turned down. There is an analogy here, too: Hermione always answers in Harry Potter's class. Since it also usually knows the correct answer, it is no longer called up as often. So you turn it down so that the others can also get a hold of it. The factor indicates how much you want to downregulate.

What else could be done

I have left out one means for improving the model: that Augment the data. The individual samples are distorted in order to obtain more test data. Actually, that's not that difficult to add, and it brings the solution far ahead. To be precise, the top MNIST models in Kaggle all augment them. However, that inflates the Python code by a few Generators on. I would like to leave out this additional complexity here and concentrate on the essentials.

Layer (type) Output Shape Param #

conv2d_1 (Conv2D) (None, 28, 28, 32) 320

max_pooling2d_1 (MaxPooling2 (None, 14, 14, 32) 0

conv2d_2 (Conv2D) (None, 14, 14, 32) 9248

max_pooling2d_2 (MaxPooling2 (None, 7, 7, 32) 0

conv2d_3 (Conv2D) (None, 7, 7, 32) 9248

max_pooling2d_3 (MaxPooling2 (None, 3, 3, 32) 0

conv2d_4 (Conv2D) (None, 3, 3, 16) 4624

dropout_1 (Dropout) (None, 3, 3, 16) 0

flatten_1 (Flatten) (None, 144) 0

dropout_2 (Dropout) (None, 144) 0

dense_1 (Dense) (None, 128) 18560

dropout_3 (Dropout) (None, 128) 0

dense_2 (Dense) (None, 10) 1290

Total params: 43,290
Trainable params: 43,290
Non-trainable params: 0

In the summary you can see very clearly how the pair of convolution layer and pooling layer work together. For example, the first one results in 32 filtered images of size 28 by 28 pixels. The layer concentrates this on half in each direction. So to 14 by 14 pixels.
The layer ultimately bends the 3x3x16 tensor into a vector of length 144 in order to pump the released features into the fully connected part of the network.

Train the model

Train on 54,000 samples, validate on 6,000 samples
Epoch 1/15
54000/54000 [==============================] - 6s 112us / step - loss: 1.0720 - acc: 0.7441 - val_loss : 0.3139 - val_acc: 0.9468
Epoch 2/15
54000/54000 [==============================] - 4s 71us / step - loss: 0.3658 - acc: 0.9182 - val_loss : 0.1452 - val_acc: 0.9757
Epoch 3/15
54000/54000 [==============================] - 4s 68us / step - loss: 0.2454 - acc: 0.9421 - val_loss : 0.1299 - val_acc: 0.9738
Epoch 4/15
54000/54000 [==============================] - 4s 69us / step - loss: 0.2036 - acc: 0.9523 - val_loss : 0.0965 - val_acc: 0.9822
Epoch 5/15
54000/54000 [==============================] - 4s 69us / step - loss: 0.1733 - acc: 0.9592 - val_loss : 0.0880 - val_acc: 0.9830
Epoch 6/15
54000/54000 [==============================] - 4s 71us / step - loss: 0.1543 - acc: 0.9645 - val_loss : 0.0770 - val_acc: 0.9870
Epoch 7/15
54000/54000 [==============================] - 4s 69us / step - loss: 0.1407 - acc: 0.9677 - val_loss : 0.0819 - val_acc: 0.9855
Epoch 8/15
54000/54000 [==============================] - 4s 69us / step - loss: 0.1307 - acc: 0.9707 - val_loss : 0.0685 - val_acc: 0.9880
Epoch 9/15
54000/54000 [==============================] - 4s 69us / step - loss: 0.1209 - acc: 0.9729 - val_loss : 0.0648 - val_acc: 0.9892
Epoch 10/15
54000/54000 [==============================] - 4s 73us / step - loss: 0.1107 - acc: 0.9760 - val_loss : 0.0722 - val_acc: 0.9875
Epoch 11/15
54000/54000 [==============================] - 4s 70us / step - loss: 0.1061 - acc: 0.9760 - val_loss : 0.0655 - val_acc: 0.9875
Epoch 12/15
54000/54000 [==============================] - 4s 71us / step - loss: 0.1023 - acc: 0.9778 - val_loss : 0.0621 - val_acc: 0.9880
Epoch 13/15
54000/54000 [==============================] - 4s 70us / step - loss: 0.0992 - acc: 0.9784 - val_loss : 0.0706 - val_acc: 0.9878
Epoch 14/15
54000/54000 [==============================] - 4s 72us / step - loss: 0.0940 - acc: 0.9796 - val_loss : 0.0567 - val_acc: 0.9902
Epoch 15/15
54000/54000 [==============================] - 4s 70us / step - loss: 0.0896 - acc: 0.9802 - val_loss : 0.0595 - val_acc: 0.9887
It took: 59.30845069885254

Test the model

10000/10000 [===============================] - 1s 93us / step
Loss: 0.04145347746014595
Accuracy: 0.9926
There is one thing to keep in mind when training the model: We invited the godfather to the party: The initial filters and weights are selected randomly. The dropout layer strikes randomly. The samples are remixed before each epoch. That is why every run is different from the others.
Still, one shouldn't hope for a miracle. In this case, I got a test accuracy of 99.26%. If I train the model thousands of times, there is still no model with 99.5% below, because the configuration simply does not allow that.

to save

Saved model to disk

Interpret training progress

You can see that the curve over the course of the training is very unsteady in this case. This is an indication that the individual epochs have apparently optimized themselves into very different minima. This can be seen as a hint that the learning rate of the optimization function has been chosen incorrectly. Keras pre-sets the learning rate with a balanced default that fits many, but of course not every problem. You can either readjust the learning rate manually or work with a Keras optimizer.
But since I want to concentrate on the Convolutions here, I'll leave the curve as it is. By the way, there is a very well structured blog on this, which can serve as a good aid for the correct selection of the optimizer How to pick the best learning rate for your machine learning project

Fig 1: Increase in accuracy over the ages

Fig 2: Decrease in the error over the epochs

Evaluate the model

Visualization of the filters

The question arises as to how to visualize a filter. Actually there isn't much more to see than a small tensor with weights. But you can show how a filter works. For example, one could pass the image of a digit through the filter and show what the filter does with the image. I don't find the result particularly illuminating, but still worth seeing. I have activated the first nine filters per layer for one digit.
It is important to know what a bright point means: A bright point means that the filter has been activated here. The filter quasi says: “Here on this 3 by 3 pixel area I have found my pattern and mark the spot with a bright point.

Layer index 0

Layer index 2

Layer index 4

Layer index 6

Fig 3: The result of nine different filters per layer when you look at a 4.
I find the result misleading because one could intuitively assume that the result gets worse and worse from one layer to the other. In the first layer you can still assume that the filter will find contours, shadows and areas and then the result disappears more and more until only a few points remain.
In reality, however, it is exactly the other way around. In the first layer you can still see the number relatively precisely because a lot of very small filter features are found. In the deeper layers, the images get darker and darker as the filters are looking for larger and larger patterns and cannot find them. Again: A bright point only means that the filter has found its pattern, but not what its pattern is.

Dream in white noise

That's why I'm resorting to a trick here. I'm not taking the picture of a digit, just a noise. So a 28 by 28 pixel area with randomly chosen gray pixels. It's like writing on white paper with a white pen. All possible numbers are hidden in the noise, since the filter only checks whether its pattern is there, but not whether this is exclusively the case. (Exclusive here would mean: "When I am in the room, there is no room for others here")

Layer index 0
Conv2D (1, 28, 28, 32)

Fig. 3: Layer Index 2 Conv2D (1, 14, 14, 32)

Fig. 4: Layer Index 4 Conv2D (1, 7, 7, 32)

Fig. 5: Layer Index 6 Conv2D (1, 3, 3, 16)

Fig. 6: Visualization of what the different filters of a layer filter out of a noise

In the first layer you can now clearly see how evenly distributed small patterns are recognized. This creates the structure that the filters of the next layer take up.
In layer 4 you can see impressively how more complex structures were assembled from smaller ones. There is currently a research movement of its own called DeepDream.
In the last layer, what has been recognized is now coded in such a way that it can be easily classified by a fully connected network.
Attention nerds: Interestingly, the trained model recognizes the noise as 8. This illustrates that we are only looking for what is there, not what should not be there. This is a problem that has hardly been resolved at the moment and one that we encounter very often. For example, if you search Google for “threads on screws that are not metric”, you will find articles about metric screws almost at the top. If, on the other hand, I say to my son: "Don't open your room", it works really well 😉

0 ==> 0.45%
1 ==> 0.01%
2 ==> 0.14%
3 ==> 0.01%
4 ==> 0.02%
5 ==> 0.38%
6 ==> 1.20%
7 ==> 0.00%
8 ==> 97.77%
9 ==> 0.02%

What does the fully connected layer get?

What comes out of the last convolution layer becomes a feature vector in the fully connected part of the DNN pushed.
To show that the complexity has now become much less, I picked out two different -es and two -es and printed out all the filters of the last layer for them. As a reference, you can see the result for the noise in the first column.

Fig 6: Two examples of coded digits

I think you can see very clearly that the pairs of digits were coded almost identically, although they were spelled slightly differently. The CNN has practically translated the handwritten digits into its own QR code. - Great -


With this blog, the introduction to machine vision and hearing has been made. A few techniques and advanced configurations are still missing in order to be able to stand on a solid technical foundation with his knowledge and skills.
The next blog will look after the Fully Connected and Convolutional NN with the third major Deep Neuron Network DNN Deal with architecture, the Recurrent NN.