Why does MNIST only have 2 channels
Entry into convolutional neural networks with Keras
# AI workshop
What's this about
In my last blog, Getting Started with Neural Networks with Keras, I described quite comprehensively how to build a “workbench” for working with simple neural networks. As an example I used the MNIST database for handwritten digits.
The accuracy of the prediction was impressive at around 96.5%, but completely useless for reality. The prediction accuracy of humans for this data set is about 99.7%. So there is still a lot of room for improvement.
In this blog I want to try to get an accuracy of 99.1% with very simple means. With a little luck you can do this with one CNN Network architecture.
What are CNN?
Convolutions are one of two common architectures that make a network deep (The other architecture is that RNN, Recurrent Nuron Networks, which I will write about in another blog). The depth denotes the number of layers in a network. In doing so, however, several layers are combined into functional areas. Here the special functional area is the folding of pictures.
The very simple, fully connected neural networks are a very powerful tool, but they are sensitive to hypothesis spaces that are too large (small reminder: a data set is called sample, which are called the attributes of a sample Features ==> The number of features determines the number of dimensions in the NN ==> This is known as Hypothesis space).
Colloquially, one can say that too many descriptive attributes confuse the net. By the way, we humans function in a very similar way. Try to derive the emotion from the following sentence: “Oh, I don't know either. It's raining, the sky is gray and today I somehow got up with my left foot. ”It works, but the following two signs show the same emotion and can be grasped many times faster and more unambiguously 🙁.
This is exactly what the convolution layers do before the fully connected layers. They recognize and emphasize unique patterns in a large number of features.
Let's take the picture of a cat as an illustration. Convolution layers go from the very fine structures, i.e. a small line, a point or a color, to increasingly larger patterns, such as a cat's ear, a cat's nose, and cat's eyes. The fully connected layers then no longer see a lot of pixels, but are told: "There is a cat's head in the picture". This reduces the number of features from which the Fully Connected Layers part of the network has to derive its answer, which makes its work easier.
There is a really fantastic YouTube video that clearly leads into the math behind CNN. Must see. Does it make of many? a ! in the head. A friendly introduction to Convolutional Neural Networks and Image Recognition
From getting data to feature engineering
In my last blog, I described in great detail how to "clamp the samples into your workbench" for training and testing. So I can skip that here and just list the code blocks:
Using TensorFlow backend.
There is still a little something different here than last time.
I convert the 3D image tensor into a 4D tensor, because Keras needs this as input for his convolutional layers:
* 1st dimension: The number of the sample
* 2nd dimension: the picture line
* 3rd dimension: The picture column
* 4th dimension: The color channels (here we only have one, since these are grayscale images. Otherwise there are three for red, green and blue)
I divide all gray values by their maximum value (255) in order to trim them into the value range between zero and one. This range of values is easier to digest for the activation functions.
Configure the model
I have opted for an extremely lean configuration of my network. On Kaggle, for example, you can find much more powerful, but also larger configurations.
However, I benefit from the reduced complexity later when I want to highlight a few interesting details of the network. In addition, the network has all the important components that you need for your experiments.
I use Keras to get a fresh model. Then I add the first convolutional layer. Die und die says that I want to train 32 different filters of size 3 to 3 pixels. In action, a 3 x 3 large window then moves over the image and applies all 32 filters per section. Each of these filters focuses on a specific small detail such as a horizontal black line. If the filter finds this line, it reports the result. If he does not find it, the filter remains dark. The attribute padding indicates whether the margins should be taken into account or not. 'Same' means that they should be observed. As an activation function, I have the relu used. That is a little more efficient than that sigmoid and scaled linearly over the value range, what sigmoid yes exactly does not. You should try different functions when tuning.
Then comes the first pooling layer, which reduces the result of the filter by a factor of 2 in each direction. With the traditional CNN one uses pooling layers, but one goes more and more towards normalizing the result instead of concentrating it. This often gives better predictions, but is a lot slower when training. You can exchange them and compare the results with each other.
One pair is not enough
I stack a total of four convolution layers on top of each other and reduce the number of filters at the end of 32 on 16. That is rather unusual. Actually, you inflate the number of filters backwards. For example, you start with 32 filters and then expand to 64. But I'm doing exactly the opposite for two reasons. On the one hand, clear characteristics for ten different digits should be released afterwards in the filters. A room of 16 three by three large matrices should be more than sufficient for this. On the other hand, I would like to visualize afterwards how the Convolutions encode the individual digits. This can be seen more easily with a clear amount.
At the border there is straight bending
At the transition from CNN to the Fully Connected Nuron Network, the layer bends the output 3D tensor from the layer of the Convolutions into a 1D tensor. 16 filters x 3 by 3 pixels makes a vector of length 144.
I have in the model in a few places Dropout layer added. This is a very simple means of doing overfitting to avoid. The 0.4 means that 40% of the results per iteration are selected at random and discarded. Since the same samples are used for training over and over again per epoch, this avoids the decision (aka weighting) of the network being tied to a few prominent features. Neural networks behave in a similar way to us humans. You always choose the easiest way. The dropout layers are used to adjust the easy paths and force the network to look for minima in other dimensions as well.
The first Dense layer I added the parameter. Regularizers approach the problem Overfitting from the other side. They prevent the network from overinterpreting individual weights. Sounds that are too loud are therefore turned down. There is an analogy here, too: Hermione always answers in Harry Potter's class. Since it also usually knows the correct answer, it is no longer called up as often. So you turn it down so that the others can also get a hold of it. The factor indicates how much you want to downregulate.
What else could be done
I have left out one means for improving the model: that Augment the data. The individual samples are distorted in order to obtain more test data. Actually, that's not that difficult to add, and it brings the solution far ahead. To be precise, the top MNIST models in Kaggle all augment them. However, that inflates the Python code by a few Generators on. I would like to leave out this additional complexity here and concentrate on the essentials.
Layer (type) Output Shape Param #
conv2d_1 (Conv2D) (None, 28, 28, 32) 320
max_pooling2d_1 (MaxPooling2 (None, 14, 14, 32) 0
conv2d_2 (Conv2D) (None, 14, 14, 32) 9248
max_pooling2d_2 (MaxPooling2 (None, 7, 7, 32) 0
conv2d_3 (Conv2D) (None, 7, 7, 32) 9248
max_pooling2d_3 (MaxPooling2 (None, 3, 3, 32) 0
conv2d_4 (Conv2D) (None, 3, 3, 16) 4624
dropout_1 (Dropout) (None, 3, 3, 16) 0
flatten_1 (Flatten) (None, 144) 0
dropout_2 (Dropout) (None, 144) 0
dense_1 (Dense) (None, 128) 18560
dropout_3 (Dropout) (None, 128) 0
dense_2 (Dense) (None, 10) 1290
Total params: 43,290
Trainable params: 43,290
Non-trainable params: 0
In the summary you can see very clearly how the pair of convolution layer and pooling layer work together. For example, the first one results in 32 filtered images of size 28 by 28 pixels. The layer concentrates this on half in each direction. So to 14 by 14 pixels.
The layer ultimately bends the 3x3x16 tensor into a vector of length 144 in order to pump the released features into the fully connected part of the network.
Train the model
Train on 54,000 samples, validate on 6,000 samples
54000/54000 [==============================] - 6s 112us / step - loss: 1.0720 - acc: 0.7441 - val_loss : 0.3139 - val_acc: 0.9468
54000/54000 [==============================] - 4s 71us / step - loss: 0.3658 - acc: 0.9182 - val_loss : 0.1452 - val_acc: 0.9757
54000/54000 [==============================] - 4s 68us / step - loss: 0.2454 - acc: 0.9421 - val_loss : 0.1299 - val_acc: 0.9738
54000/54000 [==============================] - 4s 69us / step - loss: 0.2036 - acc: 0.9523 - val_loss : 0.0965 - val_acc: 0.9822
54000/54000 [==============================] - 4s 69us / step - loss: 0.1733 - acc: 0.9592 - val_loss : 0.0880 - val_acc: 0.9830
54000/54000 [==============================] - 4s 71us / step - loss: 0.1543 - acc: 0.9645 - val_loss : 0.0770 - val_acc: 0.9870
54000/54000 [==============================] - 4s 69us / step - loss: 0.1407 - acc: 0.9677 - val_loss : 0.0819 - val_acc: 0.9855
54000/54000 [==============================] - 4s 69us / step - loss: 0.1307 - acc: 0.9707 - val_loss : 0.0685 - val_acc: 0.9880
54000/54000 [==============================] - 4s 69us / step - loss: 0.1209 - acc: 0.9729 - val_loss : 0.0648 - val_acc: 0.9892
54000/54000 [==============================] - 4s 73us / step - loss: 0.1107 - acc: 0.9760 - val_loss : 0.0722 - val_acc: 0.9875
54000/54000 [==============================] - 4s 70us / step - loss: 0.1061 - acc: 0.9760 - val_loss : 0.0655 - val_acc: 0.9875
54000/54000 [==============================] - 4s 71us / step - loss: 0.1023 - acc: 0.9778 - val_loss : 0.0621 - val_acc: 0.9880
54000/54000 [==============================] - 4s 70us / step - loss: 0.0992 - acc: 0.9784 - val_loss : 0.0706 - val_acc: 0.9878
54000/54000 [==============================] - 4s 72us / step - loss: 0.0940 - acc: 0.9796 - val_loss : 0.0567 - val_acc: 0.9902
54000/54000 [==============================] - 4s 70us / step - loss: 0.0896 - acc: 0.9802 - val_loss : 0.0595 - val_acc: 0.9887
It took: 59.30845069885254
Test the model
10000/10000 [===============================] - 1s 93us / step
There is one thing to keep in mind when training the model: We invited the godfather to the party: The initial filters and weights are selected randomly. The dropout layer strikes randomly. The samples are remixed before each epoch. That is why every run is different from the others.
Still, one shouldn't hope for a miracle. In this case, I got a test accuracy of 99.26%. If I train the model thousands of times, there is still no model with 99.5% below, because the configuration simply does not allow that.
Saved model to disk
Interpret training progress
You can see that the curve over the course of the training is very unsteady in this case. This is an indication that the individual epochs have apparently optimized themselves into very different minima. This can be seen as a hint that the learning rate of the optimization function has been chosen incorrectly. Keras pre-sets the learning rate with a balanced default that fits many, but of course not every problem. You can either readjust the learning rate manually or work with a Keras optimizer.
But since I want to concentrate on the Convolutions here, I'll leave the curve as it is. By the way, there is a very well structured blog on this, which can serve as a good aid for the correct selection of the optimizer How to pick the best learning rate for your machine learning project
Fig 1: Increase in accuracy over the ages
Fig 2: Decrease in the error over the epochs
Evaluate the model
Visualization of the filters
The question arises as to how to visualize a filter. Actually there isn't much more to see than a small tensor with weights. But you can show how a filter works. For example, one could pass the image of a digit through the filter and show what the filter does with the image. I don't find the result particularly illuminating, but still worth seeing. I have activated the first nine filters per layer for one digit.
It is important to know what a bright point means: A bright point means that the filter has been activated here. The filter quasi says: “Here on this 3 by 3 pixel area I have found my pattern and mark the spot with a bright point.
Layer index 0
Layer index 2
Layer index 4
Layer index 6
Fig 3: The result of nine different filters per layer when you look at a 4.
I find the result misleading because one could intuitively assume that the result gets worse and worse from one layer to the other. In the first layer you can still assume that the filter will find contours, shadows and areas and then the result disappears more and more until only a few points remain.
In reality, however, it is exactly the other way around. In the first layer you can still see the number relatively precisely because a lot of very small filter features are found. In the deeper layers, the images get darker and darker as the filters are looking for larger and larger patterns and cannot find them. Again: A bright point only means that the filter has found its pattern, but not what its pattern is.
Dream in white noise
That's why I'm resorting to a trick here. I'm not taking the picture of a digit, just a noise. So a 28 by 28 pixel area with randomly chosen gray pixels. It's like writing on white paper with a white pen. All possible numbers are hidden in the noise, since the filter only checks whether its pattern is there, but not whether this is exclusively the case. (Exclusive here would mean: "When I am in the room, there is no room for others here")
Layer index 0
Conv2D (1, 28, 28, 32)
Fig. 3: Layer Index 2 Conv2D (1, 14, 14, 32)
Fig. 4: Layer Index 4 Conv2D (1, 7, 7, 32)
Fig. 5: Layer Index 6 Conv2D (1, 3, 3, 16)
Fig. 6: Visualization of what the different filters of a layer filter out of a noise
In the first layer you can now clearly see how evenly distributed small patterns are recognized. This creates the structure that the filters of the next layer take up.
In layer 4 you can see impressively how more complex structures were assembled from smaller ones. There is currently a research movement of its own called DeepDream.
In the last layer, what has been recognized is now coded in such a way that it can be easily classified by a fully connected network.
Attention nerds: Interestingly, the trained model recognizes the noise as 8. This illustrates that we are only looking for what is there, not what should not be there. This is a problem that has hardly been resolved at the moment and one that we encounter very often. For example, if you search Google for “threads on screws that are not metric”, you will find articles about metric screws almost at the top. If, on the other hand, I say to my son: "Don't open your room", it works really well 😉
0 ==> 0.45%
1 ==> 0.01%
2 ==> 0.14%
3 ==> 0.01%
4 ==> 0.02%
5 ==> 0.38%
6 ==> 1.20%
7 ==> 0.00%
8 ==> 97.77%
9 ==> 0.02%
What does the fully connected layer get?
What comes out of the last convolution layer becomes a feature vector in the fully connected part of the DNN pushed.
To show that the complexity has now become much less, I picked out two different -es and two -es and printed out all the filters of the last layer for them. As a reference, you can see the result for the noise in the first column.
Fig 6: Two examples of coded digits
I think you can see very clearly that the pairs of digits were coded almost identically, although they were spelled slightly differently. The CNN has practically translated the handwritten digits into its own QR code. - Great -
With this blog, the introduction to machine vision and hearing has been made. A few techniques and advanced configurations are still missing in order to be able to stand on a solid technical foundation with his knowledge and skills.
The next blog will look after the Fully Connected and Convolutional NN with the third major Deep Neuron Network DNN Deal with architecture, the Recurrent NN.
- What time of year is good for training in the gym
- What are your favorite spicy Gujarati dishes?
- Would the Groupons business model in Asia work?
- How do control and constant variables differ?
- How much do exceptions cost
- Who is better Busquets or Fernandinho
- Can people forgive me for my mistakes?
- You can collect rent in prison
- Is a UV lamp dangerous for fish
- What are some alternative careers to teaching
- How do I control a meeting
- Kiss is forbidden in India
- Are caramelized onions bad for you?
- What else is there besides computer programming
- Why do people have the same ideas
- Is Instagram chat private or public
- What is causing global warming in the desert
- How are aircraft rated
- Why are people afraid of knowledge
- What is the purpose of piston pumps
- Indian Health Service is necessary
- When is a composer considered a songwriter?
- Are paper cups compostable?
- How many times can you drink sattu?
- Morality has biological roots
- What are some popular HRMS SaaS providers
- Are the characters on Seinfeld likable?
- From where to take a hotel management course
- What extraordinary compounds result in iodoform and why
- What is the full form of ISLC
- What are the Best Markets in Hawaii
- What does this Arabic image mean
- Marketers create artistic visual work
- What category of visa can my husband apply for?