Understanding Neural Networks – Part Three

Training Neural Networks

We talked previously about how an artificial neuron (from now on lets call them Perceptrons like the cool kids do) generated its output from its inputs using its activation function. We also mentioned briefly that the inputs were altered by the weights on the input synapses. Now we are going to look at how those weights are at the heart of how a neural network learns.

Neural networks are most often used for supervised learning. That means that before we can use them to make useful predictions we have to train them with a lot of complete data. By complete we mean that we have to show them both the input values and the expected output values. This is very important as if you don’t already have a lot of this example data then you are going to struggle to use neural networks or any form of supervised machine learning approach. You can use neural networks for unsupervised learning and we talk about doing this in So you want to build your own Deep Learning Solution?, but mostly we do supervised learning.

Let’s assume that we are lucky enough to have a lot of this data in order to train our network. The first thing we need to do is to separate it into two unequal groups. The first and largest group, normally around 80%, we will use for training the network, the remaining data we will use for test and verification that our training is progressing on track. You absolutely can’t test the learning on data that the network has already see and used to learn and without testing you don’t know when your network has learned

What we are going to do is present the input values, one set of observations at a time, to the network and then we will measure the results at the output layer. We compare each actual output value with the expected value and measure the difference. We use an error function, more normally called a loss function to calculate the difference between the expected output and the one the network actually gave us. For those of you who have come from conventional machine learning or statistics will recognise this; It is how we fit a line in linear regression. We may not be trying to do regression with the network however so we tend to have to select a loss function in a similar way to how we selected the activation function.

Loss functions come in many varieties and the one which is right for you will depend on what your network is going to predict. If you have already chosen an activation function this should narrow down the field remarkably for your loss function as they tend to go together. If you are doing categorical prediction (predicting what category the observations fall into) then you may well have used the sigmoid activation function and now Binary Cross Entropy would be a great choice for a loss function. This probably sounds quite difficult but often once you have made the first decision the rest naturally fall into-place.

We present the input data and measure the loss in batches and after each batch we adjust the weights using an algorithm called Gradient Descent (more normally Stochastic Gradient Descent) to try to reduce the loss. We run through multiple iterations of this process call epochs as our network becomes better and better at generating the correct outputs and we use the test data set to verify that this process is working. The idea is that by using data the network hasn’t seen before so that we have a good independent test of its accuracy.

If we just had a single perceptron this process would be very easy as there would just be one set of weights to adjust connected directly to our output perceptron. Because we have multiple layers of perceptrons we can only adjust the weights of the output layer. We then use a process called back propagation to cascade changes to weights back through the network in the opposite direction to the calculations of the networks when we are generating outputs.

I’ll explain all of these concepts in an advanced series of articles in the future but for now this should give you an idea of how the network is learning.

Whilst we run multiple epochs of training we must guard against over training; this is where the network we create (lets call it a model now that it is trained) appears to have very good accuracy, but when we try it on new data the accuracy suddenly drops of dramatically. What we have done is built a model which fits the train and test data so closely that it has lost its ability to generate generalised answers for broader data. This is called Over Fitting and we will talk about this more in the advanced series.

Having trained a model we normally save it and use it in some production system to make predictions. Saving and reloading the model will depend on what language and which deep learning frameworks you have chosen to use. Training the model is hard and time consuming and requires a lot of data. We often use Graphics Processing Units (GPUs) to do this work as the calculations we need to do are particularly suited to GPUS in that they can be run in parallel and each calculation is easily converted to linear algebra that these cards take in their strides. Once trained however the model can often be run on much lower spec hardware and still perform considerably faster than other approaches.

About Us

Welcome to the home of advanced Information Security. Here you can learn about using Machine Learning and advanced analytics to improve your security environment.

In addition we will provide impartial advice about security technologies such as SIEM (Security Information and Event Management) and UEBA (User and Entity Behavioral Analysis) systems.

If you’d like help or advice on any of these subjects, or if you’d like to submit your own articles for consideration, then you can contact the site administrator through Linkedin. Check out the Contact page for more details.

Recent Posts

Categories