In a previous post I talked about my project to use a Convolutional Neural Network (CNN) as a pre parser for security logs. I recently got a dump of anonymised data from a friend which arrived in directories identifying what it was and with two additional headers per file with additional information. This amounted to around 1GB of files, enough to get me started. Now the hard work began…
The problem with most Deep Learning tutorials is that they leap straight into building the model. They use existing data sets like the MNIST ones which are built right into the APIs so getting good formatted data is a matter of just loading it from a remote source and it is done in just a line or two of python. In the real world things are just not like this and actually most of the work of doing good analytics consists of acquiring and preparing your data. So, for a peak into the real, as opposed to ideal world of data science I thought I’d take you through building my data pipeline.
Stage 1
To begin with I needed to keep in mind what I was trying to achieve. I’m not trying to do any form of natural language processing on this data I’m simple interested in the elements of the structure which identify what the data source is. For that reason I wanted to normalise my data at the start of the pipeline. I decided I wanted to use scala for this, partly because I like using scala but also because I recognised that I could parallelise the process for a significant speed-up and I found that much easier in scala than any other language I knew.
My plan was to remove the type information given by the directory name and the header information and encode that into the file name in a standardised way that I could access later in the pipeline. At the same time I wanted to process every line of data to remove things like dates, times, IP addresses, host and domain names etc as well as other easily recognisable data within the file and replace it with keywords, so a date would be replaced by <DATE> for instance. I expected that this would have a significant improvement in terms of training speed and accuracy. I decided to read a set of regex strings and associated tags from a JSON file and process each line with each regex. I was familiar with processing json in scala so this was pretty quick to implement. I coded the flow to run each file in a given directory in parallel using scala Futures which is a scala threading feature. Each file would be transformed and written out to an output directory in its new format with a newly encoded filename.
The whole thing worked perfectly until I hit some particular files which contained email logs. I then crashed with a heap overflow error from the JVM. I increased heap size and added a Try to catch the exception but it turned out there was a problem with the java regex library which backed the clever scala code. The error didn’t propagate up, probably because of the Scala Futures, so I couldn’t see exactly what was causing it but I isolated a particular regex expression and modified it and the whole thing ran to completion.
Stage 2
I now needed to vectorise the data – That is to turn words into numbers – so I could process it in a CNN model. I decided to switch to python and leverage the Tensorflow APIs for this. I could have written something very fast in scala but I recognised that I would need to provide the dictionary to friends for use in an API with a trained model as they would have to normalise and vectorise before running the inference. Their vectorisation would have to use the same dictionary as me so doing it in python removed the challenge of creating data structures in one language and then reading them in another – with all the inherent problems that causes. I’d had some bad experiences going the other way, from python to scala and got stung by moving from a weakly typed to a strongly typed language. I have been using python for much longer than scala but I always feel like I’m loosing something going back to python, even though it is pretty quick to prototype something in the language. I can always tell quite quickly that my scala code is correct or not, but in python I always feel that I have written bad code, even when I haven’t.
I wrote a program to take two passes at the new data directory. The first one built the dictionary from each file and the second one then used the dictionary to vectorise the data and write it out to files in yet another directory. the first pass also wrote out the “pickled” dictionary object. Parallelising the first pass would have required a map-reduce process and so I decided to just go with a single thread. It actually ran reasonably fast. The second pass was much slower. I decided to limit the vectorisation to only the most common 10,000 words. I could probably have limited that still further and I probably need to do some analysis of the vectorisation output to know if that was a good choice. I also back filled short lines and limited long lines to 1,000 numbers. Again, that was probably overkill, but at this stage I wanted to get the code working and worry about tuning it later. I need to go back and multi-thread this part to make it much faster but that can wait as I’m not expecting a new data dump any time soon.
Finally I wrote out a csv file for each input file. Each file contained exactly 1,000 numbers per row with most having lots of trailing zeros.
Stage 3
By now I’m pretty close. The last thing I needed to do was to read in each file as a numpy array and to add an additional first column with a value set to indicate which type of data it was. This information I get from the filename and I need to save the dictionary map of file types to a number representation for later use. I probably could have done this in the preceding stage but I also wanted to combine all files into one huge numpy array and then randomise it so that the data is well and truly mixed up. The first entry still tells me what the row type is but now the types are mixed together throughout the array. I then write this data out into two files of different sizes. One is a random selection of data which I save for training on and the other is the remaining data to use as test. A command line option allows me to set the proportions of these two files but by default it is 80/20. Because of my poor judgement earlier these two files now add up to 117GB. I probably need to take that 1,000 entry row down quite a way – It will certainly make for smaller models and faster training if I do.
I plan to sample from both files in the first instant to get a model which trains well and then use the full data-set in final training and testing. It is much too large a data-set for me to use in initial trial and error training to get the model and hyper parameters set. I’ll talk more about that in a later post.
What you can see here is that building a pipeline is quite a complex and time consuming task. I made some decisions which made it harder for me and I’ll probably go back and correct some of those. At present I only have a single data dump and the data is now in a final form which works for me but building a pipeline ensures that if I receive further data dumps I can quickly process them and feed them to a model for training.
Things like unexpected changes in format can easily wreck your pipeline but more subtle changes might feed through unnoticed and have an impact on how accurately the model trains. I will probably add more output from my pipeline so I can see things like average line length, number of words found etc so I can see if things change and investigate why.
Actually understanding your data as a subject matter expert is pretty important in this pipeline building stage. Someone who knows what they are looking at is much more likely to see when it is wrong and also more likely to have an insight about how to structure the model. Building a data science or analytics system is a team affair, it doesn’t just mean hiring a data scientist and asking him or her to build you a model. It needs data engineers and subject matter experts and it needs the cooperation of those teams who have access to the data in the first place. Fortunately I have some friends who were willing to give me good tagged data in an easily digested format I could use. My data pipeline actually started with them, I have simply build the middle part. And as usual it is the end part – with a trained model which will get all of the glory.
Edit 04th Feb 2021
Interestingly enough I have realised that not only will training require me to build different models and try different hyper-parameters it will also require me to alter my pipeline several times to provide different observation data as well. I want to experiment with reducing the number of observations presented to the model – I could do that in the model training code itself of course but I also want to experiment with reducing the normalisation I am doing to see if the CNN can identify dates and IP addresses etc without me having to explicitly call those out. To do that I probably need to modify the code which builds the dictionary to ensure it recognises punctuation as well, or perhaps not?
Lots of research still to do…