Improving Security Log Parsing with Neural Networks

I always have a few projects on the go but the main one I’m working on at the time of writing, concerns parsing of security logs.

The start of any data pipeline for security monitoring starts with the ingestion of large amounts of log data. In order for this data to be useful it needs to be parsed and later normalised. For those of you not familiar with this process, If I have a stream of data from Windows let’s say, I want to pick out the useful information in it such as users logging into systems. Picking out that information from the extra data Microsoft sees fit to provide is called parsing which is a term that comes from semantics. Now if I also have a data stream from Unix I might want similar information but my challenge is that I might want to look at all logins across both types of platforms. login’s look quite different on each and not half as different as things like taking on additional rights or creating accounts or groups. Often these things take up more than a single log entry for instance. Being able to ask questions across very different systems is called normalisation. We need to avoid loosing any information when we normalise or very specific questions about a specific data stream might become impossible if we have generalised too much.

As you can see the ingest phase is very challenging and gets more so each time you add a new data source. What we tend to do is build specific parsers for each type of data. That way the parser knows exactly what to expect from its input data and can process it efficiently and accurately. Because the data from many systems tends to arrive all jumbled-up in a single feed, to ensure that each parser only gets the right data, i.e. the Microsoft-Security parser only gets Microsoft-Security data, we first send the data to a router or pre-parser. The job of this element is simply to identify the source type of the data and route it to the right parser.

The pre-parser needs to be lightweight because it is the first stage of the ingestion pipeline and any problems here will translate to a less efficient parsing stage. To be efficient the pre-parser tends to be quite simple and looks for key elements in a line of data. This can cause problems however as a new type of data from a new system which contains something similar can cause chaos by being sent to the wrong parser. The order that we try to match data sources in the pre-parser then becomes important. As we add data sources the pre parser element becomes more and more complex and more and more prone to issues and problems as we deploy it in different environments. Dealing with a pre parser that can handle any type of data means making it very complex and difficult to alter.

Here-in lies another issue. The data you see on one site may not look quite the same as the data you see on another site, even if it comes from the same kinds of systems. Log formats tends to be configurable and it may be that administrators have added or removed fields for whatever purpose. Approaching a new site to deploy an ingestion pipeline can therefore be a step into the unknown.

And this brings us to my little project. I’m currently looking at developing a Deep Learning based pre-parser which will recognise a data types and correctly route it to the specific parser. I’m experimenting with using Convolutional Neural Networks – more normally used in computer vision, to recognise elements in a data stream, in order to do this. CNNs are much faster than some of the Deep Learning network structures used to do natural language processing but because I’m just building the pre-parser I’m only really interested in the structure of the data and not the meaning. this allows me to use a simpler type of network.

A CNN pre-parser should not suffer from the issues with more traditional pattern matching pre-parsers because there is no order to worry about and because the performance of the pre-parser should be constant regardless of number of data sources. In addition to identifying the correct parser it should also be able to provide a confidence factor which will allow me to spot log lines which may not parse and to preprocess them before they are sent to the parser. at the very least I’ll be able to identify parsing problems and their cause much more easily.

In order to complete this work I need tagged data. That is to say that I need log data together with an indication which system it is from. I’m currently collecting this from various internet sources and designing the model to process them. I’m also designing my own input pipeline as a CNN doesn’t want text it needs numbers so I need to vectorise the text in the log lines so it maintained the structure I want to analyse. That means that I need to do my own parsing and recognise dates and IP addresses etc. I am not interested in what an IP address is, just that there is a field with an IP address in it. This will give me two things, a dictionary which maps text to numbers and post processed log lines to train my model.

I’ll update this post with progress and ultimately I’d like to make the model available to anyone who needs it. The model will only be good for analysing log data types that it has seen before so getting very broad data in the early stages is my chief concern right now.

Technical Note

Traditional preparers use the Aho–Corasick algorithm which has a complexity which is essentially linear in the length of the strings plus the length of the searched text plus the number of output matches. What I am hoping to do its to build a solution to pre parsing which has a complexity which is essentially constant rather than linear. This may mean that for a small number of conditions in the pre parser it under performs the Aho–Corasick approach but at some level of output matches it will begin to outperform it. There may be challenges in actually achieving this and the overhead of retraining the model as each new output match is added may also be a limiting factor but it remains an interesting project.

Update 17th September 2021

Having completed the data pipeline I finally got around to building the model. I used a tensorflow.keras sequential model which relied upon Conv1D layers to help me identify the data source. The examples I had included around 50 different data sources and I used fixed length tokenised data with 200 tokens per line (so a lot of the data was truncated for some longer lines). My first stab at the model gave me 98% accuracy against the test data set which was pretty impressive. The model also converged very quickly which is a bonus.

The next step for me is to remove the feature engineering I did earlier in the pipeline – where I did things like used regex to identify dates, times, IP addresses and the like. I’m pretty sure I will loose very little accuracy if I skip that stage and it will make the model much easier to use in the field. I will probably have to do a little hyper-parameter tuning in order to get the accuracy back up and I may need to extend the tokenisation beyond 200 values, but that will only impact training and not inference.

Training it was also pretty simple – it used all of the memory of an Nvidia 2080TI but only about 25% of its cores.