Build your own Deep Learning UEBA system?

At the time of writing this article I lead the Solutions Architecture team at Exabeam, a UEBA based SIEM company. As such, I tended to get into some pretty interesting conversations with customers about both security monitoring and data science. My favourite conversation by subject is definitely when customers tell me they want to build their own analytics platform. Perhaps they have read a paper on the subject or they have just hired a data science team into security or something else makes them want to explore the technology at a deeper level than just using what we provide out of the box.

If the “I want to do data science for myself” is my favourite conversation then the “I want to do Deep Learning” is a special treat. It is my own area of research and whilst we don’t do deep learning in our platform there is no reason why we shouldn’t make deep learning for security analytics a reality for those customers who are ready to take that step. If you are that customer – or perhaps you aren’t even a customer yet, buckle in and I’ll talk you through the dos and don’ts of deep learning for security monitoring.

Often the first conversation on this subject I have is when a customer, new to the subject, starts challenging me on some aspect of the Exabeam platform. Normally I’m being told that I am missing some trick, or just plain doing it wrong. Often the customer intends to prove this though some grand plan of their own. These grand plans range from “Deep learning will just automagically find the bad guys” through “I have just learnt about Jupiter notebooks and want to find the bad guys with that” to “We built a data lake five years ago and we now plan to start analysing that data”; and my response to all of these is “Let me stop you right there”…

All of these approaches make the same set of mistakes, but often at different levels. These mistakes often fall into the “don’t start with raw data”, category. What this means is that data science, not matter what area you work in, has a data pipeline before you get to the point where you are doing actual analytics – long before you can do analytics in most cases. And the truth of the matter is that raw data tends to be horrible. I mean really bad – useless in most cases. If you look at a typical SIEM vendor, the first two stages in their pipeline are data collection and data parsing. Well they are also the first two stages a data science team need to consider. They need to collect the right data in a consistent manner, ideally without too many gaps and they need to isolate the data features – to use the data science parlance, or to extract the fields as we InfoSec people like to think of it.

For this reason I tend to suggest that budding security analytics teams start from their existing SIEM and pull data already parsed and normalised. what do I mean by normalised? Well if you want to analyse logins then you probably want to do that regardless of technology being logged into. So a login on a Cisco router needs to be analysable in the same way as a login on a linux server or a windows workstation. Normalisation is changing the data so that features or fields are made the same based upon equivalence of action. You SIEM will do all of that for you (or it should). So save yourself a lot of hassle and take post ingested data. If you built your own data lake some time ago my bet is that the normalisation layer (and possibly even the parsing layer) got missed out. If not then you are one of the lucky few. Data lakes are often built on the schemaless (or schema on read) principle. That makes it easy to get data into them but defers the work of normalisation and feature extraction to later in the pipeline. One way or another that work needs to get done.

So now you have all of your data you need to have a theory about what it can show you. Many new to the field of Machine Learning believe that they can just throw all of the data at a Machine Learning program and it will spit out revelations and insights. Unfortunately that is not the case. You need to develop a hypothesis and then you need to develop a model for analysing that hypothesis. Lastly you need to evaluate the success of the model so you know if it is worth deploying it in production, i.e. giving insights to the over stretched SOC team to use on a daily basis. Trust me, giving them a whole lot more false positives to look at every day isn’t going to make you popular. For more information on the approach I suggest you read: The Scientific Method

There are any number of things you might choose to analyse but here is one I tried, and yes, it uses deep learning. The Exabeam Advanced Analytics system identifies anomalies and attaches risk to each one based upon a wide range of criteria including context associated with the person or machine. Risk accumulates for a user (or machine) but it also decays. Exabeam can provide a list of notable users who’s risk has risen above a threshold. This works very well in practice and we have seen countless breaches halted in their tracks with this system. My hypothesis was that looking over a longer time period, users would generally all generate some risk as anomalies are simply that, some atypical behaviour. Corporate users are always accessing new systems, being put into new groups or forgetting their passwords. However, I postulated that these normal anomalies might form a pattern for each group of users based upon their normal role and these patterns would be different to the pattern of anomalies generated by a bad actor; someone seeking to compromise systems would generate many of the same anomalies as a normal user but they might also not generate some of those more typical anomalies. In short I wanted to look at anomalous anomalies.

This is an excellent example of using data from as late in the processing pipeline as possible. I planned to use the list of users who generate risk with which anomalies contribute to that risk as my input but I deliberately ignored the level of risk for each anomaly as irrelevant. I needed to generate a list of candidate notable users as my output to replace the list generated simply by adding risk scores. I have no tagged data to work with as I had to assume that for any given organisation there might be good and bad users already present. I therefore knew I needed an unsupervised learning approach – one which would find patterns in my data without the need to have data already tagged as good behaviour or bad behaviour on which first to train. This is one of the biggest problems in Machine Learning for Information Security. There just isn’t much bad behaviour tagged data around and where there is, it tends to be specific to a single organisation or technology. Supervised learning techniques tend to be a little niche for Information Security use cases as a result.

After some investigation I decided to try using a Deep Learning technique called Self Organising Maps or SOMs. I’m not going to explain SOMs to you because there are many excellent article which do that very well. In short though the data I had was has high dimensional. That is to say that each anomaly in my data would form a dimension (also called a feature) and each user would record their anomalies as a row in this data. So my first run against real data over a month gave me 9,180 users (rows) with 115 different anomalies triggered. I had over nine thousand rows of data and 115 dimensions (columns of features). The great thing about a SOM is that it will reduce those dimensions to three effectively. Something much easier to work with.

I used python to build my program. If you don’t know Python then Machine Learning probably isn’t for you. Python has become the default standard for ML over the last few years and there are lots of packages available to make your life easier. This is where I have an advantage as I pulled the data directly from Mongo – I have access to parts of the system that normal users don’t but I could have used the Exabeam APIs to do this with just a bit more work. I then built a pandas data frame containing this data and used an open source implementation of Self Organising Maps called MiniSOM (https://pypi.org/project/MiniSom/) to analyse it. MinSOM basically makes the data gravitate together or clump based upon its similarity. The idea is to form clusters of similar behaviour and then to look at the outliers within a two dimensional grid. I needed good clumping so I needed a lot of iterations of the training for the SOM model and after playing around with the size of the grip used to range the data and the number of iterations, as well as some other hyper parameters (parameters which control the analysis – not the ones which come from your data) I was able to get some very interesting results. Having confirmed that everything was working as expected I also excluded some low risk but high frequency risk rules as they made it harder to find outliers.

I decided to add a visualisation of the SOMs I was generating to get a better understanding of how they were training and I have shown a couple here. The dark areas are high density groups where users are forming common patterns and the very light areas are less dense and contain smaller clusters. The secret is to find the least dense part of the diagram for the most anomalous users. Don’t forget there are over 9,000 users in this grid. I also decided that I would run multiple epochs (an epoch is running through your training once, so multiple epochs means running it multiple times with different starting positions). I then looked for users who were outliers in multiple epochs – These truly were anomalous. The reason for doing this is the map starts with random values and it is possible for a clump to form close to a real anomaly obscuring it; running multiple epochs reduces this chance and gives significantly better results. Some epochs would simple identify very small groups of normal users so at least ten epochs tended to give me the best chance of finding the real bad actors without also generating this annoying false positives.

Having found the anomalies I went back to the original pandas data frame and identified which user it was. This is because my SOM operates with purely numeric data and anomalies need to be encoded in a categorical fashion with a 1 for triggered and a zero for not triggered for each anomaly type. I could have considered using a sliding scale for how frequently an anomaly was triggered but actually that would have been irrelevant to my hypothesis (stick to the plan – don’t be tempted to throw in irrelevant data). My final trick was to look at the anomalies they had triggered in the period and map them against MITRE to get a feel for their activity. Clearly if they were all in the same MITRE Tactics area they were less interesting than seeing activity spread across a possible entire kill-chain.

Here are some of my results. Is this better than what Exabeam Advanced Analytics does by default? Who knows, it’s a different view into the user activity and avoids assigning fairly arbitrary risk scores. It also looks for patterns over long periods of time and so can find low and slow attacks. You could find all of this activity by using the threat hunter feature, if you knew what to look for. I set out to see if there were different approaches to fining notable users to investigate which might provide different results as a starting point for an investigation. What this does demonstrate are the thought processes which might lead you to develop your own Deep Learning approach to Information Security. Start with preprocessed data. Have a theory to test; play around with the parameters and the model and finally know if your model is produce something real or just random noise. The jury is still out on that last one. Here are some of the outputs for you to judge:

Notable Users
svc-proxy  - Risk: 10.0
Rules
DC23 : Abnormal session start time
    - T1124 (System Time Discovery)
WPA-OH-F : First execution of critical windows command using privileged access on this host
    - T1082 (System Information Discovery)
SEQ-UH-16 : Exceeded number of failed logons for the user
    - T1110 (Brute Force)
    - T1078 (Valid Accounts)
A-EPA-HP-F : First execution of process on asset
    - T1204 (User Execution)
EPA-HP-F : First execution of process on host
    - T1204 (User Execution)
AL-RT : Risk transfer from account lockout activities
WPA-UH-F : First privileged access event on host for user
    - T1068 (Exploitation for Privilege Escalation)
RA-UH-F : First access to asset
    - T1078 (Valid Accounts)
EPA-OP-F : First execution of process in this organization
    - T1204 (User Execution)
A-EPA-OP-F : First execution of process for the asset in this organization
    - T1204 (User Execution)
EPA-UP-F : First execution of process for user
    - T1204 (User Execution)
EPA-USequenceSize-WC : Abnormal number of critical windows command executions by the user
    - T1059 (Command-Line Interface)
EPA-OH-F : First execution of critical windows command on this host
    - T1059 (Command-Line Interface)
DC24 : Abnormal day of week
    - T1124 (System Time Discovery)
Tactics Employed
TA0001 :  Initial Access
TA0002 :  Execution
TA0003 :  Persistence
TA0004 :  Privilege Escalation
TA0005 :  Defense Evasion
TA0006 :  Credential Access
TA0007 :  Discovery


bracer-admin  - Risk: 20.0
Rules
AM-OG-F : First member addition to this group for the organization
    - T1098 (Account Manipulation)
RA-UH-F : First access to asset
    - T1078 (Valid Accounts)
EPA-UP-F : First execution of process for user
    - T1204 (User Execution)
A-EPA-HP-F : First execution of process on asset
    - T1204 (User Execution)
EPA-OP-F : First execution of process in this organization
    - T1204 (User Execution)
A-EPA-OP-F : First execution of process for the asset in this organization
    - T1204 (User Execution)
EPA-PU-PS-F : First execution of powershell process for user
    - T1086 (PowerShell)
AE-UA-F : First activity type for user
    - T1078 (Valid Accounts)
AM-UA-MA-F : First account group management activity for user
    - T1078 (Valid Accounts)
WPA-UH-F : First privileged access event on host for user
    - T1068 (Exploitation for Privilege Escalation)
Tactics Employed
TA0001 :  Initial Access
TA0002 :  Execution
TA0003 :  Persistence
TA0004 :  Privilege Escalation
TA0005 :  Defense Evasion
TA0006 :  Credential Access


griffinb , Beatrice Griffin - Risk: 0.0
Rules
A-NET-HCountry-Outbound-F : First outbound connection to this country from asset
    - T1071 (Standard Application Layer Protocol)
PA-NoIT : Badge access without IT presence
    - T1078 (Valid Accounts)
EPA-HP-F : First execution of process on host
    - T1204 (User Execution)
A-EPA-HP-F : First execution of process on asset
    - T1204 (User Execution)
A-EPA-OP-F : First execution of process for the asset in this organization
    - T1204 (User Execution)
AE-UA-F : First activity type for user
    - T1078 (Valid Accounts)
AE-UA-F-VPN : First VPN connection for user
    - T1133 (External Remote Services)
PA-UTi-A : Badge access at abnormal time
    - T1078 (Valid Accounts)
A-NET-HdPort-Outbound-F : First outbound connection on port for asset
    - T1065 (Uncommonly Used Port)
Tactics Employed
TA0001 :  Initial Access
TA0002 :  Execution
TA0003 :  Persistence
TA0004 :  Privilege Escalation
TA0005 :  Defense Evasion
TA0011 :  Command and Control

I don’t know about you but I think I’d be tempted to take a look at the timelines for those users, though the last one looks like it could just be a user travelling; worth a check though I’d say. That is the last take-away for security analytics. It doesn’t provide definitive answers most of the time. Its real value is that it points you in the right direction. To identify things worthy of your time to investigate and it given you something better to work on than chasing the same set of alerts all the time. Machine learning is just another tool in your kitbag. The best tool is still a skilled security professional looking at a timeline.

Build your own Deep Learning UEBA system?

About Us

Recent Posts

Categories