Build your own SIEM - Part II

Previously I introduced the idea of building your own UEBA system but I mainly talked about a specific data science project of mine designed to avoid the challenge of just adding risk scores to individuals or systems. In this article I want to take a step back and look at the bigger picture of developing your own SIEM or UBA platform from scratch.

One of the points I made in my previous article bears repeating. Building a SIEM or security analytics platform is not just a case of hiring a few data scientists and letting them loose on your data lake. I have seen that tried so many times and not seen it succeed even once. A SIEM is about taking raw data – mostly log data and turning it into security insights. Not just once but repeatedly, every single day, over and over again, for many different systems and users on many different authentication systems on different networks. Yes, a data lake might have some useful information in it and quite possibly you can turn that into insights as a one- off but a SIEM it is not!

For an insight into how a SIEM actually works and what might be involved in building your own, read on…

The SIEM Pipeline

If you are thinking of building your own SIEM, UEBA system or analytics platform you first need an architecture not just for one part of the system but for the end-to-end pipeline. I call it a pipeline for a good reason as essentially a SIEM is a data pipeline where we steadily transform the data from raw logs and alerts at one end into useable security insights and actions at the other end. Each transformation will take some processing and possibly contextual data and rules in order to successfully transform elements. As the pipeline continues more of the data should be discarded but what remains will normally become more and more enriched. One complicating factor however is the need to search against (and sometimes retain for legal and compliance reasons) the original raw data, often for an extended period of time.

As SIEMs have developed, the amount of transformation in the pipeline has increased. Generation one SIEMs did little more than parse and to a small extent normalise the raw data. Modern analytics based SIEMS tend to build baselines for data, identify clusters and patterns in the data and build “transactions” from multiple events in the data to establish individual workflows for users and systems, often inferring missing events.

The Pipeline then is the key to developing a working SIEM. Data engineers will need to establish the translations in the data flow and architects will be required to solve the problems of resilient, continuous flows of data through the system. Viewing your SIEM as a pipeline also gives you the opportunity to compartmentalise certain aspects of the SIEM and to easily plan and extend functionality by inserting new stages.

Let’s start by looking at each stage in turn.

Data Acquisition and Ingestion

The first challenge we face is actually capturing the data into the system and getting it into some sort of common format. Think of this as the funnel which sends data into the open end at the start of your pipeline. SIEMs must handle both passive and active data acquisition and each comes with its own challenges. Passive acquisition involved receiving data sent to us by end systems or by data acquisition systems. A good example of this is receiving syslog data either directly or from upstream syslog servers. Here we have no control of the data rates, and we must deal with systems adding or changing some of the data. Each syslog server for instance will add its own timestamp and hostname to every record. This may create problems further down your pipeline particularly where the native format changes, say a windows collector which then sends you syslog. Your parsing components will have to deal with this format change; but more on that later.

If data is being sent to you then you almost certainly need to load balance your collectors, both for resilience and to spread the load between them. Most systems which send you data will not store for long, if at all, so data loss becomes a significant problem if you collectors are down for upgrades or failures etc. Collectors may also have times where they are unable to send data to the pipeline (which may be on another network) so they need to be able to store data for a possibly significant period of time.

If receiving data is problematic, then pulling data is much more so. Getting data from Databases, Cloud services, applications data lakes and many other sources may require you to pull rather than passively receive a data stream. Now you have a plethora or APIs, protocols and authentication mechanisms to deal with. Not only that but failover becomes more difficult and load balancing can be a nightmare. If two systems pull, how do you know they aren’t getting the same data, or that there is data that neither pulled. And if you stop pulling for a period, is it your responsibility to remember where you were up-to or can the API answer that question for you?

And once you have the data, however you receive it, how do you get it into your pipeline? You almost certainly need to change formats to do that, say to JSON as you send it into a Kafka topic for example. You are going to have to add your own metadata, to say where you got the data at the very least. You may also need to add timestamp data, particularly if you are dealing with data which does not have that data or which has it but without time zone labels.

One thing to note here is that the data entering your pipeline will be larger than the data originally generated on monitored systems. The extra timestamps, meta data and format changes will see to that. It is often useful to add filtering to your ingestor to prevent some data entering the pipeline in the first place if for instance it is not security related. This is particularly valuable when you have a common log infrastructure which you want the SIEM to feed from and where data is intended for many uses.

Parsing and Normalising

Parsing and normalising is the process of unpacking the raw data you received and making some sense of the fields of data contained therein. Strictly speaking we are tokenising rather than parsing, but the industry has decided to call it parsing so who am I to fight against that. Parsing is essential if you are going to do anything useful with the data, even if that is just to search on it as you will want to search by field. Normalising the data means ensuring you can search consistently and across data sources. A login to a Windows machine will look different to a login on a unix machine and different again to a login to a custom web-based application. You Security team however will at least on occasion, wish to search for logins by particular users regardless of the data source. Making them query in every way possibly will make them inefficient at the least. Normalising then means adding extra fields computed from your data (as you will still wish to specifically look for the windows login). In this way Parsing is going to increase the data volume over what was originally ingested.

The next problem you face will send you back to your ingestion tier. That is, processing time stamps. It is vital that you understand what order things happen in. The timestamp in almost every log is therefore high on the list of things you parse out. But wait. Each log will have several time stamps, ranging from the time the log entry was created to the time it was ingested (and probably one for every intermediary step) and to make things worse, they may all be in different formats – the great thing about standards is that there are so many off them! This however is not the real problem. The real issue here is the Earth, or more specifically that we have different time zones as you travel around it. 09:15 in New Jersey is not the same thing as 09:15 in London, or Paris, or Kiev. Many of these logs will omit the time zone information, so somehow you need to figure it out or later when searching you will have no clue about the order of events. Having the collectors figure this out for you may help, but you are definitely likely to get this wrong more than once if you work for a global company; particularly when you have logs that do contain time zone information, but it is wrong!

Parsing itself is a two-step process. The first step, called pre-parsing, is about recognising the data source and selecting the correct parser for it. If you ingestion tier is able to do this then the parsing tier has an easier job. The work still needs to be done but where you do it has a lot to do with managing the load for your architecture. And load when it comes to parsing is an issue. Regular expression matching is at the heart of a configurable parser. As with most things in computer science there are often several ways to do anything; In regular expression writing it is possible to write computationally expensive regular expressions or simpler ones which will put less load on your system. Often you find in commercial SIEMs that the out-of-the-box parsers are very efficient but the ones configured for you by professional services are much less so. The issues is that the out-of-the-box parsers seldom work with your data sources so everything needs to be hand crafter for you.

When first getting your parsers working you will almost certainly discover that some of the originator systems are sending different formats, or not sending you the fields that you need and expect. There is often a lengthy period where you work with the managers of the various geographies or platforms to get their logging to a level where you can use it.

Once the parser has done its job and tokenised the input, discarding unneeded fields and adding new fields for normalised data, you can start to move on to adding the context which is where your SIEM starts to add some value! Don’t forget that you will also need to preserve the original raw data either as a field or separately with a reference. This is particularly important from a non-repudiation perspective and also in some regulated industries where Audit and regulators may wish to see the originals.

Context and Enrichment

We add context to the data for several reasons, most notably because the workflow of security operations often involved looking up that context when they are investigating potential incidents. Context takes many forms. It can be as simple as adding IP addresses to domain names and vice-versa or it might include adding real names and departments for user accounts. This latter is very important, particularly for analytics based SIEMs. Users often have multiple accounts and it becomes important to join up all of those account names and associate them with a real user; in much the same way that one physical machine might have many real and virtual network interfaces and thus many IP addresses. We often wish to query what a user did or what a machine was involved in and joining the dots when not included in the data can be arduous, particularly where DHCP and other systems mean that an IP address might be associated to one machine at one time and another machine at another time. Without knowing which machine had that IP address at the time of the logs you may be helpless to continue your investigation.

Here we also see that the decision to embed context or to look it up becomes important. Lookups save you storage space but may increase complexity and latency in your pipeline compared with embedded context. They may also be impractical where context changes frequently such as in the DHCP example above.

The context and enrichment part of your pipeline has another important role; that of disambiguation. In the real world that most organisations must live in, there is a problem with the data. IP addresses and account names (amongst other things) might be duplicated across different parts of the organisation. Take for example an organisation which has grown through acquisition. It is very likely that it will have used the same IP addresses across its different divisions and so two (or more) servers may well have the same IP address. Similarly, you may have two smithJ account names in two different domains. Organisations can deal with this from an operational point of view in several ways, but when the logs arrive it might create ambiguity that your SIEM needs to handle. Typically this is done by tagging data upon ingest (adding additional fields to the data) so that we can disambiguate based upon where the data entered our pipeline for instance. It may also be that there is already data in the feed which can help, such as fully qualified domain names etc.

Now your contextual information is likely to reside in several places. You may need to pull user data from LDAP based repositories as well as from HR systems where you need richer organisational context. Data about machines will come from DNS, DHCP and probably CMDB systems for rich systems data. You can’t imagine how much time your SecOps teams will save if they can see immediately what applications a server hosts and who owns it!

Another source of enrichment data is external data such as threat data. This is seldom as valuable as people think but if used correctly it can help to make decisions further down the pipeline. Beware of adding too much enrichment. Vulnerability and patch data may seem like a good idea but does knowing a server has a vulnerability actually change your decision making that much, and given they are each likely to have lots of vulnerabilities, you can quickly bottleneck you pipeline with data you can’t really make use of. Even without these bloating data sources you will already have more than doubled you ingest volume when it comes to storage and indexing. If you wish to use these things then a better approach may be to have them as lookups further down the pipeline or as part of your search facility.

Whilst context is valuable in the SecOps workflow and so must be available in search, it is also vital in analytics as you are likely to want to build models based not just on users and machines but on groups, teams and other collections.

Indexing and Search

The previous stages of the SIEM pipeline tend to be common regardless of the SIEM you are using. Where things start to divert is where we make use of our enriched data. The first thing you will want to do is to index your data for search. The second generation of SIEMs focused much more heavily on search than the previous generation which had majored on alerting functions, believing that real time was the thing to aim for. With the volume of data you are dealing with put real time out of your mind. Even if you have a SOAR (incident response system) you are not going to have it take actions in anything approaching real time.

So search is where it is at for second generation SIEMS and all that enriched data is going to make for a far more pleasant search experience. If you are building your own SIEM then your best bet here is going to be to select a search environment from the world of open source, something you can live with, and which might get developed in parallel with your SIEM efforts. Elastic is a popular one though beware, these large open-source projects have a habit of becoming closed faster than you’d expect. Elastic themselves did this by mixing in proprietary code to the open-source base then ensuring there was no development except on the closed source path. Other solutions exist, such as Solr but do not be tempted to look at any sort of database here It simply is not designed for this sort of thing. Another option might be a search or query based lay on top of Hadoop – the traditional data lake approach – At least now your data is more structured. These tend not to be very good with time domain data – which is after all what we have.

When it comes to search there is a definite trade-off between your costs and performance. Most SIEM service offerings will try to limit your “hot” searchable data and give you more “warm” search. This means that they have a small amount of very fast, expensive to operate search capability and a much larger but less performant though cheaper “Warm” search. Generally, the way this works is that data is written to both of these systems in parallel but queries are routed intelligently. If you have 14 days of HOT search and 90 days of warm search then a query which targets 14 days or less will be sent to just the fast (HOT) search. A query without a time bound or which specifically targets a longer period is sent to both and the results are assembled on the fly as the data is returned. This is actually a really effective approach as it helps with the user interface as users will not accept waiting for the first data to appear but are happy if the first data is there quickly but that the rest streams in over seconds or minutes…

Data is rolled out of the HOT and WARM storage as it exceeds its age limit. With warm storage you may wish to consider more involved roll-off strategies, particularly if you ware in a high compliance environment where some data must be retained for search over a set period of time and other data does not have this restriction placed upon it. This warrants considerable though as your roll-off strategy will dictate your index structure for many systems so changing this later may not be an option. If the long retained data is just a small part of the total the savings in storage costs AND search performance may be huge however.

Warm storage come in many forms and leverage cloud native storage solutions (for those of you building in the cloud) as well as Hadoop components. There are also new technologies such as Snowflake which can be cost effective and performant, but which lack security specific capabilities. These tools may be more suited to business intelligence and require more from the SIEM architect than simpler search technologies.

You may also be aware of the concept of Cold or Frozen storage. Here the data is taken offline and is “rehydrated” when needed. Some organisations have a need to query data back many months or even years. Frozen data can be reindexed to allow its search either automatically or as a manual process to reload it. Saving the index with the data may speed up rehydration at additional storage costs.

At this stage you may well be thinking that something like Splunk may be the answer? You get advanced and performant search and for free you get the early part of the pipeline such as ingestion and parsing. Leaving aside the insane cost of using Splunk, the real problem is that integrated pipeline which you have very limited access to. Splunk will work for you provided you want to do what Splunk can do in a Splunk sort of a way. The Splunk sales team may point out the extensions to the search language which allow you to build alerting and analytic but these are all based upon search as their underlying capabilities and this will limit what you can do and will struggle at typical SIEM volumes. They also have very limited state management, something we will be talking about later when we get on to alerting and analytics.

I have glossed over the indexing part. Getting you data out of your pipeline and into the search subsystem. If you have used Elastic or similar technology then you probably have a way of doing that from the technology stack of your chosen search engine. One of the things you will need to consider is if you need to save the original raw data as a field in your searchable index data or if you can simply store it separately in a cheaper storage system and reference it.

In summary then, you need to think about the structure of your indexes when you store this data. How will you search it, and perhaps as importantly, how will you make it unavailable? Variable roll-off can save valuable storage and also improve search performance, but it may require separate indexes for each data source (normally the chief criteria for variable roll-off). Pick the wrong index structure and this may be difficult or even impossible to achieve. Get it right and you can probably automate the whole thing. Index structure is also important if you want to add lookups for non-log data as well. Search index construction could probably take an article of its own. Make sure you have thought about this before you start as changing index structure after your SIEM is live may not be feasible without deleting indexes and starting again.

On subject that comes up regularly with stored log data is non-repudiation – proving that the logs have not been changed since they were received. This is one reason you need to retain that original raw data. Typically, non-repudiation also requires you to create a hash of the original data and store it somewhere which cannot be changed. Here Elastic has a simple system where changed data increases a version number for the indexed “Document”. So it is not about preventing change but it is about showing that a change has or has not happened to that data. So you can see that there are different ways of achieving the same thing. If you are going to implement non-repudiation then make sure it is robust as you may need to defend it in court!

Lastly, Think ahead to how the data might be searched. If you are going to want to generate reports for just a subset of something – say just the Sarbanes Oxley related servers, then you will either need an indexed field which within the data itself which identifies those things, or you will need a way of searching against an external list. The latter is often far more compute intensive but the former will both take up more storage and will also require that you anticipate every possible subdivision which might be searched.

Alerting

When people think of SIEM they think of alerting. Alerts were one of the earliest features which drove the adoption of SIEM for security monitoring. If you are building a traditional SIEM (rather than an analytics based one) then alerts will be at the heart of your Security Operations Centre workflows. Alerts are how we reduce the mass of raw data into a more finite stream of security events worthy of some human attention or action.

Alerts generally come in two flavours, simple alerts and correlated alerts. In a nutshell, simple alerts are essentially flags which say – “that thing you were looking for, well I just saw it”. They look for a specific value in a specific field and when you see one you raise the flag. Correlated alerts are a different kettle of fish however. Not because they provide next level functionality but because they are the first element of our pipeline (aside from context) which requires us to maintain state. With correlated alerting you are not looking for a single thing but in terms a database person might use, you are looking for a transaction. A sequence of events; two or more values appearing within some set timeframe. In a good, correlated alerting system the second thing you are looking for links to some aspect of the first thing you saw. An example would be to alert when two logins for the same user happen within a set period, when one of them was a local login and the other was a VPN login. Secops professionals will recognise this as a simple “Time of flight” test. You can’t be local and remote at the same time.

The correlation system needs to watch for either type of login and when it sees it, the SIEM needs to check if it has already seen the other type for that user. For that user is the critical thing here. We are tracking on a per user basis! No good sending me and alert when user one logs in locally and user two uses the VPN. So, we need now to handle maintaining state and some form of variable timeout, because if they login on the VPN today and in an office next week that probably isn’t something I want to worry about. Correlated alerts were a big thing when they were first developed. Suddenly SIEMS became really useful, far more so than with just simple alerts, but this also marked the start of SIEMs becoming big and complex to install, configure and maintain. The rule base got incrementally more complex with correlated alerting so you need to think about the configuration of rules and you also need to start thinking about management tooling for your SIEM so that the security technology team running your SIEM (as opposed to using it) can see which rules are firing and how often. We will talk more about this aspect of the SIEM in a later article.

The nature of the state store for your SIEM will depend very much on your architecture. Correlated alerting tends not to be demanding in terms of latency or bandwidth and scales with the ingest volume, so you may be able to get away with a simple SQL database. Alternately REDIS or a similar high-performance cache may be better where your volumes are high and you expect to rely heavily on correlated alerting.

Once generated you need to route your alerts to your users in some way. This may be in an internal case management system or they may be sent to an external system. You may wish to have a way of damping down alerts. Poorly configured alerts may generate vast numbers of events which can easily swamp downstream systems. This is where alert fatigue comes from. SecOps personnel just end up turning off alerts because there are far too many for them to handle. If this becomes the case your SIEM will not be creating any real value as will have stopped you looking for a needle in a haystack in favour of looking for a needle in a stack of needles.

Beware sending alerts directly to automated response systems. This can have severe impact on your operating environment and may even be used against you be hackers who will see it as a way of degrading your defences by generating false positive alerts.

Enough for now

In this article we have introduced the idea of the SIEM as a pipeline and looked at some of the key early stages of that pipeline. In the next article we will explore later stages and talk about SIEM performance as well as ancillary functions such as Role Based Access for SIEMS, Audit logging for SIEM users and SIEM resilience and Disaster Recovery.