Sizing for SIEM

Security Information and Event Managment (SIEM) is a valuable tool to give you insight into what is happening, from a security perspective, in your environment. It allows you to react to developing threats and it gives you the ability to report upwards to management in a way they can understand. But buying and deploying that first SIEM solution can be very daunting for any security manager, not least of which is understanding the sizing of the platform itself. Particularly with Cloud based deployments costs will be linked to the amount of data you send to it and possibly to the amount of work you SIEM does on that data, often measured in concurrent searches or similar.

Of course if you already have a SIEM you have access to great information about the volume and type of data you are sending it as well as the queries you run. If you have a SIEM and you still don’t know this stuff you definitely need a new SIEM!

Before we start to look at how you should go about doing this it is important to think about what you want to send to the SIEM. Security managers, particularly those who have come up through technical disciplines tend to start out by thinking of their favourite data types. If you are a network person then you start thinking about Netflow and Firewall data for instance. But before you rush off to throw terabytes of network traffic at your SIEM let me give you one piece of advice:

The value of data increases as you move up the stack.

Yes, you can get some really good insights by looking at network data but you have to do a great deal more work to get those insights. If you want to understand what a user is doing on a Web connection you have to capture all of the sessions traffic and stitch it back together. That data may be encrypted of course making life even more difficult. The same data is readily available as a log stream direct from the web server itself or a proxy server of course, providing you can get access. Think about what data you are trying to get and what value it will give you. There are often different ways to get what you need.

Most SIEM technology expects to get data in standard formats and then uses parsers to break that data stream into fields. After that some form of normalisation happens so that you can ask for a users login without having to create a query which can recognise every form of login on every possible system. Systems which can assemble streams of data into transactions tend to be a little more specialised and much more expensive. Don’t make life harder for yourself or your chosen SIEM vendor than it needs to be. If you have access to insights from log files and other application layer stack sources then use that. In short – Plan your deployment out – have some idea what you want to achieve at least in the early stages and most importantly grab the low hanging fruit early in your project! Leave the difficult to achieve edge cases for later in the project – there is a chance you’ll always have more important things to do that implement them anyway.

Now – back to figuring out the sizing. You will get into this conversation pretty early on with your vendor. Certainly if you ask for budgetary pricing you will quickly be talking about some sort of volume assessment. And here, let’s take another quick aside and talk about volume! some vendors will want to talk in GB per day and some will want to talk in Events Per Second (EPS). Be cautious here as the conversion is not as simple as you might think. Each of your data sources (each type of log file or data feed) will have its own average event size. This sizing can vary hugely. Low level traffic can have pretty small event sizes, often down at just three or four hundred bytes per event. High level data, things such as Windows event data can have very large event sizes often exceeding a couple of KB per event and cloud based data streams can have even larger event sizes.

What is worse is that some SIEM platforms handle this data differently. UEBA (User and Entity Behavioral Analysis) systems for example will need to build sessions from things like Windows data and perform a great deal more processing on that data than on say Firewall logs. If you want to make your vendor really happy then give them a list of the volume of each data source and let them worry about any conversion. If you do need to convert between byte volume and EPS, perhaps because you are keeping copies of all the data locally, then use a weighted average event size methodology.

Weighted Average Event Size (WAES)

This formula may look intimidating but it is actually very simple. Vi is the percentage of the total feed for any given data source i and Si is the average event size for that data source. So all this is doing is adjusting contribution of the total average event size of any individual feed by how much of that feed is seen in the total data. In that way, for instance, feeds with very large average event sizes but very low volumes have less of an impact on the WAES than their huge size would suggest. This is important because a very small average event size will generate a high EPS for a given volume whilst a very large event size will generate a much smaller EPS for the same volume and if you are doing the calculation in reverse then clearly your storage my be exhausted earlier than expected if your average event size is much bigger than you expect for a given EPS.

Getting those raw figures if you aren’t already collecting your data can be a problem however. It is often here where lots of security engineering managers give up and start guessing. Don’t. Seriously – just don’t! Go to the heads of departments for the technologies which will generate this raw data. A greater part of it will be from security tooling in your environment so you should have no issue here. Much of it however will come from the Windows team, or the Networking team or another operational team. These teams will be busy supporting the business and will try to push you away- particularly if they are not already collecting this data. It means work for them to give you estimates which are not guesses. Be cautious of per machine “guesswork”. It never works like that. The only answer is to turn on logging and see how much data is generated. It doesn’t need to be stored for very long at this stage, you are just interested in the EPS and average event size. Another challenge may be flexible logging where you can turn on and off particular fields. Take a look at the documentation and decide what fields you need. Don’t turn them all on except perhaps if you are unsure what data is contained in those fields, but quickly tune them down to just what you need.

The teams in question will tell you that logging will impact performance. Remind them that there are operational benefits there for them too. Get them to “measure” the impact of logging. They have probably not had the chance to capture this data themselves before so get help from your vendor to talk them through what they will see in this data. Assuming of course your vendor deals in IT Ops use cases as well as security. But beware here, things like Windows data comes with lots of logging facilities – many windows event logs can be turned on or off and these will significantly change the log volume from this high volume high value feed source. Only log what you need and ensure the right things are turned on when you run these volume tests.

Bear one thing in mind. If you can’t even get good tests of data volumes then you will never get the full data feed when you have bought your SIEM platform. In short if you can’t get the volumes then your SIEM project is going to fail anyway so you may as well stop here. Do not guess this – it is part of your project as it doesn’t just get you accurate pricing but also starts to build the relationships you will need with the operational managers across your enterprise and it also gives you a chance to think about your log collection architecture. Don’t take the easy road here – it doesn’t lead anywhere!

Once you have a big list of feeds, with their average event sizes (the vendor should be able to supply AES information but you are best off measuring it yourself for your environment), you will have a value for both EPS and data volume. Now you need to think about growth. No organisation stands still and growth is compounded. If you think your organisation’s IT estate will grow at 10% per year then in three years time it will NOT be 30% bigger than it is now! Use a proper compound growth calculation to get a more accurate figure, though I accept that this is just more accurate guesswork! Still, I find going back to ask for more budget nine months after your SIEM project to handle more storage or ingest never pleases senior management. If your organisation has a habit of growing by acquisition then consider that and at least have a range of growth possibilities to get priced. Obviously if there is a specific acquisition target on the horizon then factor in the size of that organisation. I have written some articles about security in acquisition situations you might want to find whilst you are at it… Try this one: https://www.financedigest.com/maximising-the-cybersecurity-of-mergers-and-acquisitions.html

The next thing to think about is data retention. If your vendor can’t give you variable retention (often called variable roll-off) then you might need to find a new vendor, particularly if you are in a regulated industry (I’m thinking financial services here in particular) where you may have some long data retention periods but only for some data types. The reality is that some data you may need to retain for a year or more, but other data, most particularly high volume/low value feeds such a Firewall or network data, you may wish to flush out of your system in 30 days or less. This has a big impact on cost but should also significantly improve the performance of searches and reporting. Trust me, variable retention should be at the top of the list of your non-functional requirements!

You may be worrying about GDPR and the so called wright to be forgotten. Never fear, there is an exemption available for security information, you can read more about that in my GDPR and Security article.

Lastly go back and think about the use cases you want to implement. Try not to bring data in just because it is easy to collect. There is always a temptation to think, bring it in – we will get some value from it. There may be an argument for doing this as a short term initiative, but if you aren’t getting value from the data then cut the feed. It costs you money, warms the planet and tends to obscure the real insights.

And lastly, SIEM deployments are big projects and you are often still bringing in data feeds years after the initial go-live. Talk to your vendor to ensure you are only paying for data you actually ingest and not for a volume you may not hit until year three.

Good information about your ingested data volumes is a critical first step when preparing for a new SIEM project. Data volumes and EPS drive price and impact internal data collection architectures. There is value in the data not just for security but for operational teams as well, but they may need data that the security team has no use for. There are many ways to approach a new SIEM project but they all start with the data.

About Us

Welcome to the home of advanced Information Security. Here you can learn about using Machine Learning and advanced analytics to improve your security environment.

In addition we will provide impartial advice about security technologies such as SIEM (Security Information and Event Management) and UEBA (User and Entity Behavioral Analysis) systems.

If you’d like help or advice on any of these subjects, or if you’d like to submit your own articles for consideration, then you can contact the site administrator through Linkedin. Check out the Contact page for more details.

Recent Posts

Categories