
It is tradition to start any conversation about graphs by explaining that they are nothing to do with pretty pictures but are in fact a mathematical construct consisting of nodes and edges. I’m not going to explain graphs here, there are any number of great Youtube videos or websites which will do that. What we are going to do here is explain why they are particularly suited to certain Information Security problems.
Before we look at Graph Databases lets start by looking at traditional database types and where they struggle. In a relational database columns (we like to call them tables) of data (each row is an entry) can be related to one another in a number of different ways. One-to-many and many-to-one relationships are quite simple and work by allowing you to SELECT X WHERE X.y = Z.y or in other words where a field in my row equals the value of a field in a row for a different column which can even be in a separate database in the Relational Data Base Management System (RDBMS). Many to many relationships are a bit more complicated and require the use of a lookup column – A table which is simple there to relate entries from other tables together. Many to many queries are more expensive (they require more RDBMS resource to execute) than simpler queries and they are also harder to express in SQL, the problem really gets sticky when you are looking for more complex relationships. Perhaps your query is Many to Many to Many or worse?
In case you are thinking that this is a contrived problem which doesn’t really exist in the real world, think about groups. If you want to find the members of a group or to check if a single item is a member of a particular group then you can do that with a simple query. But as soon as other Groups can be members of a Group you have introduced arbitrary depth Many to Many relationships. A group could be a member of a group which is a member of another group and if you want to find all users who are in a Group even if it is because they are in a Group in that Group… And one group could be in several other groups, and let’s hope that there is no circular nesting of Groups – Group A is in Group B and Group B is in Group C which is in Group A. Relational databases, despite the name, were not designed to deal with these complex relationships.
NoSQL database don’t help us in this situation either but there is a database type which takes such problems in its stride. Yes, you guessed it, the Graph Database. But you may still be wondering if this is a real problem and if so, where it impacts Information Security. Actually this is a problem that almost every large corporate has because they all rely upon Active Directory (or similar technology) and use AD Groups in order to provision users to resources. That is to say that rather than say User X can access Resource Y, they configure an AD Group to have permissions for Resource Y and Users get added to the Group if they need that resource. Except what tends to happen is that users are already in Groups and so the Group of users is put in the Group for the resource – Nesting level 1 and here is our Many to Many relationship. Likely every department will have its own group so the Resource Group will have many Groups in it. Then you create a new resource and realise that everyone who can access the first Group also needs access to the second group, but some will only need access to the second, so you create a new group, add users to that group and add the group for the first resource – You are now at nesting level 2 some Many to Many to Many relationships.
This situation tends to get more and more convoluted as time goes by and pretty soon you have no idea who can access what. Worse, you don’t know who gets to decide who goes into a group so you end-up with a recertification and provisioning nightmare. In fact resource recertification often eats up a considerable amount of compliance or security resource, particularly for regulated industries like Finance. Add to the problem that lazy system administrators will also add groups like EVERYONE or DOMAIN USERS to a group rather than figure out who should actually have access and you have both a compliance and a Security catastrophe. Checking where the EVERYONE group is used is often the first thing an auditor will do!
So getting back on track, why is it that Graph databases are the solution to this problem? Well if you drew out by hand the links between groups and people and resources you’d see that it actually forms a Graph structure. Graph databases store the information in a form which matches the natural form of our data – that has to be a good thing. In addition, normal databases just store values in fields – or key value pairs if your are using a NoSQL type data store. With Graph databases the values of some of the data actually create and support the structure making queries of the time we want to perform very fast indeed.
Whilst we are on the subject of queries, it is worth noting that that the way we query a data store reflects that nature of relationships of objects within that sore. For a relational database, SQL is a natural declarative approach to retrieving the data we want. SQL however does not have the language structure to effectively query a graph database. The most common form of query for a graph DB is a system known as CYPHER. The reason that CYPHER is so effective is that it allows you to easily encode a pattern to search for in your query. Complex patterns which express multiple relationships can be built and easily understood.
For-instance:
match (m)-[:NESTED*0..]->(s:Group)-[:MEMBER]->(a:Account)<-[:ACCOUNT]-(p:Person) return p
is a query which easily allows us to solve our earlier problem or finding all of the effective members of a group, and:
match (m)-[:NESTED*1..]->(s:Group) return count(s)
gives us the number of subgroups regardless of nesting level.
It is not possible to do the same thing in SQL and so you would be forced to create subqueries which might well grow out of control to the point where the database would become unresponsive. This would happen even at relatively shallow levels of nesting such as three or four. A graph database is unconcerned with nesting levels as it was designed to follow edges to nodes. In Graph theory – yes, there is such a thing, we’d call that a Graph Walk and the Graph database encapsulates some Graph theory to identify such things as circular queries. It will also be able to do things like easily give you the shortest path between two nodes.
I used this technique a few years ago to build a compliance tool for a small company providing services mainly to large banks. It was able to return answers to questions such as “Who has access to this resource” and “What does this person have access to” in milliseconds for a company of 140,000 users with AD groups nested to arbitrary depths and even where there was circular nesting at depth twelve or above. Most of the problems I faced in building this system were solved simply by choosing a Graph database as the most sensible storage structure for the data. Creating the queries seemed pretty natural after that.
I used Neo4j for the database and wrote the whole thing in python using the django framework to provide a web interface. The tool was actually far more effective than some commercial products the company had been using that cost upwards of $1MM and required over night processing to pull data. A simple LDAP export to an LDIF file will give you all the data you need and a python script can process even the biggest LDIFF file in minutes to create the database and from there queries are blisteringly fast, even when run on a laptop!
Checkout Neo4j, there is a free version which has all of the functionality you will need to build your own graph databases to answer complex relationship queries. It comes with a great browser based query and visualisation tool.