This post is about an exciting journey that starts with a problem and ends with a solution. One of the top banks in Europe came to us with a request: they needed a better profiling system.
We came up with a methodology for clustering nodes in a graph database according to concrete parameters.
We started by developing a Proof of Concept (POC) to test an approximation of the bank’s profiling data, using the following technologies:
The POC began as a 2-month long project in which we rapidly discovered a powerful solution to the bank’s needs.
We decided to use Neo4j, along with Cypher, Neo4j’s query language, because relationships are a core aspect of their data model. Their graph databases can manage highly connected data and complex queries. We were then able to make node clusters thanks to GraphX, an Apache Spark API for running graph and graph-parallel compute operations on data.
Along the way, we decided to challenge another of the more well-known issues faced by banks: Detecting data redundancy in a massive database. Our client needed a function that could help them detect nodes that already had the same set of nodes (or a way to get an equality measure between them) and the possibility to create more complex queries to test their database through this information. For example, taking all the departments in a company and detecting which of them has exactly the same related users.
The above image shows all the possible situations that we can face in a graph database. Departments with exactly the same shared users, department that share part of their users and finally departments which manage totally different groups of user. How can we measure the difference between sets of users to detect redundant departments?
With this in mind, we can move on to our fancy solution!
The first step was the research of a math solution to get a type of measure which could help us with the set differentiation. After a couple of hours, we came across our our new best friend Paul Jaccard, who created a Similarity coefficient with the following formula.
This method allowed us to detect departments with exactly the same users → J(A,B) = 0 or with totally different sets → J(A,B) = 1. Using this method, we developed a solution with Spark to compare all nodes.
As you can see in the above image, at the end of the process we obtain a RDD with the Jaccard Index between every department. This content is then uploaded to Neo4j through a batch process. During the ingestion, we create a relationship between every department node. Each of these relationships is based on the famous Jaccard index.
Chief Analytics Officer Spring 2017
15% off with code MP15
Big Data and Analytics for Healthcare Philadelphia
$200 off with code DATA200
10% off with code 7WDATASMX
Data Science Congress 2017
20% off with code 7wdata_DSC2017
20% off with code AIP17-7WDATA-20