Profiling and segmentation: A graph database clustering solution

Profiling and segmentation: A graph database clustering solution

Profiling and segmentation: A graph database clustering solution

This post is about an exciting journey that starts with a problem and ends with a solution. One of the top banks in Europe came to us with a request: they needed a better profiling system.

We came up with a methodology for clustering nodes in a graph database according to concrete parameters.

We started by developing a Proof of Concept (POC) to test an approximation of the bank’s profiling data, using the following technologies:

The POC began as a 2-month long project in which we rapidly discovered a powerful solution to the bank’s needs.

We decided to use Neo4j, along with Cypher, Neo4j’s query language, because relationships are a core aspect of their data model. Their graph databases can manage highly connected data and complex queries. We were then able to make node clusters thanks to GraphX, an Apache Spark API for running graph and graph-parallel compute operations on data.

Along the way, we decided to challenge another of the more well-known issues faced by banks: Detecting data redundancy in a massive database. Our client needed a function that could help them detect nodes that already had the same set of nodes (or a way to get an equality measure between them) and the possibility to create more complex queries to test their database through this information. For example, taking all the departments in a company and detecting which of them has exactly the same related users.

Read Also:
Agile Development at the Enterprise Level: Misconceptions That Jeopardize Success

The above image shows all the possible situations that we can face in a graph database. Departments with exactly the same shared users, department that share part of their users and finally departments which manage totally different groups of user. How can we measure the difference between sets of users to detect redundant departments?

With this in mind, we can move on to our fancy solution!

The first step was the research of a math solution to get a type of measure which could help us with the set differentiation. After a couple of hours, we came across our  our new best friend Paul Jaccard, who created a Similarity coefficient with the following formula.

This method allowed us to detect departments with exactly the same users → J(A,B) = 0 or with totally different sets → J(A,B) = 1. Using this method, we developed a solution with Spark to compare all nodes.

As you can see in the above image, at the end of the process we obtain a RDD with the Jaccard Index between every department. This content is then uploaded  to Neo4j through a batch process. During the ingestion, we create a relationship between every department node. Each of these relationships is based on the famous Jaccard index.

Read Also:
With better scaling, semantic technology knocks on enterprise's door


Data Innovation Summit 2017

30
Mar
2017
Data Innovation Summit 2017

30% off with code 7wData

Read Also:
Infographic: The 4 Types of Data Science Problems Companies Face

Big Data Innovation Summit London

30
Mar
2017
Big Data Innovation Summit London

$200 off with code DATA200

Read Also:
Data Science and Cognitive Computing with HPE Haven OnDemand

Enterprise Data World 2017

2
Apr
2017
Enterprise Data World 2017

$200 off with code 7WDATA

Read Also:
Infographic: The 4 Types of Data Science Problems Companies Face

Data Visualisation Summit San Francisco

19
Apr
2017
Data Visualisation Summit San Francisco

$200 off with code DATA200

Read Also:
Why "Data Ownership" Matters

Chief Analytics Officer Europe

25
Apr
2017
Chief Analytics Officer Europe

15% off with code 7WDCAO17

Read Also:
Text Analytics API Now Available in Multiple Languages

Leave a Reply

Your email address will not be published. Required fields are marked *