Resolving 3 Crucial Bottlenecks of Data Processing in BI SoftwareI have been working with databases and business intelligence systems for close to 20 years now, and what always struck me as odd was how slow and cumbersome these systems tend to be. Waiting hours or days for answers used to be a regular affair, and each new query would often require weeks of advance work and careful scripting to create indexes and denormalize the data.
Over the years computers have become faster and RAM has become cheaper; however the field of business intelligence has also evolved, and what would have been acceptable, or even exceptional in the past, simply won’t cut it today: data is much bigger in size and more complex; it comes from new and unpredictable sources, both structured and unstructured; the questions organizations ask about their data are more complicated; and the answers need to be delivered immediately or they will become irrelevant in today’s rapid decision-making processes.
Has business intelligence software kept up with the demands of the modern business world? Only partially so: while modern tools provide great performance for smaller and simpler datasets, they tend to buckle down under the pressure of dealing with big data, disparate data sources, or many concurrent users. To answer these needs, companies are usually required to spend significantly more resources on massive hardware and IT infrastructures, or to use a patchwork of softwares – data warehouses, ETL, and front-end visualization tools – in order to ensure reasonable performance.
But this doesn’t have to be the case: essentially the problem boils down to the speed in which computers access, scan and process data. There are three major bottlenecks that affect this, and I will proceed to present these obstacles and how they can be overcome in order to provide high-performance, low latency business analytics software.
1. Tables, Columns and Rows: the I/O Bottleneck
Traditional relational databases such as SQL server, Oracle, DB2 etc. – are row-based This means that they take data that is organized in tables and physically represent it, in terms of the way it is stored on the disk or in memory, as a single block of serialized rows.
Relational databases are very efficient for retrieving a specific record within the data – e.g., in the above table, to answer the question: how much did the company spend in India? However, once aggregations and comparisons need to be made – e.g., to answer the question: where did the company spend the most on marketing? – there is a significant slowdown in performance, since the system needs to scan the entire table in order to find the matching records. And the wider the table, the more time needs to be spent scanning it.
A columnar database, on the other hand, would store the same table as columns. Thus, to answer the question of the country with the highest marketing spend, the database only needs to scan a single column, and then retrieve the relevant record. While the differences are of course minor with such a small dataset, when we’re looking at tables with millions of rows and dozens of columns – the difference in performance becomes significant.
This makes a world of difference for business intelligence because modern BI is all about ad-hoc,exploratory analysis. This means giving users the ability to look at the data from many angles and to ask new questions about it – all of which essentially boils down to aggregating data, comparing records and performing calculations. To return to our example: “how much did we spend in India?” is perhaps interesting, but “where did we spend the most money?” takes us that much closer to an actionable insight – the ‘holy grail’ of contemporary business intelligence.
Columnar storage enables us to bypass this first bottleneck. And indeed, after years of relying on legacy RDBMS systems, the second wave of business intelligence tools began leveraging columnar databases. However, it is at this stage that we encounter
2. RAM, CPU, and Disk: The Memory Bottleneck
To understand this problem, we first need to briefly explain how computers process data: Raw data is usually stored in the computer’s hard drive – the cheapest and most easily available type of storage. However, to actually “crunch” the data – scan the relevant records and perform calculations on them, the data must be transferred from disk to RAM and from there to the CPU, where the actual crunching takes place. However, this flow of data between memory layers involves latencies,.
Advancements in computing technology have made CPUs much faster, and RAM much cheaper and more abundant. This brought about the second wave of business intelligence tools, often known as “in-memory”: these softwares move all of the data into RAM, thus ostensibly ensuring fast performance by removing the need to constantly shift data from its storage place on the hard drive.
However, as data continued to grow in size and complexity, the limitations of this approach became apparent: while RAM is not as expensive as it was twenty years ago, it is still not a cost-effective solution for terabyte-scale data, and each new query eats up more valuable computational resources; furthermore, there is still a significant latency involved in transferring the data from RAM to the CPU.
The solution to this problem is technically complex, but in simple terms boils down to optimizing the way computational resources are distributed. Instead of cramming the entire dataset into memory, we can continue to apply ‘columnar’ logic – identify the data that is actually needed to answer the business questions at hand, and move these specific columns into RAM while the rest of the data remains “at rest” on the hard drive.
Additional optimization is achieved in the transfer of data from RAM to CPU — again, taking only the parts that are strictly necessary, compressing the data in RAM and then decompressing it in the CPU. This is the basis of In-Chip technology.
In technological terms this is not simple to achieve, as it requires the BI system to bypass the operating system and directly access the CPU’s instruction set. However, all of this happens ‘under the hood’, and the end result is that the same hardware infrastructure can bear much heavier loads – in terms of the total size of the data, the complexity of the data model, or the number of concurrent users. Which leads us to…
3. Scaling Users: The Concurrency Bottleneck
As I mentioned earlier, modern data analytics is focused on exploratory analysis. This means that each user has the ability to ask ad-hoc questions, which the system cannot predict in advance yet still needs to answer on the spot. But what happens when there are dozens, hundreds or thousands of concurrent users, each posing his or her own ad-hoc questions to the data?
This is where many BI systems, even more advanced ones, will stutter. Queries take longer to return results, memory resources run low and the data must be simplified, denormalized or pre-indexed, which limits the scope of analysis and does not fully solve the performance issues.
But here too this is not an irredeemable state of affairs. In-Chip technology offers a solution in the form of query recycling. This is essentially a learning algorithm that breaks queries down into reusable chunks, which are then “recycled” in order to answer future queries.
As can be seen in the example above, new queries will often contain components which also appeared in older ones. When a new query is made, the In-Chip engine will use the results of previous subqueries, which are then fetched and used as a stepping stone to perform the new calculation. Thus, as more queries are made, the system’s learning algorithm improves, and it contains more previously answered subqueries – meaning it will actually be able to answer new queries faster as the amount of users increase.
For BI tools to stay relevant in an age of growing data and an endless stream of new data sources, they will have to rely on a strong technological back-end in order to provide the fast, actionable insights business users have come to expect.