Chris Snow is a data and application architect at IBM, who loves helping customers with their data architectures. With over 20 years of IT experience working with business stakeholders at all levels, Snow’s experience spans banking, government, insurance, manufacturing, retail and telecommunications. He is currently focused on IBM Cloud Data Services and emerging technologies such as big data streaming architectures. In his spare time, Snow is the leader and a contributor on an open source project that provides executable examples for IBM BigInsights for Apache Hadoop. The examples kick-start your BigInsights projects, allowing you to move at warp speed on your big data use cases. The project can be found on GitHub.
One of the key roles of today’s data scientists and other data practitioners is exploratory analysis—handling data sets that you don’t know much about and understanding what they are and what’s potentially valuable about them. Sometimes new data scientists and practitioners have a tendency to jump straight in and start running their data through a machine learning algorithm. However, the first and most important thing to do when you get hold of a new data set is to make some plots and visualize the data.
A famous example is Anscombe’s quartet—four data sets that have almost identical statistical properties when you look at their standard summary statistics such as mean, variance, x/y correlation and linear regression. But when you see the data in a graphical format (see it here), you realize that the four data sets are completely different.
That’s why visualization is so important; if you can see the shape of the data, you can start drawing some useful conclusions about what it means. And that’s the first reason why Jupyter Notebooks is such a great tool. Notebooks give people who work with data the ability to plot and visualize data sets very quickly and easily, see if they are skewed or have other problems, and decide how to work with the data to turn it into something valuable.
Notebooks have the potential to change the whole culture around the way data practitioners and data scientists report their results because they help you prove that your methods are sound and your work is reproducible. The code itself is all embedded in the notebook, where it can be audited and rerun against your data by other data users—and that capability means anyone who needs to work with the data, not just data scientists. Also, the notebooks are self-documenting because you can add narrative sections to explain what you did and why. Following your logic and checking your methods and assumptions is so easy for readers.
Looking at a slightly longer-term view, notebooks very likely are going to spread far beyond the traditional domain of data science and data engineering. In the future, I think we’re going to see all kinds of business users benefiting from some type of Jupyter Notebooks application.
The problem with some of the standard programming tools used in Jupyter Notebooks, such as Python and R, is that they don’t scale very well over multiple cores or processors. If you’re running a notebook against a data set that fits on your laptop, that approach is not really a problem. But if you’re working with a larger data set, or you’re doing something that involves a more sophisticated algorithm such as training a machine learning model, you’re quickly going to run into the limitations of the hardware. And you will whether that limitation is in terms of storage, RAM, CPU speed or all three.
Spark solves that problem.