Accelerating data applications with Jupyter Notebooks

Accelerating data applications with Jupyter Notebooks, Hadoop and

Accelerating data applications with Jupyter Notebooks, Hadoop and

Chris Snow is a data and application architect at IBM, who loves helping customers with their data architectures. With over 20 years of IT experience working with business stakeholders at all levels, Snow’s experience spans banking, government, insurance, manufacturing, retail and telecommunications. He is currently focused on IBM Cloud Data Services and emerging technologies such as Big Data streaming architectures. In his spare time, Snow is the leader and a contributor on an open source project that provides executable examples for IBM BigInsights for Apache Hadoop. The examples kick-start your BigInsights projects, allowing you to move at warp speed on your big data use cases. The project can be found on GitHub.

One of the key roles of today’s data scientists and other data practitioners is exploratory analysis—handling data sets that you don’t know much about and understanding what they are and what’s potentially valuable about them. Sometimes new data scientists and practitioners have a tendency to jump straight in and start running their data through a machine learning algorithm. However, the first and most important thing to do when you get hold of a new data set is to make some plots and visualize the data.

A famous example is Anscombe’s quartet—four data sets that have almost identical statistical properties when you look at their standard summary statistics such as mean, variance, x/y correlation and linear regression. But when you see the data in a graphical format (see it here), you realize that the four data sets are completely different.

Read Also:
9 Artificial Intelligence Stats That Will Blow You Away

That’s why visualization is so important; if you can see the shape of the data, you can start drawing some useful conclusions about what it means. And that’s the first reason why Jupyter Notebooks is such a great tool. Notebooks give people who work with data the ability to plot and visualize data sets very quickly and easily, see if they are skewed or have other problems, and decide how to work with the data to turn it into something valuable.

Notebooks have the potential to change the whole culture around the way data practitioners and data scientists report their results because they help you prove that your methods are sound and your work is reproducible. The code itself is all embedded in the notebook, where it can be audited and rerun against your data by other data users—and that capability means anyone who needs to work with the data, not just data scientists. Also, the notebooks are self-documenting because you can add narrative sections to explain what you did and why. Following your logic and checking your methods and assumptions is so easy for readers.

Read Also:
Benchmarks to prove you need an analytical database for Big Data

Since the notebooks themselves are just [JavaScript Object Notation] JSON documents, tools such as GitHub can render previews of them, so you can browse other people’s notebooks and find interesting analyses that you might want to download, try and tinker with for yourself. So notebooks are really helping people to share knowledge and democratize data science as a discipline.

Looking at a slightly longer-term view, notebooks very likely are going to spread far beyond the traditional domain of data science and data engineering. In the future, I think we’re going to see all kinds of business users benefiting from some type of Jupyter Notebooks application.

The problem with some of the standard programming tools used in Jupyter Notebooks, such as Python and R, is that they don’t scale very well over multiple cores or processors. If you’re running a notebook against a data set that fits on your laptop, that approach is not really a problem. But if you’re working with a larger data set, or you’re doing something that involves a more sophisticated algorithm such as training a machine learning model, you’re quickly going to run into the limitations of the hardware. And you will whether that limitation is in terms of storage, RAM, CPU speed or all three.

Read Also:
How Data Analytics Benefits Both Seniors and Care Providers

Spark solves that problem.

 



HR & Workforce Analytics Summit 2017 San Francisco

19
Jun
2017
HR & Workforce Analytics Summit 2017 San Francisco

$200 off with code DATA200

Read Also:
In big data, industrialization is innovation

M.I.E. SUMMIT BERLIN 2017

20
Jun
2017
M.I.E. SUMMIT BERLIN 2017

15% off with code 7databe

Read Also:
Important Things to Consider When Implementing IIoT, Advanced Analytics, and Big Data

Sentiment Analysis Symposium

27
Jun
2017
Sentiment Analysis Symposium

15% off with code 7WDATA

Read Also:
9 Artificial Intelligence Stats That Will Blow You Away

Data Analytics and Behavioural Science Applied to Retail and Consumer Markets

28
Jun
2017
Data Analytics and Behavioural Science Applied to Retail and Consumer Markets

15% off with code 7WDATA

Read Also:
Follow Me To The Data Field Days [Live Broadcast]

AI, Machine Learning and Sentiment Analysis Applied to Finance

28
Jun
2017
AI, Machine Learning and Sentiment Analysis Applied to Finance

15% off with code 7WDATA

Read Also:
Get the facts straight: The 10 Most Common Statistical Blunders

Leave a Reply

Your email address will not be published. Required fields are marked *