big-data-1084656_1920-100653242-primary-idge

Which freaking big data programming language should I use?

Which freaking big data programming language should I use?

You have a big data project. You understand the problem domain, you know what infrastructure to use, and maybe you've even decided on the framework you will use to process all that data, but one decision looms large: What language should I choose? (Or perhaps more pointed: What language should I force all my developers and data scientists to suffer?) It's a question that can be put off for only so long.

Sure, there's nothing stopping you from doing big data work with, say, XSLT transformations (a good April Fools' suggestion for tomorrow, simply to see the looks on everybody's faces). But in general, there are three languages of choice for big data these days -- R, Python, and Scala -- plus the perennial stalwart enterprise tortoise of Java. What language should you choose and why ... or when?

Here's a rundown of each to help guide your decision.

R is often called "a language for statisticians built by statisticians." If you need an esoteric statistical model for your calculations, you'll likely find it on CRAN -- it's not called the Comprehensive R Archive Network for nothing, you know. For analysis and plotting, you can't beat ggplot2. And if you need to harness more power than your machine can offer, you can use the SparkR bindings to run Spark on R.

Read Also:
The Changing Landscape: Data Science Trends

However, if you are not a data scientist and haven't used Matlab, SAS, or OCTAVE before, it can take a bit of adjustment to be productive in R. While it's great for data analysis, it's less good at more general purposes. You'd construct a model in R, but you would consider translating the model into Scala or Python for production, and you'd be unlikely to write a clustering control system using the language (good luck debugging it if you do).

If your data scientists don't do R, they'll likely know Python inside and out. Python has been very popular in academia for more than a decade, especially in areas like Natural Language Processing (NLP). As a result, if you have a project that requires NLP work, you'll face an embarrassing number of choices, including the classic NTLK, topic modeling with GenSim, or the blazing-fast and accurate spaCy. Similarly, Python punches well above its weight when it comes to neural networking, with Theano and Tensorflow; then there's scikit-learn for machine learning, as well as NumPy and Pandas for data analysis.

Read Also:
Data Frankenstein: Bringing Old Business Data Back to Life

There's Juypter/iPython too -- the Web-based notebook server that allows you to mix code, plots, and, well, almost anything, in a shareable logbook format. This had been one of Python's killer features, although these days, the concept has proved so useful that it has spread across almost all languages that have a concept of Read-Evaluate-Print-Loop (REPL), including both Scala and R.

Python tends to be supported in big data processing frameworks, but at the same time, it tends not to be a first-class citizen. For example, new features in Spark will almost always appear at the top in the Scala/Java bindings, and it may take a few minor versions for those updates to be made available in PySpark (especially true for the Spark Streaming/MLLib side of development).

As opposed to R, Python is a traditional object-oriented language, so most developers will be fairly comfortable working with it, whereas first exposure to R or Scala can be quite intimidating. A slight issue is the requirement of correct white-spacing in your code. This splits people between "this is great for enforcing readability" and those of us who believe that in 2016 we shouldn't need to fight an interpreter to get a program running because a line has one character out of place (you might guess where I fall on this issue).

Read Also:
5 Developments in Data Analytics to Watch in 2017 

Ah, Scala -- of the four languages in this article, Scala is the one that leans back effortlessly against the wall with everybody admiring its type system.;



Data Science Congress 2017

5
Jun
2017
Data Science Congress 2017

20% off with code 7wdata_DSC2017

Read Also:
The Enterprise of the Future Will Need Connected Big Data

AI Paris

6
Jun
2017
AI Paris

20% off with code AIP17-7WDATA-20

Read Also:
Why big data leaders must worry about IoT security

Chief Data Officer Summit San Francisco

7
Jun
2017
Chief Data Officer Summit San Francisco

$200 off with code DATA200

Read Also:
Building a Smart Data Lake While Avoiding the ‘Dump’

Customer Analytics Innovation Summit Chicago

7
Jun
2017
Customer Analytics Innovation Summit Chicago

$200 off with code DATA200

Read Also:
How big data is having a 'mind-blowing' impact on medicine

Big Data and Analytics Marketing Summit London

12
Jun
2017
Big Data and Analytics Marketing Summit London

$200 off with code DATA200

Read Also:
Across Industries, Big Data is the Engine of Digital Innovation

Leave a Reply

Your email address will not be published. Required fields are marked *