Let’s be honest, there are two reasons why it’s worth learning a new programming language. The first reason is because you will need it for your daily job and the second reason is because it’s fun.
The programming language Scala is something you would like to learn by the end of this post if you work in Data Science. Why? Because it’s a distributed-ready language, it is Open Source, runs in the JVM, it’s interactive and because Apache Spark is almost fully written in Scala, and can deal with billions of records with good performance.
First, a bit of history. The Scala language was created by Martin Odersky in 2003. It is Open Source which means among other things high interoperability with other Open Source tools written in Java. Scala runs in the Java Virtual Machine or JVM and it has Java interoperability, which means you can run Java code in Scala and you could create a Scala class extending a Java class. I assume we can agree that no single tool can do the whole process of data analysis, therefore, integration with other tools is key.
Let’s agree that scaling out (adding more cores to the infrastructure) is the way of getting more processing power these days rather than scaling up (speeding up the cores). In this scenario, parallelization represents the way of doing things performantly. Scala is a distributed-ready language, meaning the same code will run in a single core machine or in as many cores as they are available for the task. This is important if you want to run machine learning tasks and make sure they are optimized to perform well. The language is taking care of the infrastructure optimization. “Once you have distributed computing available the next step is to do Data Science” (Andy Petrella).