Why Python (IT Best Kept Secret Is Optimization)

Why are you recommending Python?  That’s the question a colleague of mine asked when I was pitching Python for data science work.  It is a fair question, and I tried to answer with facts and not opinions.  Indeed, answering a question about why a language is better than others can quickly turn into a religious war.  So, let me try to avoid that with some disclaimers.  First of all, I don’t think one size fits all: Python is not going to become THE programming language.  Depending on the task, other languages are a much better fit.  For instance, Java for enterprise applications solving well defined problems.  Fortran, C, and C++ are great for HPC. C is dominant for systems programming.  Javascript + node.js, or PHP, are de facto standards for web site implementation.  I could go on forever, as many languages fit a niche.  But when it comes to data science, Python has taken the lead.  Let’s look at facts before you start arguing with me.

Read Also:
Big data takes aim at pediatric cancer

I am not the only one saying Python has the lead.  Here is a first fact supporting this.  It is the job trends for data science related topics on 

These job trends are for: Python and (“data science” or “big data” or “statistical analysis” or “data mining” or “machine learning”), Scala and (“data science” or “big data” or “statistical analysis” or “data mining” or “machine learning”), R and (“data science” or “big data” or “statistical analysis” or “data mining” or “machine learning”) .

I selected R, Python, and Scala for this comparison because they are the most popular open source languages for data science.  R has been for long the dominant open source for statisticians, and by extension, for data science.  But we see that Python is taking over since a couple of years.  Scala is a recent contender, because of its link to Spark and Spark ML but it is a quite distant follower still.

What about commercial software?  I do think that SPSS modeler is here to stay as well for instance.  But its target is a bit different from R, Python or Scala.  Indeed, SPSS modeler is a click and point software aimed at non programmers.  With SPSS modeler one draws the machine learning pipeline, whereas one programs it in Python, R, or Scala.  It is because of this difference that I did not include SPSS modeler in the comparison, as it would be comparing apple to orange.

Read Also:
Will Artificial Intelligence Defeat Cancer?

Back to open source, here are other signs of Python popularity.  The table below includes the number of questions on stack overflow, the number of packages in the main package repository for the language, and the programming community index on  For Scala, to be fair, one should count all Java libraries.  I did not find a simple way to evaluate their numbers, hence I left it blank.

These measure the strength and popularity of the ecosystems built around these languages.  Indeed, when comparing languages, one should not just do a feature by feature comparison, or efficiency benchmarks.  Having a vibrant community that can help newcomers, and that can further advance the language, is key. 

There are probably additional ways to evaluate the importance of an ecosystem, and I welcome suggestions.

We can also get facts about the main data scientists IDE for the languages: IPython/Jupyter for Python notebooks, RStudio for R scripts, and Apache Zeppelin for Scala notebooks.  I look at the number of stack overflow questions, at the number of github repositories using these languages, then the starts, forks, commits, and contributors for the main github directory: Jupyter/IPython, RStudio, and Zeppelin.;

Read Also:
7 Cases Where Big Data Isn’t Better

Read Full Story…


Leave a Reply

Your email address will not be published. Required fields are marked *