I am not the only one saying Python has the lead. Here is a first fact supporting this. It is the job trends for data science related topics on indeed.com.
These job trends are for: Python and (“data science” or “big data” or “statistical analysis” or “data mining” or “machine learning”), Scala and (“data science” or “big data” or “statistical analysis” or “data mining” or “machine learning”), R and (“data science” or “big data” or “statistical analysis” or “data mining” or “machine learning”) .
I selected R, Python, and Scala for this comparison because they are the most popular open source languages for data science. R has been for long the dominant open source for statisticians, and by extension, for data science. But we see that Python is taking over since a couple of years. Scala is a recent contender, because of its link to Spark and Spark ML but it is a quite distant follower still.
What about commercial software? I do think that SPSS modeler is here to stay as well for instance. But its target is a bit different from R, Python or Scala. Indeed, SPSS modeler is a click and point software aimed at non programmers. With SPSS modeler one draws the machine learning pipeline, whereas one programs it in Python, R, or Scala. It is because of this difference that I did not include SPSS modeler in the comparison, as it would be comparing apple to orange.
Back to open source, here are other signs of Python popularity. The table below includes the number of questions on stack overflow, the number of packages in the main package repository for the language, and the programming community index on tiobe.com. For Scala, to be fair, one should count all Java libraries. I did not find a simple way to evaluate their numbers, hence I left it blank.
These measure the strength and popularity of the ecosystems built around these languages. Indeed, when comparing languages, one should not just do a feature by feature comparison, or efficiency benchmarks. Having a vibrant community that can help newcomers, and that can further advance the language, is key.
There are probably additional ways to evaluate the importance of an ecosystem, and I welcome suggestions.
We can also get facts about the main data scientists IDE for the languages: IPython/Jupyter for Python notebooks, RStudio for R scripts, and Apache Zeppelin for Scala notebooks. I look at the number of stack overflow questions, at the number of github repositories using these languages, then the starts, forks, commits, and contributors for the main github directory: Jupyter/IPython, RStudio, and Zeppelin.;