The familiarity and ability to use Hadoop, Java, Python, SQL, Hive, and Pig are core essentials. Programming itself and computer science in general is the very starting point of gathering data and understanding how to “get” data and piece it together. Just moving data is it’s own specialty reserved for ETL (extract, transformation, loading) specialists. ETL tools may include Informatica, MS SSIS, Teradata bulk loading tools among others. If you can’t GET data, you sure can’t analyze it. And you sure can’t expect somebody else to capture it for you.
Understanding the business data itself is its own special domain expertise that only comes with working in that data domain. Medical data is different from ecological data which is different from all the varieties of business data. This only comes from studying and asking lots of questions while working in that particular field.
Knowing the difference between a fact table that is put together well and one that is faulty with semi-structured unconstrained keys makes all the difference in how easily you can trust and massage the data you’re trying to capture. Knowing the validity and proper use of each of the dimensions is also key to leveraging any star-schemed data structure. Unstructured data is another story where you may have to figure or organize yourself a staging layer before the data itself is even useful. If you can’t get through these things you can’t begin making propositional sense of data to analyze.
Using R, Excel, SAS, or other tools to piece together your propositions and discover potential patterns and correlations through statistics are the heart of working data to discover and apply your creativity. This is where true genius can shine, but use of the tools is the first essential grind of skills required. If you can’t use the tools, you can’t analyze the data. You could use paper and pencil or even a fancy calculator if you’ve got the math skills down cold.
Understanding correlation, multivariate regression and all aspects of massaging data together to look at it from different angles for use in predictive and prescriptive modeling is the backbone knowledge that’s really step one of revealing intelligence. Nothing more to say. If you don’t have this, all the data collection and presentation polishing in the world is meaningless.
Potential list includes Flare, HighCharts, AmCharts, D3.js, Processing, Google Visualization API, Tableau, Excel, PowerPoint and Raphael.js (?). Most of those I admittedly don’t know. Tableau and Excel should provide you with basic enough tools. Heck, if you’re good, MS Paint will work just fine.
This is that special set of soft skills that nobody can quite pin down. It’s the art and communication holistic human side of the complete data scientist package. This is what makes the difference between a geek scientist and a business savvy Data Scientist of the sexy bent that’s valued highly with the according pay and executive respect. When you can come into a meeting and throw up a PowerPoint presentation with an introduction, a proposition, and a revelation in business terms that tells the business what’s wrong and what’s right and how money is being made and lost, you’ve earned your income. The trick and value is that elusive almost lost art of storytelling.
Go sit on the porch with Grandpa and get him to tell some stories. Listen to how he sets them up, builds upon them and then delivers the punch lines. You can still learn if you can put your analytic mind aside for awhile. It’s the ART of the holistic ART of Data Science. Without it, you might as well just wear a lab-coat. With it you can wear your sunglasses at night.