Fighting fraud, reducing customer churn, improving the bottom line – these are just a few of the promises of data science. Today, we have more data to work with than ever before, thanks to new data-generating technologies like smart meters, vehicle telemetry, RFID, and intelligent sensors.
But with all that data, are we driving equivalentvalue? Many data scientists say theyspendmost of their time as “data janitors” combining data from many sources, dealing with complex formats, and cleaning up dirty data.
Data scientists also say they spend a lot of time serving as “plumbers” – handling DevOps and managing the analytics infrastructure. Time devoted to data wrangling and DevOps is a dead loss; it reduces the amount of time data scientists can spend delivering real value to clients.
Small data tools.Data analytics software introduced before 2012 runs on single machines only; this includes most commercial software for analytics as well as open source R and Python. When the volume of data exceeds the capacity of the computer, runtime performance degrades or jobs fail. Data scientists working with these tools must invest time in workarounds, such as sampling, filtering or aggregating. In addition to taking time, these techniques reduce the amount of data available for analysis, which affects quality.
Complex and diverse data sources. Organizations use a wide variety of data management platforms to manage the flood of Big Data, including relational databases; Hadoop; NoSQL data stores; cloud storage; and many others. These platforms are often “siloed” from one another. The data in those platforms can be structured, semi-structured and unstructured; static and streaming; cleansed and uncleansed. Legacy analytic software is not designed to handle complex data; the user must use other tools, such as Hive or Pig, or write custom code.
Single-threaded software.Legacy software scales up, not out. If you want more computing power, you’ll have to buy a bigger machine. In addition to limiting the amount of data you can analyze, it also means that tasks run serially, one after the other. For a complex task, that can take days or even weeks.
Complex infrastructure. Jeff Magnusson, Director of Algorithms Platform at online retailer, Stitch Fixnotesthat data science teams typically include groups of engineers who spend most of their time keeping the infrastructure running. Data science teams often manage their platforms because clients have urgent needs, the technology is increasingly sophisticated, and corporate IT budgets are lean.
It doesn’t make sense to hire highly paid employees with skills in advanced analytics, then put them to work cleaning up data and managing clusters. Visionary data scientists seek tools and platforms that arescalable;interoperablewith Big Data platforms;distributed; andelastic.
Scalability.Some academicsquestionthe value of working with large datasets. For data scientists, however, the question is moot; you can’t escape using large datasets even if you agree with the academics. Why? Because the data you need for your analysis comes from a growing universe of data; and, if you build a predictive model, your organization will need to score large volumes of data. You don’t have a choice; large datasets are a fact of life, and your tools must reflect this reality.
Integrated with Big Data platforms.As a data scientist, you may have little or no control over the structure of the data you need to analyze or the platforms your organization uses to manage data.