Due to the hype and rapid growth of Big Data Engineering and Data Science, it seems many companies and practitioners have gotten so excited by hiring, building infrastructure, fashionable models and shiny technology that one crucial part of the field seems to be missing - delivery. I hear of countless stories, in both small & large companies, where teams are built, clusters bought, prototype algorithms written and software is installed, but it then takes months or even longer to deliver working data driven applications, or for insights to be acted on. Hype is thick in the air but delivery is thin on the ground.
The review and related blogs correctly point out that we should focus on AI applications, that is automation. My addition is that these applications can not always be easily bought in for many domains. In such cases they should be built in-house and the builders ought to be Agile Big Data Engineers and Data Scientists that understand the importance of weekly or fortnightly iteration. The title of Data Scientist is not dead, but keeping Data Science alive will mean shifting the focus of the Data Scientist away from hacking, ad hoc analysis and prototyping and on to high quality code, automation, applications and Agile methodologies. Let's remember the technology industry has a habit of finding ways to automate the job of those that lack the imagination to transition to automators, i.e. those that cannot be scaled.
Why Agile methodologies are so lacking in Data Science and Big Data is confusing. Perhaps it's the age of the industry? To be frank I believe a smidgen of elitism aims to distinguish the industries from regular software development as though the practices and principles are beneath the concerns of mighty data minds. One other issue seems to be a big misconception that "exploratory" work precludes frequent iteration over automated end to end applications, that is Data Scientists claims they need to "explore" for a month or two before they can deliver. This I see as ironic since the tension between exploratory work and continuous delivery is exactly what Agile solves. Finally another recurring misconception is that the day to day practices of Agile, like tests, automation, clean code and clean structure are "time consuming" and will slow down "exploratory" work. This is also ironic since again Agile aims to make exploratory work faster and less laborious. Hopefully the details of my posts will flesh out why these objections are misconceptions.
Automatic tests are absolutely critical in correctly practicing Agile, and from TDD evolved more acronyms and terms than many Data Scientists have written tests; TDD, BDD, DDD, ATDD, SDD, EDD, CDD, unit tests, integrations tests, black box tests, end-to-end tests, systems tests, acceptance tests, property based tests, example based tests, functional tests, contract based tests, etc. At a glance things like interactive work, long running jobs, unclear objectives, peculiar development environments, etc preclude *DD approaches. Nevertheless, if one strips away the unnecessary verbosity of *DD the remaining core can easily accommodate such problems.
Sometimes Data Science can feel like academia except much better paid. So until the bubble bursts, which still doesn't seem to be any time soon, should we just have as much fun as possible?