“The first rule of data science is: don’t ask how to define data science.” So says Josh Bloom, a UC Berkeley professor of astronomy and a lead principal investigator (PI) at the Berkeley Institute for Data Science (BIDS). If this approach seems problematic, that’s because it is—data science is more of an emerging interdisciplinary philosophy, a wide-ranging modus operandi that entails a cultural shift in the academic community. The term means something different to every data scientist, and in a time when all researchers create, contribute to, and share information that describes how we live and interact with our surroundings in unprecedented detail, all researchers are data scientists.
We live in a digitized world in which massive amounts of data are harvested daily to inform actions and policies for the future. We build sophisticated systems to collect, organize, analyze, and share data. We each have unlimited access to huge amounts of information and the tools to interpret it. We are more aware than ever how molecules and cells move, how inflation fluctuates, and how the flu travels, all in real time. We can efficiently distribute bus stations and plan transit schedules. With the right tools, we can predict how proteins misfold in our brains,
or what our galaxy might look like in a thousand years. In a society driven by data, knowledge is a commodity that is created and shared transparently all over the world. It connects causes with effects, familiar places with distant locales, the past with the future, people with one another.
In this rapidly changing world, universities are faced with the challenge of adapting to increasingly data-driven research agendas. At UC Berkeley and elsewhere, scientists and administration are working together to reshape how we do research and ultimately restructure the culture of academia.
More than ever, researchers in all disciplines find themselves wading through more and more kinds of data. Frequently, there is no standard system for storing, organizing, or analyzing this data. Data often never leaves the lab; the students graduate, the computers are upgraded, and records are simply lost. This makes research in the social, physical, and life sciences difficult to reproduce and develop further. To make matters worse, it’s no easy task to build tools for general scientific computing and data analysis. Doing so requires a set of skills researchers must largely learn independently, and a timeframe that extends beyond the length of the average PhD.
Historically, no single practice described the simultaneous use of so many different skill sets and bases of knowledge. However, in recent years data science has emerged as the field that exists at the intersection of math and statistics knowledge, expertise in a science discipline, and so-called “hacking skills,” or computer programming ability. While these skills are changing the way that science is practiced, they’re also changing other aspects of society, such as business and technology startups. In a world where rapidly advancing technology is forcibly changing data science practices, universities are struggling to keep up, often losing good researchers to industries that place a high value on their computational skills.
Despite its increasing importance and relevance, it’s almost impossible to pin down what data science actually is. Data scientists hate doing it. Bloom describes data science as a context-dependent way of thinking about and working around data—a set of skills derived from statistics, computer science, and physical and social sciences. Cathryn Carson, the associate dean of Social Science who is heavily involved in BIDS and the new Social Science Data Laboratory (D-Lab), is more interested in how we can use the idea of data science to do more interesting science. This involves bringing people from different areas of expertise together to work on multifaceted problems. “It’s a kind of social engineering,” Carson says.