Introduction to Open Science: Why data versioning and data care practices are key for science and social science.

Introduction to Open Science: Why data versioning and data care practices are key for science and social science.

Introduction to Open Science: Why data versioning and data care practices are key for science and social science.

A significant shift in how researchers approach their data is needed if transparent and reproducible research practices are to be broadly advanced. Carly Strasserhas put together a useful guide to embracing open science, pitched largely at graduate students. But the tips shared will be of interest far beyond the completion of a PhD. If time is spent up front thinking about file organization, sample naming schemes, backup plans, and quality control measures, many hours of heartache can be averted.

I’m guessing that first year graduate students are knee-deep in courses, barely considering their potential thesis project. But for those that can multi-task, I have compiled this list of 10 things that you should undertake in your first year as a grad student. These aren’t just any 10 things… they are 10 steps you can take to make sure you contribute to a culture shift towards open science. Some a big steps, and others are small, but they will all get you (and the rest of your field) one step closer to reproducible, transparent research.

1. Learn to code in some language. Any language.

Here’s the deal: it’s easier to use black-box applications to run your analyses than to create scripts. Everyone knows this. You put in some numbers and out pop your results; you’re ready to write up your paper and get that H-index headed upwards. But this approach will not cut the mustard for much longer in the research world. Researchers need to know about how to code. Growing amounts and diversity of data, more interdisciplinary collaborators, and increasing complexity of analyses mean that no longer can black-box models, software, and applications be used in research. The truth is, if you want your research to be reproducible and transparent, you must code. In a 2013 article “The Big Data Brain Drain: Why Science is in Trouble“, Jake Vanderplas argues that

Read Also:
Stop the privatization of health data

I learned MATLAB in graduate school, and experimented with R during a postdoc. I wish I’d delved into this world earlier, and had more skills and knowledge about best practices for scientific software. Basically, I wish I had attended a Software Carpentry bootcamp.

The growing number of Software Carpentry (SWC) bootcamps are more evidence that researchers are increasingly aware of the importance of coding and reproducibility. These bootcamps teach researchers the basics of coding, version control, and similar topics, with the potential for customizing the course’s content to the primary discipline of the audience. I’m a big fan of SWC – read more in my blog post on the organization. Check out SWC founder Greg Wilson’s article on some insights from his years in teaching bootcamps: Software Carpentry: Lessons Learned.

2. Stop using Excel. Or at least stop ONLY using Excel.

Most seasoned researchers know that Microsoft Excel can be potentially problematic for data management: there are loads of ways to manipulate, edit, reorder, and change your data without really knowing exactly what you did. In nerd terms, the trail of dataset changes is known as provenance; generally Excel is terrible at documenting provenance. I wrote about this a few years ago on the blog, and we mentioned a few of the more egregious ways people abuse Excel in ourF1000Research publication on the DataUp tool. More recently guest blogger Kara Woo wrote a great post about struggles with dates in Excel.

Read Also:
How to Become a Whistleblower: from Panama Papers to Open Data

Of course, everyone uses Excel. In our surveys for the DataUp project, about 88% of the researchers we interviewed used Excel at some point in their research. And we can’t expect folks to stop using it: it’s a great tool! It should, however, be used carefully. For instance, don’t manipulate the sole copy of your raw data in Excel; keep your raw data raw. Use Excel to explore your data, but use other tools to clean and analyze it, such as R, Python, or MATLAB (see #1 above on learning to code). For more help with spreadsheets, see our list of resources and tools: UC3 Spreadsheet Help.

3.

 



Data Innovation Summit 2017

30
Mar
2017
Data Innovation Summit 2017

30% off with code 7wData

Read Also:
How To Become A Machine Learning Expert In One Simple Step

Big Data Innovation Summit London

30
Mar
2017
Big Data Innovation Summit London

$200 off with code DATA200

Read Also:
Exploring open data quality
Read Also:
Online epidemic tracking tool embraces open data and collective intelligence

Enterprise Data World 2017

2
Apr
2017
Enterprise Data World 2017

$200 off with code 7WDATA

Read Also:
Citizen scientists now do more than collect data

Data Visualisation Summit San Francisco

19
Apr
2017
Data Visualisation Summit San Francisco

$200 off with code DATA200

Read Also:
Cloud Set to Bolster Analytics and Business Intelligence

Chief Analytics Officer Europe

25
Apr
2017
Chief Analytics Officer Europe

15% off with code 7WDCAO17

Read Also:
How to Become a Whistleblower: from Panama Papers to Open Data

Leave a Reply

Your email address will not be published. Required fields are marked *