A year ago, I worked with large amounts of data stored on a server and on my laptop. Most of the data I used was in the form of text or binary files that were linked together with lookup tables. This method worked for me because I was the only one using the data, which meant I had to be very careful to not make any mistakes in the data analyses.
Since then a lot has changed. First of all, I changed jobs, from research fellow in academia to developer advocate for IBM Cloud Data Services. I made this change because I was fascinated to work with and learn more about data science in the cloud. As a developer advocate, I now get to play with all the new tools that IBM offers and write and talk about them.
The basics of the data science have not changed, but the tools I use have. I am still using Python most of the time, but where I store my data has become much simpler. For me, a typical workflow goes through the following steps:
The first step, defining the question, is for me to come up with examples to show how to use new techniques and tools or to show how to work with interesting data sets. This approach is different when you work for a company that needs to solve problems or needs insights from their data to increase sales, for instance, or understand customer behavior. But the workflow is pretty much the same.
When you know what the question is that you want to answer, the time to look for the right data is now, and the data can be in any format and size and from many different sources. After a first, quick exploration of the data, you have to decide how to use it and, if needed, where to store it. Often I use an application programming interface (API) to collect data and then store it in a data warehouse such as dashDB, for example, if the data is structured.