In any data analytics project, after business understanding phase, data understanding and selection of right data format as well as ETL tools is very important task. In this article, Silvia, an experienced data engineer from silicon valley, has explained very useful and practical guidelines about data understanding, data format and ETL phases of project lifecycle.
It’s easy to become overwhelmed when it comes time to choose a data format. Picture it: you have just built and configured your new Hadoop Cluster. But now you must figure out how to load your data. Should you save your data as text, or should you try to use Avro or Parquet? Honestly, the right answer often depends on your data. However, in this post I’ll give you a framework for approaching this choice, and provide some example use cases.
There are different data formats available for use in the Hadoop Distributed File System (HDFS), and your choice can greatly impact your project in terms of performance and space requirements. The findings provided here are based on my team’s past experiences, as well as tests that we ran comparing read and write execution times with files saved in text, Apache Hadoop’s SequenceFile, Apache Avro, Apache Parquet, and Optimized Row Columnar (ORC) formats.
More details on each of these data formats, including characteristics and an overview on their structures, can be found here.
Where to start? There are several considerations that need to be taken into account when trying to determine which data format you should use in your project; here, we discuss the most important ones you will encounter: system specifications, data characteristics, and use case scenarios.
System specifications Start by looking at the technologies you’ve chosen to use, and their characteristics; this includes tools used for ETL(Extract, Transform and Load) processes as wells as tools used to query and analyze the data. This information will help you figure out which format you’re able to use.
Not all tools support all of the data formats, and writing additional data parsers and converters will add unwanted complexity to the project. For example, as of the writing of this post, Impala does not offer support for ORC format; therefore, if you are planning on running the majority of your queries in Impala then ORC would not be a good candidate. You can, instead, use the similar RCFile format, or Parquet.
You should also consider the reality of your system. Are you constrained on storage or memory? Some data formats can compress more than others. For example, datasets stored as Parquet and ORC with snappy compression can reduce their size to a quarter of the size of their uncompressed text format counterpart, and Avro with deflate compression can achieve similar results. However, writing into any of these formats will be more memory intensive, and you might have to tune the memory settings in your system to allocate more memory. There are many options that can be tweaked to adapt your system, and it is often desirable to run some tests in your system before fully committing to using a format. We will talk about some of the tests that you can run in a later section of this post.
Characteristics and size of the data The next consideration is around the data you want to process and store in your system. Let’s look at some of the aspects that can impact performance in a data format.
How is your raw data structured? Maybe you have regular text format or csv files and you are considering storing them as such.