Big Data management is essential for experimental science and technologies used in various science communities often predate those in Big Data industry and in many cases continue to develop independently. This post highlights some of these technologies, focusing on those used by several projects supported by the National Energy Research Scientific Computing Centre (NERSC).
Large-scale data management is essential for experimental science and has been for many years. Telescopes, particle accelerators and detectors, and gene sequencers, for example, generate hundreds of petabytes of data that must be processed to extract secrets and patterns in life and in the universe.
The data technologies used in these various science communities often predate those in the rapidly growing industry big data world, and, in many cases, continue to develop independently, occupying a parallel big data ecosystem for science (see Figure 1). This post highlights some of these technologies, focusing on those used by several projects supported by the National Energy Research Scientific Computing Centre (NERSC).
This post originally appeared on oreilly.com, organizers of Strata Hadoop World. Republished with permission.
“One of the most valuable events to advance your career.”
Across these projects we see a common theme: data volumes are growing, and there is an increasing need for tools that can effectively store and process data at such a scale. In some cases, the projects could benefit from big data technologies being developed in industry, and in some other projects, the research itself will lead to new capabilities.
The Large Hadron Collider (LHC) at the European Organization for Nuclear Research (CERN) in Geneva is the world’s largest scientific instrument, designed to collide protons at the highest energies ever achieved. The resulting spray of particles is observed in detectors the size of buildings, in an attempt to discover one-in-a-billion events that have the potential to uncover new fundamental particles and, ultimately, secrets of the universe. The extreme rate of data produced, together with the overall volume of data and the rarity of interesting events, has made the research with the LHC one of the original examples of big data. LHC experiments require smart data ingestion, efficient data storage formats that allow for fast extraction of relevant data, powerful tools for transfer to collaborators around the world, and sophisticated statistical analysis.
The LHC enables protons to collide 40 million times per second in detectors packed with instruments that take hundreds of millions of measurements during each collision (Figure 2).