The 3Vs that define Big Data
As I studied the subject, the following three terms stood out in relation to Big Data.
Variety, Velocity and Volume.
In marketing, the 4Ps define all of marketing using only four terms.
Product, Promotion, Place, and Price.
I claim that the 3Vs above totally define big data in a similar fashion.
These three properties define the expansion of a data set along various fronts to where it merits to be called big data. An expansion that is accelerating to generate yet more data of various types.
The plot above, using three axes helps to visualize the concept.
The size of available data has been growing at an increasing rate. This applies to companies and to individuals. A text file is a few kilo bytes, a sound file is a few mega bytes while a full length movie is a few giga bytes.
More sources of data are added on continuous basis. For companies, in the old days, all data was generated internally by employees. Currently, the data is generated by employees, partners and customers. For a group of companies, the data is also generated by machines. For example, Hundreds of millions of smart phones send a variety of information to the network infrastructure. This data did not exist five years ago.
More sources of data with a larger size of data combine to increase the volume of data that has to be analyzed. This is a major issue for those looking to put that data to use instead of letting it just disappear.
Peta byte data sets are common these days and Exa byte is not far away.
Large Synoptic Survey Telescope (LSST).
72 hours of video are uploaded to YouTube every minute
There is a corollary to Parkinson’s law that states: “Data expands to fill the space available for storage.”
Initially, companies analyzed data using a batch process. One takes a chunk of data, submits a job to the server and waits for delivery of the result. That scheme works when the incoming data rate is slower than the batch processing rate and when the result is useful despite the delay. With the new sources of data such as social and mobile applications, the batch process breaks down. The data is now streaming into the server in real time, in a continuous fashion and the result is only useful if the delay is very short.
140 million tweets per day on average.( more in 2012)
I have not yet determined how data velocity may continue to increase since real time is as fast as it gets. The delay for the results and analysis will continue to shrink to also reach real time.
From excel tables and databases, data structure has changed to loose its structure and to add hundreds of formats. Pure text, photo, audio, video, web, GPS data, sensor data, relational data bases, documents, SMS, pdf, flash, etc etc etc. One no longer has control over the input data format. Structure can no longer be imposed like in the past in order to keep control over the analysis. As new applications are introduced new data formats come to life.
Google uses smart phones as sensors to determine traffic conditions.
In this application they are most likely reading the speed and position of millions of cars to construct the traffic pattern in order to select the best routes for those asking for driving directions. This sort of data did not exist on a collective scale a few years ago.
The 3Vs together describe a set of data and a set of analysis conditions that clearly define the concept of big data.
So what is one to do about this?
So far, I have seen two approaches.