According to IBM 90% of the world’s data has been created in the last two years (note: Nicki Minaj got a twitter account 2 years ago. Coincidence? I think not). Of this ever-expanding sea of data, 80% of it is “unstructured” (well, I’ll say!). Sheer size alone makes analysis of petabytes of information difficult, but it's harder still due to thecomplexity of unstructured data. In the age of big data we’re faced with not only storage issues but also computational power restrictions. The easy solution would be to throw out some of Minaj’s less-erudite tweets, but companies and their data scientists have concluded that there is gold in them thar hills--they just need to devise ways to store and process it.
Today, many data scientists agree that best way to extract value from Nicki Minaj’s and the other 100 million users' tweets is by using Apache Hadoop (or one of its commercial variants). The result of a beautiful marriage between infrastructure and programming model, Hadoop is the solution to our big data woes! In 2004 Google published a paper entitled “MapReduce: Simplified Data Processing on Large Clusters” (great beach read, btw) which outlined the MapReduce programming model that allows users to process and generate data sets from distributed file systems. A MapReduce job has two phases, unexpectedly named "Map" and "Reduce." In the Map phase, a dataset to be queried (e.g. from Minaj's tweets, pull all strings of consonants longer than 8 consonants) is collected, chopped up, and assigned to the fleet of plebeian servers. In the Reduce phase, those little Minaj analyses are reunited as one big output. There are: 2557 strings of consonants longer than 8.
MapReduce paired with the Hadoop Distributed File System (aka HDFS, aka the files that run on fleet of plebeian servers) is the fastest most efficient way for people, companies, and governments to analyze and store data.
Little known companies like “Facebook” and “Yahoo” are now utilizing Hadoop to warehouse their data. Keeping in step with the ostensible value of data, one might be wise to an eye out for this “Facebook,” as they’re cranking out ½ a petabyte daily! With that much “valuable” data, chances are they’re making a killing for their shareholders!