Wednesday, March 28, 2012

BIG Data...

Some interesting facts...

  • It currently costs about $600 for a drive that would store all the music of the world
  • 5 billion mobile phones in use in 2010
  • 30 billion pieces of content shared on Facebook in a month
  • 237 terabytes of data collected by the Library of Congress in April 2011



So What is Big Data?

You may or may not have heard about Big Data before, but whether you like it or not you will probably run into it in the near future (especially if you are in IT).  The world around us has exploded exponentially, and we must come up with a way to manage all of this large amount of data.  The data will only increase over time with each year bringing exponentially more and more for us to manage. 

Big Data consists of datasets that grow so large that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analytics, and visualizing.  I actually deal with large data sets at work (maybe not large enough to be considered Big Data, but large nonetheless).  Sometimes I am working with spreadsheets of data that have millions and millions of records on it.  I can understand how gangly and slow working with this data can seem.  It takes forever to save these files, to pass them to others via some medium, and they are just cumbersome to analyze quickly.  I'm not even working with truly Big Data, but I can imagine the need to managing these large datasets quickly and efficiently.

There are few key points regarding Big Data:

1)  Big Data is showing up in all industries
2)  Big Data can create value
3)  Using Big Data effectively will become key for competition
4)  There is a shortage of talent necessary to take advantage of big data
5)  There are still obstacles to overcome before Big Data can be fully realized
6)  Relational Databases are no longer needed

Point number 6 brings us to the next question...

What are "No SQL" databases?

 These are data management systems that differ from the typical relational database model in that it doesn't use SQL as it's query language.  These systems became popular from the likes of companies like Google, Amazon, Twitter, and Facebook who have to manage extremely large loads of data.  These databases are built for speed and offer very little value other than data storage. 

Why are these things important?

As I stated earlier, the world's data is growing at an exponential pace.  There is a huge need to quickly manage all of this data.  Being able to do such a thing puts a company way ahead of the curve.  So in time it will all the company to make faster decisions on a whole host of different data elements.  There is a shortage of professionals with big data experience, and it's only going to get worse as the data stores keep going up and up.

What is Hadoop?

Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of computational independent computers and petabytes of data. Hadoop was derived from Google's MapReduce and Google File System (GFS) papers.
Hadoop is a top-level Apache project being built and used by a global community of contributors, written in the Java programming language. Yahoo! has been the largest contributor to the project, and uses Hadoop extensively across its businesses.

What is Pig Latin and Hive?

Pig and Hive are two different methods for managing the Hadoop framework.  Pig is more of a procedural language whereas Hive is more of a SQL type language.  You could use one or the other or both with a Hadoop install.  Hive was created by Facebook.  Both are now owned by Apache (which also owns hadoop). 

Why are these important?

These tools are important because they provide the user with easy and standardized methods for managing big data.  They are also open-sourced tools so they are free to users.

Conclusion:

Big Data is here to stay and it's getting a lot of buzz lately.  I'm going to start learning Hadoop so I can make myself more marketable.  I mean, who doesn't want to put "Big Data Expert" on their resume...I know I do.