Introduction to Big Data
Big data is a buzzword, or a catch-phrase, used to describe the massive volume of both structured and unstructured data which is difficult to process using traditional relational database and software techniques as per organization’s hardware and infrastructure.
Big data is more than simply a matter of size which, according to IBM, holds three major attributes as:
Variety – Different type of data including text, audio, video, click streams, log files, and more which can be structured, semi-structure or unstructured.
Volume – Hundreds of terabytes and petabytes of information.
Velocity – Speed of data to be analyzed in real time to maximize the data’s business value.
Some samples of data size of few leading companies
Data Generated by NYTimes in one day
50Gb of uncompressed log files
10Gb of compressed log files
0.5Gb of processed log files
4-6M unique users
7000 unique pages with more than 100 hits
Index size 2Gb
Pre-processing & indexing time
10min on workstation (4 cores & 32Gb)
1hour on EC2 (2 cores & 16Gb)
Data Generated by Facebook in one month
30 billion pieces of content shared on facebook every month.
40% projected growth in global data generated per year vs 5% growth in global IT spending.
Let us look out what is big data analysis, need of analysis and how we can do analysis in optimized way through different approaches.
Big Data Analysis = Big Data + Analysis
We need to store, clean, brush up, apply some mathematical and algorithmic model to analyze and then beautify our [data] to get a visualized and beautiful story out of it for the senior management team to take some decisions.
Can you imagine how your organization handles Big Data during daily operations? Just to give you an idea, consider the following scenarios:
What if a Monday morning suddenly your VP asks, “Hey, can you provide me a quick sentiment analysis for ABC news story published yesterday?”
Or, “Can you quickly draw a graph of sentiment analysis for ABC news story published yesterday for a XYZ region?”
What do you need to do so as you can respond to your VP quickly by scanning your big data?
Previously, it was the statisticians whom to play with data and come up with some models to reach out to some decision. Now it’s the data scientists whom come up with such solutions. So a data Scientist is a mixed blend of a data base expert, a statistician and a story writer. In order to make their life easier we have R language wherein we can either store data or use existing data(from some database like SQL server or oracle) and then we can perform our analysis using some predefined packages within R.
R can handle big data using ff, fbase, RODBC, RHadoop packages.