Overview of big data, hadoop and predictive analytics(1)

BIG DATA:

Summary:

Big Data has three defining attributes – three Vs as he calls them. They are Data Volume, Data Variety and Data Velocity and together they constitute a comprehensive definition of Big Data. So Big Data is not just about Data Volume, but also the variety of data (mostly unstructured) and the velocity with which the data is generated and need to be analyzed. Given the three Vs of Big Data, namely Volume, Variety and Velocity, challenge before large and medium sized companies is how to unlock the potential of Big Data and productively leverage its value in running the business.

Big data represents data sets that can no longer be easily managed or analyzed with traditional or common data management tools, methods and infrastructure. At its core, big data carries certain characteristics that add to this challenge, including high velocity, high volume and in some cases a variety of data structures. These characteristics bring new challenges to data analysis, search, data integration, information discovery and exploration, reporting and system maintenance.

Because of the business requirement of analyzing vast amount of ever changing structured and unstructured Big Data almost instantaneously, companies will be hard pressed to do this on their own. But given the fact that Big Data stored in cloud can be accessed from anywhere the internet is available and can be analyzed almost instantaneously by third party service providers, outsourcing companies can offer to their clients value added services in the area of Big Data analytics without heavy investments on the part of clients in specialized hardware and software as was the case with ‘traditional’ data analytics. This will bring down significantly costs (especially fixed costs) associated with building and maintaining analytics infrastructure and solution center.

Big Data with Predictive Enterprise Solutions:In “traditional” Data Analytics or Business Intelligence, focus is more on analysis and reporting of “historic” or past data stored in the database. Take for example how most organizations use data from their CRM or ERP applications. Almost all the reports that are generated pertain to past or “historic” information. Running a business based on “historic” or past data is like driving a car looking in the rear view mirror and is not going to work. Instead, companies must analyze all the available information in real time, apply statistical modeling techniques to available information in order to predict future outcomes and take action/run the business based on predicted outcome rather than analysis of historic data as is being done currently. Since Big Data is characterized by not only Volume, but also Velocity and Variety, it is very important that Big Data is used for analysis in real-time to predict the future and take corrective action based on that analysis. E.g. to predict Churn or Customer Attrition in Telecom industry and taking corrective action to prevent Churn/Attrition rather than analyzing “historic” attrition rate, call volumes or average response time as being done currently. The real value is in using predictive analytics and taking corrective action before it is too late, rather than just reporting historical information.

Techniques like Multiple-regression analysis coupled with Factor analysis, Cluster analysis and Causal Path analysis can be used very effectively with Big data – now that we have many variables and multiple observation for each variable at a customer level to generate statistically significant difference in analysis.

In future, no ERP or CRM system will be complete without Predictive Analytics functionality that will enable companies take preventive steps (rather than reactive) in real time. For example, rather than analyzing “historic” attrition rates, Predictive CRM application will make it possible for companies to identify critical incidents leading to customer attrition so that steps can be taken to retain the customer before he/she defects

Big Data Source:

The scope of big data is growing beyond niche sources to include sensor and machine data, transactional data, metadata, social network data and consumer-authored information

. An example of sensor and machine data is found at the Large Hadron Collider at CERN, the European Organization for Nuclear Research CERN scientists can generate 40 terabytes of data every second during experiments. Similarly, Boeing jet engines can produce 10 terabytes of operational information for every 30 minutes they turn. A four- engine jumbo jet can create 640 terabytes of data on just one e Atlantic crossing

Social network data is a new and exciting source of big data that companies would like to leverage. The micro blogging site Twitter serves more than 200 million users who produce more than 90 million "tweets" per day, or 800 per second. Each of these posts is approximately 200 bytes in size. On an average day, this traffic equals more than 12 gigabytes and, throughout the Twitter ecosystem, the company produces a total of eight terabytes of data per day. In comparison, the New York Stock Exchange produces about one terabyte of data per day.

In July of this year, Face book announced they had surpassed the 750 million active-user mark, making the social networking site the largest consumer-driven data source in the world. Face book users spend more than 700 billion minutes per month on the service, and the average user creates 90 pieces of content every 30 days. Each month, the community creates more than 30 billion pieces of content ranging from Web links, news, stories, blog posts and notes to videos and photos

Hadoop, Big Data, and Enterprise Business Intelligence :

Hadoop is open-source software that enables reliable, scalable, distributed computing on clusters of inexpensive servers. Hadoop has become the technology of choice to support applications that in turn support petabyte-sized analytics utilizing large numbers of computing nodes. It is:

Reliable : The software is fault tolerant, it expects and handles hardware and software failures Scalable : Designed for massive scale of processors, memory, and local attached storage Distributed : Handles replication. Offers massively parallel programming model, MapReduce

Hadoop is designed to process terabytes and even petabytes of unstructured and structured data. It breaks large workloads into smaller data blocks that are distributed across a cluster of commodity hardware for faster processing. And it’s part of a larger framework of related technologies:

HDFS: Hadoop Distributed File System HBase: Column oriented, non-relational, schema-less, distributed database modeled after

Google’s BigTable. Promises “Random, real-time read/write access to Big Data” Hive: Data warehouse system that provides SQL interface. Data structure can be projected ad

hoc onto unstructured underlying data

Pig: A platform for manipulating and analyzing large data sets. High level language for analysts

ZooKeeper: a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services

How Hadoop is being Use in Relation to Traditional BI and EDW?

Currently, Hadoop has carved out a clear niche next to conventional systems. Hadoop is good at handling batch processing of large sets of unstructured data, reliably, and at low cost. It does, however, require scarce engineering expertise, real-time analysis is challenging, and it much less matures than traditional

http://hadoop.apache.org/

http://www.computerworld.com/s/article/358164/Hadoop_Works_Alongside_RDBMS

http://www.computerworld.com/s/article/358164/Hadoop_Works_Alongside_RDBMS

http://zookeeper.apache.org/

http://pig.apache.org/

http://hive.apache.org/

http://en.wikipedia.org/wiki/BigTable

http://hbase.apache.org/

http://hadoop.apache.org/hdfs/

http://hadoop.apache.org/mapreduce/

http://timoelliott.com/blog/wp-content/uploads/2011/09/hadoop-vs-traditional.png%00%E5%A1%B9%EF%92%81%E1%B4%BB%E4%A1%BF%E2%B2%AF%E5%B6%82%E8%97%84%E6%8C%A7%00%00%EA%AE%A5

approaches. As a result, Hadoop is not typically being used for analyzing conventional structured data such as transaction data, customer information and call records, where traditional RDBMS tools are still better adapted:

“Hadoop is real, but it’s still quite immature. On the ‘real’ side, Hadoop has already been adopted by many companies for extremely scalable analytics in the cloud. On the ‘immature’ side, Hadoop is not ready for broader deployment in enterprise data analytics environments…”

To considerably over-simplify: if we consider what’s called the 3 ‘V’s of the data challenge: “Volume, Velocity, and Variety” (and there’s a fourth, Validity), then traditional data warehousing is great at Volume and Velocity (especially with the new analytic architectures), while Hadoop is good at Volume and Variety.Today, Hadoop is being used as a:

Staging layer : The most common use of Hadoop in enterprise environments is as “Hadoop ETL” - preprocessing, filtering, and transforming vast quantities of semi-structured and unstructured data for loading into a data warehouse.

Event analytics layer: large-scale log processing of event data: call records, behavioral analysis, social network analysis, click stream data, etc.

Content analytics layer : next-best action, customer experience optimization, social media analytics. MapReduce provides the abstraction layer for integrating content analytics with more traditional forms of advanced analysis.

“The bottom line is that Hadoop is the future of the cloud EDW, and its footprint in companies, core EDW architectures is likely to keep growing throughout this decade. Hadoop provides a new, complementary approach to traditional data warehousing that helps deliver on some of the most difficult challenges of enterprise data warehouses. Of course, it’s not a panacea, but by making it easier to gather and analyze data, it may help move the spotlight away from the technology towards the more important limitations on today’s business intelligence efforts”

Hadoop is particularly useful when:

Complex information processing is needed Unstructured data needs to be turned into structured data Queries can be reasonably expressed using SQL Heavily recursive algorithms Complex but parallelizable algorithms needed, such as geo-spatial analysis or genome

sequencing Machine learning Data sets are too large to fit into database RAM, discs, or require too many cores (TB up to PB) Data value does not justify expense of constant real-time availability, such as archives or special

interest info, which can be moved to Hadoop and remain available at lower cost Results are not needed in real time Fault tolerance is critical Significant custom coding would be required to handle job scheduling

References: http://hkotadia.com/archives/4687http://timoelliott.com/blog/2011/09/hadoop-big-data-and-enterprise-business-intelligence.html

http://timoelliott.com/blog/2011/09/hadoop-big-data-and-enterprise-business-intelligence.html

http://hkotadia.com/archives/4687

http://blogs.forrester.com/james_kobielus/11-06-07-hadoop_what_are_these_big_bad_insights_that_need_all_this_nouveau_stuff

Technology

Overview of big data, hadoop and predictive analytics(1)