•Big Data is the amount of data that is beyond the storage and processing capabilities of a single machine
•Big Data: huge volume of data, comes from variety of sources, in variety of formats, with high velocity.
•Big Data is similar to ‘small data’, but bigger
•Having data bigger requires different approaches: Techniques, tools and architecture
• Data quantityVolume• Data SpeedVelocity• Data TypesVariety• Accuracy
• Big Data –Veracity = Incorrect inferences?Veracity• logic or fact?
• Volume -Validity = Worthlesness?Validity• Usefulness
• Big Data = Data + Value?Value
• Big Data – visibility = Black Hole?Visibility
Market Size
Source: WikibonTaming Big Data
By 2015 4.4 million IT jobs in Big Data ; 1.9 million is in US itself
MENA – Big Data
• Gaining attraction
• Huge market opportunities for IT services (82.9% of revenues) and analytics firms (17.1 % )
• Current market size is for GCC is 135.7 million. By 2020 it will be 635.5 million
• The opportunity for MENA service providers lies in offering services around Big Data implementation and analytics for global multinationals
Why Big Data became possible
•Key enablers of appearance and growth of Big Data are:
–Increase of storage capacities
–Increase of processing power
–Availability of data
•Every day we create 2.5 quintillion bytes of data; 90% of the data in the world today has been created in the last two years alone
Applications for Big Data Analytics
Homeland Security
Finance Smarter HealthcareMulti-channel
sales
Telecom
Manufacturing
Traffic Control
Trading Analytics Fraud and Risk
Log Analysis
Search Quality
Retail: Churn, NBO
Healthcare
• 80% of medical data is unstructured and is clinically relevant
• Data resides in multiple places like individual EMRs, lab and imaging systems, physician notes, medical correspondence, claims etc
• Leveraging Big Data• Build sustainable healthcare systems
• Collaborate to improve care and outcomes
• Increase access to healthcare
NoSQL : non-relational or at least non-SQL database
solutions such as HBase (also a part of the Hadoop
ecosystem), Cassandra, MongoDB, Riak, CouchDB, and
many others.
Hadoop: It is an ecosystem of software packages,
including MapReduce, HDFS, and a whole host of other
software packages
+ Hadoop, MapReduce, Hive, Pig, Cascading,
HBase, Hypertable, Cassandra, Flume, Sqoop,
Mongo, Voldemort, Storm, Kafka, Drill, Dremmel,
Impala, Zookeeper, Ambari, Oozi, Yarn, Redis,
Rajak, Pregel, Gremlin, Giraph, Solr, Lucene, R,
Mahout, Weka,
• Google 24 PB data processed daily
• Twitter 340 mln daily tweets + 1.6 bln
search queries + 7 TB added daily
• Facebook + 750 mln users + 12 TB daily daily
content + 2.7 bln “likes” and comments daily
• Relational Data (Tables/Transaction/Legacy
Data)
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
• Social Network, Semantic Web (RDF), …
• Streaming Data
• You can only scan the data once
• Unstructured Data (Documents)
• RFID
• Web logs
• User interaction logs
• User transaction history
• Social Network, Semantic Web (RDF), …
• Climate sensors
• Internal
• Transactions
• Emails
• Log data
• External
• Social Networks
• Web
• Media
What to do with this data?
Analyze it
Why Big Data Analytics?
• Examining large amount of data
• Appropriate information
• Identification of hidden patterns, unknown correlations
• Competitive advantage
• Better business decisions: strategic and operational
• Effective marketing, customer satisfaction, increased revenue
• Vital Information discovery
• Trends detection and prediction
• Personalized user services
• Identify the most important customers
• Identify the best time to perform
maintenance based on the usage patterns
• Analyze brands reputation in Social Media
How can such huge amount of data processed?
Distributed systems
Application Server
Application Server
Application Server
Storage Server
Storage Server
Storage Server
Storage Server
Storage Area Network
Architecture
Problems
• Dependency on Network and big demand of
network bandwidth
• Scale up and down is not that smooth
• Partial failure is problematic
• Transferring data consumes processing power
• Data synchronization is a headache
Problems
• Dependency on Network and big demand of
network bandwidth
• Scale up and down is not that smooth
• Partial failure is problematic
• Transferring data consumes processing power
• Data synchronization is a headache
Big Data revolution comes to the stage
Big data revolution
• Google: GFS, MapReduce, BigTable,
• Yahoo: Hadoop
• Amazon: DynamoDB
• Facebook: Cassandra, HBase
• Twitter: FlockDB, Storm
• LinkedIn: Vondelmort, Kafka
• Machine Learning
• Data Mining
• Statistics
• Software Engineering
• Hadoop/MapReduce/HBase/Hive/Pig
• Java, Python, C/C+, SQL
“By 2018, the United States alone could face a shortage of 140,000
to 190,000 people with deep analytical skills as well as 1.5 million
managers and analysts with the know-how to use the analysis of big
data to make effective decisions.”