33
BIG Da Ta Introduction Dr. Akram Alkouz Princess Sumaya University for Technology

1 PSUT Big Data Class, introduction

Embed Size (px)

Citation preview

BIGDaTa

Introduction

Dr. Akram Alkouz

Princess Sumaya University for Technology

Data Information Understanding

•Big Data is the amount of data that is beyond the storage and processing capabilities of a single machine

•Big Data: huge volume of data, comes from variety of sources, in variety of formats, with high velocity.

•Big Data is similar to ‘small data’, but bigger

•Having data bigger requires different approaches: Techniques, tools and architecture

• Data quantityVolume• Data SpeedVelocity• Data TypesVariety• Accuracy

• Big Data –Veracity = Incorrect inferences?Veracity• logic or fact?

• Volume -Validity = Worthlesness?Validity• Usefulness

• Big Data = Data + Value?Value

• Big Data – visibility = Black Hole?Visibility

• High trend

• Real data problems

Market Size

Source: WikibonTaming Big Data

By 2015 4.4 million IT jobs in Big Data ; 1.9 million is in US itself

MENA – Big Data

• Gaining attraction

• Huge market opportunities for IT services (82.9% of revenues) and analytics firms (17.1 % )

• Current market size is for GCC is 135.7 million. By 2020 it will be 635.5 million

• The opportunity for MENA service providers lies in offering services around Big Data implementation and analytics for global multinationals

Why Big Data became possible

•Key enablers of appearance and growth of Big Data are:

–Increase of storage capacities

–Increase of processing power

–Availability of data

•Every day we create 2.5 quintillion bytes of data; 90% of the data in the world today has been created in the last two years alone

Applications for Big Data Analytics

Homeland Security

Finance Smarter HealthcareMulti-channel

sales

Telecom

Manufacturing

Traffic Control

Trading Analytics Fraud and Risk

Log Analysis

Search Quality

Retail: Churn, NBO

Healthcare

• 80% of medical data is unstructured and is clinically relevant

• Data resides in multiple places like individual EMRs, lab and imaging systems, physician notes, medical correspondence, claims etc

• Leveraging Big Data• Build sustainable healthcare systems

• Collaborate to improve care and outcomes

• Increase access to healthcare

NoSQL : non-relational or at least non-SQL database

solutions such as HBase (also a part of the Hadoop

ecosystem), Cassandra, MongoDB, Riak, CouchDB, and

many others.

Hadoop: It is an ecosystem of software packages,

including MapReduce, HDFS, and a whole host of other

software packages

+ Hadoop, MapReduce, Hive, Pig, Cascading,

HBase, Hypertable, Cassandra, Flume, Sqoop,

Mongo, Voldemort, Storm, Kafka, Drill, Dremmel,

Impala, Zookeeper, Ambari, Oozi, Yarn, Redis,

Rajak, Pregel, Gremlin, Giraph, Solr, Lucene, R,

Mahout, Weka,

• Google 24 PB data processed daily

• Twitter 340 mln daily tweets + 1.6 bln

search queries + 7 TB added daily

• Facebook + 750 mln users + 12 TB daily daily

content + 2.7 bln “likes” and comments daily

• Relational Data (Tables/Transaction/Legacy

Data)

• Text Data (Web)

• Semi-structured Data (XML)

• Graph Data

• Social Network, Semantic Web (RDF), …

• Streaming Data

• You can only scan the data once

• Unstructured Data (Documents)

• RFID

• Web logs

• User interaction logs

• User transaction history

• Social Network, Semantic Web (RDF), …

• Climate sensors

• Internal

• Transactions

• Emails

• Log data

• External

• Social Networks

• Web

• Media

What to do with this data?

Analyze it

Why Big Data Analytics?

• Examining large amount of data

• Appropriate information

• Identification of hidden patterns, unknown correlations

• Competitive advantage

• Better business decisions: strategic and operational

• Effective marketing, customer satisfaction, increased revenue

• Vital Information discovery

• Trends detection and prediction

• Personalized user services

• Identify the most important customers

• Identify the best time to perform

maintenance based on the usage patterns

• Analyze brands reputation in Social Media

How can such huge amount of data processed?

Distributed systems

Application Server

Application Server

Application Server

Storage Server

Storage Server

Storage Server

Storage Server

Storage Area Network

Architecture

Problems

• Dependency on Network and big demand of

network bandwidth

• Scale up and down is not that smooth

• Partial failure is problematic

• Transferring data consumes processing power

• Data synchronization is a headache

Problems

• Dependency on Network and big demand of

network bandwidth

• Scale up and down is not that smooth

• Partial failure is problematic

• Transferring data consumes processing power

• Data synchronization is a headache

Big Data revolution comes to the stage

Big data revolution

• Google: GFS, MapReduce, BigTable,

• Yahoo: Hadoop

• Amazon: DynamoDB

• Facebook: Cassandra, HBase

• Twitter: FlockDB, Storm

• LinkedIn: Vondelmort, Kafka

• Machine Learning

• Data Mining

• Statistics

• Software Engineering

• Hadoop/MapReduce/HBase/Hive/Pig

• Java, Python, C/C+, SQL

“By 2018, the United States alone could face a shortage of 140,000

to 190,000 people with deep analytical skills as well as 1.5 million

managers and analysts with the know-how to use the analysis of big

data to make effective decisions.”

Thank you