If you can't read please download the document
Upload
vuongtu
View
222
Download
0
Embed Size (px)
Citation preview
1 Prof. Kai Hwang, USC, March 24, 2014
Big-Data Analytics, Smart Clouds and Intelligent IoT Applications
Kai Hwang
University of Southern California currently visiting Wuhan University
Presentation in Wuhan, March 17, 2016
1 2
What is Data Science ? s) Data Science is the extraction
of actionable knowledge directly from data through a process of discovery, hypothesis, and analytical hypothesis analysis.
A Data Scientist is a practitioner who has sufficient knowledge of the overlapping regimes of expertise in business needs, domain knowledge, analytical skills and programming expertise to manage the end-to-end scientific method process through each stage in the big data lifecycle.
Big Data refers to digital data volume, velocity and/or variety whose management requires scalability across coupled horizontal resources
3 4 Prof. Kai Hwang, USC, March 24, 2014
The Five Vs of Big Data
5
Domain Expertise
MathStatistics
Programming Skills
Analytics
Algorithms
Models
Data Science
Distributed Computing
Hadoop
Statistics
Machine Learning
Deep Learning (Neural Networks)Natural Language Processing
Data Mining
Data Visualization
Social Network & Graph Analysis
Spark
Medical Engineering &
Science
Linear Algebra & Programming
How Big Is Data Industry Today ?
All rights reserved, Kai Hwang, July 2015 6
The Evolution of Scalable Parallel Computing on Clouds: From MapReduce to Hadoop and Spark in the last 10 years
! Google MapReduce Paradigm Written in C: from Search Engine to Google AppEngine
! Hadoop Library for MapReduce Programming in Java environment
! Extending Hadoop from MapReduce in batch processing over bi-parti workflow using distributed disks
! Using Spark for in-memory processing in streaming mode over any DAG computing paradigm
7
Distributed and Cloud Computing Kai Hwang, Geoffrey Fox, and Jack Dongarra, published by Morgan Kaufmann, 2012 (648 pages). covering clouds, grids, social networks, P2P systems, Internet of Things (IoT) and Big Data and Security studies. Second edition to
appear in late 2016 1 - 8
Data Deluge Enabling New Challenges
9
Architecture of The Internet of Things
Merchandise Tracking
Environment Protection
Intelligent Search
Tele- medicine
Intelligent Traffic
Cloud Computing Platform
Smart Home
Mobile Telecom Network
The Internet Information Network
RFID
RFID Label
Sensor Network
Sensor Nodes
GPS
Road Mapper Sensing
Layer
Network Layer
Application Layer
10
Wireless Sensor Network (WSN): Spatially distributed sensors to monitor physical or environmental conditions. WSNs
emphasizing the information perception through all kinds of sensor nodes -- A basic scenario of the IoT.
Machine to Machine (M2M) Communication: Typically, M2M
refers to data communications without or with limited human intervention among various terminal devices such as
computers, embedded processors, smart sensors/actuators and mobile devices, etc.
Body-Area Network (BAN): Use of advances on lightweight,
small-size, ultra-low-power, and intelligent monitoring wearable sensors, which continuously monitor humans
physiological conditions for health status and motion control
Cyber Physical System (CPS): It is a system of collaborating computational elements controlling physical entities.
"Wireless Sensor Network (WSN): Spatially distributed sensors to monitor physical or environmental conditions. WSNs emphasizing the information perception through all kinds of sensor nodes -- A basic scenario of the IoT.
"Machine to Machine (M2M) Communication: Typically, M2M refers to data communications without or with limited human intervention among various terminal devices such as computers, embedded processors, smart sensors/actuators and mobile devices, etc.
" Body-Area Network (BAN): Use of advances on lightweight, small-size, ultra-low-power, and intelligent monitoring wearable sensors, which continuously monitor humans physiological conditions for health status and motion control
" Cyber Physical System (CPS): It is a system of collaborating computational elements controlling physical entities.
Four Major IoT Components :
11
Body-Area Networks (BAN) for Health-Care and Other
Personal Applications
12 Prof. Kai Hwang, USC, Nov. 25, 2013
1. Government Operation: National Archives and Records Administration, Census Bureau
2. Commercial: Finance in Cloud, Cloud Backup, Mendeley (Citations), Netflix, Web Search, Digital Materials, Cargo shipping
3. Defense: Sensors, Image surveillance, Situation Assessment
4. Healthcare and Life Sciences: Medical records, Graph and Probabilistic analysis, Pathology, Bioimaging, Genomics, Epidemiology, People Activity models, Biodiversity
5. Deep Learning and Social Media: Driving Car, Geolocate images/cameras, Twitter, Crowd Sourcing, Network Science, NIST benchmark datasets
51 Use Cases of Big Data: from TBs to PBs (NIST 2013)
12
13 Prof. Kai Hwang, USC, Nov. 25, 2013
6. The Ecosystem for Research: Metadata, Collaboration, Language Translation, Light source experiments
7. Astronomy and Physics: Sky Surveys, Large Hadron Collider at CERN and Belle Accelerator in Japan
8. Earth, Environmental and Polar Science: Radar Scattering in Atmosphere, Earthquake, Ocean, Earth Observation, Ice sheet Radar scattering, Earth radar mapping, Climate simulation datasets, Atmospheric turbulence identification, Subsurface Biogeochemistry (microbes to watersheds), AmeriFlux and FLUXNET gas sensors
9. Energy: Smartgrid
13
51 Big-Data Use Cases: from TBs to PBs (Continued)
14
Example: 2012 US Election
15
Progression from Simple Analysis of Small Data To Cloud Analytics for Big Data
16 Prof. Kai Hwang, USC, March 24, 2014
Cloud Analytics on Big Data
Bayesian Classifiers,
ANN, SVM
Professor Kai Hwang University of Southern California
16
17
Analytics Process Model
18 Prof. Kai Hwang, USC, March 24, 2014
Execution Layers of Cloud for Big-Data Mining and Analytics Apps.
19
Cloud Mining of Big Data Structured data; low processing
efficiency of unstructured data Expensive Hardware; Poor
compatibility High scalability with the cost of
Vendor lock-in
Hybrid analysis capabilities of structured/unstructured data
X86 servers; Good compatibility High scalability, over 10,000 node-level
deployment
Traditional Database/
Data Warehouse
Distributed architectureTB PB EB # ZB
Sever supercluster (Data Warehouse) + Hadoop
20
21
Evolution of Hadoop Programming On Internet Clouds In 10 Years
22
Machine Learning Approaches (1) 1. Decision Tree Learning : Using multi-way tree
to make categorical decisions 2. Association Rule Learning : Discovering interesting
relations between variables in large databases 3. Artificial Neural Networks (ANN) : Learning algorithm
that is inspired by the structure and functional aspects of bilological neural networks
4. Support Vector Machines (SVM) : Using supervised learning methods for classification and regression.
23
Machine Learning Approaches (2)
5. Clustering Analysis : grouping sample data into clusters with similar properties or some predefined criteria.
6. Bayesian Networks : A belief network that represent a set of random variables and their independencies. 7. Representative Learning: Preserving the input
information as a preprocessing process for other classification algorithms.
8. Genetic Algorithms (GA) : A research heuristic that mimics the process of natural selection and uses methods such as mutation and crossover to generate genotype towards making better decision
24
Bayesian Classifiers (1) Consider each attribute and class label as
random variables
Given a record with attributes (A1, A2,,An) Goal is to predict class C Specifically, we want to find the value of C
that maximizes P(C| A1, A2,,An )
Can we estimate P(C| A1, A2,,An ) directly from data?
25
Bayesian Classifiers (2) Compute the posterior probability P(C |
A1, A2, , An) for all values of C using the Bayes theorem
Choose value of C that maximizes P(C | A1, A2, , An)
Equivalent to choosing value of C that maximizes P(A1, A2, , An|C) P(C)
How to estimate P(A1, A2, , An | C )?
)()()|()|(
21
21
21
n
n
n AAAPCPCAAAPAAACP
=
26
Nave Bayes Classifier (3) Assume independence among attributes
Ai when class is given: P(A1, A2, , An |C)
= P(A1| Cj) P(A2| Cj) P(An| Cj) Can estimate P(Ai| Cj) for all Ai and Cj. New point is classified to Cj ,
if P(Cj) P(Ai| Cj) is maximal.
27
Example of Nave Bayes Classifier (4) Name Give Birth Can Fly Live in Water Have Legs Class
human yes no no yes mammalspython no no no no non-mammalssalmon no no yes no non-mammalswhale yes no yes no mammalsfrog no no sometimes yes non-mammalskomodo no no no yes non-mammalsbat yes yes no yes mammalspigeon no yes no yes non-mammalscat yes no no yes mammalsleopard shark yes no yes no non-mammalsturtle no no sometimes yes non-mammalspenguin no no sometimes yes non-mammalsporcupine yes no no yes mammalseel no no yes no non-mammalssalamander no no sometimes yes non-mammalsgila monster no no no yes non-mammalsplatypus no no no yes mammalsowl no yes no yes non-mammalsdolphin yes no yes no mammalseagle no yes no yes non-mammals
Give Birth Can Fly Live in Water Have Legs Classyes no yes no ?
0027.02013004.0)()|(
021.020706.0)()|(
0042.0134
133
1310
131)|(
06.072
72
76
76)|(
==
==
==
==
NPNAP
MPMAP
NAP
MAP
A: attributes
M: mammals
N: non-mammals
P(A|M)P(M) > P(A|N)P(N)
=> Mammals
28
Decision Tree Training Process
29
What is Apache Spark ? Very Powerful in Big Data Processing
! It is a cluster computing platform designed to be fast in general-purpose applications, compared with the use of MapReduce or Hadoop.
! On the speed side, Spark extends the MapReduce model to support interactive queries and streaming processing.
! Spark offers the ability to run computations in memory, which is more efficient than MapReduce running on disks for complex applications.
! Spark is highly accessible, offering simple APIs in Python, Java, Scala, SQL, etc. Spark can run in Hadoop Clusters and access any Hadoop data source like Cassandra.
30
31
Key Spark Libraries by 2015
!Spark SQL deal with structured data. !Spark Streaming handles live streams of data. !Mllib library contains common machine
learning functionality. !GraphX for manuipulating social network
graphs. !Sparks Cluster Manager can run with !Hadoop YARN !Apache Mesos !Sparks own Standalone Scheduler.
32
The SMACT Technologies
33
Services-Oriented Architecture (SOA)
34
Gartners 2015 Hype Cycle of Emerging New Technologies:
35
SMACT
Technology
Theoretical Foundations
Hardware Advances
Sofware
Tools and Libraries
Networking
Enablers
Representative
Service Providers
SMACT Technologies
36
Centralized Cloud Control and Processing
37
SMACT Techno-
logy
Theoretical
Foundations
Hardware Advances
Sofware
Tools and Libraries
Networking
Enablers
Representativ
e Service Providers
Cloud
Compu-ting
Virtualiza-tion, Parallel and Distributed Computing
Server clusters, Clouds, Virtual Machines, Intercon-nection networks.
OpenStack, GFS, HDFS, MapReduce, Hadoop, Spark, Storm, Cassandra
Virtual Networks, OpenFlow Networks, Software-Defined Networks
AWS, GAE, IBM, Salesforce, GoGrid Apache, Azure Rachspace, DropBox
Internet
of Things
(IoT)
Sensing Theory, Cyber Physics, Navitgation, Pervasive Computing
Sensors, RFID, GPS, Robotics, Satellites, Zigbee, Geroscope,
TyneOS, WAP, WTCP,IPv6, Mobile IP, Android, iOS, WPKI, UPnP, JVM
Wireless LAN, PAN, MANET, WLAN Mesh, VANet, Bluetooth
IoT Council, IBM, Health-Care, Smar tGrid, Social Media, Smart Earth, Google, Samsung
SMACT Technologies by Basic Theories, Typical Hardware, Software Tooling, Networking and Service Providers
38
Concluding Remarks : $ We must leverage computing clouds and data analytics for the
storing, processing and mining of big data, which changes in time and space all the time.
$ Apache Hadoop made it possible for big-data processing on large server clusters or on elastic clouds in the past decade.
$ Since 2009, Berkeley Spark frees up many constraints in MapReduce and Hadoop programming for general-purpose static or streaming big-data applications.
$ Bigdata and cloud growth demand a major overhaul of our educational programs in computer science and technology.
$ The clouds, mobile, IoT and social networks are changing our world, reshaping human relations, promoting the global economy and triggering societal reforms on a global scale.