View
1.669
Download
0
Category
Preview:
DESCRIPTION
Citation preview
NYC Data Science AcademyHadoop Application Development with Real Cases
Hadoop Application Development with Real
Cases
NYC Data Science AcademyHadoop Application Development with Real Cases
Multi-layer Model
2
NYC Data Science AcademyHadoop Application Development with Real Cases
Data Pyramid and Character
Business personnel
ETL Engineer
Data Warehouse Engineer
Analyzer
Data Visualization
Engineer
IT supporter: Operation-
Maintanence, Programmer
3
NYC Data Science AcademyHadoop Application Development with Real Cases
Data Analysis
Analyze collected data with statistical methods on purpose, then
understand and implement the result
4
NYC Data Science AcademyHadoop Application Development with Real Cases
Data Mining
Data Mining is a technique focusing on retrieving hidden information in the data. It is a
process that apply knowledge-discovery algorithms to large database and show the
associations to the users.
Original Idea: Hypothesis testing, Pattern Recognition, Artificial Intellegence, Machine
Learning
Common Data Mining Projects: Association Rules, Clustering, Outlier Analysis
Case: Beer and Diaper
Science: Detecting Novel Associations in Large Data Sets
5
NYC Data Science AcademyHadoop Application Development with Real Cases
Business Intelligence
BI = Data Warehouses (Storage) + Data Analysis and Data Mining
(Analysis) + Report (Demonstration)
Our course
6
NYC Data Science AcademyHadoop Application Development with Real Cases
Data Analysis Algorithms
Popular Algorithms
7
NYC Data Science AcademyHadoop Application Development with Real Cases
Regression
8
NYC Data Science AcademyHadoop Application Development with Real Cases
Time Series Analysis
NYC Data Science AcademyHadoop Application Development with Real Cases
Classifier
10
NYC Data Science AcademyHadoop Application Development with Real Cases
Clustering
11
NYC Data Science AcademyHadoop Application Development with Real Cases
Association Rules
12
NYC Data Science AcademyHadoop Application Development with Real Cases
Data Analysis
Data Analysis Tools
13
NYC Data Science AcademyHadoop Application Development with Real Cases
Popular Data Analysis Tools Ranking
14
NYC Data Science AcademyHadoop Application Development with Real Cases
Data Analysis stages
stage 1: Dominate by Business personnel
stage 2: Dominate by both Business personnel and Analyzer
stage 3: Dominate by Analyzer
15
NYC Data Science AcademyHadoop Application Development with Real Cases
Data Analysis in stage 1
Business staff set all the requirements and most analysis plans
According to experiences, Business staff select features, set
threshold, and IT staff search, integrate data, analyzer make
report
Feature selection and choice of threshold is based on experience
and personal knowledge
Suitable for simple cases, analysis technique is equivalent to the
simplest decision tree
Business staffs has valuable experiences and hard to be replaced,
analyzers are just for graphing and is easily replaced
This is common in the traditional industry
16
NYC Data Science AcademyHadoop Application Development with Real Cases
Data Analysis in stage 2
More complex. Business staffs could analyze a small
number of data records while cannot figure out all the
features and the relationship among them. They have no
experience with large number of samples.
Analyzer come to clean data and select features, and finally
build suitable model to solve problem.
Business staffs and analyzer could evaluate the result
together, very likely to success. Analyzer prefer this step
because their ability and value is confirmed.
17
NYC Data Science AcademyHadoop Application Development with Real Cases
Spammer in Wordpress
NYC Data Science AcademyHadoop Application Development with Real Cases
Data Analysis in stage 3
Business staffs have no experience for
the case, and cannot offer any useful
prior knowledge
Data analyzers use various tools and
models to mine the data and trying to
have interesting discovery
It is analyzer’s ideal world, while it is
likely to fail
Business staffs cannot get involved, and
they dislike this stage
19
NYC Data Science AcademyHadoop Application Development with Real Cases
Step Forward
The first stage(Gold on the ground) -> The
second stage(Gold beneath the ground) -> The
third stage (Gold deeply buried)
If analyzers are reckless, business staffs will resist
to help
Data analysis is rooted in the business
background. The goal of analysis is increasing
profit. Successful analysis could not be apart from
business
Interesting topic is more important than the
model
20
NYC Data Science AcademyHadoop Application Development with Real Cases
What is Big Data
NYC Data Science AcademyHadoop Application Development with Real Cases
Features of Big Data
NYC Data Science AcademyHadoop Application Development with Real Cases
Challenges for Analyzers
Bottleneck for both insertion and query due to the increasing amount of
data
The trend of integrating users’ application and analysis result is asking for
faster real-time computation and response time
More complex models require more expensive computation
23
NYC Data Science AcademyHadoop Application Development with Real Cases
Dilemma of Traditional Data Analysis Tools
R, SAS, SPSS are experimental tools
Capable data size is restricted by the memory size
Use Oracle database for large volume of data, but lack of professional and
fast analyzing ability
Sampling is a limited solution, it is not useful for clustering and
recommendation system
Solution: Hadoop cluster and Map-Reduce parallel computing
24
NYC Data Science AcademyHadoop Application Development with Real Cases
Case 1: analysis and monitor for a telecommunication company
25
NYC Data Science AcademyHadoop Application Development with Real Cases
Case 1: analysis and monitor for a telecommunication company
Configuration of the original database server: HP minicomputer, 128G
memory, 48-core CPU, RAC with two nodes, one node for insertion and the
other for query
Storage: HP virtual storage, over 1000 disks
Architecture: Oracle RAC with two nodes
Bottleneck: 1. Insertion 2. Query
26
NYC Data Science AcademyHadoop Application Development with Real Cases
Case 2: DNA database
27
NYC Data Science AcademyHadoop Application Development with Real Cases
Case 3: Social analysis, activity fingerprint detection
28 | April 11, 2023 |
Public Voice mail
intersect IMSI 1 IMSI 2 …… IMSI ntotal call duration
User A IMSI 20% 12% …… 5% 365
User B IMSI 15% 13% …… 2% 310
Public SMS intersect IMSI 1 IMSI 2 …… IMSI n
Monthly SMS count
User A IMSI 50% 10% …… 5% 200
User B IMSI 20% 13% …… 2% 260
Public base station CGI 1 CGI 2 …… CGI n Shutdown
User A IMSI 20% 12% …… 5% 20%
User B IMSI 15% 13% …… 2% 5%
Public Fingerprint
(0.2, 0.12, …, 0.05)(0.15, 0.13, …, 0.02)
(0.5, 0.1, …, 0.05)(0.2, 0.13, …, 0.02)
(0.2, 0.12, …, 0.05, 0.2)(0.15, 0.13, …, 0.02, 0.05)
eigenvector
NYC Data Science AcademyHadoop Application Development with Real Cases
When equals to , these two vectors are independent
When equals to 0 , these two vectors are perfectly dependent
The closer is from 0, the more dependent these vectors are
90
Case 3: Social analysis, activity fingerprint detection
29
NYC Data Science AcademyHadoop Application Development with Real Cases
Case 3: Social analysis, VIP detection
30
NYC Data Science AcademyHadoop Application Development with Real Cases
Solution that analyzers look forward to
Perfectly eliminate the bottleneck in the foreseeable future
Smoothly transplant available techniques, for example SQL and R.
The cost of new platform: hardware and software, re-development, skill
training, maintenance
31
NYC Data Science AcademyHadoop Application Development with Real Cases
Path to Big Data
NYC Data Science AcademyHadoop Application Development with Real Cases
Idea of Hadoop
33
NYC Data Science AcademyHadoop Application Development with Real Cases
Map-Reduce Programming
34
NYC Data Science AcademyHadoop Application Development with Real Cases
Map-Reduce program for meteorological data analysis
35
NYC Data Science AcademyHadoop Application Development with Real Cases
Map-Reduce implementation for popular algorithms
36
NYC Data Science AcademyHadoop Application Development with Real Cases
Map-Reduce implementation for popular algorithms
37
NYC Data Science AcademyHadoop Application Development with Real Cases
Why not Hadoop?
Java?
Hard to control?
Hard to integrate data?
Hadoop vs Oracle
38
NYC Data Science AcademyHadoop Application Development with Real Cases
Analysis under Hadoop system
Mainstream: Java program
Light-weighted script language: Pig
Smooth transplant from SQL: Hive
NoSQL: HBase
39
NYC Data Science AcademyHadoop Application Development with Real Cases
Family of Hadoop
40
NYC Data Science AcademyHadoop Application Development with Real Cases
pig
Pig could be treated as a client
software to the hadoop, could
connect to hadoop and analyze
Pig is convenient for users
unfamiliar with java, using a SQL-
like language, pig latin, dealing
with data flow
Pig latin could perform sorting,
filtering, sum, grouping,
association, and define custom
functions. It is a light-weighted
script language for data operation
and analysis
Pig could be treated as the
mapping from pig latin to map-
reduce
41
NYC Data Science AcademyHadoop Application Development with Real Cases
Hive
Data warehouse tool, could turn
primary data structure in
Hadoop into tables in Hive
Support HiveQL, a language
almost the same as SQL, its
function is the same as SQL
except updating, indexing and
could be treated as the mapping
from SQL to map-reduce
Offering interfaces for
shell、 JDBC/ODBC、 Thrift、W
eb
42
NYC Data Science AcademyHadoop Application Development with Real Cases
Features of Mahout
Mahout is for scalable machine
learning algorithms (M-R
implementation), and Hadoop
platform is not necessary. The core
library also have efficient algorithms
on single machine
Mature and popular algorithms are
1. Frequent Itemset Mining
2. Clustering
3. Classifier
4. Recommendation System
5. Frequent Subgraph Mining
43
NYC Data Science AcademyHadoop Application Development with Real Cases
Reference Textbooks
NYC Data Science AcademyHadoop Application Development with Real Cases
Reference Textbooks
NYC Data Science AcademyHadoop Application Development with Real Cases
Reference Textbooks
NYC Data Science AcademyHadoop Application Development with Real Cases
Reference Textbooks
47
NYC Data Science AcademyHadoop Application Development with Real Cases
Typical Experiment Environtment(with server)
Server: ESXi, capable of deploying multiple virtual machines and could run
3 machines at the same time
PC: Linux or Windows+Cygwin, linux could be standalone or a virtual
machine
SSH: Use command ssh under linux, and SecureCRT or putty under
Windows to connect with remote linux server
Vmware client: Management of ESXi
Hadoop: Use version 1.x or 2.x
48
NYC Data Science AcademyHadoop Application Development with Real Cases
Typical Experiment Environtment(with only PC or laptop running Windows) At Least 4G memory, 64bit windows is preferred, because 32bit machine
can use only more than 3G memory.
Install vmware workstation or virtual box
Deploy 3 virtual machines and running at the same time. If can only run
two VMs, treat host as a node (by cygwin), and use bridged networking for
virtual network
Install Linux and Java
Old computers could consider pseudo-distributed environment
49
NYC Data Science AcademyHadoop Application Development with Real Cases
Experiment Environment
Deploy Pig
Deploy Hive
Deploy Mahout
NYC Data Science AcademyHadoop Application Development with Real Cases
List of Cases of the Course
Analysis of high volume website log system; Retrieve KPI data(Map-Reduce)
LBS application for telecommunication company; Analysis of trace of user‘s mobile
phone(Map-Reduce)
User analysis for telecommunication company; Labeling duplicated users by the
fingerprint of calls(Map-Reduce)
Recommendation system for E-commerce company(Map-Reduce)
Complicated recommendation system application(mahout)
Social network; Distance between users; Community detection(Pig)
Importance of nodes in a social network(Map-Reduce)
Application of clustering algorithm; Analysis of VIP(Map-Reduce, Mahout)
Financial data analysis; Retrieve reverse repurchase information from historical
data(Hive)
Set stock strategies with data analysis(Map-Reduce, Hive)
GPS application; Sign-in data analysis(Pig)
Implementation and optimization of sorting on Map-Reduce
Middleware development; Cooperation of multiple Hadoop clusters
Recommended