Hadoop dev 01

NYC Data Science AcademyHadoop Application Development with Real Cases

Hadoop Application Development with Real

Multi-layer Model

Data Pyramid and Character

Business personnel

ETL Engineer

Data Warehouse Engineer

Analyzer

Data Visualization

Engineer

IT supporter: Operation-

Maintanence, Programmer

Data Analysis

Analyze collected data with statistical methods on purpose, then

understand and implement the result

Data Mining

Data Mining is a technique focusing on retrieving hidden information in the data. It is a

process that apply knowledge-discovery algorithms to large database and show the

associations to the users.

Original Idea: Hypothesis testing, Pattern Recognition, Artificial Intellegence, Machine

Learning

Common Data Mining Projects: Association Rules, Clustering, Outlier Analysis

Case: Beer and Diaper

Science: Detecting Novel Associations in Large Data Sets

Business Intelligence

BI = Data Warehouses (Storage) + Data Analysis and Data Mining

(Analysis) + Report (Demonstration)

Our course

Data Analysis Algorithms

Popular Algorithms

Regression

Time Series Analysis

Classifier

Clustering

Association Rules

Data Analysis

Data Analysis Tools

Popular Data Analysis Tools Ranking

Data Analysis stages

stage 1: Dominate by Business personnel

stage 2: Dominate by both Business personnel and Analyzer

stage 3: Dominate by Analyzer

Data Analysis in stage 1

Business staff set all the requirements and most analysis plans

According to experiences, Business staff select features, set

threshold, and IT staff search, integrate data, analyzer make

report

Feature selection and choice of threshold is based on experience

and personal knowledge

Suitable for simple cases, analysis technique is equivalent to the

simplest decision tree

Business staffs has valuable experiences and hard to be replaced,

analyzers are just for graphing and is easily replaced

This is common in the traditional industry

More complex. Business staffs could analyze a small

number of data records while cannot figure out all the

features and the relationship among them. They have no

experience with large number of samples.

Analyzer come to clean data and select features, and finally

build suitable model to solve problem.

Business staffs and analyzer could evaluate the result

together, very likely to success. Analyzer prefer this step

because their ability and value is confirmed.

Spammer in Wordpress

Business staffs have no experience for

the case, and cannot offer any useful

prior knowledge

Data analyzers use various tools and

models to mine the data and trying to

have interesting discovery

It is analyzer’s ideal world, while it is

likely to fail

Business staffs cannot get involved, and

they dislike this stage

Step Forward

The first stage(Gold on the ground) -> The

second stage(Gold beneath the ground) -> The

third stage (Gold deeply buried)

If analyzers are reckless, business staffs will resist

to help

Data analysis is rooted in the business

background. The goal of analysis is increasing

profit. Successful analysis could not be apart from

business

Interesting topic is more important than the

What is Big Data

Features of Big Data

Challenges for Analyzers

Bottleneck for both insertion and query due to the increasing amount of

The trend of integrating users’ application and analysis result is asking for

faster real-time computation and response time

More complex models require more expensive computation

Dilemma of Traditional Data Analysis Tools

R, SAS, SPSS are experimental tools

Capable data size is restricted by the memory size

Use Oracle database for large volume of data, but lack of professional and

fast analyzing ability

Sampling is a limited solution, it is not useful for clustering and

recommendation system

Solution: Hadoop cluster and Map-Reduce parallel computing

Case 1: analysis and monitor for a telecommunication company

Configuration of the original database server: HP minicomputer, 128G

memory, 48-core CPU, RAC with two nodes, one node for insertion and the

other for query

Storage: HP virtual storage, over 1000 disks

Architecture: Oracle RAC with two nodes

Bottleneck: 1. Insertion 2. Query

Case 2: DNA database

Case 3: Social analysis, activity fingerprint detection

28 | April 11, 2023 |

Public Voice mail

intersect IMSI 1 IMSI 2 …… IMSI ntotal call duration

User A IMSI 20% 12% …… 5% 365

User B IMSI 15% 13% …… 2% 310

Public SMS intersect IMSI 1 IMSI 2 …… IMSI n

Monthly SMS count

User A IMSI 50% 10% …… 5% 200

User B IMSI 20% 13% …… 2% 260

Public base station CGI 1 CGI 2 …… CGI n Shutdown

User A IMSI 20% 12% …… 5% 20%

User B IMSI 15% 13% …… 2% 5%

Public Fingerprint

(0.2, 0.12, …, 0.05)(0.15, 0.13, …, 0.02)

(0.5, 0.1, …, 0.05)(0.2, 0.13, …, 0.02)

(0.2, 0.12, …, 0.05, 0.2)(0.15, 0.13, …, 0.02, 0.05)

eigenvector

When equals to , these two vectors are independent

When equals to 0 , these two vectors are perfectly dependent

The closer is from 0, the more dependent these vectors are

Case 3: Social analysis, activity fingerprint detection

Case 3: Social analysis, VIP detection

Solution that analyzers look forward to

Perfectly eliminate the bottleneck in the foreseeable future

Smoothly transplant available techniques, for example SQL and R.

The cost of new platform: hardware and software, re-development, skill

training, maintenance

Path to Big Data

Idea of Hadoop

Map-Reduce Programming

Map-Reduce program for meteorological data analysis

Map-Reduce implementation for popular algorithms

Why not Hadoop？

Hard to control?

Hard to integrate data?

Hadoop vs Oracle

Analysis under Hadoop system

Mainstream: Java program

Light-weighted script language: Pig

Smooth transplant from SQL: Hive

NoSQL: HBase

Family of Hadoop

Pig could be treated as a client

software to the hadoop, could

connect to hadoop and analyze

Pig is convenient for users

unfamiliar with java, using a SQL-

like language, pig latin, dealing

with data flow

Pig latin could perform sorting,

filtering, sum, grouping,

association, and define custom

functions. It is a light-weighted

script language for data operation

and analysis

Pig could be treated as the

mapping from pig latin to map-

reduce

Data warehouse tool, could turn

primary data structure in

Hadoop into tables in Hive

Support HiveQL, a language

almost the same as SQL, its

function is the same as SQL

except updating, indexing and

could be treated as the mapping

from SQL to map-reduce

Offering interfaces for

shell、 JDBC/ODBC、 Thrift、W

Features of Mahout

Mahout is for scalable machine

learning algorithms (M-R

implementation), and Hadoop

platform is not necessary. The core

library also have efficient algorithms

on single machine

Mature and popular algorithms are

1. Frequent Itemset Mining

2. Clustering

3. Classifier

4. Recommendation System

5. Frequent Subgraph Mining

Reference Textbooks

Typical Experiment Environtment(with server)

Server: ESXi, capable of deploying multiple virtual machines and could run

3 machines at the same time

PC: Linux or Windows+Cygwin, linux could be standalone or a virtual

machine

SSH: Use command ssh under linux, and SecureCRT or putty under

Windows to connect with remote linux server

Vmware client: Management of ESXi

Hadoop: Use version 1.x or 2.x

Typical Experiment Environtment(with only PC or laptop running Windows) At Least 4G memory, 64bit windows is preferred, because 32bit machine

can use only more than 3G memory.

Install vmware workstation or virtual box

Deploy 3 virtual machines and running at the same time. If can only run

two VMs, treat host as a node (by cygwin), and use bridged networking for

virtual network

Install Linux and Java

Old computers could consider pseudo-distributed environment

Experiment Environment

Deploy Pig

Deploy Hive

Deploy Mahout

List of Cases of the Course

Analysis of high volume website log system; Retrieve KPI data(Map-Reduce)

LBS application for telecommunication company; Analysis of trace of user‘s mobile

phone(Map-Reduce)

User analysis for telecommunication company; Labeling duplicated users by the

fingerprint of calls(Map-Reduce)

Recommendation system for E-commerce company(Map-Reduce)

Complicated recommendation system application(mahout)

Social network; Distance between users; Community detection(Pig)

Importance of nodes in a social network(Map-Reduce)

Application of clustering algorithm; Analysis of VIP(Map-Reduce, Mahout)

Financial data analysis; Retrieve reverse repurchase information from historical

data(Hive)

Set stock strategies with data analysis(Map-Reduce, Hive)

GPS application; Sign-in data analysis(Pig)

Implementation and optimization of sorting on Map-Reduce

Middleware development; Cooperation of multiple Hadoop clusters

Hadoop dev 01

Education

8361-1 Software Dev Fund Lesson 01

Ax2012 Enus Wn Dev 01

Metro Style Dev #01 IMAP Client

2012-01-16_The European Hull Database (Dev. 2.2)

FEBRUARY/MARCH 2018 - dublinohiousa.govdublinohiousa.gov/dev/dev/wp-content/uploads/2018/01/February... · 01-02-2018 · $15 and $20 for school district/non- ... American Ninja Squirrels

Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University

Konfiguriranje na DEV C++IP Vezbi 01 Copy

Hadoop Stories - bigdatabigdata.be/wp-content/uploads/2016/01/Hadoop-Stories-bigdata.be_.… · Hortonworks: Hadoop for the Enterprise Y 100 open source Apache Hadoop data platform

Spreads dev 01

Hadoop - Study Mafiastudymafia.org/wp-content/uploads/2016/01/CSE-hadoop-report.pdf · Introduction Hadoop is an Apache open ... the first provider to offer a pure open source solution

01 why of dev ops - devopsguys - magentys - final

058 01 Cticm Dev Durable

Apache Hadoop - Hortonworkshortonworks.com/wp-content/uploads/2012/01/ApacheHadoop-Next.pdf · – Formerly, Architect Hadoop MapReduce, Yahoo – Responsible for running Hadoop MR

Hadoop for shanghai dev meetup

poster dev 01

01 Big Data Hadoop Intro.pdf

Cloudera Hadoop Dev-Test Hadoop MySQL Reporting Apache ... · Hadoop YARN ˜Support for long running services in YARN. FService Registry for applications. ˜Support for rolling upgrades

Accelerating SQL on Hadoop* with Big Data Benchmark for …pic.huodongjia.com/reviewdocs/2016-01-30/1454153054.33.pdf · 2016-01-30 · Hadoop engines: Spark ... Roadmap for the benchmark

Windows Server 2008 R2 Dev Session 01

프레젠테이션2 - KAISTkoasas.kaist.ac.kr/bitstream/10203/21167/1/CD-36.pdf · 9-1 Aspect Weaver* 01 E Verification Enginee E Aspect DEV S DEVS 01 q. E DEV S o 01 Aspect DEV S