Download pdf - Myths and Reality of Big Dataael.chungbuk.ac.kr/ael/연구기초특론/강의자료및... · 2014-05-23 · Twitters 2억 사용자, 하루 9천억 이상 트윗 1일 12 GB Facebook

Myths and Reality of Big Data

충북대학교 컴퓨터과학과 인공지능연구실

이건명

ailab.cbnu.ac.kr

CBNU AI Lab.

1. Big Data

2. Storage for Big Data : HDFS

3. Processing of Big Volume : MapReduce

4. Analysis of Big Data

5. Visualization of Big Data

6. Myths to Big Data

7. Conclusions

목차

2

14:02

CBNU AI Lab.

빅데이터(Big data)란 데이터의 규모가 일반 소프트웨어를 통해서는 참을 만한 시간 내에 획득, 저장, 정리, 검색, 공유, 분석하기 곤란한 정도인 데이터의 모음 - Term origin : John Mashey, mid 1990s SG

3V 이슈

-Volume (규모) • giga, tera, peta, exa

-Variety (다양성) • 데이터의 형태 : numeric, set-valued, categorical, mixed

• 작업의 종류 : curate, search, share, analyze, …

-Velocity (처리 속도) • 참을 만한 시간 내의(tolerable time) 서비스

3/80

1. 빅 데이터

14:02

CBNU AI Lab.

Twitters 2억 사용자, 하루 9천억 이상 트윗 1일 12 GB

Facebook 10억 실 사용자 (2013.1) 1천억개 사진 (2011 여름) 월 60억 업로드

No. of Web pages 6.44 억 웹 사이트 (2012.3) 최소 1조개의 웹 페이지 (indexed pages : at least 16 billions)

Wal-Mart 시간당 2.5 PB 약 4,000 store

빅 데이터의 사례

14:02

4 CBNU AI Lab.

Big Volume Distributed File Systems

HDFS

Heavy Computation Parallel & Distributed Processing

MapReduce, MPI, OpenMP, CUDA

Meaningful Analysis Data Analysis

-General purpose / tailored analysis

R, Python, Mahout

Persuading Report Visualization

R, D3.js

5

빅데이터 이슈와 도구

Big data storage Manage big data of size tera, peta, exa bytes

Affordable cost to acquire a storage system

Economic Distributed File System Hadoop Distributed File System

-Cluster of commodity PCs

-Robustness & Fault Tolerance

6

2. Storage for Big Data

Hadoop Java로 작성된 분산 컴퓨팅 프레임워크

-Apache 프로젝트의 하나 (Apache Hadoop)

HDFS(Hadoop Distributed File System)

-대규모 파일 관리를 위한 분산화일 시스템

MapReduce

-분산 병렬 처리를 위한 프로그래밍 패러다임

7

Hadoop

14:02

CBNU AI Lab.

Hadoop cluster를 위한 빅데이터 처리 사례 정렬 (sorting)

-1PB 정렬 / 3,658 nodes at Yahoo : 16.25h

-9TB 정렬 / 900 nodes : 1.8h

8

Hadoop

14:02

CBNU AI Lab.

Hadoop cluster Commodity PC(일반 PC) 를 4,000대까지 연결 (최근 확대)

대용량 분산 파일 시스템 지원

분산 병렬 처리 지원

9

Hadoop

14:02

Yahoo Hadoop cluster

CBNU AI Lab.

HDFS (Hadoop Distributed File System) 다수의 PC 서버 사용

파일의 분산 중복 저장

분산 처리의 효율성 증대

장애(Failure)에 대한 강인성 및 동적 복구, 부하 분산

10

HDFS

14:02

CBNU AI Lab.

Namenode

Datanodes

1 2 3 4

1 2 4

2 1 3

1 4 3

3 2 4

File1

Processing of Big Data Parallel and distributed computation

Use distributed nodes and/or multicores/graphics processors

MapReduce

-Work on HDFS

MPI

OpenMP

CUDA

Dryard

11

3. Processing of Big Volume

MapReduce Data-parallel programming model for clusters of commodity

machines

함수 프로그래밍(functional programming) 모델에 기반

각 데이터 요소에 대해서 특정 작업 적용

-예. (map square ‘(1 2 3 4)) ==> (1 4 9 16)

큰 작업을 map과 reduce의 조합으로 분할

map 작업 (mapper)

- map(key, value) list(key’, value’)

reduce 작업 (reducer)

- reduce(key, list(values)) list(value’)

12

MapReduce

14:02

CBNU AI Lab.

MapReduce 예 : 도시별 최고온도는?

13

MapReduce

(서울, 33) (인천, 31) (대전, 32) (청주, 29) (광주, 33) (대구, 24) (부산, 32)

14:02

CBNU AI Lab.

Key Value

reducer

(서울, (30, 28 , 33 , 19)) (인천, (31, 26, 18)) (대전, (24, 32, 32)) (청주, (29, 23)) (광주, (33, 24)) (대구, (24)) (부산, (32))

어제 서울은 30도 였..

인천 온도는 31도로..

온도가 대전은 24도..

청주 오늘은 29이고…

대전 32도로 무덥고..

광주 33도 날씨에…

대구 온도는 24….

오늘 서울 날씨는 28도..

청주 23도로 시원한..

지난 주 대전은 평균 32

서울 33도의 폭염에…

인천 최고기온 26도…

인천 아침 기온이 18…

내일 광주는 24도로…

부산 오늘 32도 이고…

서울 아침 기온이 19도...

입력데이터

(서울, 30)

(인천, 31)

(대전, 24)

(청주, 29)

(대전, 32)

(광주, 33)

(대구, 24)

(서울, 28)

(청주, 23)

(대전, 32)

(서울, 33)

(인천, 26)

(인천, 18)

(광주, 24)

(부산, 32)

(서울, 29)

mapper

MapReduce 작업 Pipeline 다수의 mapper, reducer process가 분산 실행

14

MapReduce

14:02

CBNU AI Lab.

Master node

-JobTracker instance 실행 • Client들로 부터 Job request를 받아 처리

Slave nodes

-TaskTracker instances 실행 • task instances 별로 별개의 Java process를 생성하여 실행

15

MapReduce Engine

CBNU AI Lab.

16

MapReduce Programming import org.apache.hadoop.*;

public class WordCount {

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(Object key, Text value, Context context)

throws IOException, InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString());

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

context.write(word, one);

}

}

}

public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {

private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values, Context context)

throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

result.set(sum);

context.write(key, result);

}

}

// 단어의 개수를 세는 MapReduce 프로그램

// mapper

// reducer

17

MapReduce Programming

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();

if (otherArgs.length != 2) {

System.err.println("Usage: wordcount <in> <out>");

System.exit(2);

}

Job job = new Job(conf, "word count");

job.setJarByClass(WordCount.class);

job.setMapperClass(TokenizerMapper.class);

job.setCombinerClass(IntSumReducer.class);

job.setReducerClass(IntSumReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

}

// Driver

MapReduce 프로그래밍 언어 다수의 MapReduce 프로그램 작성 언어

Java

R / Python / Ruby

Streaming

Pig / Hive / Cascading : higher-level abstraction

18

MapReduce Langauges

대표적인 사례 Google - Index building for Google Search -Article clustering for Google News -Statistical machine translation

Yahoo! - Index building for Yahoo! Search -Spam detection for Yahoo! Mail

Facebook -Data mining -Ad optimization -Spam detection

19

MapReduce 적용분야

대표적인 사례 – cont. Research

– Analyzing Wikipedia conflicts (PARC)

– Natural language processing (CMU)

– Bioinformatics (Maryland)

– Astronomical image analysis (Washington)

– Ocean climate simulation (Washington)

Machine learning tools

– Mahout

20

MapReduce 적용분야

Data Analysis Tools R

Python

MatLab

SPSS, SAS

Mahout

21

4. Analysis of Big Data

R 데이터 분석 및 가시화 프로그래밍 언어이며 소프트웨어 환경

Open source로 다양한 라이브러리 제공

핵심적인 패키지는 R과 함께 설치되며, CRAN (the Comprehensive R Archive Network)을 통해 제공

22

Analysis of Big Data

14:02

CBNU AI Lab.

Mahout a scalable machine learning and data mining library Classification -Logistic regression, Naïve Bayes/Compressed Naïve Bayes,

Random forest, HMM, Online Passive Aggressive Clustering -Capony, k-means, fuzzy k-means, Mean-shift, Dirichlet

shift, LDA, Minhahs, kMeans++, hierarchical, spectral clustering

Dimension reduction -SVD, Stochastic SVD, PCA, GDA

Evolutionary Algorithms Other -Distance computation, collocations

23

Analysis of Big Data

Popularity of Hadoop Hadoop ecosystem

- various supporting tools developed

24

Why Hadoop is so popular

25

Hadoop Ecosystem

Effective Reporting of Big Data Analysis Results Use visualization tools

Make data visible to the users

R

D3.js

시각화 (Visualization) 숫자를 공간에 배치해 보여줌으로써 그 패턴을 인지하게 만드는 것

인간의 탁월한 패턴 인식 능력을 이용하여, 통계분석으로 쉽게 알 수 있는 패턴을 식별할 수 있도록 지원

26

5. Visualization of Big Data

다양한 시각화 방법 제공

27

R기반 시각화

D3.js (D3, Data-Driven Document) 웹 브라우저에서 동적이고 상호작용하는 그래픽을 생성할 수 있게 하는 JavaScript 라이브러리

Scalable Vector Graphics (SVG), JavaScript, HTML5, Cascading Style Sheets (CSS) 표준을 사용 데이터 시각화 지원

28

D3.js의 시각화

Not all big data are appropriate to Hadoop processing.

Applications not for MapReduce Processing thousands of small files (sized less than 1 HDFS

block, typically 128MB)

Processing very large data-sets with small HDFS block size

Applications with a large number (thousands) of maps with a very small runtime (e.g., 5s)

Applications processing large data-sets with very few reduces (e.g., 1)

Applications processing data with very large number of reduces, such that each reduce processes less than 1-2GB of data.

Applications writing out multiple, small, output files from each reduce

29

6. Myths to Big Data

Is this big data?

Is this required to be processed in real time?

Which is the right tool for the job?

30

Right Tool to the Job

image : http://logicalsysinc.wordpress.com/2012/09/24/plcs-versus-pcs-for-control-is-there-still-a-debate/

MPI(Message Passing Interface) 분산 및 병렬 처리에서 정보의 교환에 대해 기술하는 표준

병렬 처리에서 정보를 교환할 때 필요한 기본적인 기능들과 문법, 그리고 프로그래밍 API 에 대해 기술

31

MPI

image : http://ainkaboot.co.uk/cluster-architecture.php

OpenMP (Open Multi-Processing) 공유 메모리(shared memory) 다중 처리 프로그래밍 API

C, C++, 포트란 언어와, 유닉스 및 마이크로소프트 윈도 플랫폼을 비롯한 여러 플랫폼을 지원

32

OpenMP

http://en.wikipedia.org/wiki/OpenMP

CUDA (Compute Unified Device Architecture) 그래픽 처리 장치(GPU)에서 수행하는 (병렬 처리) 알고리즘을 C 언어 등을 사용하여 작성할 수 있도록 하는 GPGPU 기술

G8X GPU로 구성된 지포스 8 시리즈급 이상에서 동작

GPU는 병렬 다수 코어 구조를 가지고 있고, 각 코어는 수천 스레드를 동시 실행 가능

33

CUDA

image : http://govardhant.com/research.html

MapReduce Provide a fault-tolerant mechanism

handle big volume of data

Communicate between nodes by disk I/O

MPI control the parallel process in a finer granularity

communication by message passing

34


OpenMP Shared-memory architecture

Multi-threading on a single node (host)

CUDA A “SIMD” architecture

Works well when a similar operation is applied to a large dataset on a single node

35


Data Processing handle the big data in themselves

- develop new algorithms or use the existing ones

sample or reduce the big data into a manageable one, and use conventional methods

- wise, efficient, and effective

36


Big Data 전문가 요건 시스템 운영 -시스템 설치, 관리

프로그래밍 - Java, MapReduce programming -MapReduce Design Pattern

Hadoop 생태계 도구 활용 - Hbase, Hive, Pig, Zookeeper

데이터 분석 능력 - R/Python -다양한 사례 학습 (텍스트, 스트림 데이터, 가시화), Tera 데이터 분석 - Understanding business domain - Statistical literacy

기계학습 및 데이터마이닝 -고급 분석 능력 확보 -Mahout

Right tools to the job

37

7. Conclusions