RでMapreduce

RでMapReduce

@holidayworking

2010年 8月 28日

自己紹介

田中秀和（たなかひでかず)I Twitter: @holidayworking

職業: プログラマー出身: 北海道江別市趣味: 音楽を聴くこと、F1 をみること言語:

I Java, PL/SQL: 仕事でI Python, Ruby, R: プライベートで

@holidayworking () R で MapReduce 2010 年 8 月 28 日 2 / 18

http://maps.google.co.jp/maps?ie=UTF8&ll=43.074405,141.532059&spn=0.35461,0.53009&z=11&brcurrent=3,0x5f74aaa710c44cc9:0x742cf679f08fa68f,1

MapReduce

Google によって考案された大規模なデータを効率的に並立処理するためのプログラミングモデルmap と reduce という 2つの関数の組み合わせを定義するだけで、大規模なデータに対する様々な計算問題を解決することができる


MapReduce の処理工程

.

..

1 Map フェーズ入力データの各レコードから中間データを生成。中間データはキーと値の組

.

.

.

2 Shuffle フェーズキーが同じ中間データをまとめて、キーと値のリストを生成

.

.

.

3 Reduce フェーズキーとそのキーに対応する値のリストから出力データを生成


MapReduce 図

参考文献 [1]より転載


MapReduce でできること

カウンタ分散 Grep

分散ソート検索エンジンの転置インデックスの作成


Hadoop

Google File System と MapReduce のオープンソース実装

Hadoop は Java で実装されている

I MapReduce 処理を書く場合も基本的には Java でプログラムを書くことになる

Hadoop Streaming

I 標準入出力に対応している言語で MapReduce 処理を書くことができる

I R も標準入出力に対応している言語のひとつ


Hadoop

Google File System と MapReduce のオープンソース実装Hadoop は Java で実装されている


Hadoop Streaming




Hadoop



Hadoop Streaming




Hadoop



Hadoop Streaming




Hadoop



Hadoop StreamingI 標準入出力に対応している言語で MapReduce 処理を書くことができる



Hadoop



Hadoop StreamingI 標準入出力に対応している言語で MapReduce 処理を書くことができる



R で MapReduce を実装してみる

とあるバーにおけるスコッチウィスキーの注文データを解析登場スコッチ

I シングルモルトF Ardbeg 10 Years OldF Bowmore 12 Years OldF Talisker 10 Years OldF The Glenlivet 12 Year OldF The Macallan 12 Years

I ブレンデッドF Ballantine 12 Years OldF Ballantine 17 Years OldF Johnnie Walker Gold Label 18 Years OldF Johnnie Walker Swing


使用データiWork の Numbers で作成したデータデータ数は 250件

日付ブランド分類注文数2010/07/01 The Macallan 12 Years single malt 102010/07/01 Ballantine 12 Years Old blended 32010/07/01 Ballantine 17 Years Old blended 62010/07/01 Johnnie Walker Gold Label 18 Years Old blended 62010/07/02 The Glenlivet 12 Year Old single malt 42010/07/02 Ardbeg 10 Years Old single malt 22010/07/02 Ballantine 12 Years Old blended 82010/07/02 Ballantine 17 Years Old blended 72010/07/02 Johnnie Walker Swing blended 3

(中略)2010/07/31 Johnnie Walker Swing blended 42010/07/31 Johnnie Walker Gold Label 18 Years Old blended 22010/07/31 Bowmore 12 Years Old single malt 42010/07/31 Talisker 10 Years Old single malt 7


日毎の売上数をカウントブランドごとの売上数をカウント分類による売上数をカウント


MapReduce の実行順

.

. .1 Mapper を定義

.

. . 2 Reducer を定義

.

..

3 Hadoop Streaming で実行$ hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.20.2-streaming.jar \

-input scotch.tsv \

-output output \

-mapper mapper.r \

-reducer reducer.r

.

.

.

4 実行結果を確認$ cat output/part-00000

blended 592

single malt 783


カウンタのための Reducer

#!/usr/bin/env Rscript

env <- new.env(hash = TRUE)

con <- file("stdin", open = "r")

while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {

line <- unlist(strsplit(line, "\t"))

key <- line[1]

value <- as.integer(line[2])

if (exists(key, envir = env, inherits = FALSE)) {

oldcount <- get(key, envir = env)

assign(key, oldcount + value, envir = env)

} else {

assign(key, value, envir = env)

}

}

close(con)

for (key in ls(env, all = TRUE)) {

cat(key, "\t", get(value, envir = env), "\n", sep = " ")

}


日毎の売上数をカウント日毎の売上数の Mapper





date <- line[1]

order <- line[4]

cat(sprintf("%s\t%s\n", date, order), sep = "")

}

close(con)

実行結果cat output/part-00000

2010/07/01 25

2010/07/02 42

2010/07/03 39

2010/07/29 17

2010/07/30 45

2010/07/31 47


ブランドによる売上数をカウントブランドによる売上数の Mapper#!/usr/bin/env Rscript




brand <- line[2]

order <- line[4]

cat(sprintf("%s\t%s\n", brand, order), sep = "")

}

close(con)

実行結果$ cat output/part-00000

Ardbeg 10 Years Old 166

Ballantine 12 Years Old 142

Ballantine 17 Years Old 150

Bowmore 12 Years Old 149

Johnnie Walker Gold Label 18 Years Old 176

Johnnie Walker Swing 124

Talisker 10 Years Old 176

The Glenlivet 12 Year Old 164

The Macallan 12 Years 128@holidayworking () R で MapReduce 2010 年 8 月 28 日 14 / 18

分類による売上数をカウント

分類による売上数の Mapper





type <- line[3]

order <- line[4]

cat(sprintf("%s\t%s\n", type, order), sep = "")

}

close(con)

実行結果$ cat output/part-00000

blended 592

single malt 783


まとめ

MapReduce : 大規模なデータを効率的に処理するためのプログラミングモデル

Hadoop : Google File System と MapReduce のオープンソース実装Hadoop Streaming を使って、R で MapReduce を実行


まとめ

MapReduce : 大規模なデータを効率的に処理するためのプログラミングモデルHadoop : Google File System と MapReduce のオープンソース実装

Hadoop Streaming を使って、R で MapReduce を実行


まとめ

MapReduce : 大規模なデータを効率的に処理するためのプログラミングモデルHadoop : Google File System と MapReduce のオープンソース実装Hadoop Streaming を使って、R で MapReduce を実行


ご静聴ありがとうございました。


参考文献

Jeffrey Dean and Sanjay Ghemawat.Mapreduce: Simplified data processing on large clusters.OSDI’04: Sixth Symposium on Operating System Design and Implementation, 2004.

Tom White.Hadoop.オライリー・ジャパン.


Documents

RでMapreduce