Data Intensive Text Processing with MapReduce - #3 MapReduce Algorithm Design -

Data Intensive Text Processing with MapReduce

- #3 MapReduce Algorithm Design -

@just_do_neet

Data Intensive Text Processing with MapReduce #3

Data Intensive...(snip書籍

2


#3 MapReduce Algorithm Design

•MapReduceはシンプルでスケーラブル（Mapper / Reducer）

•シンプルなため制約が大きく、限定的な手法しか用いることができない。

•その中で、MapReduceにおけるデザインパターン的なものや、問題解決のテクニックを紹介。

第三章：MapReduce アルゴリズムの設計

3


#3 MapReduce Algorithm Design

•ローカル集約

•pairsとstripes

•相対頻度の計算

•セカンダリソート

•リレーショナルな結合

第三章：MapReduce アルゴリズムの設計

4


ローカル集約

5


Local Aggregation

•HadoopではMap→Reduce間の受け渡しの際に中間データをディスクに書き込む

•オーバーヘッドが大きい

•中間データの削減を行う事で処理効率がアップする

ローカル集約

6


Local Aggregation

•問題設定：さだまさしの歌詞から頻出単語を抽出。

•データ元：http://www.cai-insect.jp/sada/

ローカル集約

7

http://www.cai-insect.jp/sada/



Local Aggregation

•標準的なMapReduce処理ドキュメント中に語が出現するごとにEmit

• https://gist.github.com/3475182https://gist.github.com/3475195

ローカル集約

8

https://gist.github.com/3475182





Local Aggregation

•連想配列を用いてドキュメントごとに語のカウントを集計してEmit(in-mapper combining)

• https://gist.github.com/3475211

ローカル集約

9




Local Aggregation

•連想配列をクラス内で保持し、すべてのドキュメント中の語のカウントを集計した後にEmit


ローカル集約

10




Local Aggregation

•in-mapper combining のメリット

•Map→Reduceの受け渡し回数を減らすことでパフォーマンスの向上が期待できる。

•デメリット

•Mapタスクのメモリ枯渇に注意

•データ出現パターンによってはあまり有効でないケースもある。

ローカル集約

11


Local Aggregation

•in-mapper combiningのnaiveな改善(メモリ関連)

•https://gist.github.com/3475348

•定期的にMapの内容をフラッシュ

ローカル集約

12




pairsとstripes

13


pairs and stripes

•複合型のキーの集約テクニック

•一例：文章の中から語の共起頻度を算出する

•共起：ある単語がある文章中に出たとき、その文章中に別の限られた単語が頻繁に出現すること。(wikipedia)

•「私はさだまさしが好きです。」→「私：さだまさし」「私：好き」...

pairsとstripes

14


pairs and stripes

•共起語抽出の情報量→基本的にO(n^2）

•「私はさだまさしが好きです。」→「私：は」「私：さだまさし」「私：が」...　「好き：です」

pairsとstripes

15

私はさだまさしが好きですは (私) さだまさしが好きです

さだまさし (私) (は）が好きですが (私) (は） (さだまさし) 好きです好き (私) (は） (さだまさし) （が）ですです (私) (は） (さだまさし) （が）（好き）


pairs and stripes

•問題設定：さだまさしの歌詞から頻出する共起語を抽出。

•データ元：http://www.cai-insect.jp/sada/

pairsとstripes

16




pairs and stripes

•pairs:ワードwの共起語uを抽出し複合キーとし、複合キー＋出現頻度をEmit


pairsとstripes

17






pairs and stripes

•stripes:ウインドウの最初の語ｗをキー。共起語uのそれぞれの頻度をHashで保持しEmit


pairsとstripes

18






pairs and stripes

•「私はさだまさしが好きです」

•pairs• {私は:1}, {私さだまさし:1} ,{私が:1}, {私好き:1}, {私です:1},

{はさだまさし:1}, {はが:1}.....

•stripes• {私: {さだまさし:1} {が:1} {好き:1} {です:1}},

{は: {さだまさし:1} {が:1}.....}

•Map→Reduceのemitの数は paris > stripes

pairsとstripes

19


pairs and stripes

•共起語の出現頻度

pairsとstripes

20


相対頻度

21


Computing Relative Freq.

•ある語ｗと共起するuの出現頻度だけでなく、相対頻度（条件付き確率？）が取得したい場合がある。

•そのためには語ｗの出現頻度（式右下部）を算出する必要がある。

相対頻度

22


Computing Relative Freq.

•stripes: https://gist.github.com/3475934

•語ｗについて、すべての共起語uとその出現頻度がReducerに渡されるので、出現頻度を合算して計算すれば良い。

•pairs: https://gist.github.com/3475992

•そのままでは不可。Partitionerを改修して、語ｗが先頭のkeyをすべて同じReducerに振り分けるようにする必要がある。

相対頻度

23






セカンダリソート

24


Secondary Sort

•Keyだけでなく、Valueでもソートをしたい

1.Reduceの中でソート

2.Map→Reduceの際に、ソートしたいValueをKeyに含めてしまう。（value-to-key conversion）


25


Secondary Sort

•問題設定：さだまさしのコンサート会場のリストを解析

•Sort1：コンサート会場Sort2：コンサート実施年


26


Secondary Sort

•Reduceの中でソート


• Map→Reduce{“東京厚生年金会館” : “2000\t1”} {“東京厚生年金会館” : “2000\t1”} {“東京厚生年金会館” : “2001\t1”}Reduce→Result{“東京厚生年金会館” : “2000\t2”}{“東京厚生年金会館” : “2001\t1”} ←Reduce内で年で並び替え


27






Secondary Sort

•value-to-key conversion


• Map→Reduce{“東京厚生年金会館\t2000” : 1} {“東京厚生年金会館\t2000” : 1} {“東京厚生年金会館\t2001” : 1} ←Keyの中に年を含めるReduce→Result{“東京厚生年金会館\t2000” : 2}{“東京厚生年金会館\t2001” : 1}


28




リレーショナルな結合

29


Relational Join

•手法だけ紹介

•Reduce Side Join→Reduce側でJoinする

•Map Side Join→Map側でJoinする

•Memory-Backed Join→Mapperもしくは外部メモリ(memcachedなど）でデータをまとめて保持し、Joinする


30


Relational Join

•Reduce Side Join

•参考：http://code.google.com/p/try-hadoop-mapreduce-java/source/browse/trunk/try-mapreduce/src/main/java/jp/gr/java_conf/n3104/try_mapreduce/JoinWithDeptNameUsingReduceSideJoin.java

•Map Side Join

•参考：http://code.google.com/p/try-hadoop-mapreduce-java/source/browse/trunk/try-mapreduce/src/main/java/jp/gr/java_conf/n3104/try_mapreduce/JoinWithDeptNameUsingReduceSideJoin.java


31

http://code.google.com/p/try-hadoop-mapreduce-java/source/browse/trunk/try-mapreduce/src/main/java/jp/gr/java_conf/n3104/try_mapreduce/JoinWithDeptNameUsingReduceSideJoin.java













Relational Join

•Memory-Backed Join

•参考：http://d.hatena.ne.jp/wyukawa/20110818/1313670105


32

http://d.hatena.ne.jp/wyukawa/20110818/1313670105



Bibliography

•http://www.slideshare.net/nokuno/hadoopreading05-data-intensive3

•http://d.hatena.ne.jp/wyukawa/20111002/1317550750

参考文献（書籍以外）

33

http://www.slideshare.net/nokuno/hadoopreading05-data-intensive3









ご清聴ありがとうございました

34

Technology

Data Intensive Text Processing with MapReduce - #3 MapReduce Algorithm Design -