29
PARALLELIZATION OF REGULAR EXPRESSION MATCHING AND ITS EVALUATION ON HADOOP KIMINORI MATSUZAKI, KENTO EMOTO, YU LIU 情報処理学会論文誌 プログラミング Vol.4 No.4 1-11 (Sep. 2011)

Parallelization of regular expression matching and its evaluation on Hadoop

Embed Size (px)

Citation preview

Page 1: Parallelization of regular expression matching and its evaluation on Hadoop

PARALLELIZATION OF REGULAR EXPRESSION MATCHING AND ITS EVALUATION ON HADOOP

KIMINORI MATSUZAKI, KENTO EMOTO, YU LIU情報処理学会論文誌 プログラミング�Vol.4 No.4 1-11 (Sep. 2011)

Page 2: Parallelization of regular expression matching and its evaluation on Hadoop

INTRODUCTION AND

MOTIVATION

0

Page 3: Parallelization of regular expression matching and its evaluation on Hadoop

REGULAR EXPRESSION

LIST HOMOMORPHISM

HADOOP

FINITE AUTOMATON

PARALLELIZATION

Page 4: Parallelization of regular expression matching and its evaluation on Hadoop

DFA IS BETTER

Page 5: Parallelization of regular expression matching and its evaluation on Hadoop

PROCESSOR SCALABILITY

Page 6: Parallelization of regular expression matching and its evaluation on Hadoop

OPTIMIZATION OF

REGULAR EXPRESSION MATCHING

1

Page 7: Parallelization of regular expression matching and its evaluation on Hadoop

Hadoop

hadoopHadooooop

hadop

Hadooop

hadoooooop

Hadop

(H|h)adoo*pREGULAREXPRESSION

Page 8: Parallelization of regular expression matching and its evaluation on Hadoop

full-text search

search engine

XML processingaccess log analysis

natural language processing

text replacing network securitycompiler front-endACHIEVED WITH

REGULAREXPRESSIONS

URL router

Page 9: Parallelization of regular expression matching and its evaluation on Hadoop

FINITE AUTOMATON

Page 10: Parallelization of regular expression matching and its evaluation on Hadoop

a

ε

a

a

NON-DETERMINISTICFINITE AUTOMATON

Page 11: Parallelization of regular expression matching and its evaluation on Hadoop

a

b

a

c

d e

a

DETERMINISTICFINITE AUTOMATON

Page 12: Parallelization of regular expression matching and its evaluation on Hadoop

PARALLELISM

Page 13: Parallelization of regular expression matching and its evaluation on Hadoop

LISTHOMOMORPHISM

2

Page 14: Parallelization of regular expression matching and its evaluation on Hadoop

({[a],[b],[a, b],[b, c, d],[e, f],..}, ++)

({1,1,2,3,2,..}, +)

HOMOMORPHISM

Page 15: Parallelization of regular expression matching and its evaluation on Hadoop

[1, 2, 3] ++ [7, 8] = [1, 2, 3, 7, 8]

3 + 2 = 5

HOMOMORPHISM

Page 16: Parallelization of regular expression matching and its evaluation on Hadoop

DIVIDEAND

CONQUER

Page 17: Parallelization of regular expression matching and its evaluation on Hadoop

LIST HOMOMORPHISM

Page 18: Parallelization of regular expression matching and its evaluation on Hadoop

B C D A BA ...

foldl

Page 19: Parallelization of regular expression matching and its evaluation on Hadoop

O((n/p + log p))入力文字列の長さがn計算ノードの数がp

Page 20: Parallelization of regular expression matching and its evaluation on Hadoop

DFAO((n/p + log p)|QD|)入力文字列の長さがn計算ノードの数がpDFAの状態数がQD

Page 21: Parallelization of regular expression matching and its evaluation on Hadoop

NFAO((n/p + log p)|QN|^3)入力文字列の長さがn計算ノードの数がpNFAの状態数がQN

Page 22: Parallelization of regular expression matching and its evaluation on Hadoop

EVALUATION ON

HADOOP

3

Page 23: Parallelization of regular expression matching and its evaluation on Hadoop

MAP REDUCE

MAPPER

REDUCER

MAPPER

MAPPER

MAPPER

INPUT OUTPUT

Page 24: Parallelization of regular expression matching and its evaluation on Hadoop

0s

125s

250s

375s

500s

0 8 16 24 32 40

Exec

utin

tim

e

Number of Nodes

DFA NFA

small REGULAR EXPRESSION

Page 25: Parallelization of regular expression matching and its evaluation on Hadoop

0s

1750s

3500s

5250s

7000s

0 8 16 24 32 40

Exec

utin

Tim

e

Number of Nodes

DFA NFA

LARGE REGULAR EXPRESSION

Page 26: Parallelization of regular expression matching and its evaluation on Hadoop

0s

75s

150s

225s

300s

0 1500 3000 4500 6000

Exec

utio

n tim

e

Number of states

DFA

LINEAR

Page 27: Parallelization of regular expression matching and its evaluation on Hadoop

0s

1000s

2000s

3000s

4000s

0 10 20 30 40

Exec

utin

tim

e

Number of states

NFA

CUBIC

Page 28: Parallelization of regular expression matching and its evaluation on Hadoop

RELEVANTSTUDIES

4

Page 29: Parallelization of regular expression matching and its evaluation on Hadoop

TREEHOMOMORPHISM

GPGPU BASED

MAXIMUM MARKING PROBLEMS 松崎公紀, 胡 振江, 武市正人:

リスト上の最大マーク付け問題を解く並列プログラムの導出,情報処理学会論文誌:プログラミング,Vol.49, No.SIG 3 (PRO 36), pp.16‒27 (2008).

Skillicorn, D.B.: Structured Parallel Computation in Structured Documents, Journal of Universal Computer Science, Vol.3, No.1, pp.42‒68 (1997).野村芳明, 江本健斗, 松崎公紀, 胡 振江, 武市正人:木スケルトンによるXPathクエリの並列化とその評価,コンピュータソフトウェア, Vol.24, No.3, pp.51‒62 (2007).

Naghmouchi, J., Scarpazza, D.P. and Berekovic, M.:Small-ruleset Regular Expres- sion Matching on GPGPUs: Quantitative Performance Analysis and Optimization, Proc. 24th International Conference on Supercomputing, 2010, Tsukuba, Ibaraki, Japan, June 2-4, 2010,Boku, T., Nakashima, H. and Mendelson, A. (Eds.), pp.337‒ 348, ACM (2010).