13
PARALLEL SORTED NEIGHBORHOOD BLOCKING WITH MAPREDUCE Lars Kolb , Andreas Thor, Erhard Rahm Database Group Leipzig http://dbs.uni-leipzig.de Kaiserslautern, BTW 2011

P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig Kaiserslautern,

Embed Size (px)

Citation preview

Page 1: P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig  Kaiserslautern,

PARALLEL SORTED NEIGHBORHOOD BLOCKING WITH MAPREDUCELars Kolb, Andreas Thor, Erhard Rahm

Database Group Leipzighttp://dbs.uni-leipzig.de

Kaiserslautern, BTW 2011

Page 2: P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig  Kaiserslautern,

2 / 13

• Detection of entities in one or more sources that refer to the same real-world object

ENTITY RESOLUTION

Parallel Sorted Neighborhood Blocking with MapReduce

Page 3: P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig  Kaiserslautern,

3 / 13

ENTITY RESOLUTION (2)

• Runtime-intensive task O(n²) entity comparisons

• Blocking:• Semantically grouping of similar entities in blocks• Based on blocking keys derived from entities attributes• Restrict entity comparisons to entities from the same block

• Parallelization• MapReduce• Exploitation cloud infrastructures

Parallel Sorted Neighborhood Blocking with MapReduce

Page 4: P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig  Kaiserslautern,

4 / 13

SORTED NEIGHBORHOOD - RUNNING EXAMPLE (w=3)

Parallel Sorted Neighborhood Blocking with MapReduce

K S1 a1 d2 b2 e2 f2 h3 c3 g3 i

Sabcdefghi

Key Generation + Sort by Key

d-e, b-eb-f, e-fe-h, f-hf-c, h-ch-g, c-gc-i, g-i

Sliding Window

• Determine blocking key for each entity and sort entities by blocking key• Move window of fixed size w over sorted records and compare all entities

within window• All entities within a distance of w-1 are compared• O(n²) O(n) + O(n*log n) + O(n*w)

a-d, a-b, d-b

Page 5: P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig  Kaiserslautern,

5 / 13

OUTLINE

• Motivation

• Sorted Neighborhood and SN with MapReduce• Challenge 1: Sorted Reduce Partitions SRP• Challenge 2: Comparison of Boundary Entities JobSN/RepSN

• Experimental Results

• Conclusions & Future Work

Parallel Sorted Neighborhood Blocking with MapReduce

Page 6: P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig  Kaiserslautern,

6 / 13

MAPREDUCE• Computation expressed by two UDFs• Contain sequential code• Executed in parallel among multiple nodes• map: (keyin, valuein) list(keytmp, valuetmp)

• reduce: (keytmp, list(valuetmp)) list(keyout, valueout)

• Computation relies on data partitioning and redistribution• Number of map tasks m and reduce tasks r• Task executed by some idle node in the cluster• UDF part partitions map output and distributes it to the r reduce tasks• Sorting of key-value pairs• Grouping of key-value pairs by key and invocation of reduce for each group

Parallel Sorted Neighborhood Blocking with MapReduce

Page 7: P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig  Kaiserslautern,

7 / 13

ENTITY RESOLUTION WITH MAPREDUCE (m =3, r =2)

Parallel Sorted Neighborhood Blocking with MapReduce

Inpu

t Spl

it

map1Sabcdefghi

K S1 d2 e2 f

K S3 g2 h3 i

map2

map3

Sdef

Sghi

Parti

tioni

ng “

key

mod

ulo

r”

reduce1

reduce2

Mb-fe-h

Ma-dc-ib-fe-h

Out

put M

erge

Map Step: Blocking Reduce Step: Matching

K S1 a2 b3 c

Sabc

K S1 a

3 c1 d3 g3 i

K S1 a1 d3 c3 g3 i

K S2 b2 e2 f2 h

Ma-dc-i

•Map phase•Input data partitioned in m partitions•Each processed by one map task that calls map for each input record (“blocking”)•UDF part partitions map output and distributes it to the r reduce tasks

•Reduce phase•Sorting of key-value pairs by key •Grouping of key-value pairs by key•Invocation of reduce for each group (“matching”)

•Challenge 1: SN requires totally sorted list of entities•All entities assigned to reduce task Ri have smaller blocking key than all entities

assigned to reduce task Ri+1

•“Sorted reduce partitions” (SRP)•Must be ensured by part range partitioning

Page 8: P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig  Kaiserslautern,

8 / 13

• reduce:forEach(entity ϵ list(valuetmp))

match(buffer, entity); //match all buffered entities with entitybuffer.append(entity);if(buffer.size()==w) buffer.removeFirst();

SORTED NEIGHBORHOOD WITH MAPREDUCE – SRP

Parallel Sorted Neighborhood Blocking with MapReduce

map1

K S1.1 a1.2 b2.3 c

K S1.1 d1.2 e1.2 f

K S2.3 g1.2 h2.3 i

map2

map3

Sabc

Sdef

Sghi Pa

rtitio

ning

by

parti

tion

prefi

x

K S1.1 a1.1 d1.2 b1.2 e1.2 f1.2 h

K S2.3 c2.3 g2.3 i

reduce1

reduce2

Bc-gc-ig-i

Key Generation + Partition Prefix Sliding Window (+ Matching)

K S1 a2 b3 c

K S1 d2 e2 f

K S3 g2 h3 i

Ba-da-bd-bd-eb-eb-fe-fe-hf-h

f-c ?h-c?h-g?

• Challenge 2: Boundary Entities• Comparison of entities entities that are assigned to different reduce tasks

• map outputs composite key: partitionPrefix.blockKey• partitionPrefix(k)= 1 if k<=2, otherwise 2 (range partitioning)

• part(partitionPrefix.blockKey)= partitionPrefix• Key-value pairs are sorted and grouped by composed key

Page 9: P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig  Kaiserslautern,

9 / 13

• SN realization using two consecutive jobs• Job1:

• SRP + additional output of boundary entities• Keys of the additionally outputted entities are

prefixed with an additional boundary component• Job2:

• SN for boundary entities• part(boundary.partitionIndex.blockKey)= boundary % r• Sort and group by composed key

SORTED NEIGHBORHOOD WITH MAPREDUCE – JOBSN

Parallel Sorted Neighborhood Blocking with MapReduce

K S1.1 a1.1 d1.2 b1.2 e1.2 f1.2 h

K S2.3 c2.3 g2.3 i

reduce1

reduce2

Ba-d...f-h

Bc-gc-ig-i

Sliding Window (+ Matching)+ Boundary Prefix

K S1.2 f1.2 h

K S2.3 c2.3 g

map1

Parti

tioni

ng b

y bo

unda

ry p

refix

reduce1

Bf-ch-ch-g

Identity Sliding Window (+ Matching)

K S1.1.2 f1.1.2 h

map2

K S1.2.3 c1.2.3 g

K S1.1.2 f1.1.2 h1.2.3 c1.2.3 gK S

1.1.2 f1.1.2 h

K S1.2.3 c1.2.3 g

Page 10: P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig  Kaiserslautern,

10 / 13

SORTED NEIGHBORHOOD WITH MAPREDUCE - REPSN

Parallel Sorted Neighborhood Blocking with MapReduce

map1

K S1.1 a1.2 b2.3 c

K S1.1 d1.2 e1.2 f

K S2.3 g1.2 h2.3 i

map2

map3

Sabc

Sdef

Sghi

Key Generation + Partition Prefix + Boundary Prefix

K S1.1 a1.2 b2.3 c1.1 a1.2 b

K S1.1 d1.2 e1.2 f1.2 e1.2 f

K S2.3 g1.2 h2.3 i1.2 h

K S1.1.1 a1.1.2 b2.2.3 c2.1.1 a2.1.2 b

K S1.1.1 d1.1.2 e1.1.2 f2.1.2 e2.1.2 f

K S2.2.3 g1.1.2 h2.2.3 i2.1.2 h

K S1.1.1 a1.1.1 d1.1.2 b1.1.2 e1.1.2 f1.1.2 h

K S2.1.1 a2.1.2 b2.1.2 e2.1.2 f2.1.2 h2.2.3 c2.2.3 g2.2.3 i

reduce1

reduce2

Ba-da-bd-bd-eb-eb-fe-fe-hf-h

Bf-ch-ch-gc-gc-ig-i

Sliding Window (+ Matching)

Parti

tion

ing

by b

ound

ary

prefi

x

• SN realization using data replication•Reduce task i>1 needs last w-1 entities ofprevious partition in front of its input•Potential boundary entities are replicatedby the map tasks (two key-value pairs)•Replica of entity that is assigned toreduce task Ri is assigned to Ri+1

•Implementation•Map key prefixed with boundary component (like JobSN)•boundary= partitionPrefix+1 for replicatedentities (boundary=partitionPrefix otherwise)•part(boundary.partitionPrefix.blockKey)= boundary

Page 11: P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig  Kaiserslautern,

11 / 13

EXPERIMENTAL RESULTS• 1.4m publication records, blocking by title.substring(2), w=1000• 4 Dual core nodes, Hadoop 0.20.2

• Runtime reduction: 9h to 1.5h relative speedup of almost 6• Runtime of the implementations differ only slightly• JobSN faster for small degree of parallelism• RepSN completes faster gebinning with m=r=4

Parallel Sorted Neighborhood Blocking with MapReduce

Page 12: P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig  Kaiserslautern,

12 / 13

CONCLUSIONS• Application of the MapReduce programming model for parallel

execution of typical Entity Resolution workflows

• Realization of Sorted Neighborhood Blocking with MapReduce• Sorted reduce partitions

• Range partitioning

• Boundary entities• JobSN: generation of boundary correspondences by additional job• RepSN: SN realization within a single job using data replication in map phase

• Evaluation of the proposed approaches

• Future work• Load balancing mechanisms for handling skewed (blocking key) data• Multi-pass Blocking within single job

Parallel Sorted Neighborhood Blocking with MapReduce

Page 13: P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig  Kaiserslautern,

13 / 13Parallel Sorted Neighborhood Blocking with MapReduce

THANK YOU FOR YOUR ATTENTION