182
J.ROCHAS,GE SONG, F.HUET PROJET SOUS LA DIRECTION : SURVEY OF DIFFERENT APPROACHES FOR COMPUTING KNN ON TOP OF MAP REDUCE LEA EL BEZE 1

Survey of different approaches for computing KNN on top of Map Reduce

Embed Size (px)

Citation preview

Page 1: Survey of different approaches for computing KNN on top of Map Reduce

J.ROCHAS,GE SONG, F.HUET

PROJET

SOUS LA DIRECTION :

SURVEY OF DIFFERENT APPROACHES FOR COMPUTING KNN ON TOP OF MAP REDUCE LEA EL BEZE

1

Page 2: Survey of different approaches for computing KNN on top of Map Reduce

DEFINITIONS :

KNN = K nearest neighbors

KNN(r,S) = set KNN of r from S

KNNJ(R,S) = {r,KNN(r,S) | for all r in R}

2

Page 3: Survey of different approaches for computing KNN on top of Map Reduce

Exemple

exemple pour k =3

R s

3

Page 4: Survey of different approaches for computing KNN on top of Map Reduce

Exemple

exemple pour k =3

R s

4

Page 5: Survey of different approaches for computing KNN on top of Map Reduce

R s

Exemple

exemple pour k =3

5

Page 6: Survey of different approaches for computing KNN on top of Map Reduce

Problèmes :Data deluge

Parallelisme!

6

Page 7: Survey of different approaches for computing KNN on top of Map Reduce

MAP REDUCE7

Page 8: Survey of different approaches for computing KNN on top of Map Reduce

Map Reduce

MapReduce est un patron d'architecture de développement informatique, popularisé par Google, dans lequel sont effectués des calculs parallèles, et souvent distribués, de données potentiellement très volumineuses (Wikipedia)

8

Page 9: Survey of different approaches for computing KNN on top of Map Reduce

Map Reduce

Idee :

Map<K1,V1> —> list <K2, V2>

Reduce<K2,list<V2>> —> list <K3, V3>

9

Page 10: Survey of different approaches for computing KNN on top of Map Reduce

KNN : DIFFERENTS ALGORITHMES

10

Page 11: Survey of different approaches for computing KNN on top of Map Reduce

2 types d’AlgorithmesKNN

reel K plus proche voisins

ANN

approximatif plus proche voisins

11

Page 12: Survey of different approaches for computing KNN on top of Map Reduce

KNN : REEL

1. HBKNNJ : BASIC 2. HBNLJ : BLOC NESTED

12

Page 13: Survey of different approaches for computing KNN on top of Map Reduce

HBKNNJHadoop Basic K Nearest Neighbors Join

13

Page 14: Survey of different approaches for computing KNN on top of Map Reduce

HBKNNJ : Hadoop Basic K Nearest Neighbors Join

PRINCIPE :

2 Dataset R et S

Joindre R a S

calculer les K plus proches voisins de R dans S

14

Page 15: Survey of different approaches for computing KNN on top of Map Reduce

Tours

Nice

Paris

Toulouse

Reims

Quels sont les 2 plus proches voisins de R dans S ?

R

S

15

Page 16: Survey of different approaches for computing KNN on top of Map Reduce

HBKNNJ : Hadoop Basic K 1 JOB

nice, <43, 7>toulouse,<43, 1>

reims<49,4>paris,<48,2>tours,<47,1>,R

input

R

S

16

Page 17: Survey of different approaches for computing KNN on top of Map Reduce

HBKNNJ : Hadoop Basic K 1 JOB

nice, <43, 7>toulouse,<43, 1>

reims<49,4>paris,<48,2>tours,<47,1>,R tours,<47,1>,S

nice, <43, 7>,R

reims,<49,4>,S

paris, <48,2>,S

toulouse<43,1> R

input map

R

S

17

Page 18: Survey of different approaches for computing KNN on top of Map Reduce

HBKNNJ : Hadoop Basic K 1 JOB

nice, <43, 7>toulouse,<43, 1>

reims<49,4>paris,<48,2>tours,<47,1> tours,<47,1>,S

nice, <43, 7>,R

nice|nice,reims| = 6.7|nice,paris| = 7|nice,tours| = 6!!

toulouse| toulouse,reims| = 6.7 |toulouse,paris| = 5| toulouse,tours| = 2

reims,<49,4>,S

paris, <48,2>,S

toulouse<43,1> R

input map reduce

R

S

18

Page 19: Survey of different approaches for computing KNN on top of Map Reduce

HBKNNJ : Hadoop Basic K 1 JOB

nice, <43, 7>toulouse,<43, 1>

reims<49,4>paris,<48,2>tours,<47,1> tours,<47,1>,S

nice, <43, 7>,R

nice|nice,reims| = 6.7|nice,paris| = 7|nice,tours| = 6!!

toulouse| toulouse,reims| = 6.7 |toulouse,paris| = 5| toulouse,tours| = 2

reims,<49,4>,S

paris, <48,2>,S

toulouse<43,1> R

nice,tours 6

nice,reims 6.7

toulouse,paris 5

toulouse,tours 2

input map reduce output

R

S

19

Page 20: Survey of different approaches for computing KNN on top of Map Reduce

Tours

Nice

Paris

Toulouse

Reims

20

Page 21: Survey of different approaches for computing KNN on top of Map Reduce

HBKNNJ : Hadoop Basic K Nearest Neighbors Join

INCONVENIENTS :

seul la phase map est mise en parallèle

AVANTAGES :

1 seul job map reduce

EN PRATIQUE :

utile pour des petits dataset mais devient très vite cher en temps pour des plus grand dataset

21

Page 22: Survey of different approaches for computing KNN on top of Map Reduce

HBNLJHadoop Block Nested Loop Join

22

"Efficient Parallel kNN Joins for Large

Page 23: Survey of different approaches for computing KNN on top of Map Reduce

HBNLJ : Hadoop Block Nested Loop Join

But :

On veut mettre le phase reduce en parallèle en découpant le travail

Methode :

Découper R et S en n*n reducer

23

Page 24: Survey of different approaches for computing KNN on top of Map Reduce

HBNLJ : Hadoop Block Nested Loop Join

Decoupage :

R1 R1 R1

R2 R2 R2

R3 R3 R3

s1

s1

s1

s2

s2

s2

s3

s3

s3

distribue R en ligne

distribue S en colonne

24

Page 25: Survey of different approaches for computing KNN on top of Map Reduce

nice

paristourslyon

reimsnancy

lille

toulouse

R

S

HBNLJ : Hadoop Block Nested Loop Join

1er JOB

25

Page 26: Survey of different approaches for computing KNN on top of Map Reduce

nice

paristourslyon

reimsnancy

lille

toulouse

R

S

HBNLJ : Hadoop Block Nested Loop Join

1er JOBPhase map

toulouse toulouse

26

Page 27: Survey of different approaches for computing KNN on top of Map Reduce

nice

paristourslyon

reimsnancy

lille

toulouse

R

S

HBNLJ : Hadoop Block Nested Loop Join

1er JOB

toulouse toulouse

nice nice

27

Page 28: Survey of different approaches for computing KNN on top of Map Reduce

nice

paristourslyon

reimsnancy

lille

toulouse

R

S

HBNLJ : Hadoop Block Nested Loop Join

1er JOBPhase map

toulouse toulouse

nice nice

paris

paris

28

Page 29: Survey of different approaches for computing KNN on top of Map Reduce

nice

paristourslyon

reimsnancy

lille

toulouse

R

S

HBNLJ : Hadoop Block Nested Loop Join

1er JOBPhase map

toulouse toulouse

nice nice

paris

parisnancy

nancy

29

Page 30: Survey of different approaches for computing KNN on top of Map Reduce

nice

paristourslyon

reimsnancy

lille

toulouse

R

S

HBNLJ : Hadoop Block Nested Loop Join

1er JOBPhase map

toulouse toulouse

nice nice

paris

paris

tours

tours nancy

nancy

reims

reimslyon

lyon

lille

lille

30

Page 31: Survey of different approaches for computing KNN on top of Map Reduce

HBNLJ : Hadoop Block Nested Loop Join

1er JOBPhase reduce

toulouse toulouse

nice nice

paris

paris

tours

tours nancy

nancy

reims

reimslyon

lyon

lille

lille

31

Page 32: Survey of different approaches for computing KNN on top of Map Reduce

HBNLJ : Hadoop Block Nested Loop Join

1er JOB

toulouse toulouse

nice nice

paris

paris

tours

nancy

reims

reimslyon

lyon

32

Page 33: Survey of different approaches for computing KNN on top of Map Reduce

HBNLJ : Hadoop Block Nested Loop Join

1er JOBoutput

toulouse toulouse

nice nice

paris

paris

tours

nancy

reims

reimslyon

lyon

toulouse,paris

toulouse,lyon

33

Page 34: Survey of different approaches for computing KNN on top of Map Reduce

HBNLJ : Hadoop Block Nested Loop Join

1er JOBoutput

toulouse toulouse

nice nice

paris

paris

tours

nancy

reims

reimslyon

lyon

toulouse,reims

toulouse,tours

toulouse,paris

toulouse,lyon

34

Page 35: Survey of different approaches for computing KNN on top of Map Reduce

HBNLJ : Hadoop Block Nested Loop Join

1er JOBoutput

toulouse toulouse

nice nice

paris

paris

tours

nancy

reims

reimslyon

lyon

toulouse,reims

toulouse,tours

toulouse,paris

toulouse,lyon

nice,reims

nice,nancy

nice,paris

nice,lyon

candidats

35

Page 36: Survey of different approaches for computing KNN on top of Map Reduce

HBNLJ : Hadoop Block Nested Loop Join

2eme JOB

toulouse,reims

toulouse,tours

toulouse,paris

toulouse,lyon

nice,reims

nice,nancy

nice,paris

nice,lyon36

Page 37: Survey of different approaches for computing KNN on top of Map Reduce

HBNLJ : Hadoop Block Nested Loop Join

2eme JOB

toulouse,reims

toulouse,tours

toulouse,paris

toulouse,lyon

nice,reims

nice,nancy

nice,paris

nice,lyon

<toulouse>reims,tours,paris,lyon

<nice>reims,nancy,paris,lyon

37

Page 38: Survey of different approaches for computing KNN on top of Map Reduce

HBNLJ : Hadoop Block Nested Loop Join

2eme JOB

toulouse,reims

toulouse,tours

toulouse,paris

toulouse,lyon

nice,reims

nice,nancy

nice,paris

nice,lyon

<toulouse>reims,tours,paris,lyon

<nice>reims,nancy,paris,lyon

nice,lyon

nice,paris

toulouse,lyon

toulouse,tours

38

Page 39: Survey of different approaches for computing KNN on top of Map Reduce

HBNLJ : Hadoop Block Nested Loop Join

JOB 1: calcul des candidats KNN( R ,S)

JOB 2: calcul des KNN( R ,S)

INPUTR

INPUTS

39

Page 40: Survey of different approaches for computing KNN on top of Map Reduce

HBNLJ : Hadoop Block Nested Loop Join

Avantage :

On découpe le travail en n*n reducer donc la phase reduce peut être mis en parallèle.

Inconvenient :

replication de R et S n fois

on a en sortit que les cKNN(R, S) , c’est a dire les candidats pour les KNN(R,S)

implique un 2nd job pour les KNN(R,S)

40

Page 41: Survey of different approaches for computing KNN on top of Map Reduce

HVKNNJHadoop Voronoi K Nearest Neighbors Join

41

"Efficient processing of k

Page 42: Survey of different approaches for computing KNN on top of Map Reduce

HVKNNJ : Hadoop Voronoi K Nearest Neighbors Join

Voronoi ?

c est une structure qui permet de diviser notre espace en cellule…

42

Page 43: Survey of different approaches for computing KNN on top of Map Reduce

HVKNNJ : Hadoop Voronoi K Nearest Neighbors Join

R s

43

Page 44: Survey of different approaches for computing KNN on top of Map Reduce

HVKNNJ : Hadoop Voronoi K Nearest Neighbors Join

R s

selection pivots

44

Page 45: Survey of different approaches for computing KNN on top of Map Reduce

HVKNNJ : Hadoop Voronoi K Nearest Neighbors Join

R s

diagrame voronoi

45

Page 46: Survey of different approaches for computing KNN on top of Map Reduce

HVKNNJ : Hadoop Voronoi K Nearest Neighbors Join

R s

diagrame voronoi

46

Page 47: Survey of different approaches for computing KNN on top of Map Reduce

HVKNNJ : Hadoop Voronoi K Nearest Neighbors Join

R s

Pb : la replication de S

pour k = 3

47

Page 48: Survey of different approaches for computing KNN on top of Map Reduce

HVKNNJ : Hadoop Voronoi K Nearest Neighbors Join

R s

Pb : la replication de S

pour k = 3

48

Page 49: Survey of different approaches for computing KNN on top of Map Reduce

HVKNNJ : Hadoop Voronoi K Nearest Neighbors Join

JOB 2: selection du pivot

le plus proche +statistique

ETAPE: grouping

INPUTR

INPUTS

JOB 1: generer les pivots

INPUTR JOB 3:

calcul des KNN

49

Page 50: Survey of different approaches for computing KNN on top of Map Reduce

HVKNNJ : Hadoop Voronoi K Nearest Neighbors

strategie de pivots

Farthest :

sur un échantillon on prend le plus loin

KMeans :

sur un échantillon on prend un nombre de centroides qui seront les pivots

50

Page 51: Survey of different approaches for computing KNN on top of Map Reduce

HVKNNJ : Hadoop Voronoi K Nearest Neighbors

Grouping ???

Probleme ?

on a n reducer et p cellule pivots comment distribuer p dans n ?

But ?

regrouper les pivots pour minimiser la replication de S

avoir un bon balancing (temps equivalent pour les slots)

51

Page 52: Survey of different approaches for computing KNN on top of Map Reduce

HVKNNJ : Hadoop Voronoi K Nearest Neighbors

Grouping ???

Probleme ?

on a n reducer et p cellules de pivots

comment distribuer p dans n ?

4 reducers 52

Page 53: Survey of different approaches for computing KNN on top of Map Reduce

HVKNNJ : Hadoop Voronoi K Nearest Neighbors

Grouping ???

But

éviter la replication de S

avoir des cellule qui prennent le meme temps:good balancing

53

Page 54: Survey of different approaches for computing KNN on top of Map Reduce

HVKNNJ : Hadoop Voronoi K Nearest Neighbors

Grouping ???

2 strategies

Geo

Greedy

54

Page 55: Survey of different approaches for computing KNN on top of Map Reduce

HVKNNJ : Hadoop Voronoi K Nearest Neighbors

Grouping ???Geo

Geo : Regrouper les pivots les plus proches

1

2

3

4

5

55

Page 56: Survey of different approaches for computing KNN on top of Map Reduce

HVKNNJ : Hadoop Voronoi K Nearest Neighbors

Grouping ???Geo

Geo : Mais les cellules non pas la meme repartion

Prendra plus longtemps

que les autres

56

Page 57: Survey of different approaches for computing KNN on top of Map Reduce

HVKNNJ : Hadoop Voronoi K Nearest Neighbors

Grouping ???Greedy

Greedy :repartir suivant les scores et les cellules qui ont le plus de replication commun

grace au stat : on peut borner les réplications des cellules et donc la complexité en temps

c(v) = complexité d une cellule

c(v)=#r*(#s+#rep)

57

Page 58: Survey of different approaches for computing KNN on top of Map Reduce

HVKNNJ : Hadoop Voronoi K Nearest Neighbors

Grouping ???Greedy

2

2

2

2

1

6

18

10

3

pivots rep scoreA E,B 28B A,E,D,C 40C B,D 15D E,B,C 25E A,B,D 44

1

A

B

CD

E

2 reducersgroupe score

58

Page 59: Survey of different approaches for computing KNN on top of Map Reduce

HVKNNJ : Hadoop Voronoi K Nearest Neighbors

Grouping ???Greedy

2

2

2

2

1

6

18

10

3

pivots rep scoreA E,B 28B A,E,D,C 40C B,D 15D E,B,C 25E A,B,D 44

1

A

B

CD

E

2 reducersgroupe score

2815

AC

59

Page 60: Survey of different approaches for computing KNN on top of Map Reduce

HVKNNJ : Hadoop Voronoi K Nearest Neighbors

Grouping ???Greedy

2

2

2

2

1

6

18

10

3

pivots rep scoreA E,B 28B A,E,D,C 40C B,D 15D E,B,C 25E A,B,D 44

1

A

B

CD

E

2 reducersgroupe score

2855

AC B

60

Page 61: Survey of different approaches for computing KNN on top of Map Reduce

HVKNNJ : Hadoop Voronoi K Nearest Neighbors

Grouping ???Greedy

2

2

2

2

1

6

18

10

3

pivots rep scoreA E,B 28B A,E,D,C 40C B,D 15D E,B,C 25E A,B,D 44

1

A

B

CD

E

2 reducersgroupe score

7255

AC B

E

61

Page 62: Survey of different approaches for computing KNN on top of Map Reduce

HVKNNJ : Hadoop Voronoi K Nearest Neighbors

Grouping ???Greedy

2

2

2

2

1

6

18

10

3

pivots rep scoreA E,B 28B A,E,D,C 40C B,D 15D E,B,C 25E A,B,D 44

1

A

B

CD

E

2 reducersgroupe score

7280

AC B

E

D62

Page 63: Survey of different approaches for computing KNN on top of Map Reduce

HVKNNJ : Hadoop Voronoi K Nearest Neighbors

etape 3: grouping

Resultat :

Greedy meilleur

avec plus de reducer

63

Page 64: Survey of different approaches for computing KNN on top of Map Reduce

ANN : APPROXIMATIVE

1. HZKNNJ : Z-VALUE 2. LSH : LOCALITY

64

Page 65: Survey of different approaches for computing KNN on top of Map Reduce

HZKNNJHadoop Z-Value K Nearest Neighbors Join

65

"Efficient Parallel kNN Joins for Large

Page 66: Survey of different approaches for computing KNN on top of Map Reduce

HZKNNJ : Hadoop Z-Value K Nearest Neighbors Join

Idee :

Transformer d-dimension en 1 dimension grace au Z value

On mappe les 1 dimension pour trouver les KNNJ

66

Page 67: Survey of different approaches for computing KNN on top of Map Reduce

HZKNNJ : Hadoop Z-Value K Nearest Neighbors Join

depart creation des copiestransformation basé sur le space filling = mappe sur 1-D

67

Page 68: Survey of different approaches for computing KNN on top of Map Reduce

HZKNNJ : Hadoop Z-Value K Nearest Neighbors Join

Z-valuez_q

z+(z_q)z-(z_q)

candidats de q

68

Page 69: Survey of different approaches for computing KNN on top of Map Reduce

HZKNNJ : Hadoop Z-Value K Nearest Neighbors Join

Z-S

Z-R

Comment faire la partition ?

69

Page 70: Survey of different approaches for computing KNN on top of Map Reduce

HZKNNJ : Hadoop Z-Value K Nearest Neighbors Join

Z-S

Z-R

70

Page 71: Survey of different approaches for computing KNN on top of Map Reduce

HZKNNJ : Hadoop Z-Value K Nearest Neighbors Join

Z-S

Z-R

71

Page 72: Survey of different approaches for computing KNN on top of Map Reduce

HZKNNJ : Hadoop Z-Value K Nearest Neighbors Join

Z-S

Z-R

2-NNcopy

72

Page 73: Survey of different approaches for computing KNN on top of Map Reduce

HZKNNJ : Hadoop Z-Value K Nearest Neighbors Join

Z-S

Z-R

2-NNcopy

73

Page 74: Survey of different approaches for computing KNN on top of Map Reduce

HZKNNJ : Hadoop Z-Value K Nearest Neighbors Join

Z-S

Z-R

2-NN

BTree BTree BTree

cKNN cKNNcKNN74

Page 75: Survey of different approaches for computing KNN on top of Map Reduce

JOB 1: copies

+z_value————-

statistiquesINPUT

S

etape 0: creation des vectors

INPUTR

JOB 2: calcul des candidats

JOB 3:calcul des

KNN

HZKNNJ : Hadoop Z-Value K Nearest Neighbors Join

75

Page 76: Survey of different approaches for computing KNN on top of Map Reduce

HLSHHadoop Locality Sensitive Hashing

76

"Parallel Similarity Join »

Page 77: Survey of different approaches for computing KNN on top of Map Reduce

HLSH : Hadoop Locality Sensitive Hashing

1,0,-4

6,-8,7

9,0,-8

Idee : • hash nos objets • les objets qui ont le meme

hashing sont dans le meme bucket

• =cherche de collision

77

Page 78: Survey of different approaches for computing KNN on top of Map Reduce

HLSH : Hadoop Locality Sensitive Hashing

1,0,-4

6,-8,7

9,0,-8

calcul des KNN des objets R avec les objets S, du meme bucket

bucket vide (pas de R) = eliminé

78

Page 79: Survey of different approaches for computing KNN on top of Map Reduce

La fonction de hashing

g est une famille de hash de longueur M

L famille de g

HLSH : Hadoop Locality Sensitive Hashing

longueur M

objet

h1,h2…,hm

h1,h2…,hmhash L fois

79

Page 80: Survey of different approaches for computing KNN on top of Map Reduce

La fonction de hashing

!

!

!

a = random gaussian vector

b in [0,W]

W = définit la taille du bucket

HLSH : Hadoop Locality Sensitive Hashing

h(v) = a• v + bW

⎢⎣⎢

⎥⎦⎥

80

Page 81: Survey of different approaches for computing KNN on top of Map Reduce

La fonction de hashing

3 parametres

L = augmente la precision, mais augmente le temps

M = augmente probabilité que les elements proches sont dans le meme bucket

W = la taille de la fenêtre

HLSH : Hadoop Locality Sensitive Hashing

81

Page 82: Survey of different approaches for computing KNN on top of Map Reduce

JOB 1: générer les

hash et calcul statistique

ETAPE 2: définir la

partition grace aux statistiques

INPUTS

etape 0: definir les hash value

INPUTR JOB 2:

calcul des KNN

des buckets=candidats

JOB 3:calcul des

KNN

HLSH : Hadoop Locality Sensitive Hashing

82

Page 83: Survey of different approaches for computing KNN on top of Map Reduce

Partition

on a plus de buckets que de reducers

Hadoop fait mal sa partition

Définir une partition pour que chaque reducer ait la meme complexité en temps

soit P reducer tmps(P) = Sum(#ri*#si) i=bucket

HLSH : Hadoop Locality Sensitive Hashing

83

Page 84: Survey of different approaches for computing KNN on top of Map Reduce

HLSH : Hadoop Locality Sensitive Hashing

Pros

rapide et moins de calcul

Cons

la fonction de hash est dataset dépendantmais il a été prouvée que ce type de fonctions étaient plus efficace

on n a pas l’idée de replication des buckets si pas assez d'éléments.

peut être améliorer (Multi probe, LSH forest, ….)

84

Page 85: Survey of different approaches for computing KNN on top of Map Reduce

RESUME

1. KNN 1. HBKNNJ : BASIC 2. HBNLJ : BLOC NESTED

LOOP 3. HVKNNJ : VORONOI

85

Page 86: Survey of different approaches for computing KNN on top of Map Reduce

RESUME

HBKNNJ:Basic

HBNLJ:bloc nested loop

HVKNNJ: Voronoi

HZKNNJ:zvalue

HLSH:locality sensitve hashing

pivots

fct hash

shifts,reduction dimension

Preprocessing

calcul du partitionement/grouping

calcul KNN

calcul des candidats

partition candidats86

Page 87: Survey of different approaches for computing KNN on top of Map Reduce

EXPERIMENTATIONS

87

Page 88: Survey of different approaches for computing KNN on top of Map Reduce

Contexte

Réalisé Hadoop 1.3

Sur Grid5000

88

Page 89: Survey of different approaches for computing KNN on top of Map Reduce

Datasets

2 datasets

Open street map : OSM

SURF

89

Page 90: Survey of different approaches for computing KNN on top of Map Reduce

Mesures

90

Temps Precision

#data #nodes#plus proches

voisins

Page 91: Survey of different approaches for computing KNN on top of Map Reduce

Precision

recall = | A(q)∩ I(q) || I(q) |

A = Actual dataset

I = Ideal dataset

91

Page 92: Survey of different approaches for computing KNN on top of Map Reduce

GEOGRAPHIC DATA…92

Page 93: Survey of different approaches for computing KNN on top of Map Reduce

OSM DATA

Donnée géographiques

XML

latitude,longitude

2 dimensions

93

Page 94: Survey of different approaches for computing KNN on top of Map Reduce

IMPACT SUR NOMBRE DE NODES

94

Page 95: Survey of different approaches for computing KNN on top of Map Reduce

4x10ˆ3m data

time(

sec)

110

115

120

125

130

#nodes

3 10 15 18

HBKNNJ

Impact du nombre de nodes …

95

Page 96: Survey of different approaches for computing KNN on top of Map Reduce

4x10ˆ3m data

time(

sec)

0

37,5

75

112,5

150

#nodes

3 10 15 18

HBKNNJ : Basic HBNLJ: Bloc nested loop

Impact du nombre de nodes …

96

Page 97: Survey of different approaches for computing KNN on top of Map Reduce

4x10ˆ3m data

time(

sec)

0

125

250

375

500

#nodes

3 10 15 18

HBKNNJ : Basic HBNLJ: Bloc nested loop HVKNNJ: Voronoi

4x10ˆ3m data

0

2000

4000

6000

8000

#nodes

3 10 15 18

Impact du nombre de nodes …

97

Page 98: Survey of different approaches for computing KNN on top of Map Reduce

4x10ˆ3 data

time(

sec)

0

275

550

825

1100

#nodes

3 10 15 18

HBKNNJ : Basic HBNLJ: Bloc nested loop HVKNNJ: Voronoi HZKNNJ : Zvalue

200x10ˆ3 data

0

2000

4000

6000

8000

#nodes

3 10 15 18

4000x10ˆ3 data

0

2000

4000

6000

8000

#nodes

3 10 15 18

Impact du nombre de nodes …

98

Page 99: Survey of different approaches for computing KNN on top of Map Reduce

4x10ˆ3 data

time(

sec)

0

275

550

825

1100

#nodes

3 10 15 18

HBKNNJ : Basic HBNLJ: Bloc nested loop HVKNNJ: Voronoi HZKNNJ : ZvalueHLSH : LSH

200x10ˆ3 data

0

2000

4000

6000

8000

#nodes

3 10 15 18

4000x10ˆ3 data

0

2000

4000

6000

8000

#nodes

3 10 15 18

Impact du nombre de nodes …

99

Page 100: Survey of different approaches for computing KNN on top of Map Reduce

IMPACT SUR NOMBRE DE DONNEES

100

Page 101: Survey of different approaches for computing KNN on top of Map Reduce

CONFIGURATION

20 nodes

1 slots/node

K = 20 trouver K voisins le plus proches

change taille des données

mesure le temps

Impact du nombre de données … : temps

101

Page 102: Survey of different approaches for computing KNN on top of Map Reduce

Impact du nombre de données 4x10ˆ3 data

time(

sec)

0

3500

7000

10500

14000

#data

50 100 200 400 800 1600

HBNLJ: bloc nested loop102

Page 103: Survey of different approaches for computing KNN on top of Map Reduce

Impact du nombre de données 4x10ˆ3 data

time(

sec)

0

10000

20000

30000

40000

#data

50 100 200 400 800 1600

HBNLJ: Bloc nested loop HVKNNJ: Voronoi103

Page 104: Survey of different approaches for computing KNN on top of Map Reduce

Impact du nombre de données 4x10ˆ3 data

time(

sec)

0

10000

20000

30000

40000

#data

50 100 200 400 800 1600

HBNLJ: Bloc nested loop HVKNNJ: Voronoi HLSH : LSH104

Page 105: Survey of different approaches for computing KNN on top of Map Reduce

Impact du nombre de données 4x10ˆ3 data

time(

sec)

0

10000

20000

30000

40000

#data

50 100 200 400 800 1600

HBNLJ: Bloc nested loop HVKNNJ: Voronoi HLSH : LSH HZKNNJ : Zvalue105

Page 106: Survey of different approaches for computing KNN on top of Map Reduce

Impact du nombre de données 4x10ˆ3 data

time(

sec)

0

1250

2500

3750

5000

#data

50 100 200 400 800 1600

HLSH : LSH HZKNNJ : Zvalue106

Page 107: Survey of different approaches for computing KNN on top of Map Reduce

Impact du nombre de données ac

cura

cy(%

)

0,8

0,85

0,9

0,95

1

#data

50 200 400 800 1600

HLSH : LSH HZKNNJ : Zvalue107

Page 108: Survey of different approaches for computing KNN on top of Map Reduce

IMPACT SUR 'K' K = LE NOMBRE DE VOISINS SOUHAITES

108

Page 109: Survey of different approaches for computing KNN on top of Map Reduce

4x10ˆ3 data

time(

sec)

0

52,5

105

157,5

210

K

2 20 200 400

HBNLJ: Bloc nested loop

Impact du nombre de K … - temps

109

Page 110: Survey of different approaches for computing KNN on top of Map Reduce

4x10ˆ3 data

time(

sec)

0

275

550

825

1100

K

2 20 200 400

HBNLJ: Bloc nested loop HVKNNJ: Voronoi

Impact du nombre de K … - temps

110

Page 111: Survey of different approaches for computing KNN on top of Map Reduce

4x10ˆ3 data

time(

sec)

0

275

550

825

1100

K

2 20 200 400

HBNLJ: Bloc nested loop HVKNNJ: Voronoi HZKNNJ : Zvalue

Impact du nombre de K … - temps

111

Page 112: Survey of different approaches for computing KNN on top of Map Reduce

4x10ˆ3 data

time(

sec)

0

275

550

825

1100

K

2 20 200 400

HBNLJ: Bloc nested loop HVKNNJ: Voronoi HZKNNJ : Zvalue HLSH : LSH

Impact du nombre de K … - temps

112

Page 113: Survey of different approaches for computing KNN on top of Map Reduce

Mais, la précision …

113

Page 114: Survey of different approaches for computing KNN on top of Map Reduce

4x10ˆ3 data

accu

racy

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

K

2 20 200 400

HBNLJ: Bloc nested loop HVKNNJ: Voronoi HZKNNJ : Zvalue HLSH : LSH

Impact du nombre de K … - précision

114

Page 115: Survey of different approaches for computing KNN on top of Map Reduce

4x10ˆ3 data

time(

sec)

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

K

2 20 200 400

HBNLJ: Bloc nested loop HVKNNJ: Voronoi HZKNNJ : Zvalue HLSH : LSH

Impact du nombre de K … - précision

115

Page 116: Survey of different approaches for computing KNN on top of Map Reduce

4x10ˆ3 data

time(

sec)

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

K

2 20 200 400

HBNLJ: Bloc nested loop HVKNNJ: Voronoi HZKNNJ : Zvalue HLSH : LSH

Impact du nombre de K … - précision

116

Page 117: Survey of different approaches for computing KNN on top of Map Reduce

Mais si on modifie les paramètres de la fonction de hash …

117

Page 118: Survey of different approaches for computing KNN on top of Map Reduce

4x10ˆ3 data

accu

racy

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

#nodes

2 20 200 400

HBNLJ: Bloc nested loop HVKNNJ: Voronoi HZKNNJ : Zvalue HLSH : LSHHLSH

Impact du nombre de K … - précision

118

Page 119: Survey of different approaches for computing KNN on top of Map Reduce

4x10ˆ3 data

accu

racy

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

#nodes

2 20 200 400

HBNLJ: Bloc nested loop HVKNNJ: Voronoi HZKNNJ : ZvalueHLSH : LSH HLSH HLSH : avec params changes

Impact du nombre de nodes …

119

Page 120: Survey of different approaches for computing KNN on top of Map Reduce

et par rapport au temps…

120

Page 121: Survey of different approaches for computing KNN on top of Map Reduce

4x10ˆ3 data

time(

sec)

0

275

550

825

1100

#nodes

2 20 200 400

HBNLJ: Bloc nested loop HVKNNJ: Voronoi HZKNNJ : Zvalue HLSH : LSHHLSH

Impact du nombre de K … - temps

121

Page 122: Survey of different approaches for computing KNN on top of Map Reduce

4x10ˆ3 data

time(

sec)

0

275

550

825

1100

#nodes

2 20 200 400

HBNLJ: Bloc nested loop HVKNNJ: Voronoi HZKNNJ : ZvalueHLSH : LSH HLSH HLSH : avec params changes

Impact du nombre de K … - temps

122

Page 123: Survey of different approaches for computing KNN on top of Map Reduce

4x10ˆ3 data

time(

sec)

0

62,5

125

187,5

250

#nodes

2 20 200 400

HLSH : LSH HLSH : avec params changes

Impact du nombre de K … - temps

123

Page 124: Survey of different approaches for computing KNN on top of Map Reduce

DATA SURF…124

Page 125: Survey of different approaches for computing KNN on top of Map Reduce

DONÉE SURF

descriptor générés par l'algorithme de Speeded Up Robust Features (SURF)

dimension 128

125

Page 126: Survey of different approaches for computing KNN on top of Map Reduce

2 algorithmes partent

HBKNNJ - Basic : trop lent

HZKNNJ - Z-Value : une precision < 5% pour un dataset de haute dimension(>30)

126

Page 127: Survey of different approaches for computing KNN on top of Map Reduce

IMPACT SUR NOMBRE DE DONNEES

127

Page 128: Survey of different approaches for computing KNN on top of Map Reduce

Impact du nombre de données !

1;49724;1186657;1.0;0h19min46sec 2;100039;2276780;1.0;0h37min56sec 4;207402;4665689;1.0;1h17min45sec 8;409052;9273714;1.0;2h34min33sec

time(

sec)

0

4500

9000

13500

18000

images

100 200 400 800 1600

HVKNNJ: Voronoi128

Page 129: Survey of different approaches for computing KNN on top of Map Reduce

Impact du nombre de données !

1;49724;1186657;1.0;0h19min46sec 2;100039;2276780;1.0;0h37min56sec 4;207402;4665689;1.0;1h17min45sec 8;409052;9273714;1.0;2h34min33sec

time(

sec)

0

7500

15000

22500

30000

images

100 200 400 800 1600

HVKNNJ: Voronoi HBNLJ: Bloc nested loop129

Page 130: Survey of different approaches for computing KNN on top of Map Reduce

Impact du nombre de données !

1;49724;1186657;1.0;0h19min46sec 2;100039;2276780;1.0;0h37min56sec 4;207402;4665689;1.0;1h17min45sec 8;409052;9273714;1.0;2h34min33sec

time(

sec)

0

7500

15000

22500

30000

images

100 200 400 800 1600

HVKNNJ: Voronoi HBNLJ: Bloc nested loop HLSH : LSH130

Page 131: Survey of different approaches for computing KNN on top of Map Reduce

Impact du nombre de données !

1;49724;1186657;1.0;0h19min46sec 2;100039;2276780;1.0;0h37min56sec 4;207402;4665689;1.0;1h17min45sec 8;409052;9273714;1.0;2h34min33sec

4x10ˆ3 data

accu

racy

0

0,25

0,5

0,75

1

images

100 200 400 800

HVKNNJ: Voronoi HBNLJ: Bloc nested loop HLSH : LSH131

Page 132: Survey of different approaches for computing KNN on top of Map Reduce

IMPACT SUR LE NOMBRE DE NODES

132

Page 133: Survey of different approaches for computing KNN on top of Map Reduce

Impact du nombre de nodes !

1;49724;1186657;1.0;0h19min46sec 2;100039;2276780;1.0;0h37min56sec 4;207402;4665689;1.0;1h17min45sec 8;409052;9273714;1.0;2h34min33sec

time(

sec)

0

1500

3000

4500

6000

#nodes

10 20 30 40

HVKNNJ: Voronoi HBNLJ: Bloc nested loop HLSH : LSH

133

Page 134: Survey of different approaches for computing KNN on top of Map Reduce

IMPACT SUR 'K' K = LE NOMBRE DE VOISINS SOUHAITES

134

Page 135: Survey of different approaches for computing KNN on top of Map Reduce

100x100 images : 49724 descriptors

time(

sec)

0

2000

4000

6000

8000

K

2 20 200 2000

HVKNNJ: Voronoi HBNLJ: Bloc nested loop HLSH : LSH

Impact du nombre de K … - temps

Contexte : 20 nodes, 1 slot/nodes

135

Page 136: Survey of different approaches for computing KNN on top of Map Reduce

100x100 images : 49724 descriptors

accu

racy

0

0,25

0,5

0,75

1

K

2 20 200 2000

HBNLJ: Bloc nested loop HVKNNJ: Voronoi HLSH : LSH

Impact du nombre de K … - temps

Contexte : 20 nodes, 1 slot/nodes

136

Page 137: Survey of different approaches for computing KNN on top of Map Reduce

100x100 images : 49724 descriptors

accu

racy

0

0,25

0,5

0,75

1

K

2 20 200 2000

HBNLJ: Bloc nested loop HVKNNJ: Voronoi HLSH : LSH HLSH with best acc

Impact du nombre de K … - temps

Contexte : 20 nodes, 1 slot/nodes

137

Page 138: Survey of different approaches for computing KNN on top of Map Reduce

100x100 images : 49724 descriptors

time(

sec)

0

2000

4000

6000

8000

K

2 20 200 2000

HVKNNJ: Voronoi HBNLJ: Bloc nested loop HLSH : LSH

Impact du nombre de K … - temps

Contexte : 20 nodes, 1 slot/nodes

138

Page 139: Survey of different approaches for computing KNN on top of Map Reduce

100x100 images : 49724 descriptors

time(

sec)

0

2000

4000

6000

8000

K

2 20 200 2000

HVKNNJ: Voronoi HBNLJ: Bloc nested loop HLSH : LSH HLSH with best acc

Impact du nombre de K … - temps

Contexte : 20 nodes, 1 slot/nodes

139

Page 140: Survey of different approaches for computing KNN on top of Map Reduce

DIFFICULTES DE TROUVER LES PARAMETRES POUR CHAQUE ALGORITHME …

140

Page 141: Survey of different approaches for computing KNN on top of Map Reduce

HBKNNJ : BASIC • NOMBRE DE SLOTS

141

Page 142: Survey of different approaches for computing KNN on top of Map Reduce

4x10ˆ3 data

time(

sec)

110

115

120

125

130

#nodes

3 10 15 18

HBKNNJ

HBKNNJ - BASIC : I.Nombre Contexte : K=20, 20 nodes, 1 slot/nodes,dimension=2

142

Page 143: Survey of different approaches for computing KNN on top of Map Reduce

HBNNLJ : BLOC NESTED LOOP • CHOIX DU NOMBRE DE

REDUCER • BALANCING

143

Page 144: Survey of different approaches for computing KNN on top of Map Reduce

HBNNLJ : BLOC NESTED tim

e(se

c)

0400800

12001600200024002800320036004000

nombre de reducers

5x5 6x6 7x7 8x8 9x9

5m 50m 100m

Contexte : K=20, 15 nodes, 1 slot/nodes

min

min

min

144

Page 145: Survey of different approaches for computing KNN on top of Map Reduce

on découpe en n*n reducers le travail

optimiser le parallélisme le plus possible , en découpant le travail suivant le nombre de slots disponibles :

on a 15 nodes :

choisir 5x5 est pas optimal car on fait travailler 15 nodes puis 10 nodes , 5 resteront inactif

choisir 7x7 est mieux 49 : 15+15+15 juste 4 reste inactif

8x8 = 64 = 15 *4 +4 mieux pr good balancing

mais plus de reducer = moins de travail par case

HBNNLJ : BLOC NESTED

145

Page 146: Survey of different approaches for computing KNN on top of Map Reduce

time(

sec)

0400800

12001600200024002800320036004000

nombre de reducers

5x5 6x6 7x7 8x8 9x9

5m 50m 100m

Contexte : K=20, 15 nodes, 1 slot/nodes

min

min

min

HBNNLJ : BLOC NESTED

146

Page 147: Survey of different approaches for computing KNN on top of Map Reduce

Contexte : K=20, 15 nodes, 1 slot/nodes, 100000 donnéesHBNNLJ : BLOC NESTED

147

Page 148: Survey of different approaches for computing KNN on top of Map Reduce

HVKNNJ : VORONOI • STRATEGIE DE PIVOT • NOMBRE DE PIVOT • STRATEGIE DE GROUPING

148

Page 149: Survey of different approaches for computing KNN on top of Map Reduce

1.CHOIX DE LA STRATEGIE DE PIVOTS

149

Page 150: Survey of different approaches for computing KNN on top of Map Reduce

Contexte : K=20, 15 nodes, 1 slot/nodes, dimension=2

farthest kmeans

HVKNNJ : VORONOI:

150

Page 151: Survey of different approaches for computing KNN on top of Map Reduce

2.IMPACT DU NOMBRE DE PIVOTS

151

Page 152: Survey of different approaches for computing KNN on top of Map Reduce

time(

sec)

200

360

520

680

840

1000

#pivots

6 30 300 500 1500 3000 4000

5M 50M 100M

+ d

e pivo

ts =

java

hea

p sp

ace

min

min

min

Contexte : K=5, 3 nodes, 1 slot/nodes

HVKNNJ : VORONOI:

152

Page 153: Survey of different approaches for computing KNN on top of Map Reduce

3.CHOIX DE LA STRATEGIE DE GROUPING

GEO VS GREEDY

153

Page 154: Survey of different approaches for computing KNN on top of Map Reduce

HVKNNJ : VORONOI:

154

Page 155: Survey of different approaches for computing KNN on top of Map Reduce

geo avec 20 reducers

HVKNNJ : VORONOI:

155

Page 156: Survey of different approaches for computing KNN on top of Map Reduce

greedy avec 20 reducers

HVKNNJ : VORONOI:

156

Page 157: Survey of different approaches for computing KNN on top of Map Reduce

greedy avec 50 reducers

HVKNNJ : VORONOI:

157

Page 158: Survey of different approaches for computing KNN on top of Map Reduce

HZKNNJ : ZVALUE • #DONNÉES VS PRECISION • DIMENSION VS PRECISION

158

Page 159: Survey of different approaches for computing KNN on top of Map Reduce

1.#Données vs PRECISION

159

Page 160: Survey of different approaches for computing KNN on top of Map Reduce

accu

racy

0,72

0,79

0,86

0,93

1

time(

sec)

0

65

130

195

260

#data*1000

50 100 400 800 1600

time (sec) accurancy

Contexte : K=20, 15 nodes, 1 slot/nodes, dimension=2, 1 copie

160

HZKNNJ : ZVALUE

Page 161: Survey of different approaches for computing KNN on top of Map Reduce

accu

racy

0,6

0,7

0,8

0,9

1

time(

sec)

0

150

300

450

600

#copy

1 2 3 4

time (sec) accurancy161

Contexte : K=20, 15 nodes, 1 slot/nodes, dimension=2, 1 copie

HZKNNJ : ZVALUE

Page 162: Survey of different approaches for computing KNN on top of Map Reduce

2.Dimension vs PRECISION

162

Page 163: Survey of different approaches for computing KNN on top of Map Reduce

accu

racy

(%)

4,68

4,693

4,705

4,718

4,73

time(

s)

0

80

160

240

#copy

1 4 Sans titre 1 8 10

time (s) accurancy(%)

Contexte : K=20, 15 nodes, 1 slot/nodes, file surf , dimension=128,#donnés=50000

nombre limité de copie

<5%

163

HZKNNJ : ZVALUE

Page 164: Survey of different approaches for computing KNN on top of Map Reduce

HLSH : LOCALITY SENSITIVE HASHING

• FONCTION DE HASHING

164

Page 165: Survey of different approaches for computing KNN on top of Map Reduce

1.Partition

165

Page 166: Survey of different approaches for computing KNN on top of Map Reduce

Contexte : K=20, 20 nodes, 1 slot/nodes, file osm , dimension=2,#donnés=100000

166

HLSH : Locality Sensitive

Page 167: Survey of different approaches for computing KNN on top of Map Reduce

2.La fonction de hashing

167

Page 168: Survey of different approaches for computing KNN on top of Map Reduce

La fonction de Hashing

Difficulté:

trouver une bonne precision

un bon temps

Contexte : K=20, 20 nodes, 1 slot/nodes, file osm , dimension=2,#donnés=400000

168

HLSH : Locality Sensitive

Page 169: Survey of different approaches for computing KNN on top of Map Reduce

La fonction de Hashing

change W impact sur la nombre des bucket

- buckets => + elements/buckets

Contexte : K=20, 20 nodes, 1 slot/nodes, file osm , dimension=2,#donnés=400000

169

HLSH : Locality Sensitive

Page 170: Survey of different approaches for computing KNN on top of Map Reduce

L=1,M=2

accu

racy

(%)

0

12,5

25

37,5

50

time(

s)

0

3000

6000

9000

12000

W

5000000 15000000

M=7,W= 5000000

accu

racy

(%)

0

15

30

45

60

time(

s)

0

1000

2000

3000

4000

L

1 2 4 6

L=1,W= 5000000

accu

racy

(%)

0

12,5

25

37,5

50

time(

s)

0

2250

4500

6750

9000

M

2 6 14

Contexte : K=20, 8 nodes, 1 slot/nodes, file surf , dimension=128,#donnés=400000

170

HLSH : Locality Sensitive

Page 171: Survey of different approaches for computing KNN on top of Map Reduce

RESUMONS

171

Page 172: Survey of different approaches for computing KNN on top of Map Reduce

A chacun ses problèmesHBKNNJ :

Basicpas assez de parallèles,

trop de calcul

HBNLJ :Bloc Nested

Loop

trop de replications, trop de calcul inutiles

HVKNNJ:Voronoi

beaucoup de calcul,

et long pour HD

peut etre risqué si mauvais choix de pivots Mieux pr data disperse

HZKNNJZ-Value precision pas stable pour #data

très bon pour petite dimension < 30très mauvais pour HD

HLSHLocality

Sensitive Hashing

precision pas stable pour K

grand

• difficultés des choix des paramètres:• depend de K • depend du dataset

tres utile pour l'use case des matching, car debase de beaucoup de candidats

172

Page 173: Survey of different approaches for computing KNN on top of Map Reduce

CONCLUSION

Expérimentations très difficiles,

beaucoup de paramètres.

Aucun algorithme est meilleur que tous les autres

dépendent du dataset

173

Page 174: Survey of different approaches for computing KNN on top of Map Reduce

USE CASESIMILARITE ENTRE DES IMAGES ALGORITHME : NEAREST NEIGHBORS RATIO

174

Page 175: Survey of different approaches for computing KNN on top of Map Reduce

Les descripteurs ?

175

Page 176: Survey of different approaches for computing KNN on top of Map Reduce

Calculer la similarité

176

Page 177: Survey of different approaches for computing KNN on top of Map Reduce

Exemple :

177

Page 178: Survey of different approaches for computing KNN on top of Map Reduce

Similarité entre les images

ETAPE 1: calcul des des

matchinggrace a nos différents

algorithmes

ETAPE 2: calcul des K plus hauts

scoresimages

S

ImagesR

OUTPUTles K images

similaires pour chaque images

178

Page 179: Survey of different approaches for computing KNN on top of Map Reduce

Impact du nombre de données tim

e(se

c)

0

3500

7000

10500

14000

#image/category

1 2 4 6 8 16

HBNLJ HVKNNJ HLSH179

Page 180: Survey of different approaches for computing KNN on top of Map Reduce

input map reduce output

HVKNNJ : Hadoop Voronoi K Nearest Neighbors

JOB 3: KNN

assignation a son groupe

KNN(r2,S)

p_r,r

g_r2,[r2, s1,s2…]

KNN(r5,S)

p_s,s decide a quel groupe de r s

doit etre répliqu'e

ETAPE: grouping

g_r1,[r5, s1,s2…]

180

Page 181: Survey of different approaches for computing KNN on top of Map Reduce

hash,r1

hash,r2

S

R

hash,s3

hash,s2

hash,s1

statistique des buckets,

et suppression des buckets

input map reduce output

hash,s1

hash,r1

hash,r2

hash,s2

hash,s3

HLSH : Hadoop Locality sensitiveJOB 1: hash et statistique

181

Page 182: Survey of different approaches for computing KNN on top of Map Reduce

hash,r1

hash,r2

hash,s3

hash,s2

hash,s1

calcul du partitionnement

des buckets

input map reduce output

knndefinit a quel

reducer le hash va

calcul des cKNN

HLSH : Hadoop Locality sensitiveJOB 1: calul des candidats

JOB 2 : calcul des KNN

calculknn

map reduce

cknn

output182