Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
DistancesMeasuring similarities
Devert AlexandreSchool of Software Engineering of USTC
December 7, 2012 — Slide 1/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Table of Contents
1 Introduction
2 Strings
3 Data semantic
4 Perceptive models
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 2/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
DistancesMany data-mining algorithms, like k-means, rely on adistance measure
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 3/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Distances
The distance measure express how 2 data of the datasetrelates to each other
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 4/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Distances
So far, we considered point in Rn and Euclidean distance.
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 5/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Distances
But what looks like a distance measure for
• Text documents ?
• Sounds ?
• Shapes ?
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 6/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Distances
Is the Euclidean good enough for all cases ?
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 7/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Table of Contents
1 Introduction
2 Strings
3 Data semantic
4 Perceptive models
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 8/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Binary strings
Let’s consider binary strings of fixed length
001010011101010100111100011111
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 9/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Hamming distance
A convenient distance measure for strings is theHamming distance
s1 = 001010011101010100111100011111
s2 = 001010011101010000111100011111
d(s1, s2) = 1
Distance is the number of different digits
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 10/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Hamming distance
A convenient distance measure for strings is theHamming distance
s1 = 001010011101010100111100011111
s2 = 001010001101000100110100011111
d(s1, s2) = 3
Distance is the number of different digits
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 10/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Hamming distance
A way to compute it quickly, if you use integers
• Exclusive Or of the 2 chains
• Count the ones in the resulting chain
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 11/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Fixed size strings
Hamming distance for other alphabets than {0, 1}
s1 = GATEAU
s2 = BATEAU
d(s1, s2) = 1
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 12/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Fixed size strings
Hamming distance for other alphabets than {0, 1}
s1 = BIGLOTRON
s2 = BAFFOTRON
d(s1, s2) = 3
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 12/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Variable length strings
But for many practical applications, we need to comparestrings with different lengths
CYCLOTRON
SYNCHROTRON
SYNCHROPHASOTRON
BIGLOTRON
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 13/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Levenshtein distance
The Levenshtein distance is the minimum number ofedits needed to transform one string into the other
1 insertion of a character
2 deletion of a character
3 substitution of a character
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 14/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Levenshtein distance
Insertion of a character
FAT ⇒ FAST
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 15/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Levenshtein distance
Deletion of a character
FART ⇒ FAT
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 16/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Levenshtein distance
Substitution of a character
FAT ⇒ CAT
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 17/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Levenshtein distance
CYCLOTRON
SYCLOTRON substitution S ⇒ C
SYNCLOTRON insertion YC ⇒ YNC
SYNCHOTRON substitution L ⇒ H
SYNCHROTRON insertion HO ⇒ HRO
distance is 4, because of the 4 steps to turnCYCLOTRON in SYNCHROTRON
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 18/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Computing Levenshtein distance
Initial step for s1 = CAAT and s2 = CAT
C A A T0 1 2 3 4
C 1A 2T 3
We will fill step by step a matrix, with the distance ofeach prefix of s1 and s2
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 19/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Computing Levenshtein distance
Let’s fill the matrix
C A A T0 1 2 3 4
C 1 0A 2T 3
Prefixes C and C are the same, distance is 0
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 19/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Computing Levenshtein distance
Let’s fill the matrix
C A A T0 1 2 3 4
C 1 0 1 2 3A 2T 3
Prefixes C and CA, CAA, CAAT have distance 1, 2, 3
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 19/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Computing Levenshtein distance
C A A T0 1 2 3 4
C 1 0 1 2 3A 2 1T 3 2
Prefixes C and CA, CAT have distance 1, 2
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 19/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Computing Levenshtein distance
C A A T0 1 2 3 4
C 1 0 1 2 3A 2 1 0 1 2T 3 2 1
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 19/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Computing Levenshtein distance
C A A T0 1 2 3 4
C 1 0 1 2 3A 2 1 0 1 2T 3 2 1 1 1
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 19/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Computing Levenshtein distance
C A A T0 1 2 3 4
C 1 0 1 2 3A 2 1 0 1 2T 3 2 1 1 1
Distance is 1 !
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 19/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Computing Levenshtein distance
CUT and CAT
C U T0 1 2 3
C 1A 2T 3
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 20/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Computing Levenshtein distance
CUT and CAT
C U T0 1 2 3
C 1 0 1 2A 2 1T 3 2
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 20/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Computing Levenshtein distance
CUT and CAT
C U T0 1 2 3
C 1 0 1 2A 2 1 1 2T 3 2 1
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 20/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Computing Levenshtein distance
CUT and CAT
C U T0 1 2 3
C 1 0 1 2A 2 1 1 2T 3 2 1 1
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 20/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Computing Levenshtein distance
CUT and CAT
C U T0 1 2 3
C 1 0 1 2A 2 1 1 2T 3 2 1 1
Distance is 1 !
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 20/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Computing Levenshtein distance
The general idea
x1 x2 . . . xn0 1 2 . . . n
y1 1y2 2...
... di ,jym m
di ,j is the distance between {x1, x2, . . . , xi} and{y1, y2, . . . , yj}
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 21/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Computing Levenshtein distance
The general idea
x1 x2 . . . xn0 1 2 . . . n
y1 1y2 2 di−1,j−1...
... di ,jym m
If xi = yjdi ,j = di−1,j−1
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 21/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Computing Levenshtein distance
The general idea
x1 x2 . . . xn0 1 2 . . . n
y1 1y2 2 di−1,j−1 di ,j−1...
... di−1,j di ,jym m
If xi 6= yj
di ,j = 1 + min(di ,j−1, di−1,j , di−1,j−1)
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 21/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Computing Levenshtein distance
C Y C L O T R O N0 1 2 3 4 5 6 7 8 9
S 1Y 2N 3C 4H 5R 6O 7T 8R 9O 10N 11
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 22/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Computing Levenshtein distance
C Y C L O T R O N0 1 2 3 4 5 6 7 8 9
S 1 1 2 3 4 5 6 7 8 9Y 2 2N 3 3C 4 3H 5 4R 6 5O 7 6T 8 7R 9 8O 10 9N 11 10
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 22/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Computing Levenshtein distance
C Y C L O T R O N0 1 2 3 4 5 6 7 8 9
S 1 1 2 3 4 5 6 7 8 9Y 2 2 1 2 3 4 5 6 7 8N 3 3 2C 4 3 3H 5 4 4R 6 5 5O 7 6 6T 8 7 7R 9 8 8O 10 9 9N 11 10 10
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 22/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Computing Levenshtein distance
C Y C L O T R O N0 1 2 3 4 5 6 7 8 9
S 1 1 2 3 4 5 6 7 8 9Y 2 2 1 2 3 4 5 6 7 8N 3 3 2 2 3 4 5 6 7 7C 4 3 3 2H 5 4 4 3R 6 5 5 4O 7 6 6 5T 8 7 7 6R 9 8 8 7O 10 9 9 8N 11 10 10 9
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 22/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Computing Levenshtein distance
C Y C L O T R O N0 1 2 3 4 5 6 7 8 9
S 1 1 2 3 4 5 6 7 8 9Y 2 2 1 2 3 4 5 6 7 8N 3 3 2 2 3 4 5 6 7 7C 4 3 3 2 3 4 5 6 7 8H 5 4 4 3 3R 6 5 5 4 4O 7 6 6 5 5T 8 7 7 6 6R 9 8 8 7 7O 10 9 9 8 8N 11 10 10 9 9
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 22/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Computing Levenshtein distance
C Y C L O T R O N0 1 2 3 4 5 6 7 8 9
S 1 1 2 3 4 5 6 7 8 9Y 2 2 1 2 3 4 5 6 7 8N 3 3 2 2 3 4 5 6 7 7C 4 3 3 2 3 4 5 6 7 8H 5 4 4 3 3 4 5 6 7 8R 6 5 5 4 4 4O 7 6 6 5 5 4T 8 7 7 6 6 5R 9 8 8 7 7 6O 10 9 9 8 8 7N 11 10 10 9 9 8
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 22/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Computing Levenshtein distance
C Y C L O T R O N0 1 2 3 4 5 6 7 8 9
S 1 1 2 3 4 5 6 7 8 9Y 2 2 1 2 3 4 5 6 7 8N 3 3 2 2 3 4 5 6 7 7C 4 3 3 2 3 4 5 6 7 8H 5 4 4 3 3 4 5 6 7 8R 6 5 5 4 4 4 5 5 6 7O 7 6 6 5 5 4 5T 8 7 7 6 6 5 4R 9 8 8 7 7 6 5O 10 9 9 8 8 7 6N 11 10 10 9 9 8 7
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 22/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Computing Levenshtein distance
C Y C L O T R O N0 1 2 3 4 5 6 7 8 9
S 1 1 2 3 4 5 6 7 8 9Y 2 2 1 2 3 4 5 6 7 8N 3 3 2 2 3 4 5 6 7 7C 4 3 3 2 3 4 5 6 7 8H 5 4 4 3 3 4 5 6 7 8R 6 5 5 4 4 4 5 5 6 7O 7 6 6 5 5 4 5 6 5 6T 8 7 7 6 6 5 4 5R 9 8 8 7 7 6 5 4O 10 9 9 8 8 7 6 5N 11 10 10 9 9 8 7 6
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 22/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Computing Levenshtein distance
C Y C L O T R O N0 1 2 3 4 5 6 7 8 9
S 1 1 2 3 4 5 6 7 8 9Y 2 2 1 2 3 4 5 6 7 8N 3 3 2 2 3 4 5 6 7 7C 4 3 3 2 3 4 5 6 7 8H 5 4 4 3 3 4 5 6 7 8R 6 5 5 4 4 4 5 5 6 7O 7 6 6 5 5 4 5 6 5 6T 8 7 7 6 6 5 4 5 6 6R 9 8 8 7 7 6 5 4 5O 10 9 9 8 8 7 6 5 4N 11 10 10 9 9 8 7 6 5
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 22/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Computing Levenshtein distance
C Y C L O T R O N0 1 2 3 4 5 6 7 8 9
S 1 1 2 3 4 5 6 7 8 9Y 2 2 1 2 3 4 5 6 7 8N 3 3 2 2 3 4 5 6 7 7C 4 3 3 2 3 4 5 6 7 8H 5 4 4 3 3 4 5 6 7 8R 6 5 5 4 4 4 5 5 6 7O 7 6 6 5 5 4 5 6 5 6T 8 7 7 6 6 5 4 5 6 6R 9 8 8 7 7 6 5 4 5 6O 10 9 9 8 8 7 6 5 4 5N 11 10 10 9 9 8 7 6 5 4
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 22/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Computing Levenshtein distance
Recursive & lazy style
de f l e v ( a , b ) :i f not a :
r e t u r n l e n ( b )
i f not b :r e t u r n l e n ( a )
r e t u r n min ( l e v ( a [ 1 : ] , b [ 1 : ] ) + ( a [ 0 ] != b [ 0 ] ) , # s u b s t i t u t i o nl e v ( a [ 1 : ] , b ) + 1 , # d e l e t i o nl e v ( a , b [ 1 : ] ) + 1) # i n s e r t i o n
Most programming languages do not deal well with suchcode
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 23/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Computing Levenshtein distance
Imperative & eager style
de f l e v ( s , t ) :s , t , d = ’ ’ + s , ’ ’ + t , { }
f o r i i n x range ( l e n ( s ) ) :d [ i , 0 ] = i
f o r j i n x range ( l e n ( t ) ) :d [ 0 , j ] = j
f o r j i n x range (1 , l e n ( t ) ) :f o r i i n range (1 , l e n ( s ) ) :
i f s [ i ] == t [ j ] :d [ i , j ] = d [ i −1, j −1]
e l s e :d [ i , j ] = min ( d [ i −1, j ] , d [ i , j −1] , d [ i −1, j −1]) + 1
r e t u r n d [ l e n ( s )−1, l e n ( t )−1]
Can be better, no need to store the complete matrix
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 24/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Damerau–Levenshtein distance
The Damerau–Levenshtein distance is like theLevenshtein distance, with one more edit operation
1 insertion of a character
2 deletion of a character
3 substitution of a character
4 transposition of 2 adjacent characters
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 25/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Computing Damerau–Levenshtein
distance
It works almost like the Levenshtein distance
x1 x2 . . . xn0 0 1 2 . . . n
0 0 1 2 . . . n
y1 1 1y2 2 2...
...... di ,j
ym m m
di ,j is the distance between {x1, x2, . . . , xi} and{y1, y2, . . . , yj}
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 26/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Computing Damerau–Levenshtein
distance
It works almost like the Levenshtein distance
x1 x2 . . . xn0 0 1 2 . . . n
0 0 1 2 . . . n
y1 1 1y2 2 2...
...... di ,j
ym m m
If xi = yjdi ,j = min(di−2,j−2, di−1,j−1)
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 26/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Computing Damerau–Levenshtein
distanceIt works almost like the Levenshtein distance
x1 x2 . . . xn0 0 1 2 . . . n
0 0 1 2 . . . n
y1 1 1y2 2 2...
...... di ,j
ym m m
If xi 6= yj
di ,j = 1 + min(di−2,j−2, di ,j−1, di−1,j , di−1,j−1)
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 26/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Robustness
Damerau–Levenshtein distance will differentiates thefollowing strings
THOMAS ⇒ TOHMAS
THOMAS ⇒ THOMASS
THOMAS ⇒ TOMAS
They have a distance = 1 with the string THOMAS
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 27/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Robustness
Those differences are likely due to some typing error !
THOMAS
TOHMAS
THOMASS
TOMAS
The Damerau–Levenshtein distance might make ouralgorithms sensible to noise, if used for typed things
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 28/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Jaro distance
A distance specialized for names records : Jaro distance
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 29/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Jaro distance
d(s1, s2) =1
3
(
m
|s1|+
m
|s2|+
m − t
m
)
• d(s1, s2) = 1 means exact match
• d(s1, s2) = 0 means no similarity
• m is the number of matching characters
• t is half the number of transpositions
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 30/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Jaro distance
2 characters from s1 and s2 are matching if they areequal and not farther than
⌊
max(|s1|, |s2|)
2
⌋
− 1
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 31/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Jaro distance
THOMAS
TOHMAS
6 matching characters, m = 6
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 32/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Jaro distance
2 characters from s1 and s2 are transposed if they are not
equal and not farther than
⌊
max(|s1|, |s2|)
2
⌋
− 1
t is equal to half the number of transposed characters
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 33/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Jaro distance
THOMAS
TOHMAS
Mismatched characters H/O and O/H , t = 22= 1
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 34/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Jaro distance
Jaro distance for THOMAS and TOHMAS is13
(
66+ 6
6+ 6−1
6
)
= 0.944
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 35/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Jaro distance
Computes which character are matching and notmatching
D I X O ND 1 0 0 0 0I 0 1 0 0 0C 0 0 0 0 0K 0 0 0 0 0S 0 0 0 0 0O 0 0 0 1 0N 0 0 0 0 1X 0 0 0 0 0
• |s1| = 5, |s2| = 8
• match window is 3
• m = 4, t = 0
• distance is13
(
m|s1|
+ m|s2|
+ m−tm
)
= 0.822
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 36/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Jaro distance
Jaro distance is an example of how to deal with noise indata ⇒ distance that considers identical 2 strings likelyto be the same thing but with some typing errors
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 37/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Distance between texts
A simple and popular way to measure distance betweenlarge texts for data-mining ⇒ bag of words
• Find a large list of common words L likely to be intext A and B .
• Build vectors X (A) and X (B), where X (A)i is thenumber of occurence of words Li .
• Distance is X (A).X (B)
Used for spam-filtering for instance
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 38/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Table of Contents
1 Introduction
2 Strings
3 Data semantic
4 Perceptive models
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 39/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
UFO sightings data
Let’s consider geographical data : UFO sightings
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 40/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
UFO sightings data
The raw data looks like this
description ufolocation Austin, Texas, USAsight date 20020804shape circle
InfoChimp UFO dataset ⇒ 60000 entries !
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 41/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
UFO sightings data
Using string distances distance for the date string wouldbe silly
d(20020804, 20050804) = 1
d(20020804, 20020704) = 1
d(20020804, 20020814) = 1
One year, one month or ten days are not the same thing !
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 42/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
UFO sightings data
Using number difference for the date string would be silly
d(20020804, 20050321) = 18020483
Dates are not a single base 10 number !
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 43/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
UFO sightings data
We need to convert dates into single numbers
description ufolocation Austin, Texas, USAsight date 1028419200shape circle
We can convert them to UTC time (beware, many UFOsightings before 1970)
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 44/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
UFO sightings data
Using string distance for location names would be silly
d(SUZHOU , FUZHOU) = 1
d(HEFEI ,HEBEI ) = 1
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 45/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
UFO sightings data
We need to convert locations into positions
description ufolocation latitude 30.25location longitude 97.75sight date 1028419200shape circle
Geo–location services can convert this to geo–coordinates
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 46/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
UFO sightings data
We will consider just the location and the time
location latitude 30.25location longitude 97.75sight date 1028419200
Can we use Euclidean distance now, to cluster those data?
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 47/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Spherical coordinatesThe locations are spherical coordinates
λ
φr
(r,φ,λ)
A
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 48/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Spherical coordinates
spherical coordinates are angles, does not work like theusual cartesian coordinates
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 49/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Spherical coordinates
Distance between 2 sphere coordinates (φa, λa), (φb, λb)is not the usual Euclidean distance
rarcos(sinφa sinφb + cosφa cosφb cos(λb − λa))
Use Vincenty formula to actually compute this
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 50/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Table of Contents
1 Introduction
2 Strings
3 Data semantic
4 Perceptive models
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 51/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Picture segmentation
Using kmeans algorithm, one can segment a picture intosimilar areas
Colors are [r , g , b] triplets, so we can use euclideandistance, right ?
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 52/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Color perception
Electronic sensors record color informations as 3 signals
red R green G blue B
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 53/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Color perceptionOthers color spaces, like YUV, separate luminance andchrominance
luminance Y’ chrominance U chrominance V
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 54/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Color interpolation
Let’s interpolate 2 colors Ca and Cb, using different colorspaces
Cα = Ca + α(Cb − Ca), α ∈ [0, 1]
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 55/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Color interpolation
Interpolating colors in different color spaces givesdifferent results
Some color spaces introduce new colors wheninterpolating from one to an other !
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 56/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Color interpolation
Interpolating colors in different color spaces givesdifferent results
Some color spaces introduce new colors wheninterpolating from one to an other !
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 56/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Color perception
When you choose to represent colors in a given colorspace, you choose
• which colors are alike and which colors are verydifferent
• a color ”neighbourhood”
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 57/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
YUV color space
YUV color space have interesting properties
• color interpolation in YUV looks perceptually morecorrect
• human eye is much more sensitive to luminance Y’
than chrominance UV
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 58/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
RGB and YUV color spaces
The transformation from RGB to YUV is linear
Y ′
U
V
=
0.299 0.587 0.114−0.14713 −0.28886 0.4360.615 −0.51499 −0.10001
R
G
B
R , G and B are in the [0, 1] range
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 59/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
RGB and YUV color spaces
RGB and YUV color spaces represents the same thing :colors. But
• RGB is the signal as it comes out of the sensors
• YUV takes in account human perception of color
Some color spaces like LAB are even better models forhuman color perception
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 60/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Human perception
There are perceptive models for
• colors
• shape
• depth
• sounds
• speech
• motion
• . . .
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 61/62
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Human perception
A correct data-mining approach can return completelymeaningless results, without a perceptive model
In speech recognition, it’s essential, just to obtain apractically usable system
Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 62/62