Upload
vladimir-kulyukin
View
435
Download
3
Embed Size (px)
Citation preview
Natural Language Processing
Levenshtein Edit Distance (LED)&
Skip Trie Matching (STM)
Vladimir Kulyukin
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Outline● Levenshtein Edit Distance (LED)
– Definition– Recursive Computation– Dynamic Programming Computation
● Skip Trie
– Background– Trie & Skip Trie– Skip Trie Matching
Levenshtein Edit Distance
Minimum Edit Distance
● Suppose we have two strings: source and target● Suppose we have a finite set of operations
(edit_ops) that can be used to transform source to target
● Each operation has a cost● A Minimum Edit Distance is a metric that mea-
sures the total cost of transforming source to tar-get
Strings as Prefix Sequences
Any string can be viewed as a sequence of prefixes
1. s = '', then the prefix sequence is ''2. s = 'a', then the prefix sequence is <'', 'a'>3. s = 'ab', then the prefix sequence is <'', 'a', 'ab'>
In general, if s = c1c
2...c
n, then the prefix sequence is
<'', 'c1', 'c
1c
2', ..., s>
Definition
● Levenshtein edit distance (LED) is a metric, one of the best known, that measures similarity be-tween two character sequences
● The metric is named after Vladimir Levenshtein who discovered this metric in 1965
● Given two strings, source and target, LED is de-fined as the minimum number of edit opera-tions (aka edits) to transform source to target
Edit Operations (AKA Edits)● The standard edit operations, aka edits, are insertion, dele-
tion, & substitution
● Assume pt
and ps
are legal positions in target and source,
respectively
● Insertion – a character at position pt in target is inserted
into source at position ps
● Deletion – a character is deleted from source at position ps
● Substitution - a character at position pt in target is substi-
tuted for a character at position ps in source
Edit Costs
● The standard edit operations have associated costs
● The costs are application dependent, and are typically positive integers
● For example, the costs of insertion, deletion, and substitution can all be set to 1
● In some contexts, substitution is set to 2 (substi-tution can be viewed as insertion and deletion)
String Transformation Cost
CT(s1, s2) = numerical cost of transforming source string s1 to target string s2
Tabulating Transformation CostsTARGET
'' c1 c2 c3 c4 c5 … cn
'' c1
c2 …
cm
SOURCE
TARGET '' c1 c2 c3 c4 c5 … cn
CT('', '')
'' c1 c2 c3 c4 c5 … cn
'' c1
c2 …
cm
CT('','c1')
'' c1 c2 c3 c4 c5 … cn
'' c1
c2 …
cm
CT('','c1c2')
'' c1 c2 c3 c4 c5 … cn
'' c1
c2 …
cm
CT('', 'c1c2c3')
'' c1 c2 c3 c4 c5 … cn
'' c1
c2 …
cm
CT('c1', '')
'' c1 c2 c3 c4 c5 … cn
'' c1
c2 …
cm
CT('c1c2', '')
'' c1 c2 c3 c4 c5 … cn
'' c1
c2 …
cm
CT('c1c2c3', '')
'' c1 c2 c3 c4 c5 … cn
'' c1
c2 …
cm
CT('c1...cm', 'c1...cn')
'' c1 c2 c3 c4 c5 … cn
'' c1
c2 …
cm
Transforming Empty Source to Target
'' c1 c2 c3 c4 c5 … cn
'' c1
c2 …
cm
0 i1 i2 i3 i4 i5 in
The only way to transform empty source to some target is to insert 0 or more characters into it (ik is the cost of inserting k characters)
Transforming Source to Empty Target
'' c1 c2 c3 c4 c5 … cn
'' c1
c2 …
cm
0d1d2d3
dm
The only way to transform some source to empty target is to delete 0 or more corresponding characters from it (dk is the cost of deleting k characters)
Examples
Example 01 Let insertion cost = deletion cost = substitution cost = 1.
Let source = '' and target = 'ab'.
How can we transform source to target?
Example 01
''
'' a b
Example 01
''
'' a b
0
Example 01
''
'' a b
0 1
Example 01
''
'' a b
0 1 2
Example 01
- insert 'a' at position 1 in source at cost 1;- insert 'b' at position 2 in source at cost 1;
So, LED('', 'ab') = 2.
Example 02
Let insert cost = delete cost = substitute cost = 1. Let source = 'ab' and target = ''.
How can we transform source to target?
Example 02
- Delete 'a' at position 1 in source at cost 1;- Delete 'b' at position 2 in source at cost 1;
So, LED('ab', '') = 2.
Example 03
Let insert cost = delete cost = substitute cost = 1. Let source = 'abc' and target = 'ac'.
- match 'a' at position 1 with 'a' at position 1 in target;- delete 'b' at position 2 in source at cost 1;- match 'c' at position 3 in source with 'c' at position 2 in target at cost 0.
So, LED('abc', 'ab') = 1.
Recursive LED Algorithm
Specification
LevEdDist(source, target, ins_cost, del_cost, sub_cost)
- source – source string
- target – target string
- ins_cost – cost of insertion
- del_cost – cost of deletion
- sub_cost – cost of substitution
LevEdDist(source, target, ins_cost, del_cost, sub_cost) returns a sequence of edits to convert source to target and the levenshtein distance, i.e., the total cost of edits
Pseudo Code
LED(source_str, target_str, edit_ops, edit_cost, ins_cost=1, del_cost=1, sub_cost=1):
#1. compute lengths of source and target strings target_len, source_len = len(target_str), len(source_str) #2. edit_ops is a list of edit operations that is destructively modified edit_ops_copy = copy(edit_ops) if source_len == 0: #3. if source is empty, insert all target characters into it for c in target_str: edit_ops_copy.append(new InsertOperator(c, ins_cost)) return edit_cost + target_len, edit_ops_copy
if target_len == 0: #4. if target is empty, delete all characters from source for c in source_str: edit_ops_copy.append(new DeleteOper('del', c, del_cost)) return edit_cost + source_len, edit_ops_copy
Recursion
● If character at position source_len-1 in source is the same as character at position target_len-1 in target, set the current cost to 0 (this is the character match, which can be viewed as substitute the character in the source for the same character in the target)
● Match is a zero-cost substitution
● If these characters are not the same, compute the costs of deletion, insertion and substitution, and choose the minimum cost
Pseudo Code: Three Recursive Calls
// choose deletion and recursedc_cost, dc_edit_ops = LED(source_str[0:source_len-1], target_str, edit_ops, edit_cost, ins_cost=ins_cost, del_cost=del_cost, sub_cost=sub_cost)
// choose insertion and recurseic_cost, ic_edit_ops = LED(source_str, target_str[0:target_len-1], edit_ops, edit_cost, ins_cost=ins_cost, del_cost=del_cost, sub_cost=sub_cost)
// choose substitution and recurse sc_cost, sc_edit_ops = LED(source_str[0:source_len-1], target_str[0:target_len-1], edit_ops, edit_cost, ins_cost=ins_cost, del_cost=del_cost, sub_cost=sub_cost)
Pseudo Code: Choosing Minimal Edit Sequence
if min_cost == dc_cost: edit_ops_copy = copy(dc_edit_ops) // add a new delete operator edit_ops_copy.append(new DelOper(source_str[source_len-1], del_cost)) else if min_cost == ic_cost: edit_ops_copy = copy(ic_edit_ops) // add a new insertion operator edit_ops_copy.append(new InsOper(target_str[target_len-1], ins_cost)) else if min_cost == sc_cost:' edit_ops_copy = copy(sc_edit_ops) if target_str[target_len-1] == source_str[source_len-1]: // if the characters are the same, then there is a match edit_ops_copy.append(new MatchOper(target_str[target_len-1], source_str[source_len-1], 0)) else: edit_ops_copy.append(new SubOper(target_str[target_len-1], source_str[source_len-1], sub_cost)) else: edit_ops_copy = copy(edit_ops)
min_cost = compute the cost of edit ops in edit_ops return min_cost, edit_ops_copy
LED Computation with
Dynamic Programming
Computing CT(r, c)
1. Construct an m x n table CT2. Fill row 03. Fill column 04. Then CT[r, c] = min{ CT[r-1,c-1] + sub_cost, CT[r-1, c] + del_cost, CT[r, c-1] + ins_cost }5. CT[m, n] is the final (and minimal!) cost
Side Notes
● LED is a minimal distance● LED is a correct minimal distance● LED can be computed only with 2 rows● An optimal sequence of edits can be recovered from
the CT table
Skip Trie & Skip Trie Matching
Motivation● According to U.S. Department of Agriculture, U.S.
residents have increased their caloric intake by 523 calories per day since 1970
● Mismanaged diets are estimated to account for 30-35% of cancer and diabetes cases
● A major contributor to the increased caloric intake is the consumer's inability (and sometimes unwillingness) to read & understand nutrition labels
● Nutrition information is rarely available to blind and visually impaired individuals
Critical Barriers
● Manual nutrition intake recording is time-consuming and error-prone, especially on smartphones
● Automated, real-time nutrition information extraction & analysis is weak or nonexistent
● Nutrition decision support – is not context-sensitive; – does not couple consumers with dieticians;– is not integrated with PHRs or ODLs
Persuasive NUTrion Management System (PNUTS)
RoboCart ShopTalk ShopMobile I ShopMobile II PNUTS
dd
2003-052006-08
2008-10
2010-12 2013-Now
R&D Road to PNUTS
PNUTS Architecture
Nutritionist
Coach
Cloud
Consumer/Patient
Inference Engine OCR Image Analysis
Vision-Based Nutrition Information Extraction in PNUTS
Line Segmentor
Nutrition Label
Localizer
TEXT
Image Table Lines
OCR
OCR Engine Accuracy Evaluation● Two hundred images of nutrition label text chunks
–
● Three categories used to categorize accuracy:– Complete: OCRed characters are identical to
image text– Partial: at least one OCRed character is missing or
misrecognized– Garbled: either empty string is returned or all
OCRed characters are misrecognized
OCR Engine Accuracy
Complete Partial Garbled
Tesseract on Device 146(73%) 36(18%) 18(9%)
GOCR on Device 42(21%) 23(11.5%) 135(67.5%)
Tesseract on Server 158(79%) 23(11.5%) 19(9.5%)
GOCR on Server 58(28.99%) 56(28%) 90(45%)
OCR Engine Speed in Milliseconds
Run 1 Run 2 Run 3 Run 4 Run 5 AVG/Sample AVG/Image
Tesseract on Device 128238 101438 101643 109678 103205110439.6 552.1
GOCR on Device 50349 47746 48964 52450 48247 49019.6 245
Tesseract on Server 38958 38061 37850 9891 39032 38289.6 191
GOCR on Server 21253 20842 20195 21182 20520 20763.3 103.8
OCR Error Types● Error Classification (Kukich 1992)
– Non-words: 'polassium' vs. 'potassium'– Real-words: 'fats' vs. 'facts'
● State of the Art Error Correction:– N-Gram– Levenshtein Edit Distance (LED)– Both algorithms are implemented in
Apache Lucene
Big O Analysis
● LED – O(m*n2), where n is the number of entries in the dictionary and n is the size of the input
● N-Gram – O(n), where n is the size of the input if the dictionary is implemented as a hash with constant lookup
Skip Trie Matching
Trie Data Structure
● Tries are popular on mobile platforms for word completion due to space efficiency
● Worst-case lookup is O(n) where n is the length of the input string
● Efficient storage compared to hash table
Skip Trie Matching● Skip Trie Matching (STM) algorithm is based on the
idea that the trie data structure can be used to find closest dictionary matches to misspelled words
● It is assumed that the dictionary of words is stored as a trie
● The only parameter in STM is the skip distance – a non-negative integer that defines the maximum number of misrecognized characters allowed in a misspelled word
STM Basic Steps
● Process the input string character by character● At the trie's current node, find the child character
that matches the input's current character● If a match is found, recurse to that node and
consume the input's character● If no match is found, recurse on each child node
after incrementing the skip distance and without consuming the input's current character
● Details and pseudocode are in this paper
STM Example
Suppose that the OCR engines recognizes the string 'ACID' as 'ACIR' and the trie dictionary has the word 'ACID' as a character path.
Back to Big O Analysis
● LED – O(m*n2), where n is the number of entries in the dictionary and n is the size of the input
● N-Gram – O(n), where n is the size of the input if the dictionary is implemented as a hash with constant lookup
● STM – O(nlog|Σ|), where |Σ| is the size of the alphabet
LED, N-Gram, STM Accuracy & Speed
STM N-Gram LED
Run Time(In milliseconds)
20 51 51
Recall 15% 9% 8%
The results in the table below were obtained on a sample of 600 texts OCRed with Tesseract
STM Limitations
● Since STM is greedy, it cannot find all possible suggestions (not a limitation if a vocabulary is limited but a limitation in general)
● Current implementation finds matches only of the same length as the misspelled input
● STM cannot correct real-word errors
Conclusions ● On the tested samples OCRed with
Tesseract, STM ran faster and was more accurate than Apache Lucene's implementations of N-GRAM & LED
● On the tested samples, Tesseract was more accurate than GOCR
● On the tested samples, GOCR ran faster than Tesseract
References
1. Levenshtein V. (1966). “Binary Codes Capable of Correcting Deletions, Insertions, and Reversals.” Soviet Physics Doklady 10: 707–10. (pdf)
2. K. Kukich, "Techniques for Automatically Correcting Words in Text." ACM Computing Surveys, Vol. 24, No. 4, Dec. 1992. (pdf)
3. Kulyukin, V., Vanka, A., Wang, H. Skip Trie Matching: A Greedy Algorithm for Real-Time OCR Error Correction on Smartphones. International Journal of Digital Information and Wireless Communication (IJDIWC): 3(3): 56-65, 2013. ISSN: 2225-658X. (pdf)