Upload
blaze-moore
View
226
Download
0
Embed Size (px)
Citation preview
32nd Internationalization and Unicode Conference San José, CA, September 2008
Matching, Sorting, and Searching with Unicode Text
Eric R. Mader, Michael OwIBM Globalization Architecture and Technology
2 32nd Internationalization and Unicode Conference San José, CA, September 2008
Topics
➢ Matching–Are two characters the same?
➢ Sorting–Proper ordering of characters
➢ Searching–Finding the desired pattern in a given text
3 32nd Internationalization and Unicode Conference San José, CA, September 2008
Different Ways to Implement
➢ “Non-Globalized Way”–Concern with individual language only
–Using codepages that are language specific or limited
➢ “Globalized Way”–Using Unicode
–Using Unicode Collation Allgorithm
–Consideration for all languages
4 32nd Internationalization and Unicode Conference San José, CA, September 2008
Non-Globalized Way
–Matching using the individual code points
•a (0x61) = a (0x61)
* using US-ASCII
5 32nd Internationalization and Unicode Conference San José, CA, September 2008
Non-Globalized Way, cont.
–Sorting can also be done by comparing code points or a simple mapping of the code points (e.g. in EBDIC)
Unsorted SortedCharacter Code Point
b 0x62a 0x61d 0x64c 0x63
Character Code Point
a 0x61
b 0x62
c 0x63d 0x64
*Using US-ASCII
6 32nd Internationalization and Unicode Conference San José, CA, September 2008
Non-Globalized Way, cont.
➢ Searching
–Many different algorithms in string search*
–Searching of text done through analysis of character code points only
*algorithms can be globalized
7 32nd Internationalization and Unicode Conference San José, CA, September 2008
Basics of String Search Algorithms
–There are many ways to search through text:
•Linear Search•Boyer-Moore•Quick Search
* There are many more, but for time constraints we will discuss the common ones that are listed above.
8 32nd Internationalization and Unicode Conference San José, CA, September 2008
Simple String Search Algorithm
➢ Linear Search
–Brute force search
–Check every character against pattern
–Very slow
–No preprocessing time
–Performance: O (m * n)
* m is the size of the pattern* n is the size of the text
9 32nd Internationalization and Unicode Conference San José, CA, September 2008
Example of Linear Search
–Text: “Cloverfield”
–Pattern: “field”
Cloverfieldfield
10 32nd Internationalization and Unicode Conference San José, CA, September 2008
Example of Linear Search
–Text: “Cloverfield”
–Pattern: “field”
Cloverfield field
11 32nd Internationalization and Unicode Conference San José, CA, September 2008
Example of Linear Search
–Text: “Cloverfield”
–Pattern: “field”
Cloverfield field
12 32nd Internationalization and Unicode Conference San José, CA, September 2008
Example of Linear Search
–Text: “Cloverfield”
–Pattern: “field”
Cloverfield field
13 32nd Internationalization and Unicode Conference San José, CA, September 2008
Example of Linear Search
–Text: “Cloverfield”
–Pattern: “field”
Cloverfield field
14 32nd Internationalization and Unicode Conference San José, CA, September 2008
Example of Linear Search
–Text: “Cloverfield”
–Pattern: “field”
Cloverfield field
15 32nd Internationalization and Unicode Conference San José, CA, September 2008
Example of Linear Search
–Text: “Cloverfield”
–Pattern: “field”
Cloverfield field
16 32nd Internationalization and Unicode Conference San José, CA, September 2008
Example of Linear Search
–Text: “Cloverfield”
–Pattern: “field”
Cloverfield field
17 32nd Internationalization and Unicode Conference San José, CA, September 2008
Example of Linear Search
–Text: “Cloverfield”
–Pattern: “field”
Cloverfield field
18 32nd Internationalization and Unicode Conference San José, CA, September 2008
Example of Linear Search
–Text: “Cloverfield”
–Pattern: “field”
Cloverfield field
19 32nd Internationalization and Unicode Conference San José, CA, September 2008
Example of Linear Search
–Text: “Cloverfield”
–Pattern: “field”
Cloverfield field
Match Found
20 32nd Internationalization and Unicode Conference San José, CA, September 2008
Better String Search Algorithm
➢ Boyer-Moore
–Intelligently skips characters in the text based on the pattern
–Very fast
–Preprocessing time: O(m + |∑|)
–Performance: O(n/m)
* m is the size of the pattern* n is the size of the text
21 32nd Internationalization and Unicode Conference San José, CA, September 2008
Example of Boyer-Moore Search
–Text: “Cloveldfield”
–Pattern: “field”Preprocessing of Pattern:
Bad Character Shift Table Good Suffix Shift Table
Letter Shift
f 4
i 3
e 2
l 1
other 5
# of Matches
Pattern Shift
0 - 1
1 -d 5
2 -ld 5
3 -eld 5
4 -ield 5
22 32nd Internationalization and Unicode Conference San José, CA, September 2008
Example of Boyer-Moore Search
–Text: “Cloveldfield”
–Pattern: “field”
Cloveldfieldfield
Letter Shift
f 4
i 3
e 2
l 1
other 5
23 32nd Internationalization and Unicode Conference San José, CA, September 2008
Example of Boyer-Moore Search
–Text: “Cloveldfield”
–Pattern: “field”
Cloveldfield field
24 32nd Internationalization and Unicode Conference San José, CA, September 2008
Example of Boyer-Moore Search
–Text: “Cloveldfield”
–Pattern: “field”
Cloveldfield field
25 32nd Internationalization and Unicode Conference San José, CA, September 2008
Example of Boyer-Moore Search
–Text: “Cloveldfield”
–Pattern: “field”
Cloveldfield field
26 32nd Internationalization and Unicode Conference San José, CA, September 2008
Example of Boyer-Moore Search
–Text: “Cloveldfield”
–Pattern: “field”
Cloveldfield field
# of Matches
Pattern Shift
0 - 1
1 -d 5
2 -ld 5
3 -eld 5
4 -ield 5
27 32nd Internationalization and Unicode Conference San José, CA, September 2008
Example of Boyer-Moore Search
–Text: “Cloveldfield”
–Pattern: “field”
Cloveldfield field
28 32nd Internationalization and Unicode Conference San José, CA, September 2008
Example of Boyer-Moore Search
–Text: “Cloveldfield”
–Pattern: “field”
Cloveldfield field
29 32nd Internationalization and Unicode Conference San José, CA, September 2008
Example of Boyer-Moore Search
–Text: “Cloveldfield”
–Pattern: “field”
Cloveldfield field
30 32nd Internationalization and Unicode Conference San José, CA, September 2008
Example of Boyer-Moore Search
–Text: “Cloveldfield”
–Pattern: “field”
Cloveldfield field
31 32nd Internationalization and Unicode Conference San José, CA, September 2008
Example of Boyer-Moore Search
–Text: “Cloveldfield”
–Pattern: “field”
Cloveldfield field
Match Found
32 32nd Internationalization and Unicode Conference San José, CA, September 2008
Another Fast String Search Algorithm
➢ Quick Search
–Skips based on character after pattern• Can compare in any order
–Very fast
–Preprocessing time: O(m + |∑|)
–Performance: O(n) for average case
* m is the size of the pattern* n is the size of the text
33 32nd Internationalization and Unicode Conference San José, CA, September 2008
Example of Quick Search
–Text: “Cloveldfield”
–Pattern: “field”Preprocessing of Pattern:
Bad Character Shift Table
Letter Shift
f 5
i 4
e 3
l 2
d 1
other 6
34 32nd Internationalization and Unicode Conference San José, CA, September 2008
Example of Quick Search
–Text: “Cloveldfield”
–Pattern: “field”
Cloveldfieldfield
Letter Shift
f 5
i 4
e 3
l 2
d 1
other 6
35 32nd Internationalization and Unicode Conference San José, CA, September 2008
Example of Quick Search
–Text: “Cloveldfield”
–Pattern: “field”
Cloveldfield field
Letter Shift
f 5
i 4
e 3
l 2
d 1
other 6
36 32nd Internationalization and Unicode Conference San José, CA, September 2008
Example of Quick Search
–Text: “Cloveldfield”
–Pattern: “field”
Cloveldfield field
37 32nd Internationalization and Unicode Conference San José, CA, September 2008
Example of Quick Search
–Text: “Cloveldfield”
–Pattern: “field”
Cloveldfield field
38 32nd Internationalization and Unicode Conference San José, CA, September 2008
Example of Quick Search
–Text: “Cloveldfield”
–Pattern: “field”
Cloveldfield field
39 32nd Internationalization and Unicode Conference San José, CA, September 2008
Example of Quick Search
–Text: “Cloveldfield”
–Pattern: “field”
Cloveldfield field
40 32nd Internationalization and Unicode Conference San José, CA, September 2008
Example of Quick Search
–Text: “Cloveldfield”
–Pattern: “field”
Cloveldfield field
Match Found
41 32nd Internationalization and Unicode Conference San José, CA, September 2008
The Unicode Standard
➢ Features:•Single encoding for all languages•Encodes over 90,000 characters
➢ Issues:
•Canonical equivalence•More than one sort order•Sorting is context sensitive•Sorting strength levels•Other sorting issues
42 32nd Internationalization and Unicode Conference San José, CA, September 2008
Canonical Equivalence
Å ≡ Å≡ A + º
x + . + ^ ≡ x + ^ + .
ự ≡ u + ’≡ ư + .≡ ụ + ’≡ u + . + ’≡ u + ’ + .
43 32nd Internationalization and Unicode Conference San José, CA, September 2008
Sort Order Varies By:➢ Language
– Swedish: z < ö
– German: ö < z
➢ Usage
– Dictionary: öf < of
– Telephone: of < öf
➢ Customizations
– A < a
– a < A
➢ Versioning
– Fixes
– New Gov. Stds
– New Characters
44 32nd Internationalization and Unicode Conference San José, CA, September 2008
Sorting Is Context Sensitive➢ Contractions
– H < Z, but CZ < CH
➢ Expansions
– OE < Œ < OF
➢ Both
– カー < カイ– キー > キイ
45 32nd Internationalization and Unicode Conference San José, CA, September 2008
Sorting Strength Levels
1. Base characters: a < b
2. Accents: as < às < at
– ignored if there is a L1 character difference
3. Case: ao < Ao < aò
– ignored if there is a L1 or L2 difference
4. Punctuation: ab < a-b < aB
– ignored* if there is a L1, L2, or L3 difference
5. Tie-breaker: NFD code point order
46 32nd Internationalization and Unicode Conference San José, CA, September 2008
Other Sorting Issues➢ Normal accents
–cote < coté < côte < côté• first accent difference determines order
➢ French accents
–cote < côte < coté < côté• last accent difference determines order
➢ Logical Order Exception (Thai, Lao)
– เ ก sorts like ก เ
47 32nd Internationalization and Unicode Conference San José, CA, September 2008
Unicode Collation Algorithm (UCA)➢ UTS #10: Unicode Collation Algorithm
– Levels, Expansions, Contractions, Punctuation, Canonical Equivalence, etc.
– Default ordering: all Unicode code points
– Provides for tailoring to given languages
– Also see: The Unicode Standard, §5.17: Sorting and Searching
➢ Aligned with ISO 14651
48 32nd Internationalization and Unicode Conference San José, CA, September 2008
Collation Elements (CEs)➢ Ordering established by weights
➢ Weights encoded in Collation Elements
– [ primary weight ] (e.g. base character)
– [ secondary weight ] (e.g. accents)
– [ tertiary weight ] (e.g. case-level)
– [ quaternary weight ] (e.g. punctuation)
➢ Must be accessed sequentially
– Characters to CEs not 1:1
➢ Canonically equivalent characters have same CEs
49 32nd Internationalization and Unicode Conference San José, CA, September 2008
Sort Keys➢ Transform string into series of bytes which
will binary-compare
–a: 06 C3 01 20 01 02 00
–A: 06 C3 01 20 01 08 00
–á: 06 C3 01 20 32 01 02 02 00
–ab:06 C3 06 D7 01 20 20 01 02 02 00
–b: 06 D7 01 20 01 02 00
Level 3 Level 2 Level 1
50 32nd Internationalization and Unicode Conference San José, CA, September 2008
Matching With UCA
➢ Compare CEs for both strings➢ Adjust for strength➢ Access sequentially➢ Stop on first mismatch➢ Number of characters, CEs may differ➢ No match if number of CEs different
51 32nd Internationalization and Unicode Conference San José, CA, September 2008
Comparing And Sorting With UCA
➢ Compare CEs
– Sequential
– Best performance for single compare
– Must have same number of CEs➢ Compare sort keys
– Best performance for multiple compares
52 32nd Internationalization and Unicode Conference San José, CA, September 2008
String Search And Unicode Text➢ Canonical equivalence
– Find “à” in “a◌̀_” and “a◌̀_” in “à”➢ Expansions & contractions
– “ß” = “ss”, “å” = “aa”– “ch” is one character
➢ “Whole character” match
– Don’t find “a” in “a◌̀_” or “c” in “ch”➢ Length of pattern, match may differ
53 32nd Internationalization and Unicode Conference San José, CA, September 2008
String Search And Collation
➢ Solves some problems:
– Same CEs for “à” and “a◌̀_”– Same CEs for “ß” and “ss” (at level 1)– Same CEs for “å” and “aa” (at level 1)– Won’t find “c” in “ch”– Same length of pattern, match
➢ Doesn’t solve others:
– CE for “a” also 1st CE for “à” and “a◌̀_”
54 32nd Internationalization and Unicode Conference San José, CA, September 2008
String Search And Collation, cont.➢ CEs:
– Expensive to generate– Cheap to compare– Sequential access– Mapping to character index approximate
➢ “Whole character” match:
– Not enough information– Use character properties
55 32nd Internationalization and Unicode Conference San José, CA, September 2008
Linear Search And Collation
➢ Convert pattern to CEs up front➢ Sequential access a good fit
–Can search forwards or backwards➢ May read a give CE more than once
– Use circular buffer for performance➢ Easy to find match bounds
– Validate for “whole character” match
56 32nd Internationalization and Unicode Conference San José, CA, September 2008
Boyer-Moore Search And Collation
➢ Use pattern CEs to compute skip tables
– CE “alphabet” large
– Use a hash function➢ Fetch target CEs backwards
– Even for backwards search➢ Access pattern makes buffering difficult
57 32nd Internationalization and Unicode Conference San José, CA, September 2008
Boyer-Moore Search And Collation
–Text: “My fußball table”
–Pattern: “fuss”Preprocessing of Pattern:
Bad Character Shift Table Good Suffix Shift Table
Letter Shift
f 3
u 2
s 1
other 4
# of Matches
Pattern Shift
0 - 1
1 -s 1
2 -ss 4
3 -uss 4
58 32nd Internationalization and Unicode Conference San José, CA, September 2008
Boyer-Moore And Collation Example
–Text: “My fußball table”
–Pattern: “fuss”
My fußball tablefuss
Letter Shift
f 3
u 2
s 1
other 4
59 32nd Internationalization and Unicode Conference San José, CA, September 2008
Boyer-Moore And Collation Example
–Text: “My fußball table”
–Pattern: “fuss”
My fußball table fuss
OOPS
Letter Shift
f 3
u 2
s 1
other 4
60 32nd Internationalization and Unicode Conference San José, CA, September 2008
Solving The Skipping Problem➢ New function: minLengthInChars
– Shortest string generating same CEs– Always treat pattern as shortest– May not always skip as far as it could– Will never skip too far
61 32nd Internationalization and Unicode Conference San José, CA, September 2008
minLengthInChars Example
–Text: “My fußball table”
–Pattern: “fuss” (treat as “fuß”)Preprocessing of Pattern:
Bad Character Shift Table Good Suffix Shift Table
Letter Shift
f 2
u 1
other 3
# of Matches
Pattern Shift
0 - 1
1 -ß 3
2 -uß 3
62 32nd Internationalization and Unicode Conference San José, CA, September 2008
minLengthInChars Example
–Text: “My fußball table”
–Pattern: “fuss”
My fußball tablefuß
Letter Shift
f 2
u 1
other 3
63 32nd Internationalization and Unicode Conference San José, CA, September 2008
minLengthInChars Example
–Text: “My fußball table”
–Pattern: “fuss”
My fußball table fuß
64 32nd Internationalization and Unicode Conference San José, CA, September 2008
minLengthInChars Example
–Text: “My fußball table”
–Pattern: “fuss”
My fußball table fuß
65 32nd Internationalization and Unicode Conference San José, CA, September 2008
minLengthInChars Example
–Text: “My fußball table”
–Pattern: “fuss”
My fußball table fuß
Match Found
66 32nd Internationalization and Unicode Conference San José, CA, September 2008
Quick Search And Collation
➢ Search pattern forward
– Fastest way to get CEs➢ Can also search backward
– Slower than forward➢ Other search orders:
– More expensive pre-processing– Non-sequential access expensive
67 32nd Internationalization and Unicode Conference San José, CA, September 2008
Quick Search And Collation Example
–Text: “My fussball table”
–Pattern: “fussball” (treated as “fußball”)Preprocessing of Pattern:Bad Character Shift Table
Letter Shift
f 7
u 6
s 5
b 4
a 3
l 1
other 8
68 32nd Internationalization and Unicode Conference San José, CA, September 2008
Quick Search And Collation Example
–Text: “My fussball table”
–Pattern: “fussball”
My fussball tablefußball
Letter Shift
f 7
u 6
s 5
b 4
a 3
l 1
other 8
69 32nd Internationalization and Unicode Conference San José, CA, September 2008
Quick Search And Collation Example
–Text: “My fussball table”
–Pattern: “fussball”
My fussball table fußball
OOPS
Letter Shift
f 7
u 6
s 5
b 4
a 3
l 1
other 8
70 32nd Internationalization and Unicode Conference San José, CA, September 2008
Quick Search And Collation Conclusion
➢ Must compare backwards➢ More expensive to fetch CE after pattern
– Non-sequential access– Character after might generate multiple CEs
➢ Boyer-Moore seems like a better fit– No need to fetch extra CE– Sequential access between skips
71 32nd Internationalization and Unicode Conference San José, CA, September 2008
Summary
• Matching, sorting, and searching are essential text handling tools
• Using the character code points is not sufficient
• Implementing the Unicode standard and the Unicode Collation Algorithm is the way to go
• Special considerations during implementation
72 32nd Internationalization and Unicode Conference San José, CA, September 2008
Questions ?
73 32nd Internationalization and Unicode Conference San José, CA, September 2008
More Information
➢ Unicode Collation Algorithm–http://unicode.org/reports/tr10
➢ ICU–http://www.icu-project.org