73
32nd Internationalization and Unicode Conf erence San José, CA, September 20 08 Matching, Sorting, and Searching with Unicode Text Eric R. Mader, Michael Ow IBM Globalization Architecture and Technology

32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

Embed Size (px)

Citation preview

Page 1: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

32nd Internationalization and Unicode Conference San José, CA, September 2008

Matching, Sorting, and Searching with Unicode Text

Eric R. Mader, Michael OwIBM Globalization Architecture and Technology

Page 2: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

2 32nd Internationalization and Unicode Conference San José, CA, September 2008

Topics

➢ Matching–Are two characters the same?

➢ Sorting–Proper ordering of characters

➢ Searching–Finding the desired pattern in a given text

Page 3: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

3 32nd Internationalization and Unicode Conference San José, CA, September 2008

Different Ways to Implement

➢ “Non-Globalized Way”–Concern with individual language only

–Using codepages that are language specific or limited

➢ “Globalized Way”–Using Unicode

–Using Unicode Collation Allgorithm

–Consideration for all languages

Page 4: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

4 32nd Internationalization and Unicode Conference San José, CA, September 2008

Non-Globalized Way

–Matching using the individual code points

•a (0x61) = a (0x61)

* using US-ASCII

Page 5: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

5 32nd Internationalization and Unicode Conference San José, CA, September 2008

Non-Globalized Way, cont.

–Sorting can also be done by comparing code points or a simple mapping of the code points (e.g. in EBDIC)

Unsorted SortedCharacter Code Point

b 0x62a 0x61d 0x64c 0x63

Character Code Point

a 0x61

b 0x62

c 0x63d 0x64

*Using US-ASCII

Page 6: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

6 32nd Internationalization and Unicode Conference San José, CA, September 2008

Non-Globalized Way, cont.

➢ Searching

–Many different algorithms in string search*

–Searching of text done through analysis of character code points only

*algorithms can be globalized

Page 7: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

7 32nd Internationalization and Unicode Conference San José, CA, September 2008

Basics of String Search Algorithms

–There are many ways to search through text:

•Linear Search•Boyer-Moore•Quick Search

* There are many more, but for time constraints we will discuss the common ones that are listed above.

Page 8: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

8 32nd Internationalization and Unicode Conference San José, CA, September 2008

Simple String Search Algorithm

➢ Linear Search

–Brute force search

–Check every character against pattern

–Very slow

–No preprocessing time

–Performance: O (m * n)

* m is the size of the pattern* n is the size of the text

Page 9: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

9 32nd Internationalization and Unicode Conference San José, CA, September 2008

Example of Linear Search

–Text: “Cloverfield”

–Pattern: “field”

Cloverfieldfield

Page 10: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

10 32nd Internationalization and Unicode Conference San José, CA, September 2008

Example of Linear Search

–Text: “Cloverfield”

–Pattern: “field”

Cloverfield field

Page 11: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

11 32nd Internationalization and Unicode Conference San José, CA, September 2008

Example of Linear Search

–Text: “Cloverfield”

–Pattern: “field”

Cloverfield field

Page 12: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

12 32nd Internationalization and Unicode Conference San José, CA, September 2008

Example of Linear Search

–Text: “Cloverfield”

–Pattern: “field”

Cloverfield field

Page 13: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

13 32nd Internationalization and Unicode Conference San José, CA, September 2008

Example of Linear Search

–Text: “Cloverfield”

–Pattern: “field”

Cloverfield field

Page 14: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

14 32nd Internationalization and Unicode Conference San José, CA, September 2008

Example of Linear Search

–Text: “Cloverfield”

–Pattern: “field”

Cloverfield field

Page 15: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

15 32nd Internationalization and Unicode Conference San José, CA, September 2008

Example of Linear Search

–Text: “Cloverfield”

–Pattern: “field”

Cloverfield field

Page 16: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

16 32nd Internationalization and Unicode Conference San José, CA, September 2008

Example of Linear Search

–Text: “Cloverfield”

–Pattern: “field”

Cloverfield field

Page 17: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

17 32nd Internationalization and Unicode Conference San José, CA, September 2008

Example of Linear Search

–Text: “Cloverfield”

–Pattern: “field”

Cloverfield field

Page 18: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

18 32nd Internationalization and Unicode Conference San José, CA, September 2008

Example of Linear Search

–Text: “Cloverfield”

–Pattern: “field”

Cloverfield field

Page 19: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

19 32nd Internationalization and Unicode Conference San José, CA, September 2008

Example of Linear Search

–Text: “Cloverfield”

–Pattern: “field”

Cloverfield field

Match Found

Page 20: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

20 32nd Internationalization and Unicode Conference San José, CA, September 2008

Better String Search Algorithm

➢ Boyer-Moore

–Intelligently skips characters in the text based on the pattern

–Very fast

–Preprocessing time: O(m + |∑|)

–Performance: O(n/m)

* m is the size of the pattern* n is the size of the text

Page 21: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

21 32nd Internationalization and Unicode Conference San José, CA, September 2008

Example of Boyer-Moore Search

–Text: “Cloveldfield”

–Pattern: “field”Preprocessing of Pattern:

Bad Character Shift Table Good Suffix Shift Table

Letter Shift

f 4

i 3

e 2

l 1

other 5

# of Matches

Pattern Shift

0 - 1

1 -d 5

2 -ld 5

3 -eld 5

4 -ield 5

Page 22: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

22 32nd Internationalization and Unicode Conference San José, CA, September 2008

Example of Boyer-Moore Search

–Text: “Cloveldfield”

–Pattern: “field”

Cloveldfieldfield

Letter Shift

f 4

i 3

e 2

l 1

other 5

Page 23: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

23 32nd Internationalization and Unicode Conference San José, CA, September 2008

Example of Boyer-Moore Search

–Text: “Cloveldfield”

–Pattern: “field”

Cloveldfield field

Page 24: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

24 32nd Internationalization and Unicode Conference San José, CA, September 2008

Example of Boyer-Moore Search

–Text: “Cloveldfield”

–Pattern: “field”

Cloveldfield field

Page 25: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

25 32nd Internationalization and Unicode Conference San José, CA, September 2008

Example of Boyer-Moore Search

–Text: “Cloveldfield”

–Pattern: “field”

Cloveldfield field

Page 26: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

26 32nd Internationalization and Unicode Conference San José, CA, September 2008

Example of Boyer-Moore Search

–Text: “Cloveldfield”

–Pattern: “field”

Cloveldfield field

# of Matches

Pattern Shift

0 - 1

1 -d 5

2 -ld 5

3 -eld 5

4 -ield 5

Page 27: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

27 32nd Internationalization and Unicode Conference San José, CA, September 2008

Example of Boyer-Moore Search

–Text: “Cloveldfield”

–Pattern: “field”

Cloveldfield field

Page 28: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

28 32nd Internationalization and Unicode Conference San José, CA, September 2008

Example of Boyer-Moore Search

–Text: “Cloveldfield”

–Pattern: “field”

Cloveldfield field

Page 29: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

29 32nd Internationalization and Unicode Conference San José, CA, September 2008

Example of Boyer-Moore Search

–Text: “Cloveldfield”

–Pattern: “field”

Cloveldfield field

Page 30: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

30 32nd Internationalization and Unicode Conference San José, CA, September 2008

Example of Boyer-Moore Search

–Text: “Cloveldfield”

–Pattern: “field”

Cloveldfield field

Page 31: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

31 32nd Internationalization and Unicode Conference San José, CA, September 2008

Example of Boyer-Moore Search

–Text: “Cloveldfield”

–Pattern: “field”

Cloveldfield field

Match Found

Page 32: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

32 32nd Internationalization and Unicode Conference San José, CA, September 2008

Another Fast String Search Algorithm

➢ Quick Search

–Skips based on character after pattern• Can compare in any order

–Very fast

–Preprocessing time: O(m + |∑|)

–Performance: O(n) for average case

* m is the size of the pattern* n is the size of the text

Page 33: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

33 32nd Internationalization and Unicode Conference San José, CA, September 2008

Example of Quick Search

–Text: “Cloveldfield”

–Pattern: “field”Preprocessing of Pattern:

Bad Character Shift Table

Letter Shift

f 5

i 4

e 3

l 2

d 1

other 6

Page 34: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

34 32nd Internationalization and Unicode Conference San José, CA, September 2008

Example of Quick Search

–Text: “Cloveldfield”

–Pattern: “field”

Cloveldfieldfield

Letter Shift

f 5

i 4

e 3

l 2

d 1

other 6

Page 35: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

35 32nd Internationalization and Unicode Conference San José, CA, September 2008

Example of Quick Search

–Text: “Cloveldfield”

–Pattern: “field”

Cloveldfield field

Letter Shift

f 5

i 4

e 3

l 2

d 1

other 6

Page 36: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

36 32nd Internationalization and Unicode Conference San José, CA, September 2008

Example of Quick Search

–Text: “Cloveldfield”

–Pattern: “field”

Cloveldfield field

Page 37: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

37 32nd Internationalization and Unicode Conference San José, CA, September 2008

Example of Quick Search

–Text: “Cloveldfield”

–Pattern: “field”

Cloveldfield field

Page 38: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

38 32nd Internationalization and Unicode Conference San José, CA, September 2008

Example of Quick Search

–Text: “Cloveldfield”

–Pattern: “field”

Cloveldfield field

Page 39: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

39 32nd Internationalization and Unicode Conference San José, CA, September 2008

Example of Quick Search

–Text: “Cloveldfield”

–Pattern: “field”

Cloveldfield field

Page 40: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

40 32nd Internationalization and Unicode Conference San José, CA, September 2008

Example of Quick Search

–Text: “Cloveldfield”

–Pattern: “field”

Cloveldfield field

Match Found

Page 41: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

41 32nd Internationalization and Unicode Conference San José, CA, September 2008

The Unicode Standard

➢ Features:•Single encoding for all languages•Encodes over 90,000 characters

➢ Issues:

•Canonical equivalence•More than one sort order•Sorting is context sensitive•Sorting strength levels•Other sorting issues

Page 42: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

42 32nd Internationalization and Unicode Conference San José, CA, September 2008

Canonical Equivalence

Å ≡ Å≡ A + º

x + . + ^ ≡ x + ^ + .

ự ≡ u + ’≡ ư + .≡ ụ + ’≡ u + . + ’≡ u + ’ + .

Page 43: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

43 32nd Internationalization and Unicode Conference San José, CA, September 2008

Sort Order Varies By:➢ Language

– Swedish: z < ö

– German: ö < z

➢ Usage

– Dictionary: öf < of

– Telephone: of < öf

➢ Customizations

– A < a

– a < A

➢ Versioning

– Fixes

– New Gov. Stds

– New Characters

Page 44: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

44 32nd Internationalization and Unicode Conference San José, CA, September 2008

Sorting Is Context Sensitive➢ Contractions

– H < Z, but CZ < CH

➢ Expansions

РOE < Π< OF

➢ Both

– カー < カイ– キー > キイ

Page 45: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

45 32nd Internationalization and Unicode Conference San José, CA, September 2008

Sorting Strength Levels

1. Base characters: a < b

2. Accents: as < às < at

– ignored if there is a L1 character difference

3. Case: ao < Ao < aò

– ignored if there is a L1 or L2 difference

4. Punctuation: ab < a-b < aB

– ignored* if there is a L1, L2, or L3 difference

5. Tie-breaker: NFD code point order

Page 46: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

46 32nd Internationalization and Unicode Conference San José, CA, September 2008

Other Sorting Issues➢ Normal accents

–cote < coté < côte < côté• first accent difference determines order

➢ French accents

–cote < côte < coté < côté• last accent difference determines order

➢ Logical Order Exception (Thai, Lao)

– เ ก sorts like ก เ

Page 47: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

47 32nd Internationalization and Unicode Conference San José, CA, September 2008

Unicode Collation Algorithm (UCA)➢ UTS #10: Unicode Collation Algorithm

– Levels, Expansions, Contractions, Punctuation, Canonical Equivalence, etc.

– Default ordering: all Unicode code points

– Provides for tailoring to given languages

– Also see: The Unicode Standard, §5.17: Sorting and Searching

➢ Aligned with ISO 14651

Page 48: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

48 32nd Internationalization and Unicode Conference San José, CA, September 2008

Collation Elements (CEs)➢ Ordering established by weights

➢ Weights encoded in Collation Elements

– [ primary weight ] (e.g. base character)

– [ secondary weight ] (e.g. accents)

– [ tertiary weight ] (e.g. case-level)

– [ quaternary weight ] (e.g. punctuation)

➢ Must be accessed sequentially

– Characters to CEs not 1:1

➢ Canonically equivalent characters have same CEs

Page 49: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

49 32nd Internationalization and Unicode Conference San José, CA, September 2008

Sort Keys➢ Transform string into series of bytes which

will binary-compare

–a: 06 C3 01 20 01 02 00

–A: 06 C3 01 20 01 08 00

–á: 06 C3 01 20 32 01 02 02 00

–ab:06 C3 06 D7 01 20 20 01 02 02 00

–b: 06 D7 01 20 01 02 00

Level 3 Level 2 Level 1

Page 50: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

50 32nd Internationalization and Unicode Conference San José, CA, September 2008

Matching With UCA

➢ Compare CEs for both strings➢ Adjust for strength➢ Access sequentially➢ Stop on first mismatch➢ Number of characters, CEs may differ➢ No match if number of CEs different

Page 51: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

51 32nd Internationalization and Unicode Conference San José, CA, September 2008

Comparing And Sorting With UCA

➢ Compare CEs

– Sequential

– Best performance for single compare

– Must have same number of CEs➢ Compare sort keys

– Best performance for multiple compares

Page 52: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

52 32nd Internationalization and Unicode Conference San José, CA, September 2008

String Search And Unicode Text➢ Canonical equivalence

– Find “à” in “a◌̀_” and “a◌̀_” in “à”➢ Expansions & contractions

– “ß” = “ss”, “å” = “aa”– “ch” is one character

➢ “Whole character” match

– Don’t find “a” in “a◌̀_” or “c” in “ch”➢ Length of pattern, match may differ

Page 53: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

53 32nd Internationalization and Unicode Conference San José, CA, September 2008

String Search And Collation

➢ Solves some problems:

– Same CEs for “à” and “a◌̀_”– Same CEs for “ß” and “ss” (at level 1)– Same CEs for “å” and “aa” (at level 1)– Won’t find “c” in “ch”– Same length of pattern, match

➢ Doesn’t solve others:

– CE for “a” also 1st CE for “à” and “a◌̀_”

Page 54: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

54 32nd Internationalization and Unicode Conference San José, CA, September 2008

String Search And Collation, cont.➢ CEs:

– Expensive to generate– Cheap to compare– Sequential access– Mapping to character index approximate

➢ “Whole character” match:

– Not enough information– Use character properties

Page 55: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

55 32nd Internationalization and Unicode Conference San José, CA, September 2008

Linear Search And Collation

➢ Convert pattern to CEs up front➢ Sequential access a good fit

–Can search forwards or backwards➢ May read a give CE more than once

– Use circular buffer for performance➢ Easy to find match bounds

– Validate for “whole character” match

Page 56: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

56 32nd Internationalization and Unicode Conference San José, CA, September 2008

Boyer-Moore Search And Collation

➢ Use pattern CEs to compute skip tables

– CE “alphabet” large

– Use a hash function➢ Fetch target CEs backwards

– Even for backwards search➢ Access pattern makes buffering difficult

Page 57: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

57 32nd Internationalization and Unicode Conference San José, CA, September 2008

Boyer-Moore Search And Collation

–Text: “My fußball table”

–Pattern: “fuss”Preprocessing of Pattern:

Bad Character Shift Table Good Suffix Shift Table

Letter Shift

f 3

u 2

s 1

other 4

# of Matches

Pattern Shift

0 - 1

1 -s 1

2 -ss 4

3 -uss 4

Page 58: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

58 32nd Internationalization and Unicode Conference San José, CA, September 2008

Boyer-Moore And Collation Example

–Text: “My fußball table”

–Pattern: “fuss”

My fußball tablefuss

Letter Shift

f 3

u 2

s 1

other 4

Page 59: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

59 32nd Internationalization and Unicode Conference San José, CA, September 2008

Boyer-Moore And Collation Example

–Text: “My fußball table”

–Pattern: “fuss”

My fußball table fuss

OOPS

Letter Shift

f 3

u 2

s 1

other 4

Page 60: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

60 32nd Internationalization and Unicode Conference San José, CA, September 2008

Solving The Skipping Problem➢ New function: minLengthInChars

– Shortest string generating same CEs– Always treat pattern as shortest– May not always skip as far as it could– Will never skip too far

Page 61: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

61 32nd Internationalization and Unicode Conference San José, CA, September 2008

minLengthInChars Example

–Text: “My fußball table”

–Pattern: “fuss” (treat as “fuß”)Preprocessing of Pattern:

Bad Character Shift Table Good Suffix Shift Table

Letter Shift

f 2

u 1

other 3

# of Matches

Pattern Shift

0 - 1

1 -ß 3

2 -uß 3

Page 62: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

62 32nd Internationalization and Unicode Conference San José, CA, September 2008

minLengthInChars Example

–Text: “My fußball table”

–Pattern: “fuss”

My fußball tablefuß

Letter Shift

f 2

u 1

other 3

Page 63: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

63 32nd Internationalization and Unicode Conference San José, CA, September 2008

minLengthInChars Example

–Text: “My fußball table”

–Pattern: “fuss”

My fußball table fuß

Page 64: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

64 32nd Internationalization and Unicode Conference San José, CA, September 2008

minLengthInChars Example

–Text: “My fußball table”

–Pattern: “fuss”

My fußball table fuß

Page 65: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

65 32nd Internationalization and Unicode Conference San José, CA, September 2008

minLengthInChars Example

–Text: “My fußball table”

–Pattern: “fuss”

My fußball table fuß

Match Found

Page 66: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

66 32nd Internationalization and Unicode Conference San José, CA, September 2008

Quick Search And Collation

➢ Search pattern forward

– Fastest way to get CEs➢ Can also search backward

– Slower than forward➢ Other search orders:

– More expensive pre-processing– Non-sequential access expensive

Page 67: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

67 32nd Internationalization and Unicode Conference San José, CA, September 2008

Quick Search And Collation Example

–Text: “My fussball table”

–Pattern: “fussball” (treated as “fußball”)Preprocessing of Pattern:Bad Character Shift Table

Letter Shift

f 7

u 6

s 5

b 4

a 3

l 1

other 8

Page 68: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

68 32nd Internationalization and Unicode Conference San José, CA, September 2008

Quick Search And Collation Example

–Text: “My fussball table”

–Pattern: “fussball”

My fussball tablefußball

Letter Shift

f 7

u 6

s 5

b 4

a 3

l 1

other 8

Page 69: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

69 32nd Internationalization and Unicode Conference San José, CA, September 2008

Quick Search And Collation Example

–Text: “My fussball table”

–Pattern: “fussball”

My fussball table fußball

OOPS

Letter Shift

f 7

u 6

s 5

b 4

a 3

l 1

other 8

Page 70: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

70 32nd Internationalization and Unicode Conference San José, CA, September 2008

Quick Search And Collation Conclusion

➢ Must compare backwards➢ More expensive to fetch CE after pattern

– Non-sequential access– Character after might generate multiple CEs

➢ Boyer-Moore seems like a better fit– No need to fetch extra CE– Sequential access between skips

Page 71: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

71 32nd Internationalization and Unicode Conference San José, CA, September 2008

Summary

• Matching, sorting, and searching are essential text handling tools

• Using the character code points is not sufficient

• Implementing the Unicode standard and the Unicode Collation Algorithm is the way to go

• Special considerations during implementation

Page 72: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

72 32nd Internationalization and Unicode Conference San José, CA, September 2008

Questions ?

Page 73: 32nd Internationalization and Unicode Conference32nd Internationalization and Unicode Conference San José, CA, September 2008San José, CA, September 2008

73 32nd Internationalization and Unicode Conference San José, CA, September 2008

More Information

➢ Unicode Collation Algorithm–http://unicode.org/reports/tr10

➢ ICU–http://www.icu-project.org