Camomile : A Unicode library for OCaml

Preview:

Citation preview

Camomile : A Unicode library for OCaml

Yoriyuki Yamagata

National Institute of Advanced Science and Technology (AIST)

ML Workshop, September 18, 2011

Outline

Overview

ASCII to Unicode : A challenge of multilingualization

Example : Unicode normal forms

ulib

Conclusion

Outline

Overview

ASCII to Unicode : A challenge of multilingualization

Example : Unicode normal forms

ulib

Conclusion

Overview - functionality

Camomile - A Unicode library for OCaml

I Unicode character typeI UTF-8, UTF-16, UTF-32 stringsI Conversion to/from approx 200 encodingsI Case mappingI Collation (sort and search)

Overview - functionality

Camomile - A Unicode library for OCaml

I Unicode character typeI UTF-8, UTF-16, UTF-32 stringsI Conversion to/from approx 200 encodingsI Case mappingI Collation (sort and search)

Overview - functionality

Camomile - A Unicode library for OCaml

I Unicode character type

I UTF-8, UTF-16, UTF-32 stringsI Conversion to/from approx 200 encodingsI Case mappingI Collation (sort and search)

Overview - functionality

Camomile - A Unicode library for OCaml

I Unicode character typeI UTF-8, UTF-16, UTF-32 strings

I Conversion to/from approx 200 encodingsI Case mappingI Collation (sort and search)

Overview - functionality

Camomile - A Unicode library for OCaml

I Unicode character typeI UTF-8, UTF-16, UTF-32 stringsI Conversion to/from approx 200 encodings

I Case mappingI Collation (sort and search)

Overview - functionality

Camomile - A Unicode library for OCaml

I Unicode character typeI UTF-8, UTF-16, UTF-32 stringsI Conversion to/from approx 200 encodingsI Case mapping

I Collation (sort and search)

Overview - functionality

Camomile - A Unicode library for OCaml

I Unicode character typeI UTF-8, UTF-16, UTF-32 stringsI Conversion to/from approx 200 encodingsI Case mappingI Collation (sort and search)

Overview - feature

I Only support “logical” operationsI No support for rendering or formattingI Purely written in OCamlI Functors and lazy evaluation play crucial roles

Overview - featureI Only support “logical” operations

I No support for rendering or formattingI Purely written in OCamlI Functors and lazy evaluation play crucial roles

Overview - featureI Only support “logical” operationsI No support for rendering or formatting

I Purely written in OCamlI Functors and lazy evaluation play crucial roles

Overview - featureI Only support “logical” operationsI No support for rendering or formattingI Purely written in OCaml

I Functors and lazy evaluation play crucial roles

Overview - featureI Only support “logical” operationsI No support for rendering or formattingI Purely written in OCamlI Functors and lazy evaluation play crucial roles

Outline

Overview

ASCII to Unicode : A challenge of multilingualization

Example : Unicode normal forms

ulib

Conclusion

ASCII to Unicode : challenge of multilingualization

Large number of characters

Multiple representation of strings

UTF-8, UTF-16 and UTF-32legacy encodings

Combining characters

ä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .

Diverse cultural conventions

Case mapping OΣOΣ → oσoς (Greek)Sorting ... < H < CH < I < ... (Slovak)

ASCII to Unicode : challenge of multilingualizationLarge number of characters

Multiple representation of strings

UTF-8, UTF-16 and UTF-32legacy encodings

Combining characters

ä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .

Diverse cultural conventions

Case mapping OΣOΣ → oσoς (Greek)Sorting ... < H < CH < I < ... (Slovak)

ASCII to Unicode : challenge of multilingualizationLarge number of characters

code range 0x0 - 0x10ffff

Multiple representation of strings

UTF-8, UTF-16 and UTF-32legacy encodings

Combining characters

ä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .

Diverse cultural conventions

Case mapping OΣOΣ → oσoς (Greek)Sorting ... < H < CH < I < ... (Slovak)

ASCII to Unicode : challenge of multilingualizationLarge number of characters

code range 0x0 - 0x10ffffMultiple representation of strings

UTF-8, UTF-16 and UTF-32legacy encodings

Combining characters

ä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .

Diverse cultural conventions

Case mapping OΣOΣ → oσoς (Greek)Sorting ... < H < CH < I < ... (Slovak)

ASCII to Unicode : challenge of multilingualizationLarge number of characters

code range 0x0 - 0x10ffffMultiple representation of strings

UTF-8, UTF-16 and UTF-32

legacy encodingsCombining characters

ä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .

Diverse cultural conventions

Case mapping OΣOΣ → oσoς (Greek)Sorting ... < H < CH < I < ... (Slovak)

ASCII to Unicode : challenge of multilingualizationLarge number of characters

code range 0x0 - 0x10ffffMultiple representation of strings

UTF-8, UTF-16 and UTF-32legacy encodings

Combining characters

ä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .

Diverse cultural conventions

Case mapping OΣOΣ → oσoς (Greek)Sorting ... < H < CH < I < ... (Slovak)

ASCII to Unicode : challenge of multilingualizationLarge number of characters

code range 0x0 - 0x10ffffMultiple representation of strings

UTF-8, UTF-16 and UTF-32legacy encodings

Combining characters

ä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .

Diverse cultural conventions

Case mapping OΣOΣ → oσoς (Greek)Sorting ... < H < CH < I < ... (Slovak)

ASCII to Unicode : challenge of multilingualizationLarge number of characters

code range 0x0 - 0x10ffffMultiple representation of strings

UTF-8, UTF-16 and UTF-32legacy encodings

Combining charactersä = a + ¨

Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .

Diverse cultural conventions

Case mapping OΣOΣ → oσoς (Greek)Sorting ... < H < CH < I < ... (Slovak)

ASCII to Unicode : challenge of multilingualizationLarge number of characters

code range 0x0 - 0x10ffffMultiple representation of strings

UTF-8, UTF-16 and UTF-32legacy encodings

Combining charactersä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en

â. = a + . + ˆ = a + ˆ + .Diverse cultural conventions

Case mapping OΣOΣ → oσoς (Greek)Sorting ... < H < CH < I < ... (Slovak)

ASCII to Unicode : challenge of multilingualizationLarge number of characters

code range 0x0 - 0x10ffffMultiple representation of strings

UTF-8, UTF-16 and UTF-32legacy encodings

Combining charactersä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .

Diverse cultural conventions

Case mapping OΣOΣ → oσoς (Greek)Sorting ... < H < CH < I < ... (Slovak)

ASCII to Unicode : challenge of multilingualizationLarge number of characters

code range 0x0 - 0x10ffffMultiple representation of strings

UTF-8, UTF-16 and UTF-32legacy encodings

Combining charactersä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .

Diverse cultural conventions

Case mapping OΣOΣ → oσoς (Greek)Sorting ... < H < CH < I < ... (Slovak)

ASCII to Unicode : challenge of multilingualizationLarge number of characters

code range 0x0 - 0x10ffffMultiple representation of strings

UTF-8, UTF-16 and UTF-32legacy encodings

Combining charactersä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .

Diverse cultural conventionsCase mapping OΣOΣ → oσoς (Greek)

Sorting ... < H < CH < I < ... (Slovak)

ASCII to Unicode : challenge of multilingualizationLarge number of characters

code range 0x0 - 0x10ffffMultiple representation of strings

UTF-8, UTF-16 and UTF-32legacy encodings

Combining charactersä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .

Diverse cultural conventionsCase mapping OΣOΣ → oσoς (Greek)

Sorting ... < H < CH < I < ... (Slovak)

Outline

Overview

ASCII to Unicode : A challenge of multilingualization

Example : Unicode normal forms

ulib

Conclusion

Unicode normal forms - what is it?

Unicode has multiple representations of “same” strings.

E.g. â. = a. + ˆ = a + . + ˆ = a + ˆ + . etc.

Normal forms give the unique representationsThere are 4 normal forms

1. NFD2. NFC3. NFKD4. NFKC

We concentrate NFD

Unicode normal forms - what is it?

Unicode has multiple representations of “same” strings.

E.g. â. = a. + ˆ = a + . + ˆ = a + ˆ + . etc.

Normal forms give the unique representationsThere are 4 normal forms

1. NFD2. NFC3. NFKD4. NFKC

We concentrate NFD

Unicode normal forms - what is it?

Unicode has multiple representations of “same” strings.E.g. â. = a. + ˆ = a + . + ˆ = a + ˆ + . etc.

Normal forms give the unique representationsThere are 4 normal forms

1. NFD2. NFC3. NFKD4. NFKC

We concentrate NFD

Unicode normal forms - what is it?

Unicode has multiple representations of “same” strings.E.g. â. = a. + ˆ = a + . + ˆ = a + ˆ + . etc.

Normal forms give the unique representationsThere are 4 normal forms

1. NFD2. NFC3. NFKD4. NFKC

We concentrate NFD

Unicode normal forms - what is it?

Unicode has multiple representations of “same” strings.E.g. â. = a. + ˆ = a + . + ˆ = a + ˆ + . etc.

Normal forms give the unique representationsThere are 4 normal forms

1. NFD2. NFC3. NFKD4. NFKC

We concentrate NFD

Unicode normal form - NFD

1. Decompose characters as much as possibleâ. ⇒ a. + ˆ ⇒ a + . + ˆ

2. Do stable sort on combining characters based oncombining class

a + . + ˆ ⇒ a + . + ˆ

Unicode normal form - NFD

1. Decompose characters as much as possibleâ. ⇒ a. + ˆ ⇒ a + . + ˆ

2. Do stable sort on combining characters based oncombining class

a + . + ˆ ⇒ a + . + ˆ

Unicode normal form - NFD

1. Decompose characters as much as possibleâ. ⇒ a. + ˆ ⇒ a + . + ˆ

2. Do stable sort on combining characters based oncombining class

a + . + ˆ ⇒ a + . + ˆ

Camomile strings - UTF8, UTF16, UCS4

UTF8UTF-8 string as a string

UTF16UTF-16 string as an unsigned 16-bit integer bigarray

UCS4UTF-32 string as a 32-bit integer bigarray

UnicodeString.TypeUTF-8/16 and UCS4 all confirm UnicodeString.TypeString operations are functors over UnicodeString.Type

Camomile strings - UTF8, UTF16, UCS4

UTF8UTF-8 string as a string

UTF16UTF-16 string as an unsigned 16-bit integer bigarray

UCS4UTF-32 string as a 32-bit integer bigarray

UnicodeString.TypeUTF-8/16 and UCS4 all confirm UnicodeString.TypeString operations are functors over UnicodeString.Type

Camomile strings - UTF8, UTF16, UCS4

UTF8UTF-8 string as a string

UTF16UTF-16 string as an unsigned 16-bit integer bigarray

UCS4UTF-32 string as a 32-bit integer bigarray

UnicodeString.TypeUTF-8/16 and UCS4 all confirm UnicodeString.TypeString operations are functors over UnicodeString.Type

Camomile strings - UTF8, UTF16, UCS4

UTF8UTF-8 string as a string

UTF16UTF-16 string as an unsigned 16-bit integer bigarray

UCS4UTF-32 string as a 32-bit integer bigarray

UnicodeString.TypeUTF-8/16 and UCS4 all confirm UnicodeString.TypeString operations are functors over UnicodeString.Type

Camomile strings - UTF8, UTF16, UCS4

UTF8UTF-8 string as a string

UTF16UTF-16 string as an unsigned 16-bit integer bigarray

UCS4UTF-32 string as a 32-bit integer bigarray

UnicodeString.TypeUTF-8/16 and UCS4 all confirm UnicodeString.TypeString operations are functors over UnicodeString.Type

Camomile modules - UNFModule for Unicode normal form

module type Type =sig

type text

val nfd : text -> textval nfkd : text -> textval nfc : text -> textval nfkc : text -> text

val canon_compare : text -> text -> intend

module Make (Text : UnicodeString.Type) :Type with type text = Text.t andtype index = Text.index

Camomile modules - UNFCreate a module for a given Unicode string

module type Type =sig

type text

val nfd : text -> textval nfkd : text -> textval nfc : text -> textval nfkc : text -> text

val canon_compare : text -> text -> intend

module Make (Text : UnicodeString.Type) :Type with type text = Text.t andtype index = Text.index

Camomile modules - UNFConversion to NFD

module type Type =sig

type text

val nfd : text -> textval nfkd : text -> textval nfc : text -> textval nfkc : text -> text

val canon_compare : text -> text -> intend

module Make (Text : UnicodeString.Type) :Type with type text = Text.t andtype index = Text.index

Camomile modules - UNFCompare strings by semantic equivalence

module type Type =sig

type text

val nfd : text -> textval nfkd : text -> textval nfc : text -> textval nfkc : text -> text

val canon_compare : text -> text -> intend

module Make (Text : UnicodeString.Type) :Type with type text = Text.t andtype index = Text.index

Camomile modules - UNFBy lazily building NFD and compare them

module type Type =sig

type text

val nfd : text -> textval nfkd : text -> textval nfc : text -> textval nfkc : text -> text

val canon_compare : text -> text -> intend

module Make (Text : UnicodeString.Type) :Type with type text = Text.t andtype index = Text.index

Outline

Overview

ASCII to Unicode : A challenge of multilingualization

Example : Unicode normal forms

ulib

Conclusion

ulib - a yet another Unicode libraryNow under development

ulib - a yet another Unicode libraryulib is compact

I Minimum functionalitiesI No data fileI No initialization

ulib - a yet another Unicode libraryulib is compact

I Minimum functionalities

I No data fileI No initialization

ulib - a yet another Unicode libraryulib is compact

I Minimum functionalitiesI No data file

I No initialization

ulib - a yet another Unicode libraryulib is compact

I Minimum functionalitiesI No data fileI No initialization

ulib - a yet another Unicode libraryulib is modern

I Rope for Unicode stringI Zipper for indexing ropeI Pluggable code converter using first class modules

ulib - a yet another Unicode libraryulib is modern

I Rope for Unicode string

I Zipper for indexing ropeI Pluggable code converter using first class modules

ulib - a yet another Unicode libraryulib is modern

I Rope for Unicode stringI Zipper for indexing rope

I Pluggable code converter using first class modules

ulib - a yet another Unicode libraryulib is modern

I Rope for Unicode stringI Zipper for indexing ropeI Pluggable code converter using first class modules

Outline

Overview

ASCII to Unicode : A challenge of multilingualization

Example : Unicode normal forms

ulib

Conclusion

Conclusion

I Unicode is different from ASCIII Camomile addresses a "logical" part of UnicodeI Functors and lazyness play crucial rolesI More simplified library "ulib" is now under development.

ConclusionI Unicode is different from ASCII

I Camomile addresses a "logical" part of UnicodeI Functors and lazyness play crucial rolesI More simplified library "ulib" is now under development.

ConclusionI Unicode is different from ASCIII Camomile addresses a "logical" part of Unicode

I Functors and lazyness play crucial rolesI More simplified library "ulib" is now under development.

ConclusionI Unicode is different from ASCIII Camomile addresses a "logical" part of UnicodeI Functors and lazyness play crucial roles

I More simplified library "ulib" is now under development.

ConclusionI Unicode is different from ASCIII Camomile addresses a "logical" part of UnicodeI Functors and lazyness play crucial rolesI More simplified library "ulib" is now under development.

Project URL

Camomile https://github.com/yoriyuki/Camomileulib https://github.com/yoriyuki/ulib

Recommended