Upload
dylan-mcgregor
View
238
Download
0
Tags:
Embed Size (px)
Citation preview
From UCS-2 to UTF-16
Discussion and practical example for the transition of a Unicode library
from UCS-2 to UTF-16
Why is this an issue?
• The concept of the Unicode standard changed during its first few years
• Unicode 2.0 (1996) expanded the code point range from 64k to 1.1M
• APIs and libraries need to follow this change and support the full range
• Upcoming character assignments (Unicode 3.1, 2001) fall into the added range
“Unicode is a 16-bit character set”
• Concept: 16-bit, fixed-width character set• Saving space by not including
precomposed, rarely-used, obsolete, … characters
• Compatibility, transition strategies, and acceptance forced loosening of these principles
• Unicode 3.1: >90k assigned characters
16-bit APIs
• APIs developed for Unicode 1.1 used 16-bit characters and strings: UCS-2
• Assuming 1:1 character:code unit
• Examples: Win32, Java, COM, ICU, Qt/KDE
• Byte-based UTF-8 (1993) mostly for MBCS compatibility and transfer protocols
Extending the range
• Set aside two blocks of 1k 16-bit values, “surrogates”, for extension
• 1k x 1k = 1M = 10000016 additional code points using a pair of code units
• 16-bit form now variable-width UTF-16
• “Unicode scalar values” 0..10ffff16
• Proposed: 1994; part of Unicode 2.0 (1996)
Parallel with ISO-10646
• ISO-10646 uses 31-bit codes: UCS-4• UCS-2: 16-bit codes for subset 0..ffff16
• UTF-16: transformation of subset 0..10ffff16
• UTF-8 covers all 31 bits• Private Use areas above 10ffff16 slated for
removal from ISO-10646 for UTF interoperability and synchronization with Unicode
21-bit code points
• Code points (“Unicode scalar values”) up to 10ffff16 use 21 bits
• 16-bit code units still good for strings: variable-width like MBCS
• Default string unit size not big enough for code points
• Dual types for programming?
C: char/wchar_t dual types
• C/C++ standards: dual types
• Strings mostly with char units (8 bits)
• Code points: wchar_t, 8..32 bits
• Typical use in I18N-ed programs: (8-bit) char strings but (16/32-bit) wchar_t (or 32-bit int) characters; code point type is implementation-dependent
Unicode: dual types, too?
• Strings could continue with 16-bit units
• Single code points could get 32-bit data type
• Dual-type model like C/C++ MBCS
Alternatives to dual 16/32 types
• UTF-32: all types 32 bits wide, fixed-width
• UTF-8: same complexity after range extension beyond just the BMP, closer to C/C++ model – byte-based
• Use pairs of 16-bit units
• Use strings for everything
• Make string unit size flexible 8/16/32 bits
UCS-2 to UTF-32
• Fixed-width, single base type for strings and code points
• UCS-2 programming assumptions mostly intact
• Wastes at least 33% space, typically 50%
• Performance bottleneck CPU - memory
UCS-2 to UTF-8
• UCS-2 programming assumes many characters in single code units
• Breaks a lot of code
• Same question of type for code points; follow C model, 32-bit wchar_t? – More difficult transition than other choices
Surrogate pairs for single chars
• Caller avoids code point calculation• But: caller and callee need to detect and
handle pairs: caller choosing argument values, callee checking for errors
• Harder to use with code point constants because they are published as scalar values
• Significant change for caller from using scalars
Strings for single chars
• Always pass in string (and offset)
• Most general, handles graphemes in addition to code points
• Harder to use with code point constants because they are published as scalar values
• Significant change for caller from using scalars
UTF-flexible
• In principle, if the implementation can handle variable-width, MBCS-style strings, could it handle any UTF-size as a compile-time choice?
• Adds interoperability with UTF-8/32 APIs• Almost no assumptions possible• Complexity of transition even higher than
of transition to pure UTF-8, performance?
Interoperability
• Break existing API users no more than necessary
• Interoperability with other APIs: Win32, Java, COM, now also XML DOM
• UTF-16 is Unicode default: good compromise (speed/ease/space)
• String units should stay 16 bits wide
Does everything need to change?
• String operations: search, substring, concatenation, … work with any UTF without change
• Character property lookup and similar: need to support the extended range
• Formatting: should handle more code points or even graphemes
• Careful evaluation of all public APIs
ICU: some of all
• Strings: UTF-16, UChar type remains 16-bit
• New UChar32 for code points• Provide macros for C to deal with all UTFs:
iteration, random access, …• C++ CharacterIterator: many new functions• Property lookup/low-level: UChar32• Formatting: strings for graphemes
Scalar code points:property lookup
• Old, 16-bit:UChar u_tolower(UChar c){ u[v[c15..7]+c6..0];}
• New, 21-bit:UChar32 u_tolower(UChar32 c){ u[v[w[c20..10]+c9..4]+c3..0];}
Formatting: grapheme strings
• Old:void setDecimalSymbol(UChar c);
• New:void setDecimalSymbol(const UnicodeString &s);
Codepage conversion
• To Unicode: results are one or two UTF-16 code units, surrogates stored directly in the conversion table
• From Unicode: triple-stage compact array access from 21-bit code points like property lookup
• Single-character-conversion to Unicode now returns UChar32 values
API first…
• Tools and basic functions and classes are in place (property lookup, conversion, iterators, BiDi)
• Public APIs reviewed and changed (“luxury” of early project stage) or deprecated and superseded by new versions
• Higher-level implementations to follow before Unicode 3.1 published
More implementations follow…
• Collation: need to prepare for >64k primary keys
• Normalization and Transliteration
• Word/Sentence break iteration
• Etc.
• No non-BMP data before Unicode 3.1 is stable
Other libraries
• Java: planning stage for transition
• Win32: rendering and UniScribe API largely UTF-16-ready
• Linux: standardizing on 32-bit Unicode wchar_t, has UTF-8 locales like other Unixes for char* APIs
• W3C: standards assume full UTF-16 range
Summary
• Transition from UCS-2 to UTF-16 gains importance after four years of standard
• APIs for single characters need change or new versions
• String APIs: no change• Implementations need to handle 21-bit code
points• Range of options
Resources
• Unicode FAQ: http://www.unicode.org/unicode/faq/
• Unicode on IBM developerWorks: http://www.ibm.com/developer/unicode/
• ICU: http://oss.software.ibm.com/icu/