From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16

From UCS-2 to UTF-16

Discussion and practical example for the transition of a Unicode library

from UCS-2 to UTF-16

Why is this an issue?

• The concept of the Unicode standard changed during its first few years

• Unicode 2.0 (1996) expanded the code point range from 64k to 1.1M

• APIs and libraries need to follow this change and support the full range

• Upcoming character assignments (Unicode 3.1, 2001) fall into the added range

“Unicode is a 16-bit character set”

• Concept: 16-bit, fixed-width character set• Saving space by not including

precomposed, rarely-used, obsolete, … characters

• Compatibility, transition strategies, and acceptance forced loosening of these principles

• Unicode 3.1: >90k assigned characters

16-bit APIs

• APIs developed for Unicode 1.1 used 16-bit characters and strings: UCS-2

• Assuming 1:1 character:code unit

• Examples: Win32, Java, COM, ICU, Qt/KDE

• Byte-based UTF-8 (1993) mostly for MBCS compatibility and transfer protocols

Extending the range

• Set aside two blocks of 1k 16-bit values, “surrogates”, for extension

• 1k x 1k = 1M = 10000016 additional code points using a pair of code units

• 16-bit form now variable-width UTF-16

• “Unicode scalar values” 0..10ffff16

• Proposed: 1994; part of Unicode 2.0 (1996)

Parallel with ISO-10646

• ISO-10646 uses 31-bit codes: UCS-4• UCS-2: 16-bit codes for subset 0..ffff16

• UTF-16: transformation of subset 0..10ffff16

• UTF-8 covers all 31 bits• Private Use areas above 10ffff16 slated for

removal from ISO-10646 for UTF interoperability and synchronization with Unicode

21-bit code points

• Code points (“Unicode scalar values”) up to 10ffff16 use 21 bits

• 16-bit code units still good for strings: variable-width like MBCS

• Default string unit size not big enough for code points

• Dual types for programming?

C: char/wchar_t dual types

• C/C++ standards: dual types

• Strings mostly with char units (8 bits)

• Code points: wchar_t, 8..32 bits

• Typical use in I18N-ed programs: (8-bit) char strings but (16/32-bit) wchar_t (or 32-bit int) characters; code point type is implementation-dependent

Unicode: dual types, too?

• Strings could continue with 16-bit units

• Single code points could get 32-bit data type

• Dual-type model like C/C++ MBCS

Alternatives to dual 16/32 types

• UTF-32: all types 32 bits wide, fixed-width

• UTF-8: same complexity after range extension beyond just the BMP, closer to C/C++ model – byte-based

• Use pairs of 16-bit units

• Use strings for everything

• Make string unit size flexible 8/16/32 bits

UCS-2 to UTF-32

• Fixed-width, single base type for strings and code points

• UCS-2 programming assumptions mostly intact

• Wastes at least 33% space, typically 50%

• Performance bottleneck CPU - memory

UCS-2 to UTF-8

• UCS-2 programming assumes many characters in single code units

• Breaks a lot of code

• Same question of type for code points; follow C model, 32-bit wchar_t? – More difficult transition than other choices

Surrogate pairs for single chars

• Caller avoids code point calculation• But: caller and callee need to detect and

handle pairs: caller choosing argument values, callee checking for errors

• Harder to use with code point constants because they are published as scalar values

• Significant change for caller from using scalars

Strings for single chars

• Always pass in string (and offset)

• Most general, handles graphemes in addition to code points

• Harder to use with code point constants because they are published as scalar values

• Significant change for caller from using scalars

UTF-flexible

• In principle, if the implementation can handle variable-width, MBCS-style strings, could it handle any UTF-size as a compile-time choice?

• Adds interoperability with UTF-8/32 APIs• Almost no assumptions possible• Complexity of transition even higher than

of transition to pure UTF-8, performance?

Interoperability

• Break existing API users no more than necessary

• Interoperability with other APIs: Win32, Java, COM, now also XML DOM

• UTF-16 is Unicode default: good compromise (speed/ease/space)

• String units should stay 16 bits wide

Does everything need to change?

• String operations: search, substring, concatenation, … work with any UTF without change

• Character property lookup and similar: need to support the extended range

• Formatting: should handle more code points or even graphemes

• Careful evaluation of all public APIs

ICU: some of all

• Strings: UTF-16, UChar type remains 16-bit

• New UChar32 for code points• Provide macros for C to deal with all UTFs:

iteration, random access, …• C++ CharacterIterator: many new functions• Property lookup/low-level: UChar32• Formatting: strings for graphemes

Scalar code points:property lookup

• Old, 16-bit:UChar u_tolower(UChar c){ u[v[c15..7]+c6..0];}

• New, 21-bit:UChar32 u_tolower(UChar32 c){ u[v[w[c20..10]+c9..4]+c3..0];}

Formatting: grapheme strings

• Old:void setDecimalSymbol(UChar c);

• New:void setDecimalSymbol(const UnicodeString &s);

Codepage conversion

• To Unicode: results are one or two UTF-16 code units, surrogates stored directly in the conversion table

• From Unicode: triple-stage compact array access from 21-bit code points like property lookup

• Single-character-conversion to Unicode now returns UChar32 values

API first…

• Tools and basic functions and classes are in place (property lookup, conversion, iterators, BiDi)

• Public APIs reviewed and changed (“luxury” of early project stage) or deprecated and superseded by new versions

• Higher-level implementations to follow before Unicode 3.1 published

More implementations follow…

• Collation: need to prepare for >64k primary keys

• Normalization and Transliteration

• Word/Sentence break iteration

• Etc.

• No non-BMP data before Unicode 3.1 is stable

Other libraries

• Java: planning stage for transition

• Win32: rendering and UniScribe API largely UTF-16-ready

• Linux: standardizing on 32-bit Unicode wchar_t, has UTF-8 locales like other Unixes for char* APIs

• W3C: standards assume full UTF-16 range

Summary

• Transition from UCS-2 to UTF-16 gains importance after four years of standard

• APIs for single characters need change or new versions

• String APIs: no change• Implementations need to handle 21-bit code

points• Range of options

Resources

• Unicode FAQ: http://www.unicode.org/unicode/faq/

• Unicode on IBM developerWorks: http://www.ibm.com/developer/unicode/

• ICU: http://oss.software.ibm.com/icu/

Documents

From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16