Upload
marvin-roberts
View
213
Download
0
Embed Size (px)
Citation preview
San Jose, California, September 2002
Compact Encodings of Unicode
Markus W. Scherer
Unicode/G11N Software Engineer
IBM Globalization Center of Competency
San Jose, California, September 200222nd International Unicode Conference 2
Agenda
• Encodings in files and protocols– Not: Processing encoding forms
• Unicode “is too big”– Issues and non-issues
• How to reduce size of Unicode text– Choice of encoding– Optional compression
• Examples and comparisons
San Jose, California, September 200222nd International Unicode Conference 3
What is ICU?• Internationalization libraries for C, C++, Java*
– Open source – non-viral– Sponsored by IBM* Sun’s Java licenses an earlier ICU version; ICU4J updates it.
• Unicode standard compliant– full supplementary support
• Cross-platform; extensible and customizable• High performance and thread-safe
– Multiple locales in same thread – simultaneously
• Converters for all Unicode charsets & hundreds of legacy codepages
• http://oss.software.ibm.com/icu/
San Jose, California, September 200222nd International Unicode Conference 4
Encodings of Unicode
• Common Unicode character set• External encodings
– Files and protocols– Almost always byte-serialized– Character Encoding Schemes/charsets
• Processing encodings– Character Encoding Forms, often 16/32-bit– Different requirements– Topic for different presentation…
San Jose, California, September 200222nd International Unicode Conference 5
Unicode “is too big”?
• Perceived large size of Unicode text– Compared with legacy codepages
• Size matters– Low-speed connections (dial-up, mobile)
– Little memory (PDA, cell phone, embedded)
• Size does not matter when…– Images & other binaries swamp text size
– High-speed network
– Temporary documents
– Large amounts of memory
San Jose, California, September 200222nd International Unicode Conference 6
How big is it?
• Size depends on language/script• Bytes/char for some language groups:
Languages Legacy UTF-8 UTF-16
Western/Latin 1 1 2
Russian/Arabic 1 2 2
Hindi/Thai 1 3 2
CJK 2 3 2
San Jose, California, September 200222nd International Unicode Conference 7
Legacy codepages
• Compact because– Designed for single/few languages– Few characters compared with Unicode
• Conversion problems– Fallback/substitution of unmappable chars– Mapping table differences– Loss of parts of text common
• Large number/size of mapping tables
San Jose, California, September 200222nd International Unicode Conference 8
Reduce Unicode text size
• Choice of encoding– Encodings designed for different purposes– Compactness vs. direct applicability vs.
software support etc.
• General-purpose compression– Best on top of compact encoding– Not available in all applications
San Jose, California, September 200222nd International Unicode Conference 9
UTF-8/16
• Designed for processing but all-purpose• UTF-8:
– Byte-based, ASCII-compatible– BMP: up to 3 bytes/char
• UTF-16 (BE/LE):– Byte-serialization of 16-bit form, not
ASCII-compatible– BE/LE forms or Byte Order Mark– BMP: always 2 bytes/char
San Jose, California, September 200222nd International Unicode Conference 10
UTF-7
• 7-bit encoding designed for email– Obsolete: email now 8-bit-safe
• Partially ASCII-compatible
• BMP: 2.67 bytes/char plus overhead– Base64-encoded UTF-16BE
• Stateful
San Jose, California, September 200222nd International Unicode Conference 11
SCSU & BOCU-1
• About as compact as legacy codepages– 1 byte/char for small scripts, 2 for CJK; stateful
– Compress short strings better than LZW (zip) etc.
• SCSU:– Limited* ASCII compatibility (initial state)
– Complex state, many encoding choices
– Indeterministic; arbitrary byte values
– Established encoding, supported in• Various tools & editors (SC UniPad), ICU, Symbian OS
(cell phones/PDAs)
San Jose, California, September 200222nd International Unicode Conference 12
BOCU-1
• BOCU-1:– Delta-encoding; avoids control codes– MIME text-compatible but not ASCII– Deterministic– Preserves binary order (for sorting,
databases)– New encoding; supported by ICU
San Jose, California, September 200222nd International Unicode Conference 13
SCSU & BOCU-1 text sizes
• Average bytes/char relative to UTF-8
Languages SCSU BOCU-1
English/French 100% 100%
Russian/Arabic 55% 60%
Hindi 40% 40%
Thai 40% 45%
Japanese 55% 60%
Korean 85% 75%
Chinese 70% 75%
San Jose, California, September 200222nd International Unicode Conference 14
Encoding vs. compression
• For example: BOCU-1 with WinZip
(sum of seven files) UTF-8 BOCU-1
Uncompressed 32024 bytes 20723 bytes
Compressed 11659 bytes 10722 bytes
San Jose, California, September 200222nd International Unicode Conference 15
Performance
• Converter performance– Roundtrip to/from UTF-16 with ICU:
• SCSU: 45%..125% of UTF-8 roundtrip time• BOCU-1: 40%..160% of UTF-8 roundtrip time
• Depends on encoding ratio– Fast for small scripts, 1 byte/char
• Separate compression adds to I/O time• Conversion time typically swamped by
– Transmission (low-bandwidth connections)• Shorter texts transmit faster!
– Parsing/processing
San Jose, California, September 200222nd International Unicode Conference 16
Further considerations
• In-document encoding declarations require ASCII readability (XML, HTML)
• Protocol may limit byte values (SMTP)– TES required for some encodings
• base64 for SCSU or UTF-16 in emails
• Increases text size
• Compression removes ASCII readability and uses arbitrary byte values
San Jose, California, September 200222nd International Unicode Conference 17
Conclusion
• UTF-8 and/or UTF-16 work in most cases
• Size of text often not critical
• When small text size needed:– Use SCSU or BOCU-1– Consider compression– Make sure receiver can handle it
San Jose, California, September 200222nd International Unicode Conference 18
References• Forms of Unicode:
http://oss.software.ibm.com/icu/docs/papers/forms_of_unicode/
• Character Encoding Model: UTR #17 http://www.unicode.org/reports/tr17/
• SCSU: UTS #6 http://www.unicode.org/reports/tr6/
• BOCU-1: http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/conversion/bocu1/bocu1.html
• ICU homepage: http://oss.software.ibm.com/icu/
• Unicode Consortium: http://www.unicode.org/
• IBM developerWorks: http://www.ibm.com/developerworks/unicode/