18
San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency

San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency

Embed Size (px)

Citation preview

Page 1: San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency

San Jose, California, September 2002

Compact Encodings of Unicode

Markus W. Scherer

Unicode/G11N Software Engineer

IBM Globalization Center of Competency

Page 2: San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency

San Jose, California, September 200222nd International Unicode Conference 2

Agenda

• Encodings in files and protocols– Not: Processing encoding forms

• Unicode “is too big”– Issues and non-issues

• How to reduce size of Unicode text– Choice of encoding– Optional compression

• Examples and comparisons

Page 3: San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency

San Jose, California, September 200222nd International Unicode Conference 3

What is ICU?• Internationalization libraries for C, C++, Java*

– Open source – non-viral– Sponsored by IBM* Sun’s Java licenses an earlier ICU version; ICU4J updates it.

• Unicode standard compliant– full supplementary support

• Cross-platform; extensible and customizable• High performance and thread-safe

– Multiple locales in same thread – simultaneously

• Converters for all Unicode charsets & hundreds of legacy codepages

• http://oss.software.ibm.com/icu/

Page 4: San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency

San Jose, California, September 200222nd International Unicode Conference 4

Encodings of Unicode

• Common Unicode character set• External encodings

– Files and protocols– Almost always byte-serialized– Character Encoding Schemes/charsets

• Processing encodings– Character Encoding Forms, often 16/32-bit– Different requirements– Topic for different presentation…

Page 5: San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency

San Jose, California, September 200222nd International Unicode Conference 5

Unicode “is too big”?

• Perceived large size of Unicode text– Compared with legacy codepages

• Size matters– Low-speed connections (dial-up, mobile)

– Little memory (PDA, cell phone, embedded)

• Size does not matter when…– Images & other binaries swamp text size

– High-speed network

– Temporary documents

– Large amounts of memory

Page 6: San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency

San Jose, California, September 200222nd International Unicode Conference 6

How big is it?

• Size depends on language/script• Bytes/char for some language groups:

Languages Legacy UTF-8 UTF-16

Western/Latin 1 1 2

Russian/Arabic 1 2 2

Hindi/Thai 1 3 2

CJK 2 3 2

Page 7: San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency

San Jose, California, September 200222nd International Unicode Conference 7

Legacy codepages

• Compact because– Designed for single/few languages– Few characters compared with Unicode

• Conversion problems– Fallback/substitution of unmappable chars– Mapping table differences– Loss of parts of text common

• Large number/size of mapping tables

Page 8: San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency

San Jose, California, September 200222nd International Unicode Conference 8

Reduce Unicode text size

• Choice of encoding– Encodings designed for different purposes– Compactness vs. direct applicability vs.

software support etc.

• General-purpose compression– Best on top of compact encoding– Not available in all applications

Page 9: San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency

San Jose, California, September 200222nd International Unicode Conference 9

UTF-8/16

• Designed for processing but all-purpose• UTF-8:

– Byte-based, ASCII-compatible– BMP: up to 3 bytes/char

• UTF-16 (BE/LE):– Byte-serialization of 16-bit form, not

ASCII-compatible– BE/LE forms or Byte Order Mark– BMP: always 2 bytes/char

Page 10: San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency

San Jose, California, September 200222nd International Unicode Conference 10

UTF-7

• 7-bit encoding designed for email– Obsolete: email now 8-bit-safe

• Partially ASCII-compatible

• BMP: 2.67 bytes/char plus overhead– Base64-encoded UTF-16BE

• Stateful

Page 11: San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency

San Jose, California, September 200222nd International Unicode Conference 11

SCSU & BOCU-1

• About as compact as legacy codepages– 1 byte/char for small scripts, 2 for CJK; stateful

– Compress short strings better than LZW (zip) etc.

• SCSU:– Limited* ASCII compatibility (initial state)

– Complex state, many encoding choices

– Indeterministic; arbitrary byte values

– Established encoding, supported in• Various tools & editors (SC UniPad), ICU, Symbian OS

(cell phones/PDAs)

Page 12: San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency

San Jose, California, September 200222nd International Unicode Conference 12

BOCU-1

• BOCU-1:– Delta-encoding; avoids control codes– MIME text-compatible but not ASCII– Deterministic– Preserves binary order (for sorting,

databases)– New encoding; supported by ICU

Page 13: San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency

San Jose, California, September 200222nd International Unicode Conference 13

SCSU & BOCU-1 text sizes

• Average bytes/char relative to UTF-8

Languages SCSU BOCU-1

English/French 100% 100%

Russian/Arabic 55% 60%

Hindi 40% 40%

Thai 40% 45%

Japanese 55% 60%

Korean 85% 75%

Chinese 70% 75%

Page 14: San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency

San Jose, California, September 200222nd International Unicode Conference 14

Encoding vs. compression

• For example: BOCU-1 with WinZip

(sum of seven files) UTF-8 BOCU-1

Uncompressed 32024 bytes 20723 bytes

Compressed 11659 bytes 10722 bytes

Page 15: San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency

San Jose, California, September 200222nd International Unicode Conference 15

Performance

• Converter performance– Roundtrip to/from UTF-16 with ICU:

• SCSU: 45%..125% of UTF-8 roundtrip time• BOCU-1: 40%..160% of UTF-8 roundtrip time

• Depends on encoding ratio– Fast for small scripts, 1 byte/char

• Separate compression adds to I/O time• Conversion time typically swamped by

– Transmission (low-bandwidth connections)• Shorter texts transmit faster!

– Parsing/processing

Page 16: San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency

San Jose, California, September 200222nd International Unicode Conference 16

Further considerations

• In-document encoding declarations require ASCII readability (XML, HTML)

• Protocol may limit byte values (SMTP)– TES required for some encodings

• base64 for SCSU or UTF-16 in emails

• Increases text size

• Compression removes ASCII readability and uses arbitrary byte values

Page 17: San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency

San Jose, California, September 200222nd International Unicode Conference 17

Conclusion

• UTF-8 and/or UTF-16 work in most cases

• Size of text often not critical

• When small text size needed:– Use SCSU or BOCU-1– Consider compression– Make sure receiver can handle it

Page 18: San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency

San Jose, California, September 200222nd International Unicode Conference 18

References• Forms of Unicode:

http://oss.software.ibm.com/icu/docs/papers/forms_of_unicode/

• Character Encoding Model: UTR #17 http://www.unicode.org/reports/tr17/

• SCSU: UTS #6 http://www.unicode.org/reports/tr6/

• BOCU-1: http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/conversion/bocu1/bocu1.html

• ICU homepage: http://oss.software.ibm.com/icu/

• Unicode Consortium: http://www.unicode.org/

• IBM developerWorks: http://www.ibm.com/developerworks/unicode/