Unicode (and Java) Brice Giesbrecht. Objective of Presentation The need for Unicode How it works Differentiate between encodings How to get your browser

Unicode (and Java)Brice Giesbrecht

Objective of Presentation

The need for Unicode How it works Differentiate between encodings How to get your browser to

work… See how Java consumes and

produces data

Overview of Presentation

Character Sets Unicode Encodings Unicode Support in Java Unicode Support in Databases (?) Demonstration (web app) Resources Door Prizes (for those still

awake…)

Character Sets

What is a character set? Code Page: a mapping in which a sequence

of bits, usually a single octet representing integer values 0 through 255, is associated with a specific character (wikipedia)

Most character sets are a direct mapping of a value to a number (7 bit / 8 bit)

Character sets are NOT fonts! Encoding is usually a lookup in a table Most IBM and Microsoft code pages use

ASCII as their base set of characters The English bias (compare to Indic

languages)

Character Sets Issues Within a single Language Selectors to overcome 8 bit limitations

(especially for CJK sets) Historical importance of platforms and

hardware Compatibility (or more likely, lack thereof) ISCII as an example Issues outside a single Language How do you produce content using multiple

languages? (Or the characters from those languages?)

http://en.wikipedia.org/wiki/Code_page_437

http://en.wikipedia.org/wiki/Code_page_437

Character Sets

Enter the standards ISO-646 (ASCII, still 7 bit)

12 whole code points to play with! C0 Control Set (0x00 – 0x1F)

ISO-8859-n 0x00 – 0x7F ISO-646 IRV 0x80 – 0xFF Different for each set (or part) ISO 8859-1 (Latin1) C1 Control Set (0x80 – 0X9F)

ISO-2022 Designed for transmission Non Latin bases & multi byte sets

Character Sets

Enter Microsoft! Windows code pages

http://www.microsoft.com/globaldev/reference/wincp.mspx

Cp1252 Based on ISO 8859-1 C1 code points used for printable characters Often mislabeled as ISO-8859-1 due to their

similarities



Unicode

What is Unicode?Unicode provides a unique number

for every character, no matter what the platform,no matter what the program,no matter what the language.

Unicode

ISO 10646 1990 Merged with the Unicode

Consortium Ties a character, name, and a code point together

BMP – Basic Multilingual Plane (the first 65,536 code points)

ISO and UC Character repertoire are synchronized

UCS (Universal Character Set)

Unicode Q: So are they the same thing?

A: No. Although the character codes and encoding forms are synchronized between Unicode and ISO/IEC 10646, the Unicode Standard imposes additional constraints on implementations to ensure that they treat characters uniformly across platforms and applications. To this end, it supplies an To this end, it supplies an extensive set of functional characterextensive set of functional characterspecifications, character data, algorithms specifications, character data, algorithms and substantial background material that and substantial background material that is not in ISO/IEC 10646is not in ISO/IEC 10646.(http://unicode.org/faq/unicode_iso.html)

http://unicode.org/faq/unicode_iso.html

http://unicode.org/faq/unicode_iso.html

Unicode The Unicode Standard includes a set

of characters, names, and coded representations that are identical with those in ISO/IEC 10646:2003. It additionally provides details of characterproperties, processing algorithms, and definitions that are useful to implementers. [It] strengthens Unicode support for worldwide communication, software availability, and publishing. (http://www.iso.org)

http://www.iso.org/

http://www.iso.org/

Unicode UCS Code space: (0x – 0x7FFFFFFF)

128 x 256 x 256 x 256 (GPRC)2,147,483,648 possible code points

The Unicode Character Database http://unicode.org/Public/UNIDATA/UCD.html Main Definition (UnicodeData.txt)

Available on line http://www.unicode.org/Public/UNIDATA/

Unicode Code Space (0x – 0x10FFFF)17 x 256 x 256 1,114,112 code points

Unicode As of Unicode 5.0.0, 101,063 (9.1%)

of these codepoints are assigned, with another 137,468 (12.3%) reserved for private use, leaving 875,441 (78.6%) unassigned. The number of assigned code points is made up as follows:

98,884 graphemes 140 formatting characters 65 control characters 2,048 surrogate characters

Unicode

Plane 0 (0000-FFFF) Basic Multilingual Plane (BMP) Used for most of the alphabets Not all code points are used Allocated in areas/blocks

Unicode

Plane 1 (10000-1FFFF): Supplementary Multilingual Plane

(SMP) Historic scripts such as Linear B,

but is also used for musical and mathematical symbols.

Unicode

Plane 2 (20000-2FFFF) Supplementary Ideographic

Plane (SIP) Used for about 40,000 rare

Chinese characters that are mostly historic

Unicode

Planes 3 to 13 (30000-DFFFF) Unassigned

Unicode

Plane 14 (E0000-EFFFF) Supplementary Special-purpose

Plane (SSP) glyph (font) selection code point + variation selector =

variation sequence http://www.unicode.org/reports/tr37/tr37-3.h

tml (Ideographic Variation Database)

http://www.unicode.org/reports/tr37/tr37-3.html

http://www.unicode.org/reports/tr37/tr37-3.html

Unicode

Plane 15 (F0000-FFFFF) Plane 16 (100000-10FFFF) Plane 0 (E000-F8FF) Private Use Area (PUA)

The use of the PUA was a concept inherited from certain Asian encoding systems. These systems had private use areas to encode Japanese Gaiji (rare personal name characters) in application-specific ways)

Unicode

ConScript Unicode Registry The purpose of the ConScript Unicode

Registry (CSUR) is to coordinate the assignment of blocks out of the Unicode Private Use Area (E000-F8FF and 000F0000-0010FFFF) to constructed/artificial scripts, including scripts for constructed/artificial languages.

Cirth, Klingon, Tengwar, etc.

Encodings

Purpose of the following encodings is to get the Unicode value to you.Depending on the storage or transmission protocols, differentencodings will need to be used. These are not different character sets, they are ways of representing the characters in Unicode.

Encodings

Endianness 0x1234 LE 34 12 BE 12 34

Byte Order Mark - 0xFEFF Helps Determine Endianness Unicode 3.2 (0x2060) 0xFFFE reserved 0XFEFF set aside for BOM Also used to declare encoding (UTF-8)

Encodings

UTF-8 Variable-length character encoding Can address all characters in the UCS but was

limited by RFC 3629 to just address the Unicode code space.

BOM – EF BB BF Format

000000-00007F 0zzzzzzz000080-0007FF 110yyyyy 10zzzzzz000800-00FFFF 1110xxxx 10yyyyyy 10zzzzzz010000-10FFFF 11110www 10xxxxxx 10yyyyyy 10zzzzzz

Encodings

UTF-32/UCS-4 Fixed-length character encoding Uses 31 bits UCS-4 capable of addressing entire UCS, but

was restricted to only cover the Unicode code space

UTF-32 only covers the Unicode code space 4E8C, 10302 = 00004E8C, 00010302 BE BOM – 00 00 FE FF LE BOM – FF FE 00 00

Encodings

UCS-2 Fixed-length encoding Two-octet It is NOT UTF-16! Only addresses BMP UCS-2BE, UCS-2LE Obsoleted by UTF-16

Encodings

UTF-16 Variable-length encoding UTF-16BE, UTF-16LE BE BOM – FEFF LE BOM – FFFE Surrogates are used to address code points

outside the BMP. (We will cover this later)

Encodings

UTF-16 Surrogate Pairs Needed for code points > 0xFFFF High Byte 0xD800 – 0xDBFF first surrogate Low Byte 0xDC00 – 0xDFFF second

surrogate Algorithm:

((cp - 0x10000) high 10 bits) | 0xD800 ((cp - 0x10000) low 10 bits) | 0xDC00

Encodings

Which Encoding should you use? If dealing with CJK or Hindi (>0x0800), UTF-

8 requires 3 bytes whereas UTF-16 needs only 2

UTF-8 is great for ASCII whereas UTF-16 needs 2 bytes for it

Java uses UTF-16 Windows uses UTF-16LE internally UTF-32 not really used that much UTF-8 and UTF-16 are the most common

Java

J2SE 1.5 version 4.0 J2SE 1.4 version 3.0 J2SE 1.3 version 2.1 Supplementary characters were

part of Unicode 3.1 Addressed in JSR 204

(http://jcp.org/en/jsr/detail?id=204)

Java Unicode characters are specified

using \u such as \u0039 Unicode can be used in source files file.encoding=Cp1252 on my

machine You can change this, but beware… Java reads and writes using this

encoding by default You can specify the character set

to use for reading or writing

JavaBig5Big5-HKSCSEUC-JPEUC-KRGB18030GB2312GBKIBM-ThaiIBM00858IBM01140IBM01141IBM01142IBM01143IBM01144IBM01145IBM01146IBM01147IBM01148IBM01149IBM037IBM1026IBM1047IBM273IBM277IBM278IBM280IBM284IBM285IBM297

IBM420IBM424IBM437IBM500IBM775IBM850IBM852IBM855IBM857IBM860IBM861IBM862IBM863IBM864IBM865IBM866IBM868IBM869IBM870IBM871IBM918ISO-2022-CNISO-2022-JPISO-2022-KRISO-8859-1ISO-8859-13ISO-8859-15ISO-8859-2ISO-8859-3

ISO-8859-4ISO-8859-5ISO-8859-6ISO-8859-7ISO-8859-8ISO-8859-9JIS_X0201JIS_X0212-1990KOI8-RShift_JISTIS-620US-ASCIIUTF-16UTF-16BEUTF-16LEUTF-8windows-1250windows-1251windows-1252windows-1253windows-1254windows-1255windows-1256windows-1257windows-1258windows-31jx-Big5-Solarisx-euc-jp-linuxx-EUC-TW

x-eucJP-Openx-IBM1006x-IBM1025x-IBM1046x-IBM1097x-IBM1098x-IBM1112x-IBM1122x-IBM1123x-IBM1124x-IBM1381x-IBM1383x-IBM33722x-IBM737x-IBM856x-IBM874x-IBM875x-IBM921x-IBM922x-IBM930x-IBM933x-IBM935x-IBM937x-IBM939x-IBM942x-IBM942Cx-IBM943x-IBM943Cx-IBM948

x-IBM949x-IBM949Cx-IBM950x-IBM964x-IBM970x-ISCII91x-ISO-2022-CN-CNSx-ISO-2022-CN-GBx-iso-8859-11x-JIS0208x-JISAutoDetectx-Johabx-MacArabicx-MacCentralEuropex-MacCroatianx-MacCyrillicx-MacDingbatx-MacGreekx-MacHebrewx-MacIcelandx-MacRomanx-MacRomaniax-MacSymbolx-MacThaix-MacTurkishx-MacUkrainex-MS950-HKSCSx-mswin-936x-PCKx-windows-874x-windows-949x-windows-950

Databases (Maybe)

SQL 92 NATIONAL CHARACTER The <key word>s NATIONAL CHARACTER are used to specify a

character string data type with a particular implementation-defined character repertoire. Special syntax (N'string') is provided for representing literals in that character repertoire.

Collation Database Support

MySQL Oracle Sql Server Postgres

Demonstration Read/Write/Examine UTF-8/UTF-16/UTF-

16LE encoded text (with Hex editor) Show encoding settings in Eclipse and

Java Show how windows (and eclipse

console) can/can't display some characters

web browser settings Chinese article on cracking of SHA-1 Martin Fowler article on dependency

Injection

Resources The big ones:

http://www.unicode.org/Public/UNIDATA/ http://en.wikipedia.org/wiki/Unicode http://www.evertype.com/standards/csur

The rest: http://java.sun.com/javase/technologies/core/basic/intl/f

aq.jsp http://en.wikibooks.org/wiki/Unicode/Character_referenc

e http://www.joelonsoftware.com/articles/Unicode.html http://www.cl.cam.ac.uk/~mgk25/unicode.html http://czyborra.com/charsets/iso646.html http://www.fileformat.info/ (GREAT resource)

For fun: http://www.omniglot.com/ http://en.wikipedia.org/wiki/Constructed_language http://talideon.com/concultures/wiki/

http://www.evertype.com/standards/csur

http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp

http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp

http://en.wikibooks.org/wiki/Unicode/Character_reference

http://en.wikibooks.org/wiki/Unicode/Character_reference

http://www.fileformat.info/

Documents

Unicode (and Java) Brice Giesbrecht. Objective of Presentation The need for Unicode How it works Differentiate between encodings How to get your browser