Upload
linette-paul
View
229
Download
4
Tags:
Embed Size (px)
Citation preview
Unicode (and Java)Brice Giesbrecht
Objective of Presentation
The need for Unicode How it works Differentiate between encodings How to get your browser to
work… See how Java consumes and
produces data
Overview of Presentation
Character Sets Unicode Encodings Unicode Support in Java Unicode Support in Databases (?) Demonstration (web app) Resources Door Prizes (for those still
awake…)
Character Sets
What is a character set? Code Page: a mapping in which a sequence
of bits, usually a single octet representing integer values 0 through 255, is associated with a specific character (wikipedia)
Most character sets are a direct mapping of a value to a number (7 bit / 8 bit)
Character sets are NOT fonts! Encoding is usually a lookup in a table Most IBM and Microsoft code pages use
ASCII as their base set of characters The English bias (compare to Indic
languages)
Character Sets Issues Within a single Language Selectors to overcome 8 bit limitations
(especially for CJK sets) Historical importance of platforms and
hardware Compatibility (or more likely, lack thereof) ISCII as an example Issues outside a single Language How do you produce content using multiple
languages? (Or the characters from those languages?)
http://en.wikipedia.org/wiki/Code_page_437
Character Sets
Enter the standards ISO-646 (ASCII, still 7 bit)
12 whole code points to play with! C0 Control Set (0x00 – 0x1F)
ISO-8859-n 0x00 – 0x7F ISO-646 IRV 0x80 – 0xFF Different for each set (or part) ISO 8859-1 (Latin1) C1 Control Set (0x80 – 0X9F)
ISO-2022 Designed for transmission Non Latin bases & multi byte sets
Character Sets
Enter Microsoft! Windows code pages
http://www.microsoft.com/globaldev/reference/wincp.mspx
Cp1252 Based on ISO 8859-1 C1 code points used for printable characters Often mislabeled as ISO-8859-1 due to their
similarities
Unicode
What is Unicode?Unicode provides a unique number
for every character, no matter what the platform,no matter what the program,no matter what the language.
Unicode
ISO 10646 1990 Merged with the Unicode
Consortium Ties a character, name, and a code point together
BMP – Basic Multilingual Plane (the first 65,536 code points)
ISO and UC Character repertoire are synchronized
UCS (Universal Character Set)
Unicode Q: So are they the same thing?
A: No. Although the character codes and encoding forms are synchronized between Unicode and ISO/IEC 10646, the Unicode Standard imposes additional constraints on implementations to ensure that they treat characters uniformly across platforms and applications. To this end, it supplies an To this end, it supplies an extensive set of functional characterextensive set of functional characterspecifications, character data, algorithms specifications, character data, algorithms and substantial background material that and substantial background material that is not in ISO/IEC 10646is not in ISO/IEC 10646.(http://unicode.org/faq/unicode_iso.html)
Unicode The Unicode Standard includes a set
of characters, names, and coded representations that are identical with those in ISO/IEC 10646:2003. It additionally provides details of characterproperties, processing algorithms, and definitions that are useful to implementers. [It] strengthens Unicode support for worldwide communication, software availability, and publishing. (http://www.iso.org)
Unicode UCS Code space: (0x – 0x7FFFFFFF)
128 x 256 x 256 x 256 (GPRC)2,147,483,648 possible code points
The Unicode Character Database http://unicode.org/Public/UNIDATA/UCD.html Main Definition (UnicodeData.txt)
Available on line http://www.unicode.org/Public/UNIDATA/
Unicode Code Space (0x – 0x10FFFF)17 x 256 x 256 1,114,112 code points
Unicode As of Unicode 5.0.0, 101,063 (9.1%)
of these codepoints are assigned, with another 137,468 (12.3%) reserved for private use, leaving 875,441 (78.6%) unassigned. The number of assigned code points is made up as follows:
98,884 graphemes 140 formatting characters 65 control characters 2,048 surrogate characters
Unicode
Plane 0 (0000-FFFF) Basic Multilingual Plane (BMP) Used for most of the alphabets Not all code points are used Allocated in areas/blocks
Unicode
Plane 1 (10000-1FFFF): Supplementary Multilingual Plane
(SMP) Historic scripts such as Linear B,
but is also used for musical and mathematical symbols.
Unicode
Plane 2 (20000-2FFFF) Supplementary Ideographic
Plane (SIP) Used for about 40,000 rare
Chinese characters that are mostly historic
Unicode
Planes 3 to 13 (30000-DFFFF) Unassigned
Unicode
Plane 14 (E0000-EFFFF) Supplementary Special-purpose
Plane (SSP) glyph (font) selection code point + variation selector =
variation sequence http://www.unicode.org/reports/tr37/tr37-3.h
tml (Ideographic Variation Database)
Unicode
Plane 15 (F0000-FFFFF) Plane 16 (100000-10FFFF) Plane 0 (E000-F8FF) Private Use Area (PUA)
The use of the PUA was a concept inherited from certain Asian encoding systems. These systems had private use areas to encode Japanese Gaiji (rare personal name characters) in application-specific ways)
Unicode
ConScript Unicode Registry The purpose of the ConScript Unicode
Registry (CSUR) is to coordinate the assignment of blocks out of the Unicode Private Use Area (E000-F8FF and 000F0000-0010FFFF) to constructed/artificial scripts, including scripts for constructed/artificial languages.
Cirth, Klingon, Tengwar, etc.
Encodings
Purpose of the following encodings is to get the Unicode value to you.Depending on the storage or transmission protocols, differentencodings will need to be used. These are not different character sets, they are ways of representing the characters in Unicode.
Encodings
Endianness 0x1234 LE 34 12 BE 12 34
Byte Order Mark - 0xFEFF Helps Determine Endianness Unicode 3.2 (0x2060) 0xFFFE reserved 0XFEFF set aside for BOM Also used to declare encoding (UTF-8)
Encodings
UTF-8 Variable-length character encoding Can address all characters in the UCS but was
limited by RFC 3629 to just address the Unicode code space.
BOM – EF BB BF Format
000000-00007F 0zzzzzzz000080-0007FF 110yyyyy 10zzzzzz000800-00FFFF 1110xxxx 10yyyyyy 10zzzzzz010000-10FFFF 11110www 10xxxxxx 10yyyyyy 10zzzzzz
Encodings
UTF-32/UCS-4 Fixed-length character encoding Uses 31 bits UCS-4 capable of addressing entire UCS, but
was restricted to only cover the Unicode code space
UTF-32 only covers the Unicode code space 4E8C, 10302 = 00004E8C, 00010302 BE BOM – 00 00 FE FF LE BOM – FF FE 00 00
Encodings
UCS-2 Fixed-length encoding Two-octet It is NOT UTF-16! Only addresses BMP UCS-2BE, UCS-2LE Obsoleted by UTF-16
Encodings
UTF-16 Variable-length encoding UTF-16BE, UTF-16LE BE BOM – FEFF LE BOM – FFFE Surrogates are used to address code points
outside the BMP. (We will cover this later)
Encodings
UTF-16 Surrogate Pairs Needed for code points > 0xFFFF High Byte 0xD800 – 0xDBFF first surrogate Low Byte 0xDC00 – 0xDFFF second
surrogate Algorithm:
((cp - 0x10000) high 10 bits) | 0xD800 ((cp - 0x10000) low 10 bits) | 0xDC00
Encodings
Which Encoding should you use? If dealing with CJK or Hindi (>0x0800), UTF-
8 requires 3 bytes whereas UTF-16 needs only 2
UTF-8 is great for ASCII whereas UTF-16 needs 2 bytes for it
Java uses UTF-16 Windows uses UTF-16LE internally UTF-32 not really used that much UTF-8 and UTF-16 are the most common
Java
J2SE 1.5 version 4.0 J2SE 1.4 version 3.0 J2SE 1.3 version 2.1 Supplementary characters were
part of Unicode 3.1 Addressed in JSR 204
(http://jcp.org/en/jsr/detail?id=204)
Java Unicode characters are specified
using \u such as \u0039 Unicode can be used in source files file.encoding=Cp1252 on my
machine You can change this, but beware… Java reads and writes using this
encoding by default You can specify the character set
to use for reading or writing
JavaBig5Big5-HKSCSEUC-JPEUC-KRGB18030GB2312GBKIBM-ThaiIBM00858IBM01140IBM01141IBM01142IBM01143IBM01144IBM01145IBM01146IBM01147IBM01148IBM01149IBM037IBM1026IBM1047IBM273IBM277IBM278IBM280IBM284IBM285IBM297
IBM420IBM424IBM437IBM500IBM775IBM850IBM852IBM855IBM857IBM860IBM861IBM862IBM863IBM864IBM865IBM866IBM868IBM869IBM870IBM871IBM918ISO-2022-CNISO-2022-JPISO-2022-KRISO-8859-1ISO-8859-13ISO-8859-15ISO-8859-2ISO-8859-3
ISO-8859-4ISO-8859-5ISO-8859-6ISO-8859-7ISO-8859-8ISO-8859-9JIS_X0201JIS_X0212-1990KOI8-RShift_JISTIS-620US-ASCIIUTF-16UTF-16BEUTF-16LEUTF-8windows-1250windows-1251windows-1252windows-1253windows-1254windows-1255windows-1256windows-1257windows-1258windows-31jx-Big5-Solarisx-euc-jp-linuxx-EUC-TW
x-eucJP-Openx-IBM1006x-IBM1025x-IBM1046x-IBM1097x-IBM1098x-IBM1112x-IBM1122x-IBM1123x-IBM1124x-IBM1381x-IBM1383x-IBM33722x-IBM737x-IBM856x-IBM874x-IBM875x-IBM921x-IBM922x-IBM930x-IBM933x-IBM935x-IBM937x-IBM939x-IBM942x-IBM942Cx-IBM943x-IBM943Cx-IBM948
x-IBM949x-IBM949Cx-IBM950x-IBM964x-IBM970x-ISCII91x-ISO-2022-CN-CNSx-ISO-2022-CN-GBx-iso-8859-11x-JIS0208x-JISAutoDetectx-Johabx-MacArabicx-MacCentralEuropex-MacCroatianx-MacCyrillicx-MacDingbatx-MacGreekx-MacHebrewx-MacIcelandx-MacRomanx-MacRomaniax-MacSymbolx-MacThaix-MacTurkishx-MacUkrainex-MS950-HKSCSx-mswin-936x-PCKx-windows-874x-windows-949x-windows-950
Databases (Maybe)
SQL 92 NATIONAL CHARACTER The <key word>s NATIONAL CHARACTER are used to specify a
character string data type with a particular implementation-defined character repertoire. Special syntax (N'string') is provided for representing literals in that character repertoire.
Collation Database Support
MySQL Oracle Sql Server Postgres
Demonstration Read/Write/Examine UTF-8/UTF-16/UTF-
16LE encoded text (with Hex editor) Show encoding settings in Eclipse and
Java Show how windows (and eclipse
console) can/can't display some characters
web browser settings Chinese article on cracking of SHA-1 Martin Fowler article on dependency
Injection
Resources The big ones:
http://www.unicode.org/Public/UNIDATA/ http://en.wikipedia.org/wiki/Unicode http://www.evertype.com/standards/csur
The rest: http://java.sun.com/javase/technologies/core/basic/intl/f
aq.jsp http://en.wikibooks.org/wiki/Unicode/Character_referenc
e http://www.joelonsoftware.com/articles/Unicode.html http://www.cl.cam.ac.uk/~mgk25/unicode.html http://czyborra.com/charsets/iso646.html http://www.fileformat.info/ (GREAT resource)
For fun: http://www.omniglot.com/ http://en.wikipedia.org/wiki/Constructed_language http://talideon.com/concultures/wiki/