39
Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings

Unicode 4.0

Embed Size (px)

DESCRIPTION

Unicode 4.0. Mark Davis President, The Unicode Consortium Note: slides differ from proceedings. Overview. New Characters Conformance UAX:Unicode Standard Annexes UCD:Unicode Character Database UTS:Unicode Technical Standards Not part of the Standard, but can claim conformance. - PowerPoint PPT Presentation

Citation preview

Page 1: Unicode 4.0

Unicode 4.0

Mark Davis

President, The Unicode Consortium

Note: slides differ from proceedings

Page 2: Unicode 4.0

OverviewNew CharactersConformanceUAX: Unicode Standard AnnexesUCD: Unicode Character Database

UTS: Unicode Technical Standards Not part of the Standard, but can claim

conformance

Page 3: Unicode 4.0

Properties and BehaviorUnicode is not just a list of charactersProperties and behavior are crucialWith them, new characters can work

“out of the box”Some are part of the standard (BIDI,

Normalization), others are associated (Collation, Regular Expressions)

Page 4: Unicode 4.0

New Characters: 1,228 Modern Scripts

(additions to) Indic, Khmer, Latin, Greek, Arabic, Syriac

(minority scripts) Limbu, Tai Le, Osmanya Historic Scripts

Linear B, Cypriot, Ugaritic, Shavian, Aegean Numbers

Symbols Monograms, digrams, tetragrams, other symbols modifier & combining characters

Page 5: Unicode 4.0

New Characters (cont.)Special Characters

additional variation selectors (for future CJK variants), double-diacritics for dictionary use

For a detailed list, see Derived Age in the UCD 4.0, and the beta Charts.

Character repertoire corresponds to ISO/IEC 10646:2003.

Page 6: Unicode 4.0

Conformance Substantially improved specification of

conformance requirements Incorporated UTR #17: Character Encoding Model,

clearly separating encoding forms and encoding schemes

Tightened definitions of UTF-8, UTF-16, UTF-32 Separate definition of Unicode String

Clarified conformance status of Unicode Standard Annexes

Formal definitions of properties & algorithms Provisional properties

Page 7: Unicode 4.0

UTF vs. Unicode String Important Distinction UTF

Unique representation for Code Point All else illegal

C0 80D800 0061

Unicode String Sequence of code units Internal Processing, not interchange Not necessarily valid UTF

C0 A0D800 0061

Page 8: Unicode 4.0

Conformance (cont.) Formalized policies for stability of the standard Clarification of semantics of important

characters, including BOM Revised scope of enclosing combining marks Revised semantics of ZWJ for cursive scripts Normalization Corrections

U+2F868; U+2F874; U+2F91F; U+2F95F; U+2F9BF

All corrections subject to strict stability constraints: For 3.2 repertoire, NFC3.2(X) = NFC4.0(X)

Page 9: Unicode 4.0

Textual Clarifications Major changes to Chapters 2, 3, 6, 14 and 15 Definitive terminology for code points:

graphic, format, control, private-use = assigned characters

surrogate, noncharacter, reserved not characters

Substantial improvements to many character block descriptions, especially Indic

Page 10: Unicode 4.0

Programming language identifiers Now backwards-compatible

Once a Unicode identifier, Always a Unicode identifier

Alternate definition for complete stability Fix set of allowed characters Allow all reserved code points + Complete stability - “Odd” characters

Also see new UTR on Syntax Characters

Page 11: Unicode 4.0

Case mappings now normative (but tailorable) Clearer definition of string functions:

isUpper(), isLower(), isTitle(), isFold() toUpper(), toLower(), toTitle(), toFold()

Definition of titlecase uses word boundaries Note that the Turkic mappings do not

maintain canonical equivalence, without additional processing.

Page 12: Unicode 4.0

UAX #9: BIDI BIDI: Arabic/Hebrew Display

HTML, all modern word processors, OSs,… New:

canonically equivalence now preserved data change, not algorithm

shaping is done after reordering but not across directional boundaries

clarifications of: ZWJ, ZWNJ intermediate level processing

Page 13: Unicode 4.0

UAX #15: Normalization Unique form for text comparison

W3C Character Model, International Domain Names, Network File System,…

New: Description of Stable Code Points. Notation NFC(x) and isNFC(x), in Notation. Added pointer to UTN #5 Canonical Equivalences in

Applications Rewrote Annex 12: Corrigenda for clarity, and to

describe the use of Normalization Corrections. Added Annex 13: Canonical Equivalence.

Page 14: Unicode 4.0

UAX #14: Line Breaking Line-Break (word-wrap) all Unicode text

Customizable for different languages New:

Negative numbers and dates with hyphens will not break across lines

Word-Joiner will link any characters (except hard line breaks)

Behavior of soft hyphen clarified marks opportunity for breaking, not specific graphic

appearance. Rules for GL relaxed: SP and ZW override New Property Values: NL, WJ

Page 15: Unicode 4.0

UAX #29: Text Boundaries Default “User Character”, Word, Sentence

boundaries Customizable for different languages Word, sentence: tailoring expected

New: Extracted from 3.0, but significantly revised Grapheme cluster (“user character”)

Hangul Syllable or other Base plus (optionally) any number of NSMs

Page 16: Unicode 4.0

No Sub. ChangesUAX #11: East Asian Width

Guidelines for choosing character widthUAX #24: Script Names

Default script assignment Used in regular expressions Now UAX

Page 17: Unicode 4.0

Superseded UAXes Incorporated into and thus superseded

by Unicode Version 4.0: UAX #13: Unicode Newline Guidelines UAX #19: UTF-32 UAX #21: Case Mappings UAX #27: Unicode 3.1 UAX #28: Unicode 3.2

Page 18: Unicode 4.0

Unicode Character Database Crucial Component of Unicode Documentation coalesced into UCD.html. New properties and values

Hangul_Syllable_Type, Unicode_Radical_Stroke CJK numeric values added. PropertyValueAliases adds block names

UCD fallback props more precisely defined. for code points not explicitly in data files

New Characters Appropriate properties assigned

Page 19: Unicode 4.0

UCD4.0 (cont.) Modifier letters

The general category of 02B9..02BA, 02C6..02CF changed to general category Lm.

Khmer Two Khmer characters are deprecated; four others

strongly discouraged. Decimal Digits

Numeric_Type=decimal digit now aligned with General_Category=Nd

Braille Added script value

Page 20: Unicode 4.0

UCD4.0 (cont. 2) Case Mapping

Fixed for Turkish, Lithuanian Default Ignorables

Hangul Filler characters Soft-Hyphen, CGJ, ZWS Arabic End of Ayah and Syriac Abbreviation Mark

no longer DI, shaping classes fixed. Grapheme_Extend

removes halfwidth katakana marks, most Mc (except as needed for canonical equivalence)

Page 21: Unicode 4.0

Unicode Technical StandardUTS: separate standard

independent conformance requirementsUTR: information and guidelines

Documents may move from UTR status to UTS

Page 22: Unicode 4.0

UTS #10: Unicode Collation Significance:

String comparison, matching, searching Compares all Unicode characters Handles linguistic features

Accents, Case, Punctuation,… Contextual weighting,…

Tailor for different languages Version 4.0.0 due Sept. 2003

From now on, to be sync'ed in repertoire and version with the Unicode Standard.

Page 23: Unicode 4.0

UTS #18: Regular Exp. Significance:

Crucial to many applications: web, XML,… Unicode adds significant requirements

Level 1: Basic Support Perl

Level 2: Extended Support Level 3: Tailored Support

New: Recently approved as UTS (was UTR) Adds clearer conformance requirements

Flexible list of features Partial conformance claims

Page 24: Unicode 4.0

UTS #6: SCSUSimple Unicode CompressionAdded suitability for XMLSee also Technical Note on BOCU

Main difference: preserves binary order x < y => BOCU(x) < BOCU(y)

Page 25: Unicode 4.0

New UTRsDraft UTR #23: Character Properties

Draft Character Property ModelCharacter Folding

Hiragana-Katakana, Case, …Programming Language IDs, Syntax

characters

Page 26: Unicode 4.0

Q& AOther talks here:Common Locale Data

interchange of language-specific data for sorting, dates, times, currencies

ICU premier Unicode enablement library full-featured, x-platform C, C++, Java

Page 27: Unicode 4.0

Background Slides

Page 28: Unicode 4.0

Unicode 3.2 (March, 2002) New Characters: 1,016 Symbols

Large collection of mathematical symbols, especially targeted at MathML, recycling symbols, ornamental brackets.

Special Characters combining grapheme joiner, word joiner, invisible

operators for math, variation selectors Modern Scripts

minority scripts of the Philippines

Page 29: Unicode 4.0

Conformance Eliminates irregular UTF-8 Defines variation sequences Replaces ZWNBSP with Word Joiner Clarifies scope of combining marks

(further revised in 4.0) Clarifications of conjoining jamo

behavior, hangul syllable structure, decomposables,

Page 30: Unicode 4.0

Textual Clarifications Combined vowels in Khmer, characters

discouraged in Khmer Use of dingbats

Page 31: Unicode 4.0

Unicode Standard Annexes UAX #21: Case Mappings (was UTR)

Page 32: Unicode 4.0

Unicode Character Database New properties:

IDS_Binary_Operator, IDS_Trinary_Operator, Radical, Unified_Ideograph,

Default_Ignorable_Code_Point, Deprecated Soft_Dotted, Logical_Order_Exception

Grapheme_Base, Grapheme_Extend,Grapheme_Link DerivedAge Normalization Corrections Added Property & Property Value Aliases Adds StandardizedVariants.html

Page 33: Unicode 4.0

Related Items UTS #10: Unicode Collation Algorithm

Ignorable character handling, dual versioning, more conditions on well-formed weights, separate weights for CJK and unassigned characters, non-characters

Note: base version still U3.1 UTR #26: CESU-8 Unicode Technical Notes Updated Character Encoding Stability Policy Added Public Review process Updated Glossary

Page 34: Unicode 4.0

Unicode 3.1 (March, 2001) New Characters: 44,946

First supplementaries encoded! Modern scripts

CJK Ideographs (now totaling 71,039) Historic scripts

Old Italic, Gothic, Deseret, Byzantine Musical Symbols

Symbols Mathematical Alphanumeric Symbols, (Western)

Musical Symbols

Page 35: Unicode 4.0

Conformance Non-shortest-form UTF-8 excluded Clarification of the stability of the standard,

code units vs. code points, non-characters, normative properties, informative properties, normative references

Revisions of guidelines: wchar_t, unassigned code points, identifiers

Major revision of Georgian Use of ZWNJ and ZWJ for ligatures Language tag characters encoded

but discouraged

Page 36: Unicode 4.0

Unicode Standard Annexes UAX #19: UTF-32

Page 37: Unicode 4.0

Unicode Character Database Major revision of PropList properties:

White_Space, Bidi_Control, Join_Control, Hex_Digit

Alphabetic, Ideographic, Lowercase, Uppercase ID_Start, ID_Continue, XID_Start, XID_Continue Noncharacter_Code_Point

Quotation_Mark, Terminal_Punctuation, Math, Dash, Hyphen, Diacritic, Extender

New properties: Case folding, Scripts Added DerivedProperties, NormalizationTest

Page 38: Unicode 4.0

Related Items Documented

Character Encoding Stability Policy UTS #10: Unicode Collation Algorithm

Merged data files; updated to base version 3.1 UTR #18: Unicode Regular Expression Guideli

nes

UTR #20: Unicode in XML and other Markup Languages

UTR #22: Character Mapping Tables UTR #24: Script Names

Page 39: Unicode 4.0

Schedule2003, April: UCD/UAXes

Final data files available Implementation can proceed

2003: September: Book Available