23
Unicode Normalization Mark Davis www.macchiato.com

Unicode Normalization

Embed Size (px)

DESCRIPTION

Unicode Normalization. Mark Davis www.macchiato.com. Normalization. Uniqueness two equivalent strings have precisely the same normalized form Fast binary comparison, accurate digital signatures Recommended for XML, JavaScript and other standards. Canonical Equivalence. - PowerPoint PPT Presentation

Citation preview

  • Unicode NormalizationMark Daviswww.macchiato.com

  • NormalizationUniquenesstwo equivalent strings have precisely the same normalized formFast binary comparison, accurate digital signaturesRecommended for XML, JavaScript and other standards

  • Canonical EquivalenceFundamental equivalenceIndistinguishable to users, when correctly renderedIncludesCombining sequencesHangulSingletonsC

  • Compatibility EquivalenceFormatting differencesFont variants ()Breaking differences (-)Cursive forms ( )Circled ()Width, size, rotated ( )Super/subscripts ( )Squared characters ()Fractions ()Others ()fikg

  • UTR #15:Unicode Normalization Forms

    Form DCanonical DecompositionForm KDCompatibility DecompositionForm CForm D + Canonical CompositionForm KCForm KD + Canonical Composition

  • Normalization RequirementUniqueness: two equivalent strings will have precisely the same normalized formIf two strings x and y are canonical equivalents, thenC(x) = C(y)D(x) = D(y)If two strings are compatibility equivalents, thenKC(x) = KC(y)KD(x) = KD(y)

  • Affected CharactersNone of the forms affect text with only ASCII characters (U+0000 to U+007F) None of the forms generate compability characters that were not in the source text. Both KD and KC replace compatibility characters.Both D and C maintain compatibility characters.

  • Cautions: DecompositionRequires decomposition mappings from the Unicode Character DatabaseThose decomposition mappings must be applied recursivelyThe string must be put into canonical orderEither Canonical or Compatibility

  • Cautions: CompositionDecomposition required first!Then canonical compositionComposition data: fixed at Unicode 3.0.0Some characters are excluded from compositionForm C and Form KC can still have combining characters!Required for Indic, Arabic, Hebrew, &c.

  • Caution: Both C & DAll normalization forms are not closed under string concatenation. Example:NFC/D "a" + "" Not Norm."a"NFC ""NFD"a"Exceptions easy to test for

  • Composition ProcessDecompose (D or KD)Combine unblocked characters with the previous starter, if possible*

  • Composition ExclusionsScript Specifics + Futures:G + GSingletons* Non-starter sequences* +

  • Legacy EncodingLegacy text is normalized if it maps 1:1 to normalized Unicode textLegacy sets:Prenormalized: e.g. ISO 8859-1Normalizable: e.g. ISO 2022 (ISO5426/ISO 8859-1/)Unnormalizable: e.g. ISO 5426

  • Programming IdentifiersClosed under all Normalization Forms, if minor changes incorporatedModified syntax:identifier := start ( start | extend )*start := [{Lu}{Ll}{Lt}{Lm}{Lo}{Nl}] - irregulars combining_like extend := [{Mn}{Mc}{Nd}{Pc}{Cf}] - irregulars + combining_like + mid_dot(Almost) closed under Case Mappingssee SpecialCasing.txt

  • ResourcesReference version on Unicode SiteProduction Versionhttp://oss.software.ibm.com/icuICU: C/C++ and Java VersionsOpen Source, with IBM Public LicenseFree commercial use and distribution: Not Viral!Panel Later todayOther companies also providing: ask!

  • NormalizationUniqueness: two equivalent strings have precisely the same normalized formFast binary comparison, accurate digital signaturesRecommended for XML, JavaScript and other standards

  • Q & A

  • Backup Slides

  • Definition: StarterS is a starter =Canonical class of zero in the Unicode Character DatabaseCan start a compositionExamples:Starters: Spacing marks, some non-spacinga, Non-starters: most non-spacing marks,

  • Definition: BlockedC is blocked from SThere is some character B between S and C, and eitherB is a starter orB has the same canonical class as CExamplesABC B blocks C from AA blocks from AA doesnt block from A

  • Testing Conformance: Canonical

    For all Unicode characters XC(X) = C(D(X)D(X), C(X) in canonical orderCDMNo CDMX=D(X)X= C(X)X D(X)No characters in D(X) have CDMX ExclusionsX ExclusionsX C(D(X)X = C(D(X)

  • Unicode NormalizationIntroductionNormalization formsDesign goalsSpecificationExcluded charactersVersionsLegacy encodingsApplications

  • Characters and Encoding FormsAC5AbstractEncoded212BF00006130ASerialized00212BDB80DC000061030AC5UTF-16BEUTF-8C3E284F3B0808061CC8A85AB

    Normalization is the process of converting equivalent strings to a unique format. It allows for fast binary comparison and accurate digital signatures. It is recommended for XML, JavaScript and other technologies.There are two different forms of equivalency in Unicode: canonical equivalency and compatibility equivalency. We will first discuss the differences between those.Canonical equivalence is a basic equivalency between characters.Compatibility equivalents are characters that are distinguished by format.Forms C and D erase the distinction between canonical equivalents, while Forms KD and KC erase the distinctions between compatibility equivalents.Forms C and D preserve the essential semantics of the text.Forms KD and KC do not. Since forms KD and KC may erase many formatting distinctions, they will prevent round-trip conversion to and from many legacy character sets, and unless supplanted by formatting markup, may remove distinctions that are important to the semantics of the text. The best way to think of these normalization forms is like uppercase or lowercase mappings: useful in certain contexts for identifying core meanings, but also performing modifications to the text that may not always be appropriate.

    The most significant requirement for all the normalization forms is uniqueness. This is maintained by the four forms for their respective equivalencies.There are some important features of the normalization forms.Certain features are worth special attention. Decomposition must be done in accordance with the full specification to work properly.Both composition normalization forms perform decomposition first. They then canonically compose the text. This means that compatibility characters that were not in the original will never be introduced.Because additional composite characters may be added to future versions of the Unicode Standard, composition is less stable than decomposition. Therefore, it is necessary to specify a fixed version for the composition process, so that implementations can get the same result for normalization even if they upgrade to a new version of Unicode.Note that even after composition, there may still be combining marks left in the text!

    Without limiting the repertoire, there is no way to produce a normalized form that is closed under simple string concatenation. If desired, however, a specialized function could be constructed that produced a normalized concatenation. Now the process is pretty simple: first decompose, then combine pairs of characters if possible. The limitations are that the first must be a starter, and the second must be unblocked. Since this are done successively, a sequence like a + circumflex + grave will first combine the a and the circumflex, then combine the a-circumflex with the grave.There are certain types of characters that are not included in composition. These fall into the three classes listed above. A list of these characters is provided in the Unicode Character Database on the Unicode Website.While the Normalization Forms are specified for Unicode text, they can also be extended to non-Unicode (legacy) character encodings. This is based on mapping the legacy character set strings to and from Unicode. These legacy sets fall into three categories:The Unicode Standard provides a recommended syntax for identifiers for programming languages that allow the use of non-ASCII languages in code. It is a natural extension of the identifier syntax used in C and other programming languages.That is, the first character of an identifier can be an uppercase letter, lowercase letter, titlecase letter, modifier letter, other letter, or letter number. The subsequent characters of an identifier can be any those, plus non-spacing marks, spacing combining marks, decimal numbers, connector punctuations, and formatting codes (such as right-left-mark). Normally the formatting codes should be filtered out before storing or comparing identifiers. Normalization can be used to avoid problems where apparently identical identifiers are not treated equivalently. Such problems can appear both during compilation and during linking, in particular also across different programming languages. To avoid such problems, programming languages should normalize identifiers before storing or comparing them, preferably in Normalization Form KC especially if the identifiers are caseless. While Normalization Form C can also be used, Form KC eliminates variations that are probably not relevant to the specification of programming language identifiers.However, if programming languages are using form KC to level differences between characters, then they need to use a slight modification of the identifier syntax from the Unicode Standard to deal with the idiosyncrasies of a small number of characters. These changes are indicated with red in the diagram. A full account is given in Unicode Technical Report #15.

    IBMs alphaworks site provides Normalization code, in two different forms:A Java versionA C/C++ version. This latter version is part of the IBM Classes for Unicode, which provides open source code under IBMs public license. This provides for commercial use, modification and distribution.For more details, see the alphaworks site.We will now talk a bit about the technical details involved in normalization. For most people, this is not important, since they will just use a normalization package that conforms to the specification. However, for the curious we will take some time to spell out some of the details.The first is a starter, which is a character that begins a sequence of characters that may be composed.Next we need the definition of blocked. This is used to characterize sequences that cant be combined, because a change in order would violate canonical equivalence.