31
Han Unifica)on for Chinese/Japanese/Korean Wanxing Wang July 17, 2018 University of Waterloo CS 846 - Advanced Topics in Electronic Publishing

Han Unifica)on for Chinese/Japanese/Koreandberry/ATEP/Student... · Kangxi Radicals, CJK Radicals Supplement and CJK Strokes •Kangxi Radicals: U+2F00 ~ U+2FD5 •Includes characters

  • Upload
    others

  • View
    9

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Han Unifica)on for Chinese/Japanese/Koreandberry/ATEP/Student... · Kangxi Radicals, CJK Radicals Supplement and CJK Strokes •Kangxi Radicals: U+2F00 ~ U+2FD5 •Includes characters

Han Unifica)on for Chinese/Japanese/Korean

Wanxing WangJuly 17, 2018

University of Waterloo CS 846 - Advanced Topics in Electronic Publishing

Page 2: Han Unifica)on for Chinese/Japanese/Koreandberry/ATEP/Student... · Kangxi Radicals, CJK Radicals Supplement and CJK Strokes •Kangxi Radicals: U+2F00 ~ U+2FD5 •Includes characters

Agenda

• Han Characters• Han Unification• CJK Unified Ideographs in Unicode Table• Further Discussion

Page 3: Han Unifica)on for Chinese/Japanese/Koreandberry/ATEP/Student... · Kangxi Radicals, CJK Radicals Supplement and CJK Strokes •Kangxi Radicals: U+2F00 ~ U+2FD5 •Includes characters

Han Characters

• Western vs East Asian words – Phonograms vs Ideograms• Example:

bright /brīt/ 明

moon /mo͞on/sun /sən/

Page 4: Han Unifica)on for Chinese/Japanese/Koreandberry/ATEP/Student... · Kangxi Radicals, CJK Radicals Supplement and CJK Strokes •Kangxi Radicals: U+2F00 ~ U+2FD5 •Includes characters

Han Characters• Phonograms: easy to represented by small set of symbols• Ideograms: not able to be generated by automation, “hardcoded”

when encoding• Record number: 48027 ideographs

Page 5: Han Unifica)on for Chinese/Japanese/Koreandberry/ATEP/Student... · Kangxi Radicals, CJK Radicals Supplement and CJK Strokes •Kangxi Radicals: U+2F00 ~ U+2FD5 •Includes characters

Variants of Han Characters

• Adapted to non-Chinese cultures:• Hanzi in Chinese, kanji in Japanese and hanja in Korean

• Examples:•切手:• Chinese – “to cut hand”• Japanese – “stamp”

•中� – in Japanese• China• A district in central west Honshuu

Page 6: Han Unifica)on for Chinese/Japanese/Koreandberry/ATEP/Student... · Kangxi Radicals, CJK Radicals Supplement and CJK Strokes •Kangxi Radicals: U+2F00 ~ U+2FD5 •Includes characters

Variants of Han Characters

• Ideograph simplifica1on• Tradi&onal Chinese• Hong Kong, Macao, Taiwan and overseas Chinese communi&es

• Simplified Chinese• By Chinese government during 20th century• Mainland China and Singapore

• Simplified Japanese• By Japanese government aBer the Second World War

Page 7: Han Unifica)on for Chinese/Japanese/Koreandberry/ATEP/Student... · Kangxi Radicals, CJK Radicals Supplement and CJK Strokes •Kangxi Radicals: U+2F00 ~ U+2FD5 •Includes characters

Variants of Han Characters

• Traditional and Simplified ideographs

Page 8: Han Unifica)on for Chinese/Japanese/Koreandberry/ATEP/Student... · Kangxi Radicals, CJK Radicals Supplement and CJK Strokes •Kangxi Radicals: U+2F00 ~ U+2FD5 •Includes characters

Variants of Han Characters

• Ideograph variants in the same character set

Page 9: Han Unifica)on for Chinese/Japanese/Koreandberry/ATEP/Student... · Kangxi Radicals, CJK Radicals Supplement and CJK Strokes •Kangxi Radicals: U+2F00 ~ U+2FD5 •Includes characters

Variants of Han Characters

• Variants of character glyphs• A wide variation in the glyphs used in different countries and for

different applications.

Page 10: Han Unifica)on for Chinese/Japanese/Koreandberry/ATEP/Student... · Kangxi Radicals, CJK Radicals Supplement and CJK Strokes •Kangxi Radicals: U+2F00 ~ U+2FD5 •Includes characters

Variants of Han Characters

Page 11: Han Unifica)on for Chinese/Japanese/Koreandberry/ATEP/Student... · Kangxi Radicals, CJK Radicals Supplement and CJK Strokes •Kangxi Radicals: U+2F00 ~ U+2FD5 •Includes characters

Han Unifica)on

• Unicode a*empts to unify all ideographs from the many CJK na9onal character set standards in to a single set of ideographs• Goal: • Provide coverage for the major CJK character set standards

• Benefits:• A much larger repertoire of characters than found in other CJK

character set standards.• Compa9bility with the character in exis9ng CJK character set

standards.

Page 12: Han Unifica)on for Chinese/Japanese/Koreandberry/ATEP/Student... · Kangxi Radicals, CJK Radicals Supplement and CJK Strokes •Kangxi Radicals: U+2F00 ~ U+2FD5 •Includes characters

Three Dimensional Conceptual Model

• X-axis: semantic (meaning, function)• Y-axis: abstract form (general form)• Z-axis: actual shape (instantiated, typeface form)

• Only Z-axis differences were merged or unified in Unicode.

Page 13: Han Unifica)on for Chinese/Japanese/Koreandberry/ATEP/Student... · Kangxi Radicals, CJK Radicals Supplement and CJK Strokes •Kangxi Radicals: U+2F00 ~ U+2FD5 •Includes characters

Three Dimensional Conceptual Model

Page 14: Han Unifica)on for Chinese/Japanese/Koreandberry/ATEP/Student... · Kangxi Radicals, CJK Radicals Supplement and CJK Strokes •Kangxi Radicals: U+2F00 ~ U+2FD5 •Includes characters

Unification Rules

• Source Separation Rule: If two ideographs are distinct in a primary source standard, then they are not unified.• Round-trip rule• Z-axis variant

Page 15: Han Unifica)on for Chinese/Japanese/Koreandberry/ATEP/Student... · Kangxi Radicals, CJK Radicals Supplement and CJK Strokes •Kangxi Radicals: U+2F00 ~ U+2FD5 •Includes characters

Unifica'on Rules

• Noncognate Rule: If two ideographs are unrelated in historical deriva6on (noncognate characters), then they are not unified.• Noncognate

• Cognate

Page 16: Han Unifica)on for Chinese/Japanese/Koreandberry/ATEP/Student... · Kangxi Radicals, CJK Radicals Supplement and CJK Strokes •Kangxi Radicals: U+2F00 ~ U+2FD5 •Includes characters

Abstract Shape

• Y-axis: abstract shape• Ideographic Component Structure

Page 17: Han Unifica)on for Chinese/Japanese/Koreandberry/ATEP/Student... · Kangxi Radicals, CJK Radicals Supplement and CJK Strokes •Kangxi Radicals: U+2F00 ~ U+2FD5 •Includes characters

Abstract Shape

• Ideograph Features• Number of components• Relative positions of components in each complete ideograph• Structure of a corresponding component• Treatment in a source character set• Radical contained in a component

• If one or more of these features are different between the ideographs compared, the ideographs are considered to have different abstract shapes.

Page 18: Han Unifica)on for Chinese/Japanese/Koreandberry/ATEP/Student... · Kangxi Radicals, CJK Radicals Supplement and CJK Strokes •Kangxi Radicals: U+2F00 ~ U+2FD5 •Includes characters

Unification Rules

• Any two ideographs that possess the same abstract shape are then unified provided that their unification is not disallowed by either the Source Separation Rule or the Noncognate Rule.

Page 19: Han Unifica)on for Chinese/Japanese/Koreandberry/ATEP/Student... · Kangxi Radicals, CJK Radicals Supplement and CJK Strokes •Kangxi Radicals: U+2F00 ~ U+2FD5 •Includes characters

Examples

• Ideographs not unified

Page 20: Han Unifica)on for Chinese/Japanese/Koreandberry/ATEP/Student... · Kangxi Radicals, CJK Radicals Supplement and CJK Strokes •Kangxi Radicals: U+2F00 ~ U+2FD5 •Includes characters

Examples

• Ideographs unified

Page 21: Han Unifica)on for Chinese/Japanese/Koreandberry/ATEP/Student... · Kangxi Radicals, CJK Radicals Supplement and CJK Strokes •Kangxi Radicals: U+2F00 ~ U+2FD5 •Includes characters

Unicode Ideographs, Radicals and Strokes

• 121000 to 20902 after Han Unification• Arranged by radical,

followed by the number of additional strokes.

Page 22: Han Unifica)on for Chinese/Japanese/Koreandberry/ATEP/Student... · Kangxi Radicals, CJK Radicals Supplement and CJK Strokes •Kangxi Radicals: U+2F00 ~ U+2FD5 •Includes characters

Han Ideograph Arrangement

• The arrangement of the Unicode Han characters is based on the positions of characters as they are listed in four major dictionaries:

• The KangXi Zidian: chosen as primary• It contains most of the source characters• Commonly used throughout East Asia

Page 23: Han Unifica)on for Chinese/Japanese/Koreandberry/ATEP/Student... · Kangxi Radicals, CJK Radicals Supplement and CJK Strokes •Kangxi Radicals: U+2F00 ~ U+2FD5 •Includes characters

CJK Unified Ideograph URO

• Unicode’s original block of 20902 ideographs is referred to as Unified Repertorie and Ordering.• Range: U+4E00 ~ U+9FFF

• Original: U+4E00 ~ U+9FA5• Version 5.0: U+9FA5 ~ U+9FBB• Recently: U+9FBC ~ U+ 9FCB

Page 24: Han Unifica)on for Chinese/Japanese/Koreandberry/ATEP/Student... · Kangxi Radicals, CJK Radicals Supplement and CJK Strokes •Kangxi Radicals: U+2F00 ~ U+2FD5 •Includes characters

CJK Unified Ideographs Extension A-F

• Extension A: U+3400 ~ U+4DBF• The last large repertoire of ideographs to be added to Unicode’s

BMP.• Extension B: U+20000 ~ U+2A6DF• In Plane 2

• Extension C: U+2A700 ~ U+2B73F• Extension D: U+2B740 ~ U+2B81F• Extension E: U+2B820 ~ U+2CEAF• Extension F: U+2CEB0 ~ U+2EBEF

Page 25: Han Unifica)on for Chinese/Japanese/Koreandberry/ATEP/Student... · Kangxi Radicals, CJK Radicals Supplement and CJK Strokes •Kangxi Radicals: U+2F00 ~ U+2FD5 •Includes characters

CJK Compatibility Ideographs

• U+F900 ~ U+FAFF• A Unicode block created to contain Han characters that were encoded

in multiple locations in other established character encodings.• In order to retain round-trip compatibility between Unicode and those

encodings.• Include a few regular ideographs that do not have duplicates.

Page 26: Han Unifica)on for Chinese/Japanese/Koreandberry/ATEP/Student... · Kangxi Radicals, CJK Radicals Supplement and CJK Strokes •Kangxi Radicals: U+2F00 ~ U+2FD5 •Includes characters

CJK Compatibility Ideographs

• Process called Normaliza)on can be applied to CJK Compa4bility Ideographs, and the result is that they are converted into their Canonical Equivalents.• For some locales and for some code points, the applica4on of

Normaliza4on effec4vely removes dis4nc4ons.• Example:

Page 27: Han Unifica)on for Chinese/Japanese/Koreandberry/ATEP/Student... · Kangxi Radicals, CJK Radicals Supplement and CJK Strokes •Kangxi Radicals: U+2F00 ~ U+2FD5 •Includes characters

Kangxi Radicals, CJK Radicals Supplement and CJK Strokes• Kangxi Radicals: U+2F00 ~ U+2FD5

• Includes characters that represent the complete set of 214 classical radicals as used by the vast majority of ideograph dicFonaries.

• CJK Radicals Supplement: U+2E80 ~ 2EF3

• This collecFon of radical variants appears to be somewhat ad-hoc.

• CJK Strokes: U+31C0 ~ U+31CF, U+31D0 ~ U+31E3

Page 28: Han Unifica)on for Chinese/Japanese/Koreandberry/ATEP/Student... · Kangxi Radicals, CJK Radicals Supplement and CJK Strokes •Kangxi Radicals: U+2F00 ~ U+2FD5 •Includes characters

Further Discussion

• Language tags and Han Unification• A common misunderstanding: Han characters cannot be rendered

properly without language information.• Plain text remains legible in the absence of these specifications.

Page 29: Han Unifica)on for Chinese/Japanese/Koreandberry/ATEP/Student... · Kangxi Radicals, CJK Radicals Supplement and CJK Strokes •Kangxi Radicals: U+2F00 ~ U+2FD5 •Includes characters

Further Discussion

• What if the ideographs are not enough?• GETA MARKER: 0x3013 〓• IDEOGRAPHIC VARIATION INDICATOR: 0x303E�

Page 30: Han Unifica)on for Chinese/Japanese/Koreandberry/ATEP/Student... · Kangxi Radicals, CJK Radicals Supplement and CJK Strokes •Kangxi Radicals: U+2F00 ~ U+2FD5 •Includes characters

Questions

Page 31: Han Unifica)on for Chinese/Japanese/Koreandberry/ATEP/Student... · Kangxi Radicals, CJK Radicals Supplement and CJK Strokes •Kangxi Radicals: U+2F00 ~ U+2FD5 •Includes characters

Thanks for listening!