.
Structure-based Web Access Method for Ancient Chinese Characters
Xiaoqing Lu Yingmin Tang Zhi Tang Yujun Gao Jianguo Zhang
Institute of Computer Science and Technology, Peking University, Beijing, 100871, China
Beijing Founder Electronics CO.,Ltd., Beijing, 100085, China
Center for Chinese Font Design and Research, Beijing, 100871, China
State Key Laboratory of Digital Publishing Technology (Peking University Founder Group
Co.,Ltd.), 100871, Beijing, China
{lvxiaoqing,tangyingmin,tangzhi}@pku.edu.cn, {gao_yujun,zjg}@founder.com
2013.11.19, ChongQing, China
.
NLP&CC 2013
Outline
• Background
• Formalization of relationships between
ACCs and modern characters
• Establishment of Super Large Font
• ACC Database
• Implementation and results
.
NLP&CC 2013
Background (1/3)
• Ancient Chinese Characters (ACCs) Important heritage of Chinese history Date back to at least 3300 year-old Development is not one-dimensional Collection, management, and access on the Internet
.
NLP&CC 2013
Background (2/3)
• Problems 1 Involves very large quantities of modern characters
Block Range CommentCJK UnifiedIdeographs
4E00–9FFF common
Extension A3400–4DBF Rare
Extension B20000–2A6DF Rare, historic
Extension C2A700–2B73F Rare, historic
Extension D2B740–2B81F Uncommon, some in current use
Compatibility F900–FAFF
Duplicates, unifiable variants, corporate characters
CompatibilitySupplement
2F800–2FA1F Unifiable variants
.
NLP&CC 2013
Background (3/3)
• Problems 2 & 3 Lack of software code
Traditional IMEs are not suitable for ACCs
.
NLP&CC 2013
Related work
• 1993, Xusheng Ji
• 1994, Ning Li
• 1996, Fangzheng Chen
• 2003, Zaixing Zhang
• 2004, Zhiji Liu
• 2005, Derming Juang
• 2007, Yi Zhuang
• 2008, James S. Kirk
• 2008, Dan Chen
• ... ...
.
NLP&CC 2013
Outline
• Background
• Formalization of relationships between Formalization of relationships between
ACCs and modern charactersACCs and modern characters
• Establishment of Super Large Font
• ACC Database
• Implementation and results
.
NLP&CC 2013
2 Formalization of relationships between ACCs and modern characters• Contemporary encoded characters
Existing encoded Chinese characters Marks for uncoded Chinese characters
• ACCs Corresponding relationships with contemporary encoded
characters No corresponding relationships with the contemporary
encoded characters
.
NLP&CC 2013
2 Formalization of relationships between ACCs and modern characters• Two relations
• Three Types of ACCs Recognized characters Ambiguous characters Unrecognized characters
.
NLP&CC 2013
Outline
• Background
• Formalization of relationships between
ACCs and modern characters
• Establishment of Super Large FontEstablishment of Super Large Font
• ACC Database
• Implementation and results
.
NLP&CC 2013
3 Establishment of Super Large Establishment of Super Large FontFont
• Automatic generation of Chinese characters [27-30]
• rules regarding glyph structure
decomposition
• redundant expressions of glyph structures
are permitted
• multi-level radicals
.
NLP&CC 2013
Outline
• Background
• Formalization of relationships between
ACCs and modern characters
• Establishment of Super Large Font
• ACC DatabaseACC Database
• Implementation and results
.
NLP&CC 2013
ACC Database (1/3)ACC Database (1/3)
• Relation SchemaItem Meaning
Unicode Contemporary Chinese character Unicode for this ancient character.
Dynasty Dynasty when this ancient character was used. Type Type of this ancient character (e.g. pictographic characters,
ideograph, and phonogram) Classification Class type of this ancient character (e.g. inscriptions on bones or
tortoise shells of the Shang Dynasty, inscriptions on bronze, seal character, etc.)
Place Contemporary place where this ancient character was unearthed. Carrier Carrier of this ancient character (e.g. the name or the number of a
certain bronze implement)Country Ancient country where this ancient character was used. SubbaseID Number of the font database storing this ancient character. SubID Code of the ancient character, used in sub-font database. Filename File name for the picture of this ancient character. ID The unique ID of this ancient character in the font database.
.
NLP&CC 2013
ACC Database (2/3)ACC Database (2/3)• Other relation schemas
Dynasty and Country (DC_RS), Ancient C_Character Classification (ACCC_RS) ACC Type (ACCT_RS) Unicode and Glyph (UG_RS) Radical and Component (RC_RS) Ancient Image (AI_RS) Contemporary Image (CI_RS)
.
NLP&CC 2013
ACC Database (3/3)ACC Database (3/3)
• Relationships of the data tables
.
NLP&CC 2013
Outline
• Background
• Formalization of relationships between
ACCs and modern characters
• Establishment of Super Large Font
• ACC Database
• Implementation and resultsImplementation and results
.
NLP&CC 2013
Implementation and Implementation and resultsresults• Retrieval method
.
NLP&CC 2013
Implementation and Implementation and resultsresults
.
NLP&CC 2013