209
ACL-IJCNLP 2015 The 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing July 30-31, 2015 Beijing, China

Proceedings of the 52nd Annual Meeting of the Association for … · 2015. 7. 25. · Hsin-Hsi Chen, National Taiwan University Kuan-hua Chen, National Taiwan University Xiangyu Duan,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • ACL-IJCNLP 2015

    The 53rd Annual Meeting of theAssociation for Computational Linguistics and the

    7th International Joint Conference on Natural LanguageProcessing

    Proceedings of the Eighth SIGHAN Workshop on ChineseLanguage Processing

    July 30-31, 2015Beijing, China

  • c©2015 The Association for Computational Linguisticsand The Asian Federation of Natural Language Processing

    Order copies of this and other ACL proceedings from:

    Association for Computational Linguistics (ACL)209 N. Eighth StreetStroudsburg, PA 18360USATel: +1-570-476-8006Fax: [email protected]

    ISBN 978-1-941643-57-0

    ii

  • Preface

    Welcome to the Eighth SIGHAN Workshop on Chinese Language Processing! Sponsored by theAssociation for Computational Linguistics (ACL) Special Interest Group on Chinese LanguageProcessing (SIGHAN), this year’s SIGHAN-8 workshop is being held in Beijing, China, on July 30-31,2015, and is co-located with ACL-IJCNLP 2015. The workshop program includes three keynotespeeches, research paper presentations and two Bake-offs. We hope that these events will bring togetherresearchers and practitioners to share ideas and developments in various aspects of Chinese languageprocessing.

    We have received 17 valid submissions, each of which has been assigned to three reviewers. After arigorous review process, we have accepted 5 papers for oral presentations (30% acceptance rate) and 6papers for poster presentations, representing a global acceptance rate of 65%.

    We are honored to welcome our distinguished speakers: Dr. Min Zhang (Distinguished Professor,Soochow University, China) and Rou Song (Professor, Beijing Language and Culture University, China)will give the first keynote speech "Discourse and Machine Translation." Yanxiong Lu and LianqiangZhou (WeChat Pattern Recognition Center at Tencent) will speak on "Intelligent Q&A System and NLPOpen Platform." Finally, Dr. Lun-Wei Ku (Assistant Research Fellow, Academia Sinica, Taiwan) willspeak on "From Lexical to Compositional Chinese Sentiment Analysis."

    We would also like to thank the Bake-off organizers. The first task Chinese Spelling Check task wasorganized by Dr. Yuen-Hsien Tseng (National Taiwan Normal University), Dr. Lung-Hao Lee (NationalTaiwan Normal University), Dr. Li-Ping Chang (National Taiwan Normal University), and Dr. Hsin-HsiChen (National Taiwan University). The second Topic-Based Chinese Message Polarity Classificationtask is organized by Dr. Xiangwen Liao (Fuzhou University, China), Dr. Ruifeng Xu (Harbin Instituteof Technology, China), Dr. Li Binyang (University of International Relation, China), and Dr. Liheng Xu(Institute of Automation, Chinese Academy of Sciences, China). A total of sixteen teams participated inthese two tasks and have achieved good results.

    Finally, we would like to thank all authors for their submissions. We appreciate your active participationand support to ensure a smooth and successful conference. The publication of these papers representsthe joint effort of many researchers, and we are grateful to the efforts of the review committee fortheir work, and to the SIGHAN committee for their continuing support. We wish all a rewarding andeye-opening time at the workshop.

    SIGHAN-8 Workshop Co-organizersLiang-Chih Yu, Yuan Ze UniversityZhifang Sui, Peking UniversityYue Zhang, Singapore University of Technology and DesignVincent Ng, University of Texas at Dallas

    iii

  • Organizing Committee

    Organizers:

    Liang-Chih Yu, Yuan Ze UniversityZhifang Sui, Peking UniversityYue Zhang, Singapore University of Technology and DesignVincent Ng, University of Texas at Dalles

    SIGHAN Committee:

    Chengqing Zong, Chinese Academy of ScienceMin Zhang, Soochow UniversityGina-Anne Levow, University of WashingtonNianwen Xue, Brandeis University

    Program Committee:

    Chia-Hui Chang, National Central UniversityLi-Ping Chang, National Taiwan Normal UniversityWangxiang Che, Harbin Institute of TechnologyHsin-Hsi Chen, National Taiwan UniversityKuan-hua Chen, National Taiwan UniversityXiangyu Duan, Soochow UniversityXianpei Han, Chinese Academy of ScienceXungjing Huang, Fudan UniversityJing Jiang, Singapore Management UniversityChunyu Kit, City University of Hong KongWai Lam, Chinese University of Hong KongChao-Hong Liu, Dublin City UniversityLung-Hao Lee, National Taiwan UniversityHaizhou Li, Institute of Infocomm ResearchJyun-Jie Lin, Yuan Ze UniversityYang Liu, Tsinghua UniversityXiangwen Liao, Fuzhou UniversityJianyun Nie, University of MontrealLikun Qiu, Ludong UniversityFuji Ren, The University of TokoshimaWeiwei Sun, City University of Hong KongYuen-Hsien Tseng, National Taiwan Normal UniversityHsin-Min Wang, Academia SinicaKun Wang, Chinese Academy of ScienceDerek F. Wong, University of MacauChung-Hsien Wu, National Chen Kung UniversityRuifeng Xu, Harbin Institute of TechnologyChin-Sheng Yang, Yuan Ze UniversityJui-Feng Yeh, National Chiayi UniversityGuodong Zhou, Soochow UniversityQiang Zhou, TsingHua UniversityJingbo Zhu, Northeastern University

    iv

  • Invited Talk: Discourse and Machine TranslationZHANG Min, Soochow University, China

    SONG Rou, Beijing Language and Culture University, China

    Abstract

    Discourse in linguistics refers to a unit of language longer than a single sentence. It has not beenwell studied in the research community of computational linguistics, but it has attracted more andmore attention in very recent years. This talk consists of two parts, i.e., discourse and machinetranslation. We will first give an overview about discourse and review the research state-of-the-artof discourse from both linguistics and computational viewpoints, and then discuss how machinetranslation can benefit from discourse-level information. Finally, we conclude the talk with somefuture direction discussions.

    Biography

    ZHANG Min: a distinguished professor and vice dean of the school of computer science andtechnology, director of the research Institute for Human Language Technology at Soochow Uni-versity (China), received his Ph.D. degree in computer science from Harbin Institute of Technology(China) in 1997. He has studied and worked oversea in industry and academy at South Korea andSingapore since 1997 to 2013. His current research interests include machine translation and natu-ral language processing. He has co-authored 2 Springer books and more than 130 papers in leadingjournals and conferences, and co-edited 13 books published by Springer and IEEE. He is an asso-ciate editor of IEEE T-ASLP (2015-2017).

    SONG Rou: a professor and Ph.D. supervisor at Applied Linguistics and Computer Applica-tion in Beijing Language and Culture University, received his Bachelor degree in mathematicsand mechanics from Beijing University in 1968 and his mater Master degree in computer sciencefrom Beijing University in 1981. He has been working on Chinese Information Processing studyfor tens of years as the PIs of more than 10 national-level projects with the research focuses ondiscourse analysis, Chinese word segmentation, Computer-aided proofreading, Chinese word at-tribute, Chinese Orthographic Computing and Chinese POS and so on. He has published morethan 100 papers at leading journals and conferences in computer science and linguistics. He hasdeveloped and commercialized several softwares with two patents. He has received several awardsfrom Beijing City and MOE, China. He has been appointed as guest professors in a few domesticand oversea universities and research institutes.

    v

  • Invited Talk: Intelligent Q&A System and NLP Open PlatformLU Yanxiong and ZHOU Lianqiang

    WeChat Pattern Recognition Center, Tencent

    Abstract

    Building a general Q&A system that can handle any subject is a very challenging AI task. Internetsocial platforms accumulate large amount of active users and UGC (User Generate Content) data,which become valuable crowdsourcing resources. In this talk, we will discuss the opportunity ofusing WeChat crowdsourcing resources to build an intelligent Q&A systems as well as some openquestions and challenges under this topic.

    Tencent Open Platform "Wen Zhi" provides comprehensive natural language processing APIs,including the modules of Lexical, Syntax, Semantics and Paragraph. It also provides the webcrawling, data extraction and transcoding services. In this talk we will give an overview of Ten-cent NLP open platform as well as the techniques behind.

    Biography

    LU Yanxiong is the senior researcher of WeChat Pattern Recognition Center, Tencent. He hasbeen working on search query analysis, Q&A system and NLP related projects in Tencent. Hiscurrent work focus on WeChat semantic analysis. His research interests include search engine,machine learning, NLP and big data analysis. Before joining in Tencent, Yanxiong worked inBaidu and graduated from Xidian University with master degree.

    ZHOU Lianqiang has been working in the field of NLP and machine learning in Tencent, suchas search query re-write, user interests mining, word segmentation, etc. He is now the senior re-searcher and team leader of NLP research group in Tencent Intelligent Computing and Search Lab.Before joining Tencent Lianqiang worked in several Internet companies and got his master degreefrom Harbin Institute of Technology.

    vi

  • Invited Talk: From Lexical to Compositional Chinese SentimentAnalysis

    KU Lun-WeiAcademia Sinica, Taiwan

    Abstract

    Sentiment analysis determines the polarities and strength of sentiment-bearing expressions, andit has been an important and attractive research area due to its close affinity to applications. Inthe past research, sentiment analysis depended highly on lexical semantics. However, sentimentanalysis is eager for the understanding of the context, and shallow features such as bag of wordscannot fulfill this need. As a result, compositional semantics, which concerns the construction ofmeaning based on syntax, has been applied to sentiment analysis through different approaches.In the Chinese language, as morphological structures may represent the compositional semanticsinside Chinese words, the compositional sentiment analysis can even start from determining thesentiment of morphemes, which will be touched in this talk.

    This talk will begin from some background knowledge of sentiment analysis, such as how senti-ment are categorized, where to find available corpora and which models are commonly applied,especially for the Chinese language. I will describe our work on compositional Chinese sentimentanalysis from words to sentences. All our involved and recently developed related resources, in-cluding Chinese Morphological Dataset, Augmented NTU Sentiment Dictionary (aug-NTUSD),E-hownet with sentiment information, and Chinese Opinion Treebank, will also be introduced inthis talk. I’ll end by describing how we have begun to test our compositional model with wordembeddings.

    Biography

    KU Lun-Wei received her Ph.D. degree in Computer Science and Information Engineering fromNational Taiwan University. Then she joined the Department of Computer Science and Informa-tion Engineering, National Yunlin University of Science and Technology (Yuntech), Taiwan, as anassistant professor. Since Aug. 2012, she joined the Institute of Information Science, AcademiaSinica as an assistant research fellow. Previously, she was a postdoctoral researcher at the Depart-ment of Computer Science and Information Engineering, National Taiwan University, workingon the project “Machine learning methods for ranking problems in multilingual information re-trieval”. She was a project researcher in Acer Product Value Lab, Taiwan, between Apr. 2003and May 2004. At that time, she joined the project in speech recognition services for home mediacenter. She was a software engineer/project manager in NaturalTel, a platform service providerof carriers, where she joined the development of speech entertainment service platform for Far-eastone (Fetnet), Taiwan. Her international recognition includes CyberLink Technical Elite Fel-lowship in 2007, IBM Ph.D. Fellowship in 2008, ROCLING Doctorial Dissertation DistinctionAward in 2009, and Good Design Award selected in 2012. Her research interests include natu-ral language processing, information retrieval, sentiment analysis, and computational linguistics.She has been working on Chinese sentiment analysis since year 2005 and was the co-organizerof NTCIR MOAT Task (Multilingual Opinion Analysis Task, traditional Chinese side) from year2006 to 2010. She is also one of the organizers of the SocialNLP workshop, which has been heldjointly in IJCNLP 2013, Coling 2014, WWW 2015 and NAACL 2015. This year, she serves asthe area chair of the sentiment analysis and opinion mining track in The 53rd Annual Meeting of

    vii

  • the Association for Computational Linguistics and The 7th International Joint Conference on Nat-ural Language Processing (ACL-IJCNLP 2015), as well as in The 2015 Conference on EmpiricalMethods on Natural Language Processing (EMNLP 2015). Other professional international ac-tivities she involved include The Publication Co-Chair, The 6th International Joint Conference onNatural Language Processing (IJCNLP-2013), Publicity Chair, The Twenty-fourth Conference onComputational Linguistics and Speech Processing (Rocling 2012), and Finance Chair, The SixthAsia Information Retrieval Societies Conference (AIRS 2010).

    viii

  • Table of Contents

    Sequential Annotation and Chunking of Chinese Discourse StructureFrances Yung, Kevin Duh and Yuji Matsumoto. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1

    Create a Manual Chinese Word Segmentation Dataset Using Crowdsourcing MethodShichang Wang, Chu-Ren Huang, Yao Yao and Angel Chan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelAaron Li-Feng Han, Xiaodong Zeng, Derek F. Wong and Lidia S. Chao . . . . . . . . . . . . . . . . . . . . . . 15

    Sentence selection for automatic scoring of Mandarin proficiencyJiahong Yuan, Xiaoying Xu, Wei Lai, Weiping Ye, Xinru Zhao and Mark Liberman . . . . . . . . . . . 21

    ACBiMA: Advanced Chinese Bi-Character Word Morphological AnalyzerTing-Hao Huang, Yun-Nung Chen and Lingpeng Kong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    Introduction to SIGHAN 2015 Bake-off for Chinese Spelling CheckYuen-Hsien Tseng, Lung-Hao Lee, Li-Ping Chang and Hsin-Hsi Chen . . . . . . . . . . . . . . . . . . . . . . . 32

    HANSpeller++: A Unified Framework for Chinese Spelling CorrectionShuiyuan Zhang, Jinhua Xiong, Jianpeng Hou, Qiao Zhang and Xueqi Cheng . . . . . . . . . . . . . . . . . 38

    Word Vector/Conditional Random Field-based Chinese Spelling Error Detection for SIGHAN-2015 Eval-uation

    Yih-Ru Wang and Yuan-Fu Liao. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46

    Introduction to a Proofreading Tool for Chinese Spelling Check Task of SIGHAN-8Tao-Hsing Chang, Hsueh-Chih Chen and Cheng-Han Yang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

    Overview of Topic-based Chinese Message Polarity Classification in SIGHAN 2015Xiangwen Liao, Binyang Li and Liheng Xu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

    A Joint Model for Chinese Microblog Sentiment AnalysisYuhui Cao, Zhao Chen, Ruifeng Xu, Tao Chen and Lin Gui . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

    Learning Salient Samples and Distributed Representations for Topic-Based Chinese Message PolarityClassification

    Xin Kang, Yunong Wu and Zhifei Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

    An combined sentiment classification system for SIGHAN-8Qiuchi Li, Qiyu Zhi and Miao Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

    Linguistic Knowledge-driven Approach to Chinese Comparative Elements ExtractionMinJun Park and Yulin Yuan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

    A CRF Method of Identifying Prepositional Phrases in Chinese Patent TextsHongzheng Li and Yaohong Jin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

    Emotion in Code-switching Texts: Corpus Construction and AnalysisSophia Lee and Zhongqing Wang. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .91

    Chinese in the Grammatical Framework: Grammar, Translation, and Other Applications AnonymousAarne Ranta, Tian Yan and Haiyan Qiao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

    x

  • KWB: An Automated Quick News System for Chinese ReadersYiqi Bai, Wenjing Yang, Hao Zhang, Jingwen Wang, Ming Jia, Roland Tong and Jie Wang. . . .110

    Chinese Semantic Role Labeling using High-quality Syntactic KnowledgeGongye Jin, Daisuke Kawahara and Sadao Kurohashi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

    Chinese Spelling Check System Based on N-gram ModelWeijian Xie, Peijie Huang, Xinrui Zhang, Kaiduo Hong, Qiang Huang, Bingzhou Chen and Lei

    Huang. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .128

    NTOU Chinese Spelling Check System in Sighan-8 Bake-offWei-Cheng Chu and Chuan-Jie Lin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

    Topic-Based Chinese Message Sentiment Analysis: A Multilayered Analysis Systemhongjie li, zhongqian sun and wei yang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

    Rule-Based Weibo Messages Sentiment Polarity Classification towards Given TopicsHongzhao Zhou, Yonglin Teng, Min Hou, Wei He, Hongtao Zhu, Xiaolin Zhu and Yanfei Mu . 149

    Topic-Based Chinese Message Polarity Classification System at SIGHAN8-Task2Chun Liao, Chong Feng, Sen Yang and Heyan Huang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

    CT-SPA: Text sentiment polarity prediction model using semi-automatically expanded sentiment lexiconTao-Hsing Chang, Ming-Jhih Lin, Chun-Hsien Chen and Shao-Yu Wang . . . . . . . . . . . . . . . . . . . . 164

    Chinese Microblogs Sentiment Classification using Maximum EntropyDashu Ye, Peijie Huang, Kaiduo Hong, Zhuoying Tang, Weijian Xie and Guilong Zhou . . . . . . 171

    NDMSCS: A Topic-Based Chinese Microblog Polarity Classification SystemYang Wang, Yaqi Wang, Shi Feng, Daling Wang and Yifei Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . 180

    NEUDM: A System for Topic-Based Message Polarity ClassificationYaqi Wang, Shi Feng, Daling Wang and Yifei Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

    xi

  • Workshop Program

    Thursday, July 30, 2015

    09:00–09:10 Opening Session

    09:10–10:30 Invited Talk

    Discourse and Machine TranslationMin Zhang and Rou Song

    10:30–10:50 Coffee Break

    10:50–12:30 Workshop Session

    10:50–11:10 Sequential Annotation and Chunking of Chinese Discourse StructureFrances Yung, Kevin Duh and Yuji Matsumoto

    11:10–11:30 Create a Manual Chinese Word Segmentation Dataset Using CrowdsourcingMethodShichang Wang, Chu-Ren Huang, Yao Yao and Angel Chan

    11:30–11:50 Chinese Named Entity Recognition with Graph-based Semi-supervised LearningModelAaron Li-Feng Han, Xiaodong Zeng, Derek F. Wong and Lidia S. Chao

    11:50–12:10 Sentence selection for automatic scoring of Mandarin proficiencyJiahong Yuan, Xiaoying Xu, Wei Lai, Weiping Ye, Xinru Zhao and Mark Liberman

    12:10–12:30 ACBiMA: Advanced Chinese Bi-Character Word Morphological AnalyzerTing-Hao Huang, Yun-Nung Chen and Lingpeng Kong

    xii

  • Thursday, July 30, 2015 (continued)

    12:30–14:30 Lunch

    14:30–15:30 Invited Talk

    From Lexical to Compositional Chinese Sentiment AnalysisLun-Wei Ku

    15:30–16:00 Coffee Break

    16:00–17:20 Bake-off Task 1: Chinese Spelling Check

    16:00–16:20 Introduction to SIGHAN 2015 Bake-off for Chinese Spelling CheckYuen-Hsien Tseng, Lung-Hao Lee, Li-Ping Chang and Hsin-Hsi Chen

    16:20–16:40 HANSpeller++: A Unified Framework for Chinese Spelling CorrectionShuiyuan Zhang, Jinhua Xiong, Jianpeng Hou, Qiao Zhang and Xueqi Cheng

    16:40–17:00 Word Vector/Conditional Random Field-based Chinese Spelling Error Detection forSIGHAN-2015 EvaluationYih-Ru Wang and Yuan-Fu Liao

    17:00–17:20 Introduction to a Proofreading Tool for Chinese Spelling Check Task of SIGHAN-8Tao-Hsing Chang, Hsueh-Chih Chen and Cheng-Han Yang

    xiii

  • Friday, July 31, 2015

    09:00–10:30 Invited Talk

    Intelligent Q&A System and NLP Open PlatformYanxiong Lu and Lianqiang Zhou

    10:30–11:00 Coffee Break

    11:00–12:20 Bake-off Task 2: Topic-Based Chinese Message Polarity Classification

    11:00–11:20 Overview of Topic-based Chinese Message Polarity Classification in SIGHAN 2015Xiangwen Liao, Binyang Li and Liheng Xu

    11:20–11:40 A Joint Model for Chinese Microblog Sentiment AnalysisYuhui Cao, Zhao Chen, Ruifeng Xu, Tao Chen and Lin Gui

    11:40–12:00 Learning Salient Samples and Distributed Representations for Topic-Based ChineseMessage Polarity ClassificationXin Kang, Yunong Wu and Zhifei Zhang

    12:00–12:20 An combined sentiment classification system for SIGHAN-8Qiuchi Li, Qiyu Zhi and Miao Li

    xiv

  • Friday, July 31, 2015 (continued)

    12:20–14:00 Lunch

    14:00–15:20 Poster Session

    Linguistic Knowledge-driven Approach to Chinese Comparative Elements Extrac-tionMinJun Park and Yulin Yuan

    A CRF Method of Identifying Prepositional Phrases in Chinese Patent TextsHongzheng Li and Yaohong Jin

    Emotion in Code-switching Texts: Corpus Construction and AnalysisSophia Lee and Zhongqing Wang

    Chinese in the Grammatical Framework: Grammar, Translation, and Other Appli-cations AnonymousAarne Ranta, Tian Yan and Haiyan Qiao

    KWB: An Automated Quick News System for Chinese ReadersYiqi Bai, Wenjing Yang, Hao Zhang, Jingwen Wang, Ming Jia, Roland Tong andJie Wang

    Chinese Semantic Role Labeling using High-quality Syntactic KnowledgeGongye Jin, Daisuke Kawahara and Sadao Kurohashi

    Chinese Spelling Check System Based on N-gram ModelWeijian Xie, Peijie Huang, Xinrui Zhang, Kaiduo Hong, Qiang Huang, BingzhouChen and Lei Huang

    NTOU Chinese Spelling Check System in Sighan-8 Bake-offWei-Cheng Chu and Chuan-Jie Lin

    Topic-Based Chinese Message Sentiment Analysis: A Multilayered Analysis Systemhongjie li, zhongqian sun and wei yang

    Rule-Based Weibo Messages Sentiment Polarity Classification towards Given TopicsHongzhao Zhou, Yonglin Teng, Min Hou, Wei He, Hongtao Zhu, Xiaolin Zhu andYanfei Mu

    Topic-Based Chinese Message Polarity Classification System at SIGHAN8-Task2Chun Liao, Chong Feng, Sen Yang and Heyan Huang

    xv

  • Friday, July 31, 2015 (continued)

    CT-SPA: Text sentiment polarity prediction model using semi-automatically ex-panded sentiment lexiconTao-Hsing Chang, Ming-Jhih Lin, Chun-Hsien Chen and Shao-Yu Wang

    Chinese Microblogs Sentiment Classification using Maximum EntropyDashu Ye, Peijie Huang, Kaiduo Hong, Zhuoying Tang, Weijian Xie and GuilongZhou

    NDMSCS: A Topic-Based Chinese Microblog Polarity Classification SystemYang Wang, Yaqi Wang, Shi Feng, Daling Wang and Yifei Zhang

    NEUDM: A System for Topic-Based Message Polarity ClassificationYaqi Wang, Shi Feng, Daling Wang and Yifei Zhang

    15:20–15:30 Closing Session

    xvi

  • Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing (SIGHAN-8), pages 1–6,Beijing, China, July 30-31, 2015. c©2015 Association for Computational Linguistics and Asian Federation of Natural Language Processing

    Sequential Annotation and Chunking of Chinese Discourse Structure

    Frances Yung Kevin DuhNara Institute of Science and Technology

    8916-5 Takayama, Ikoma, Nara, 630-0192 Japan{pikyufrances-y, kevinduh, matsu}@is.naist.jp

    Yuji Matsumoto

    Abstract

    We propose a linguistically driven ap-proach to represent discourse relations inChinese text as sequences. We observethat certain surface characteristics of Chi-nese texts, such as the order of clauses,are overt markers of discourse structures,yet existing annotation proposals adaptedfrom formalism constructed for English donot fully incorporate these characteristics.We present an annotated resource consist-ing of 325 articles in the Chinese Tree-bank. In addition, using this annotation,we introduce a discourse chunker basedon a cascade of classifiers and report 70%top-level discourse sense accuracy.

    1 IntroductionDiscourse relations refer to the relations betweenunits of text at document level. As a key forlanguage processing, they are used in tasks suchas automatic summerization, sentiment analysisand text coherence assessment (Lin et al., 2011;Trivedi and Eisenstein, 2013; Yoshida et al.,2014). While discourse-annotated English re-sources are available, resources in other languagesare limited. In this work, we present the linguis-tic motivation behind the Chinese discourse anno-tated corpus we constructed, and preliminary ex-periments on discourse chunking of Chinese.

    1.1 Related WorkMajor discourse annotated resources in Englishinclude the RST Treebank (Carlson et al., 2001)and the Penn Discourse Treebank (PDTB) (Prasadet al., 2008). The RST Treebank represents dis-course relations in a tree structure, where a satel-lite text span is related to a nucleus text span.

    On the other hand, the Penn Discourse Tree-bank represents discourse structure in a predicate-argument-like structure, where discourse connec-tives (DCs) relates two text spans (Arg1 and Arg2).Under this framerodk, covert discourse relationsare represented by implicit DCs.

    PDTB’s annotation scheme is adapted by therecently released Chinese Discourse Treebank(CDTB) (Zhou and Xue, 2015). Other efforts toexploit Chinese discourse relations include cross-lingual annotation projection based on machinetranslation or word-aligned parallel corpus (Zhouet al., 2012; Li et al., 2014). Combinition ofthe RST and PDTB formalisms is also proposed.Zhou et al. (2014) adds the distinction of satelliteand nucleus to PDTB-style annotation, and Li etal. (2014b) labels the connectives in an RST tree.

    1.2 Motivation

    Interpretation of discourse relations, as of otherlinguistic structures, is subject to the surface formof the text. We notice that Chinese discourse struc-tures are expressed by certain surface features thatdo not exist in English.

    First of all, Chinese sentences are sequencesof clauses, typically separated by punctuations.Each clause can be considered a discourse argu-ment. Above the clause level, Chinese sentences(marked by ‘。’) are also units of discourse (Chu,1998). When presented with texts where periodsand commas are removed, native Chinese speak-ers disagree with where to restore them (Bittner,2013). The actual sentence segmentation of thetext thus represents the spans of discourse argu-ments intended by the writer and should be takeninto account.

    Secondly, it is well known that syntactical struc-ture is presented by word order in Chinese - so is

    1

  • discourse. While the Arg1 can occur before or af-ter Arg2 in English, arguments predominantly oc-cur in fixed order in Chinese, depending on thelogical relation. For example, the same conces-sion relation can be expressed by both construc-tions (1) and (2) in English, but only construction(1) is acceptable in Chinese.

    1. 虽然 (suiran, although) Arg2 , Arg1 .

    2. Arg1 ,虽然 (suiran ,although) Arg2 .

    According to Chinese linguistics, adjunctclauses and discourse adverbials always precedethe main clauses (Gasde and Paul, 1996; Chu andJi, 1999). The clauses are semantically arranged ina topic-comment sequence following the writer’sconceptual mind (Tai, 1985; Bittner, 2013). Whenthe arguments are not arranged in the standard or-der, the sense of the DC is altered. For example,when ‘虽然’ (suiran, although’ is used in con-struction (2), it represents an ‘expansion’ relation(Huang et al., 2014). Therefore, discourse rela-tions should be defined given the order of the ar-guments.

    Lastly, parallel DCs are frequent in Chinesediscourse, yet usually either one DC of the pairoccurs to signify the same relation (Zhou et al.,2014). For example, (3) and (4) are grammaticalalternatives to (1).

    3. 虽然 (suiran, although) Arg1 , 但是 (dan-

    shi, but) Arg2 .

    4. Arg1 ,但是 (danshi, but) Arg2 .

    Instead of viewing ‘虽然 (suiran, although) -但是(danshi, but)’ as a pair of parallel DCs, they canbe regarded individually as a forward-linking (fw-linking) DC and a backing linking (bw-linking)DC. A fw-linking DC relates its attached discourseunit to a later coming unit, while a bw-linkingDC relates its attached discourse unit to a previousunit. Findings in linguistic studies also show thatfw-linking DCs only link discourse units withinthe sentence boundary. On the other hand, bw-linking DCs can link a discourse unit to a pre-ceding unit within or outside the sentence bound-ary, except when it is paired with a fw-linking DC(Eifring, 1995).

    To summarize, in contrast with the ambigu-ous arguments in English, punctuations and lim-itations on DC usage explicitly mark certain dis-course structure in Chinese. Section 2 illustrates

    the design of our annotation scheme driven bythese constraints.

    2 Sequential discourse annotation

    We propose to follow the natural discourse chainsin Chinese and annotate discourse structure asa sequence of alternating arguments and DCs.This section highlights the main differences of ourscheme comparing with other frameworks.

    2.1 ArgumentsEach clause separated by punctuations except quo-tation marks is treated as a candidate argument.Clauses that do not function as discourse units areclassified into 3 types - attribution, optional punc-tuation and non-discourse adverbial.

    The main difference of our annotation schemeis that the the order of the arguments for each DCis defined by default. Since the arguments of aparticular discourse relation occur in fixed orderand are always adjacent, each argument is relatedto the immediately preceding argument by a bw-linking DC. In turn, the DC in the first clause ofa sentence links the sentence to the previous one,preserving the 2 layer structure denoted by punc-tuations. An implicit bw-linking DC is inserted ifthe clause does not contain an explicit DC.

    Another characteristic of our annotation is that‘parallel DCs’ are annotated separately as one fw-linking DC and one bw-linking DC. Implicit bw-linking DCs are inserted , if possible, even the re-lation is already marked by a fw-linking DC inthe previous argument 1. In other words, dupli-cated annotation of one relation is allowed. Thishelps create more valid samples to capture variouscombinations of Chinese DCs. When an argumentspans more than one discourse units, a fw-linkingDC is used to mark the start of the span. Similarly,an implicit DC is inserted if necessary.

    2.2 ConnectivesThere is a large variety of DCs in Chinese and theirsyntactical categories are controversial. Huang etal. (2014) reports a lexicon of 808 DCs, 359 ofwhich found in the data. Since many DCs sig-nal the same relation, we adopt a functionalist ap-proach to label DC senses.

    In this approach, a DC does not limit to any syn-tactical category. Annotators are asked to perform

    1Temporal relations are often marked by one fw-linkingDC alone and it is not acceptable to insert an implicit bw-linking DC. In this case, the ’redundant’ tag is used.

    2

  • a linguistic test by replacing a candidate expres-sion with an unambiguous and preferably frequentDC of similar sense, which we call a ‘main DC’. Ifthe replacement is acceptable, then the expressionis identified as a DC and the sense is categorizedunder the ‘main DC’.

    For example, ‘尤为’ and ‘特别是’ (youwei,tebieshi, in particular / especially) are categorizedunder ‘尤其 ’ (youqi, in particular), if the annota-tor agrees that they are interchangeable in the con-text. The list of main DCs is not pre-defined but isconstructed in the course of annotation. Based onthe assigned ‘main DC’, each DC instant is catego-rized into the 4 main senses defined in PDTB: con-tingency, comparison, temporal, and expansion.

    The discourse and syntactical limitations of theDCs are considered in the replaceability test. Forexample, the following pairs are not labeled thesame ‘main DC’ even the signaled discourse rela-tion is the same:

    • Fw v.s. bw-linking DCs:虽然 (suiran, although),但是 (danshi, but)

    • Cause-result v.s. result-cause order:因为...所以... (yinwei...suoyi..., because...therefore...) and之所以...是因为... (zhisuoyi...shiyinwei...,the reason why...is because...) 2

    • Placed before v.s. after subject:却 (Que but) and但是 (danshi but)

    The list of ‘main DCs’ is not pre-defined but isconstructed in the course of annotation; an expres-sion is registered as another ‘main DC’ if it cannotbe replaced. Note that expressions that are con-sidered as ’alternative lexicalizations’ in PDTB orCDTB are also categorized as explicit connectives,if they pass the replaceability test. Otherwise, animplicit DC, chosen from the list of ‘main DCs’,is inserted.

    2.3 Annotation resultsMaterials of the corpus are raw texts of 325 arti-cles (2353 sentences) from the Chinese Treebank(Bies et al., 2007) . Errors that affect the annota-tion process, namely punctuation errors that leadto wrong segmentation, have been corrected.

    201 DCs are identified in our data, of which66 are fw-linking DCs. The DCs are catego-rized into 73 ‘main DCs’ and 22 have ambiguous

    2the 2 pairs are treated as 4 different DCs.

    senses (labelled with more than one ‘main DCs’).The distribution of the tags is shown in Table 1.Note that some of the ‘implicit’ relations we definebelongs to ‘explicit’ in other annotation schemessince ‘double annotation’ occurs in our annotation.

    CON COM TEM EXP totalExplicit 380 248 521 683 1832Implicit 1551 446 164 3022 5183

    ADV ATT OPT totalNon-

    discourse 630 783 336 1749

    Table 1: Distribution of various tags in the an-notated corpus (4 senses: CONtingency, COM-parison, TEMporal, EXPansion; 3 types of non-discourse-unit segments: ATTRibution, OPTionalpunctuation, and non-discourse ADVerbial)

    3 End-to-end discourse chunker

    Our linguistically driven annotation of discoursestructure takes the surface discourse features asground truth. In particular, we define discourse re-lations based on default argument order and span.We demonstrate its learnability by building a dis-course chunker in the form of a classifier cascadeas used in English discourse parsing(Lin et al.,2010). Features are extracted from the default ar-guments of each relation. We evaluate the accu-racy of each component and the overall accuracyof the final output, classifying up to the 4 mainsenses. The pipeline consists of 5 classifiers, asshown in Figure 1, each of which is trained withthe relevant samples, e.g. only arguments anno-tated with explicit DCs are used to train the ex-plicit DC classifier. 289 and 36 articles are used astraining and testing data respectively.

    Features include lexical and syntactical features(bag of words, bag of POS, word pairs and pro-duction rules) that have been used in classifyingimplicit English DCs (Pitler et al., 2009; Lin et al.,2010), and probability distribution of senses forexplicit DC classification. The extraction of fea-tures is based on automatic parsing by the Stan-ford Parser (Levy and Manning, 2003). We alsouse the surrounding discourse relations as features,hypothesizing that certain relation sequences aremore likely than others. The classifiers are trainedby SVM with a linear kernel using the LIBSVMpackage(Chang and Lin, 2011).

    3

  • Figure 1: Cascade of discourse relation classifiers.

    3.1 Results per component

    Table 2 shows the accuracies of individual clas-sifiers tested on relevant samples. Results basedon predictions by the most frequent class arelisted as baseline (BL). As expected, implicit re-lations (IMP) are much harder to classify thanexplicit relations (EXP). The classification resultof non-discourse-unit segments (Non-dis or not)is similar to the preliminary report of Li et al.(2014b)(averaged F1 88.8%, accuracy 89.0%).

    Step classifiers Test F1/Acc BL F1/Acc1 Non-dis or not .91/.94 .44/.802 EXP identifier .92/.93 .39/.653 EXP 4 senses .90/.92 .15/.584 Non-dis 3 types .86/.88 .17/.355 IMP 4 senses .41/.61 .18/.58

    Table 2: Accuracies of individual classifiers on’gold’ test samples. F1 is the average of the F1for each class.

    3.2 End-to-end evaluation

    We run the classifiers from Steps 1-5. After Step1, identified non-discourse-unit segments arejoined as one argument and features are updated.The discourse context features are also updatedafter each step based on last classifier’s output.The tag of a fw-linking DC is switched to thenext segment, as a relation connecting the nextsegment to the current one. The current segmentis thus passed to the implicit classifier, given thatthere is not any bw-linking DCs.

    For applications that need discourse, it may notbe necessary to distinguish between explicit andimplicit relations. Thus, we combine the outputs

    of the explicit and implicit classifiers when eval-uating the end-to-end outputs. Specifically, thepipeline outputs one of the 4 discourse senses or‘non-discourse-unit’ across a segment boundary,while the reference can be more than one, sinceduplicated annotation is allowed. The systemprediction is considered correct if it is includedin the gold tag set. The combined outputs areevaluated in terms of accuracy.

    Table 3 shows the classification accuraciesevaluated by the above principle under differenterror propagation settings. For example, givengold identification of non-discourse segments(Step 1) and explicit DC classifier (Step 2),classification of the 4 main explicit sense reachesaccuracy of 0.854, but is dropped to 0.800 if step1 and step 2 are automatic 3. It is observedthat errors are generally propagated along thepipeline. Similar to the finding in English (Pitleret al., 2009), the discourse context as predictedby earlier classifiers does not affect the latersteps - the results are the same based on gold orautomatic outputs. The end-to-end accuracy ofthe proposed pipeline is 65.7% and the baseline(classify all as ‘expansion’) is 50.0%.

    Accuraciesnon-disexp/impexplicitnon-disimplicit overor not /non-dis senses types senses -all

    Step 2-way 3-way 4-way 3-way 4-way 5-way4 Gold Gold Gold Gold .670 .7063 Gold Gold Gold .879 .670 .7062 Gold Gold .854 .879 .670 .7031 Gold .888 .800 .865 .665 .697- .862 .847 .800 .836 .657 .657

    Table 3: Accuracies at each stage under differenterror propagation settings.

    Finally, we experimented with different varia-tions of the pipeline, as shown in Table 4. The bestresult (70.1% accuracy), is obtained by classifyingimplicit DCs and non-discourse units in one step.For comaprison, Huang and Chen (2011) reportsan accuracy of 88.28% on 4-way classification ofinter-sentential discourse senses, and Huang andChen (2012) reports an accuracy of 81.63% on 2-way classification of intra-sentential contingencyvs comparison senses.

    3Note that the results under the complete gold settings donot necessarily echo the results of the individual components,where duplicated outputs are counted individually.

    4

  • Note that the result is much degraded if we trainone 5-way classifier to classify all relations. Thisshows that explicit and implicit DCs ought to betreated separately, even though we do not concernabout distinguishing them in the final output.

    Pipeline variations Overall 5-way acc.steps 1-5 .657combine steps 1-5 .549switch steps 1 & 2 .697switch steps 1 & 2+ combine steps 4&5 .701

    Table 4: 5-way accuracies of modified pipelines

    4 Conclusion

    This work presents the annotation principles ofour Chinese discourse corpus based on linguisticsanalysis. We propose to embrace the overt se-quential features as ground truth discourse struc-tures, and categorize DCs by their discoursefunctions. Based on the manually annotatedcorpus, we built and evaluate a classifier cas-cade that classifies explicit and implicit relationsand the results support that our annotation istractably learnable. The annotation is available athttp://cl.naist.jp/nldata/zhendisco/.

    ReferencesAnn Bies, Martha Palmer, Justin Mott, and Colin

    Warner. 2007. English chinese translation treebankv 1.0.

    Maria Bittner. 2013. Topic states in mandarin dis-course. Proceedings of the North American Con-ference on Chinese Linguistics.

    Lynn Carlson, Daniel Marcu, and Mary EllenOkurowski. 2001. Building a discourse-tagged cor-pus in the framework of rhetorical structure theory.Proceedings of the SIGdial Workshop on Discourseand Dialogue.

    Chihchung Chang and Chihjen Lin. 2011. Libsvm : alibrary for support vector machines. ACM Transac-tions on Intelligent Systems and Technology.

    Chauncey Chenghsi Chu and Zongren Ji. 1999. ACognitive-Functional Grammar of Mandarin Chi-nese. Crane.

    Chauncey Chenghsi Chu. 1998. A discourse grammarof Mandarin Chine. P. Lang.

    Halvor Eifring. 1995. Clause Combination in Chinese.BRILL.

    Horst-Dieter Gasde and Waltraud Paul. 1996. Fun-cional categories, topic prominence, and complexsentences in mandarin chinese. Linguistics, 34.

    Hen-Hsen Huang and Hsin-Hsi Chen. 2011. Chi-nese discourse relation recognition. Proceedings ofthe International Joint Conference on Natural Lan-guage Processings.

    Hen-Hsen Huang and Hsin-Hsi Chen. 2012. Contin-gency and comparison relation labeling and struc-ture predictuion in chinese sentences. Proceedingsof the Annual Meeting of SIGDIAL.

    Hen-Hsen Huang, Tai-Wei Chang, Huan-Yuan Chen,and Hsin-Hsi Chen. 2014. Interpretation of chinesediscourse connectives for explicit discourse relationrecognition. Proceedings of the International Con-ference on Computational Linguistics.

    Roger Levy and Christopher Manning. 2003. Is itharder to parse chinese, or the chinese treebank.Proceedings of the Annual Meeting of the Associ-ation for Computational Linguistics.

    Junyi Jessy Li, Marine Carpuat, and Ani Nenkova.2014. Cross-lingual discourse relation analysis: Acorpus study and a semi-supervised classifacationsystem. Proceedings of the International Confer-ence on Computational Linguistics.

    Yancui Li, Wenhi Feng, Jing Sun, Fang Kong, andGuodong Zhou. 2014b. Building chinese dis-course corpus with connective-driven dependencytree structure. Proceedings of the Conference onEmpirical Methods on Natural Language Process-ing.

    Ziheng Lin, Hwee Tou Ng, , and Min Yen Kan. 2010.A pdtb-styled end-to-end discourse parser. Techni-cal report, National University of Singapore.

    Ziheng Lin, Hwee Tou Ng, and Minyen Kan. 2011.Automatic evaluating text coherence using discourserelations. Proceedings of the Annual Meeting of theAssociation for Computational Linguistics.

    Emily Pitler, Annie Louis, and Ani Nenkova. 2009.Automatic sense prediction for implicit discourse re-lations in text. Proceedings of the Annual Meeting ofthe Association for Computational Linguistics andthe International Joint Conference on Natural Lan-guage Processing.

    Rashmi Prasad, Nikhit Dinesh, Alan Lee, Eleni Milt-sakaki, Livio Robaldo, Aravind Joshi, and BonnieWebber. 2008. The penn discourse treebank 2.0.Proceedings of the Language Resource and Evalua-tion Conference.

    James HY Tai. 1985. Temporal sequence and chineseword order. Iconicity in Syntax.

    5

  • Rakshit Trivedi and Jacob Eisenstein. 2013. Discourseconnectors for latent subjectivity in sentiment anal-ysis. Proceedings of the North American Chapter ofthe Association for Computational Linguistics.

    Yasuhisa Yoshida, Jun Suzuki, Tsutomu Hirao, andMasaaki Nagata. 2014. Dependency-based dis-course parser for single-document summarization.Proceedings of the Conference on Empirical Meth-ods on Natural Language Processing.

    Yuping Zhou and Nianwen Xue. 2015. The chinesediscourse treebank: a chinese corpus annotated withdiscourse relations. Language Resources and Eval-uation, 49(2).

    Lan Jun Zhou, Wei Gao, Binyang Li, Zhongyu Wei,and Kam-Fat Wong. 2012. Cross-lingual iden-tification of ambiguous discourse connectives forresource-poor language. Proceedings of the Inter-national Conference on Computational Linguistics.

    Lan Jun Zhou, Binyang Li, Zhongyu Wei, and Kam-FaiWong. 2014. The cuhk discourse treebank for chi-nese: Annotating explicit discourse connectives forthe chinese treebank. Proceedings of the LanguageResource and Evaluation Conference.

    6

  • Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing (SIGHAN-8), pages 7–14,Beijing, China, July 30-31, 2015. c©2015 Association for Computational Linguistics and Asian Federation of Natural Language Processing

    Create a Manual Chinese Word Segmentation Dataset UsingCrowdsourcing Method

    Shichang Wang, Chu-Ren Huang, Yao Yao, Angel ChanDepartment of Chinese and Bilingual Studies

    The Hong Kong Polytechnic UniversityHung Hom, Kowloon, Hong Kong

    [email protected]{churen.huang, y.yao, angel.ws.chan}@polyu.edu.hk

    AbstractThe manual Chinese word segmentationdataset WordSegCHC 1.0 which was builtby eight crowdsourcing tasks conductedon the Crowdflower platform contains themanual word segmentation data of 152Chinese sentences whose length rangesfrom 20 to 46 characters without punctu-ations. All the sentences received 200 seg-mentation responses in their correspond-ing crowdsourcing tasks and the numbersof valid response of them range from 123to 143 (each sentence was segmented bymore than 120 subjects). We also pro-posed an evaluation method called man-ual segmentation error rate (MSER) toevaluate the dataset; the MSER of thedataset is proved to be very low which in-dicates reliable data quality. In this work,we applied the crowdsourcing method toChinese word segmentation task and theresults confirmed again that the crowd-sourcing method is a promising tool forlinguistic data collection; the frameworkof crowdsourcing linguistic data collectionused in this work can be reused in simi-lar tasks; the resultant dataset filled a gapin Chinese language resources to the bestof our knowledge, and it has potential ap-plications in the research of word intuitionof Chinese speakers and Chinese languageprocessing.

    1 Introduction

    Chinese word segmentation which can be con-ducted by human or computer in the form of writ-ten or oral, is a hot topic receiving great inter-est from several branches of linguistics especially

    from theoretical, computational and psychologicallinguistics, simply because it relates to or perhapsis the key to several critical theoretical and appli-cational issues, for example word definition, wordintuition and Chinese language processing.However in the traditional laboratory setting,

    limited by budget and/or the difficulty of largescale subject recruitment, etc., it is very difficultor even impossible to build large manual Chineseword segmentation dataset (the defining feature ofthis kind of dataset is that each sentence must besegmented by a large group of people in order tomeasure word intuition of Chinese speakers) andthis hinders the availability of such language re-source. Fortunately, the crowdsourcing methodperhaps can help us to solve this problem. Be-ing aware of this background, the crowdsourcedmanual Chinese word segmentation datasetWord-SegCHC 1.0 was built with multiple purposes inour mind.The first purpose is to further explore the ap-

    plication of crowdsourcing method in language re-source building and linguistic studies in the contextof the Chinese language. Crowdsourcing methodis a promising tool to solve the linguistic data bot-tleneck problem which widely happens in vari-ous linguistic studies; it is efficient and economicand can help us realize much higher randomnessand much larger scale in sampling; in annotationtasks we can also get much higher redundancy tohelp us make decisions on ambiguous cases withmore confidence; although its signal-to-noise ratio(SNR) is usually lower than the traditional labora-tory method, it can yield high quality data as goodas or even better than the traditional method whencombined with several data quality control mea-sures including parameter optimization, screeningquestions, performance monitoring, data valida-

    7

  • tion, data cleansing, majority voting, peer review,spammer monitor, etc (Crump et al., 2013; Al-lahbakhsh et al., 2013; Mason and Suri, 2012;Behrend et al., 2011; Buhrmester et al., 2011;Callison-Burch and Dredze, 2010; Paolacci et al.,2010; Ipeirotis et al., 2010; Munro et al., 2010;Snow et al., 2008).We have already successfully applied crowd-

    sourcing method to the semantic transparency ofcompound rating task and built a semantic trans-parency dataset which contains the semantic trans-parency rating data of about 1,200 disyllabic Chi-nese nominal compounds (Wang et al., 2014a); wewant to further extend the application of crowd-sourcing method to Chinese word segmentationtask to further evaluate the crowdsourcing methodand to build new language resource.The second purpose is to support the studies

    on word intuition of Chinese speakers in generaland to examine the effect of semantic transparencyon word intuition in particular. Word intuition isspeakers’ intuitive knowledge on wordhood, i.e.,what a word is. Laymen’s word segmentation be-havior is not instructed by linguistic theories onword, but by their word intuition, hence reflectstheir word intuition; because of this, the word seg-mentation task has been used to measure and studyword intuition (王立, 2003; Hoosain, 1992). Thebasic idea is like this: if a Chinese sentence is seg-mented by, for example, 100 subjects, we can thenobserve what slices of the sentence are consistentlytreated as words by these subjects, what slices areconsistently treated as non-words, and what slicesare not so consistent by being treated as words bysome and non-words by others. This kind of seg-mentation consistency can be a convenient mea-surement of Chinese speakers’ word intuition.Word intuition per se is an important issue

    awaitingmore research which can contribute to theinvestigation of cognitive mechanism of humans’language competence and shed new light on thetheoretical problem of word definition for the the-oretical definition of word should generally accordwith the speakers’ word intuition (王洪君, 2006;王立, 2003;胡明扬, 1999;陆志韦, 1964).Semantic transparency/compositionality of a

    multi-morphemic form, simply speaking, is the ex-tent to which the lexical meaning of the wholeform can be derived from the lexical meanings ofits constituents. More accurately speaking, thisdefinition is merely the definition of overall se-

    mantic transparency (OST) of a multi-morphemicform; besides that, there is constituent semantictransparency (CST) too which means the extent towhich the lexical meaning of each constituent as aindependent lexical form retains itself in the lexi-cal meaning of the whole form.In the context of theoretical linguistics, seman-

    tic transparency is used as an empirical criterion ofwordhood (Duanmu, 1998; 吕叔湘, 1979; Chao,1968), but for Chinese disyllabic forms this crite-rion seems to be ignored to some extent by somelinguists based on word intuition (王洪君, 2006;冯胜利, 2004; 王立, 2003; 冯胜利, 2001; 胡明扬, 1999;冯胜利, 1996;吕叔湘, 1979); it is alsotreated as an indicator of lexicalization (Packard,2000; 董秀芳, 2002; 李晋霞 and李宇明, 2008).In the context of psycholinguistics, it is an “ex-tremely important factor” (Libben, 1998) affect-ing the mechanism of mental lexicon, for exam-ple the representation, processing/recognition, andmemorizing of multi-morphemic words (Han etal., 2014; Mok, 2009;王春茂 and彭聃龄, 2000;王春茂 et al., 2000; 王春茂 and 彭聃龄, 1999;Libben, 1998; Tsai, 1994). Following this line ofinvestigations, it is significant to examine the rolesemantic transparency plays in Chinese speakers’word intuition towards Chinese disyllabic forms.Whenwe build the dataset, we carefully select sen-tence stimuli which containword stimuli that coverall possible kinds of semantic transparency typesto enable us to examine the role semantic trans-parency plays in word intuition of Chinese speak-ers.The widely used Chinese segmented corpora,

    for example, the Sinica corpus (Chen et al., 1996),are usually segmented firstly by segmentation pro-grams and then revised by experts according tocertain word segmentation standard. From the in-consistent segmentation cases we can find plentyuseful information to explore word intuition. Butfrom the perspective of the measurement of Chi-nese speakers’ word intuition, the data are biasedby segmentation programs and word segmentationstandards, so they are not so suitable and reliablefor this purpose.In order to better serve the studies of word in-

    tuition of Chinese speakers, we need manual wordsegmentation datasets. In such a dataset, each andevery sentence is segmented manually by a largegroup of laymen, say 100, without the influenceof any linguistic theory or any Chinese word seg-

    8

  • mentation standard. This kind of dataset which isboth large and publicly accessible, to the best ofour knowledge, is still a gap in Chinese languageresources.And the third purpose is that the resultant man-

    ual Chinese word segmentation dataset may havepotential applications in the studies of Chinese lan-guage processing especially in the studies of au-tomatic Chinese word segmentation and cognitivemodels of Chinese language processing.

    2 Construction

    2.1 Materials

    The stimuli of word segmentation tasks are atleast phrases, but we prefer naturally occurred sen-tences. In order to cover more linguistic phenom-ena to better support the studies of word intuition,we decide to use more than 150 long sentences(the crowdsourcing method makes this possible).Meanwhile, the resultant dataset must be able tosupport the examination of the effect of semantictransparency on word intuition; so these sentencestimuli should also contain the words which coverall the word stimuli to be used in the examinationof semantic transparency effect. So the stimuli se-lection procedure consists of two steps: (1) wordselection, i.e., to select an initial set of word whichcovers all the word stimuli would be used in theexamination of semantic transparency effect, and(2) sentence selection, i.e., to select a set of sen-tences which contains the words selected in step 1(each sentence carries one word) and at the sametime satisfy other requirements.

    Word SelectionWe have already created a crowdsourced seman-tic transparency dataset SimTransCNC 1.0 whichcontains the overall and constituent semantic trans-parency rating data of about 1, 200 Chinese bi-morphemic nominal compounds which have mid-range word frequencies (Wang et al., 2014a).Based on this dataset, 152 words are selected, forthe distribution of these words, see Table 1.These words are bimorphemic nominal com-

    pounds of the structure modifier-head, and coverthree substructures: NN, AN, and VN. Follow-ing (Libben et al., 2003), we differentiate fourtransparency types: TT, TO, OT, and OO; “T”means “transparent”, and “O” means “opaque”.TT words show the highest OST scores and themost balanced CST scores, e.g., “江水”; OO

    Word Structure

    Transaprency Type NN AN VN

    TT 20 10 10TO 20 6 10OT 20 10 10OO 20 10 6

    Table 1: Distribution of types of selected words.

    words have the lowest OST scores and the mostbalanced CST scores, e.g., “脾气”; TO and OTwords bearmid-rangeOST scores and themost im-balanced CST scores, e.g., “音色” (TO) and “贵人” (OT).

    Sentence Selection

    The words selected in step 1 are used as indexes,and all the sentences carrying them in Sinica cor-pus 4.0 are extracted. One sentence is selected foreach word roughly according to the following cri-teria: (1) the length of sentence should be between20 to 50 characters (punctuations excluded); (2)the sentence should not contain too many punctu-ations; (3) prefer concrete and narrative sentencesto abstract ones which are difficult to understand;(4) if we cannot find proper sentences from Sinicacorpus for some words, we will use other corpora(only 5 sentences). In this way, a total of 152 sen-tences are selected, for the length (in character)distribution, see Table 2.

    Length of Sentence

    Min 20Max 46Sum 4,946Mean 32.54SD 5.46

    Table 2: Length distribution of selected sentences.

    2.2 Crowdsourcing Task Design

    These 152 sentence stimuli are evenly and ran-domly divided into eight sentence groups; eachsentence group has 19 sentences. We created onecrowdsourcing task for each sentence group onCrowdflower; according to our previous studies,compared to Amerzon Mechanical Turk (MTurk),Crowdflower is a more feasible platform for Chi-nese linguistic data collection (Wang et al., 2014b;Wang et al., 2014a).

    9

  • QuestionnairesThe core of each crowdsourcing task is a question-naire. Each questionnaire consists of five sections:(1) title, (2) instructions, (3) demographic ques-tions, (4) screening questions, and (5) segmenta-tion task; both simplified and traditional Chinesecharacter versions are provided. Section 3, de-mographic questions, asks the on-line subjects toprovide their identity information on gender, age,level of education, email address (optional). Sec-tion 4, screening questions, consists of four sim-ple questions on the Chinese language which canbe used to test if a subject is a Chinese speaker ornot; the first two questions are open-endedChinesecharacter identification questions, each questionshows a picture containing a simple Chinese char-acter and asks the subject to identify that characterand type it in the text-box blow it; the third ques-tion is a close-ended homophonic character identi-fication question, it shows the subject a characterand asks him/her to identify its homophonic char-acter in 10 different characters; the fourth one isa close-ended antonymous character identificationquestion, asks the subject to identify the antony-mous character of the given one from 10 differ-ent characters. The section 4s of the eight crowd-sourcing tasks share the same question types buthave different question instances. Section 5, thesegmentation task, shows the subjects 19 sentencestimuli and asks them to insert a word boundarysymbol (“/”) at each word boundary they perceive;the subjects are required to insert a “/” behind eachpunctuation and the last character of a sentence;the subjects are also informed that they need notto care about right or wrong, but just follow theirintuition.

    Parameters of TasksThese eight crowdsourcing tasks are created withthe following parameters: (1) each worker ac-count can only submit one response to one task;(2) each IP address can only submit one responseto one task; (3) we only accept the responsesfrom mainland China, Hong Kong, Macao, Tai-wan, Singapore, Indonesia, Malaysia, Thailand,Australia, Canada, Germany, United States, andNew Zealand; (4) we pay 0.25USD for one re-sponse.

    Quality Control MeasuresThe following quality control measures are used:(1) the section 4, screening questions, is used to

    discriminate Chinese speakers from non-Chinesespeakers and to block bots; (2) the section 5,the segmentation task, will keep invisible unlessthe first two screening questions are correctly an-swered; (3) the answers to the segmentation ques-tions in section 5 must comply with prescribed for-mat to prevent random string: a) the segmentationanswer to each sentence must be only composedby the original sentence with one or zero “/” be-hind each Chinese character and each punctuation,b) in the answers behind each punctuation theremust be a “/”, c) the end of an answer must be a“/”; (4) the submission attempts will be blockedunless all the required questions are answered andthe answers satisfy the above conditions; (5) datacleansing will be conducted after data collection torule out invalid responses.

    2.3 ProcedureWe firstly ran a small pretest task to test if thetasks were correctly designed, and it turned outthat the pretest task could run smoothly. Then welaunched the first task and let it run alone for abouttwo days to further test the task design. After wefinally confirmed that the tasks could really runsmoothly, we launched the other seven tasks andlet them run concurrently. Our aim was to collect200 responses for each task; the speed was amaz-ingly fast in the beginning, and all eight tasks re-ceived their first 100 responses in the first three tosix days; then the speed became slower and slower,it eventually took us about 1.3 months to reach ouraim; after all, Crowdflower is not a Chinese nativecrowdsourcing platform, this kind of speed is un-derstandable.

    2.4 Data CleansingAll tasks successfully obtained 200 responses,however not all responses are valid. Compared tothe laboratory setting, the crowdsourcing environ-ment is quite noisy by nature, so before the newlycollected data can be used in any seriously analysisto draw reliable conclusions, data cleansing mustbe conducted.The raw responses underwent rule-based data

    cleansing. A response is considered invalid if ithas at least one of the following five features: (1)at least one of the four screening questions are in-correctly answered; (2) the lengths of the resultantsegments of at least one of its 19 sentences are allone character; (3) at least one segment longer thanseven characters is observed in the resultant seg-

    10

  • ments of its 19 sentences; (4) the completion timeof the response is shorter than five minutes; (5) thecompletion time of the response is longer than onehour. Invalid responses were ruled out; the num-bers of valid response of the eight tasks are listedin Table 3.

    2.5 ResultsThe resultant dataset contains the manual Chineseword segmentation data of 152 sentences whoselength ranges from 20 to 46 characters (M =32.54, SD = 5.46), and each sentence is seg-mented by at least 123 and at most 143 subjects(M = 133.5, SD = 7.37).

    Task Valid Response %

    1 142 712 143 71.53 138 694 135 67.55 133 66.56 127 63.57 123 61.58 127 63.5

    Min 123 61.5Max 143 71.5Mean 133.5 66.75SD 7.37 3.68

    Table 3: Numbers of valid response of the tasks.

    3 Evaluation

    Although Fleiss’ kappa can be used to measurethe agreement between raters, high agreement doesnot necessarily means high data quality especiallyin the situation of intuition measurement wherevariations among subjects are expected. And itcannot show directly how many errors the resul-tant dataset actually contains either. Knowing howmany errors the dataset contains is very importantto assess the reliability of the conclusions drawnfrom the dataset. We firstly define two kinds ofmanual segmentation errors, and based on that, aevaluation method called manual segmentation er-ror rate (MSER) is proposed to evaluate the resul-tant dataset.

    3.1 Types of Manual Segmentation ErrorsIn Chinese phrases/sentences, there are three typesof non-monosyllabic segments from the point ofview of manual word segmentation: ridiculoussegments, indivisible segments, and modest seg-ments. A ridiculous segment usually cannot be

    treated as one valid unit/word, because it makes nosense in the context of the phrase/sentence; for ex-ample, in the phrase “这是好东西”, the segment“好东” cannot be treated as one unit/word, becauseit is incomprehensible. An indivisible segmentusually cannot be divided, because it is an fixedunit and its lexical meaning cannot be derived eas-ily from the lexical meanings of its constituents(or semantically opaque); it will become incom-prehensible if it is divided; for example, in thephrase example, the segment “东西” is of this type.A modest segment can be either treated as oneunit/word or divided into two or more units/words,because it is equally comprehensible no matter di-vided or not; the segment “这是” in the phrase ex-ample is of this type.Two circumstances can be treated as errors of

    manual word segmentation; firstly, if a ridiculoussegment appears in segmentation results, it can betreated as an error (type I error); and secondly, ifan indivisible segment is divided in segmentationresults, it can also be treated as an error (type II er-ror). These two circumstances are not compatiblewith our general word intuition even to the leastextent because they are simply incomprehensible;and they cannot be explained by variations of wordintuition among speakers; normally, when the sub-jects do word segmentation tasks carefully accord-ing to their word intuition, these would not occur;so we can treat them as errors. Human word seg-mentation errors will occur when the subjects try tocheat by segmenting randomly or make accidentalmistakes.

    3.2 Manual Segmentation Error Rate

    A subject divides the phrase/sentence S inton (n ∈ N+) segments by n segmentation opera-tions (not n−1; the subject left the remaining seg-ment at the tail as one word, it means the subjecthad “confirmed” that; this is a segmentation opera-tion too). A segmentation operation can only yieldone of the following four possible results: one typeI error, one type II error, one type I error plus onetype II error (two errors; e.g., “好东/西”), or noerror. Suppose e′ (e′ ∈ N) is the number of timesthe type I error occurred during the segmentationprocess, and e′′ (e′′ ∈ N), the number of times thetype II error occurred, then we can define manualsegmentation error rate (MSER):

    MSER = (e′+ e

    ′′)/n

    11

  • In extreme cases, MSER could be greater thanone, for example, in the segmentation result “去哈/尔滨/”, e′ = 2, e′′ = 1, n = 2, soMSER = 3/2. If this happens, we just assumethatMSER = 1. MSER can be used to evaluatemanual word segmentation results; lower MSERmeans better data quality. Let’s consider its col-lective form; if S is segmented by m (m ∈ N+)subjects, and the ith (1 ⩽ i ⩽ m) subject’s type Ierror count, type II error count, and segmentationoperation count are e′i, e

    ′′i , ni respectively, then

    the collective form of MSER is:

    MSER =

    m∑i=1

    (e′i + e

    ′′i )

    m∑i=1

    ni

    As a convenient way, we can find type I errorsand their counts in the unigram frequency list ofthe segmentation results, and find type II errors andtheir counts in the bigram frequency list of the seg-mentation results.

    3.3 Evaluation Procedure and ResultsAmong the 19 sentences of each task, three sen-tences were sampled for evaluation: the first sen-tence, the middle (10th) sentence, and the last(19th) sentence. We calculated the MSER foreach of them, see Table 4 for details. TheMSERsof the segmentation results of these sentences areall very low (< .05), and the mean is only .013(SD = .004); this means the resultant dataset onlycontains few error and indicates that the data qual-ity is good.

    4 Conclusion

    We created themanual Chinese word segmentationdataset WordSegCHC 1.0 using the crowdsourc-ing method; to the best of our knowledge, there isno publicly available resources of this kind; it cansupport the studies of word intuition especially theeffect of semantic transparency on word intuitionand has potential applications in Chinese languageprocessing.We also proposed an evaluation method called

    manual segmentation error rate (MSER) to eval-uate manual word segmentation dataset. The errorrate of the dataset is proved to be very low, and thisindicates that its data quality is reliable.This work also confirmed again that the crowd-

    sourcing method is a feasible, convenient, and re-

    Task Sentence∑

    n∑

    e′ ∑

    e′′

    MSER

    1S1 2864 13 20 .012S10 3904 18 16 .009S19 4046 12 7 .005

    2S1 2993 29 19 .016S10 2000 9 6 .008S19 2529 19 26 .018

    3S1 6634 32 27 .009S10 2834 21 14 .012S19 2894 43 22 .022

    4S1 2612 24 22 .018S10 1836 14 8 .012S19 2640 26 20 .017

    5S1 2361 15 14 .012S10 2829 14 7 .007S19 2489 14 15 .012

    6S1 2906 35 22 .020S10 2758 21 8 .011S19 1711 20 13 .019

    7S1 1857 19 11 .016S10 3125 35 14 .016S19 2808 28 10 .014

    8S1 2465 23 14 .015S10 3238 23 11 .011S19 2042 15 7 .011

    Min 1711 9 6 .005Max 6634 43 27 .022Sum 68375 522 353Mean 2848.96 21.75 14.71 .013SD 989.76 8.51 6.3 .004

    Table 4: Segmentation error rates (MSER) of thesegmentation results of the eight tasks.

    liable tool to collect linguistic data. And throughthis work, a reusable general framework of crowd-sourcing linguistic data collection is also pre-sented. Following this framework, larger similarChinese language resources can be constructed.We will use this dataset to examine the role of

    semantic transparency in word intuition of Chinesespeakers and to induce the factors affecting wordintuition. The consequent discoveries will deepenour understanding of the word definition problemin the Chinese language which has both theoreticaland applicational significance.In the future, once the factors modulating Chi-

    nese Speakers’ word intuition are clear, perhapsa computational cognitive model of Chinese wordsegmentation (Wu, 2011) can be proposed and webelieve that this could be an interesting new direc-tion of Chinese word segmentation research.

    Acknowledgments

    The work described in this paper was supported bya grant from the Research Grants Council of theHong Kong SAR, China (Project No. 544011).

    12

  • ReferencesM Allahbakhsh, B Benatallah, A Ignjatovic,

    HR Motahari-Nezhad, E Bertino, and S Dust-dar. 2013. Quality control in crowdsourcingsystems: Issues and directions. IEEE InternetComputing, 17(2):76–81.

    Tara S Behrend, David J Sharek, Adam W Meade, andEric N Wiebe. 2011. The viability of crowdsourc-ing for survey research. Behavior research methods,43(3):800–813.

    Michael Buhrmester, Tracy Kwang, and Samuel DGosling. 2011. Amazon’s mechanical turk a newsource of inexpensive, yet high-quality, data? Per-spectives on Psychological Science, 6(1):3–5.

    Chris Callison-Burch and Mark Dredze. 2010. Cre-ating speech and language data with amazon’s me-chanical turk. In Proceedings of the NAACL HLT2010 Workshop on Creating Speech and LanguageData with Amazon’s Mechanical Turk, pages 1–12.Association for Computational Linguistics.

    Yuen Ren Chao. 1968. A grammar of spoken Chinese.University of California Pr.

    Keh-Jiann Chen, Chu-Ren Huang, Li-Ping Chang, andHui-Li Hsu. 1996. Sinica corpus: Design method-ology for balanced corpora. In B.-S. Park and J.B.Kim, editors, Proceeding of the 11th Pacific AsiaConference on Language, Information and Compu-tation, pages 167–176. Seoul:Kyung Hee Univer-sity.

    Matthew JC Crump, John V McDonnell, and Todd MGureckis. 2013. Evaluating amazon’s mechanicalturk as a tool for experimental behavioral research.PloS one, 8(3):e57410.

    San Duanmu. 1998. Wordhood in chinese. New ap-proaches to Chinese word formation: Morphology,phonology and the lexicon in modern and ancientChinese, pages 135–196.

    Yi-Jhong Han, Shuo-chieh Huang, Chia-Ying Lee,Wen-Jui Kuo, and Shih-kuen Cheng. 2014. Themodulation of semantic transparency on the recogni-tionmemory for two-character chinese words. Mem-ory & Cognition, pages 1–10.

    Rumjahn Hoosain. 1992. Psychological reality of theword in chinese. Advances in psychology, 90:111–130.

    Panagiotis G Ipeirotis, Foster Provost, and Jing Wang.2010. Quality management on amazon mechanicalturk. In Proceedings of the ACM SIGKDDworkshopon human computation, pages 64–67. ACM.

    Gary Libben, Martha Gibson, Yeo Bom Yoon, and Do-miniek Sandra. 2003. Compound fracture: The roleof semantic transparency andmorphological headed-ness. Brain and Language, 84(1):50 – 64.

    Gary Libben. 1998. Semantic transparency in the pro-cessing of compounds: Consequences for represen-tation, processing, and impairment. Brain and Lan-guage, 61(1):30 – 44.

    Winter Mason and Siddharth Suri. 2012. Conductingbehavioral research on amazon’s mechanical turk.Behavior research methods, 44(1):1–23.

    Leh Woon Mok. 2009. Word-superiority effect as afunction of semantic transparency of chinese bimor-phemic compound words. Language and CognitiveProcesses, 24(7-8):1039–1081.

    Robert Munro, Steven Bethard, Victor Kuperman,Vicky Tzuyin Lai, Robin Melnick, ChristopherPotts, Tyler Schnoebelen, and Harry Tily. 2010.Crowdsourcing and language studies: the new gen-eration of linguistic data. In Proceedings of theNAACL HLT 2010 Workshop on Creating Speechand Language Data with Amazon’s MechanicalTurk, pages 122–130. Association for ComputationalLinguistics.

    Jerome L Packard. 2000. The morphology of Chi-nese: A linguistic and cognitive approach. Cam-bridge University Press.

    Gabriele Paolacci, Jesse Chandler, and Panagiotis GIpeirotis. 2010. Running experiments on amazonmechanical turk. Judgment and Decision making,5(5):411–419.

    Rion Snow, Brendan O’Connor, Daniel Jurafsky, andAndrew Y Ng. 2008. Cheap and fast—but is itgood?: evaluating non-expert annotations for natu-ral language tasks. In Proceedings of the conferenceon empirical methods in natural language process-ing, pages 254–263. Association for ComputationalLinguistics.

    Chih-Hao Tsai. 1994. Effects of semantic transparencyon the recognition of chinese two-character words:Evidence for a dual-process model. Master’s thesis,Graduate Institute of Psychology, National ChungCheng University, Chia-Yi, Taiwan.

    Shichang Wang, Chu-Ren Huang, Yao Yao, and An-gel Chan. 2014a. Building a semantic transparencydataset of chinese nominal compounds: A practiceof crowdsourcing methodology. In Proceedings ofWorkshop on Lexical and Grammatical Resourcesfor Language Processing, pages 147–156, Dublin,Ireland, August. Association for Computational Lin-guistics and Dublin City University.

    Shichang Wang, Chu-Ren Huang, Yao Yao, and AngelChan. 2014b. Exploring mental lexicon in an ef-ficient and economic way: Crowdsourcing methodfor linguistic experiments. In Proceedings of the4th Workshop on Cognitive Aspects of the Lexicon(CogALex), pages 105–113, Dublin, Ireland, Au-gust. Association for Computational Linguistics andDublin City University.

    13

  • Zhijie Wu. 2011. A cognitive model of chinese wordsegmentation for machine translation. Meta : jour-nal des traducteurs / Meta: Translators’ Journal,56(3):631–644, 9.

    冯胜利. 1996. 论汉语的“韵律词”. 中 社 科学,(1):161–176.

    冯胜利. 2001. 从韵律看汉语“词”“语”分流之大界. 中 语 , (1):27–37.

    冯胜利. 2004. 论汉语“词”的多维性. 语 学,3(3):161–174.

    吕叔湘. 1979. 汉语语 分 . 商务印书馆.

    李晋霞 and李宇明. 2008. 论词义的透明度. 语, (3):60–65.

    王春茂 and彭聃龄. 1999. 合成词加工中的词频,词素频率及语义透明度. 学 , 31(3):266–273.

    王春茂 and彭聃龄. 2000. 多词素词的通达表征: 分解还是整体. 科学, 23(4):395–398.

    王春茂,彭聃龄, et al. 2000. 重复启动作业中词的语义透明度的作用. 学 , 32(2):127–132.

    王洪君. 2006. 从本族人语感看汉语的“词”. 语科学.

    王立. 2003. 汉语词的社 语 学 . 商务印书馆.

    胡明扬. 1999. 说“词语”. 语 用, 3.

    董秀芳. 2002. 词 : 汉语 音词的 .四川民族出版社.

    陆志韦. 1964. 汉语的 词 . 科学出版社.

    14

  • Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing (SIGHAN-8), pages 15–20,Beijing, China, July 30-31, 2015. c©2015 Association for Computational Linguistics and Asian Federation of Natural Language Processing

    Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model

    Aaron Li-Feng Han* Xiaodong Zeng+ Derek F. Wong+ Lidia S. Chao+

    * Institute for Logic, Language and Computation, University of Amsterdam

    Science Park 107, 1098 XG Amsterdam, The Netherlands + NLP2CT Laboratory/Department of Computer and Information Science

    University of Macau, Macau S.A.R., China [email protected] [email protected] [email protected] [email protected]

    Abstract

    Named entity recognition (NER) plays an im-portant role in the NLP literature. The tradi-tional methods tend to employ large annotated corpus to achieve a high performance. Differ-ent with many semi-supervised learning mod-els for NER task, in this paper, we employ the graph-based semi-supervised learning (GBSSL) method to utilize the freely available unlabeled data. The experiment shows that the unlabeled corpus can enhance the state-of-the-art conditional random field (CRF) learning model and has potential to improve the tag-ging accuracy even though the margin is a lit-tle weak and not satisfying in current experi-ments.

    1. Introduction Named entity recognition (NER) can be regarded as a sub-task of the information extraction, and plays an important role in the natural language processing literature. The NER challenge has attracted a lot of researchers from NLP, and some successful NER tasks have been held in the past years. The annotations in MUC-71 Named Entity tasks (Marsh and Perzanowski, 1998) con-sist of entities (organization, person, and loca-tion), times and quantities such as monetary val-ues and percentages, etc. among the languages of English, Chinese and Japanese.

    The entity categories in CONLL-02 (Tjong Kim Sang, 2002) and CONLL-03 (Tjong Kim

    1 http://www-nlpir.nist.gov/related_projects/muc/proceedings/ne_task.html

    Sang and De Meulder, 2003) NER shared tasks consist of persons, locations, organizations and names of miscellaneous entities, and the lan-guages span from Spanish, Dutch, English, to German.

    The SIGHAN bakeoff-3 (Levow, 2006) and bakeoff-4 (Jin and Chen, 2008) tasks offer stand-ard Chinese NER (CNER) corpora for training and testing, which contain the three commonly used entities, i.e., personal names, location names, and organization names. The CNER task is generally more difficult than the western lan-guages due to the lack of word boundary infor-mation in Chinese expression.

    Traditional methods used for the entity recog-nition tend to employ external annotated corpora to enhance the machine learning stage, and im-prove the testing scores using the enhanced mod-els (Zhang et al., 2006; Mao et al., 2008; Yu et al., 2008). The conditional random filed (CRF) models have shown advantages and good per-formances in CNER tasks as compared with oth-er machine learning algorithms (Zhou et al., 2006; Zhao and Kit, 2008), such as ME, HMM, etc. However, the annotated corpora are general-ly very expensive and time consuming.

    On the other hand, there are a lot of freely available unlabeled data in the internet that can be used for our researches. Due to this reason, some researchers begin to explore the usage of the unlabeled data and the semi-supervised learn-ing methods based on labeled training data and unlabeled external data have shown their ad-vantages (Blum and Chawla, 2001; Shin et al., 2006; Zha et al., 2008; Zhang et al., 2013).

    15

  • 2. Semi-supervised Learning In the semi-supervised learning model, a sample {𝑍𝑍𝑖𝑖 = (𝑋𝑋𝑖𝑖 ,𝑌𝑌𝑖𝑖)}𝑖𝑖=1

    𝑛𝑛𝑙𝑙 is usually observed with label-ing 𝑌𝑌𝑖𝑖 ∈ {−1,1}, in addition to independent unla-beled samples {𝑋𝑋𝑗𝑗}𝑗𝑗=𝑛𝑛𝑙𝑙+1

    𝑛𝑛 with the 𝑛𝑛 = 𝑛𝑛𝑙𝑙 + 𝑛𝑛𝑢𝑢 . The 𝑋𝑋𝑘𝑘 = �𝑋𝑋𝑘𝑘1,𝑋𝑋𝑘𝑘2, … ,𝑋𝑋𝑘𝑘𝑘𝑘� 𝑘𝑘 ∈ (1,𝑛𝑛) is a p-dimentional input (Wang and Shen, 2007). The labeled samples are independently and identical-ly distributed according to an unknown joint dis-tribution 𝑃𝑃(𝑥𝑥,𝑦𝑦), and the unlabeled samples are independently and identically distributed from distribution 𝑃𝑃(𝑥𝑥). Many semi-supervised learn-ing models are designed through some assump-tions relating 𝑃𝑃(𝑥𝑥) to the conditional distribution, which cover EM method, Bayesian network, etc. (Zhu, 2008).

    The graph-based semi-supervised learning (GBSSL) methods have been successfully em-ployed by many researchers. For instance, Gold-berg and Zhu (2006) design the GBSSL model for sentiment categorization; Celikyilmaz et al. (2009) propose a GBSSL model for question-answering; Talukdar and Pereira (2010) use the GBSSL methods for class-Instance acquisition; Subramanya et al. (2010) utilize the GBSSL model for structured tagging models; Zeng et al., (2013) use the GBSSL method for the joint Chi-nese word segmentation and part of speech (POS) tagging and result in higher performances as compared with previous works. However, as far as we know, the GBSSL method has not been employed into the CNER task. To testify the ef-fectiveness of the GBSSL model in the tradition-al CNER task, this paper utilizes some unlabeled data to enhance the CRF learning through GBSSL method.

    3. Designed Models To briefly introduce the GBSSL method, we as-sume 𝐷𝐷𝑙𝑙 = {(𝑥𝑥𝑗𝑗, 𝑟𝑟𝑗𝑗)}𝑗𝑗=1𝑙𝑙 denote 𝑙𝑙 annotated data and the empirical label distribution of 𝑥𝑥𝑗𝑗 is 𝑟𝑟𝑗𝑗 . Assume the unlabeled data types are denoted as 𝐷𝐷𝑢𝑢 = {𝑥𝑥𝑖𝑖}𝑖𝑖=𝑙𝑙+1𝑚𝑚 . Then, the entire dataset can be represented as 𝐷𝐷 = 𝐷𝐷𝑢𝑢 ∪ 𝐷𝐷𝑙𝑙. Let 𝐺𝐺 = (𝑉𝑉,𝐸𝐸) cor-responds to an undirected graph with V as the vertices and E as the edges. Let 𝑉𝑉𝑙𝑙 and 𝑉𝑉𝑢𝑢 repre-sent the labeled and unlabeled vertices respec-tively. One important thing is to select a proper similarity measure to calculate the similarity be-tween a pair of vertices (Das and Smith, 2012). According to the smoothness assumption, if two instances are similar according to the graph, then

    the output labels should also be similar (Zhu, 2005).

    There are mainly three stages in the designed models, i.e., graph construction, label propaga-tion and CRF learning. Graph construction is performed on both labeled and unlabeled data, and the unlabeled data is automatically tagged through the label propagation stage. Then, the tagged external data will be added into the man-ually annotated training corpus to enhance the CRF learning model.

    3.1 Graph Construction & Label Propaga-tion

    We follow the research of Subramanya et al. (2010) to represent the vertices using character trigrams in labeled and unlabeled sentences for graph construction.

    A symmetric k-NN graph is utilized with the edge weights calculated by a symmetric similari-ty function designed by Zeng et al. (2013).

    The feature set we employed to measure the similarity of two vertices based on the co-occurrence statistics is the optimized one by Han et al. (2013) for CNER tasks, as denoted in Table 1.

    Feature Meaning

    𝑈𝑈𝑛𝑛,𝑛𝑛 ∈ (−4,2) Unigram, from previous 4th to following 2nd charac-ter

    𝐵𝐵𝑛𝑛,𝑛𝑛+1,𝑛𝑛 ∈ (−2,1) Bigram, 4 pairs of fea-tures, from previous 2nd to following 2nd character

    Table 1: Feature set for measuring vertices simi-larity in graph construction and training CRF model.

    After the graph construction on both labeled

    and unlabeled data, we use the sparsity inducing penalty (Das and Smith, 2012) label propagation algorithm to induce trigram level label distribu-tions from the constructed graph, which is based on the Junto toolkit (Talukdar and Pereira, 2010).

    3.2 CRF Training In the CRF model, assume a graph 𝐺𝐺 = (𝑉𝑉,𝐸𝐸)

    comprising a set 𝑉𝑉 of vertices or nodes together with a set 𝐸𝐸 of edges or lines and 𝑌𝑌 = {𝑌𝑌𝑣𝑣|𝑣𝑣 ∈𝑉𝑉} so 𝑌𝑌 is indexed by the vertices of 𝐺𝐺. The joint distribution over the label sequence 𝑌𝑌 given 𝑋𝑋 is presented as the form:

    16

  • 𝑃𝑃𝜃𝜃(𝑦𝑦|𝑥𝑥) ∝ 𝑒𝑒𝑥𝑥𝑒𝑒� � 𝜆𝜆𝑘𝑘𝑓𝑓𝑘𝑘(𝑒𝑒,𝑦𝑦|𝑒𝑒 , 𝑥𝑥)𝑒𝑒∈𝐸𝐸,𝑘𝑘

    + � 𝜇𝜇𝑘𝑘𝑔𝑔𝑘𝑘(𝑣𝑣,𝑦𝑦|𝑣𝑣 ,𝑥𝑥)𝑣𝑣∈𝑉𝑉,𝑘𝑘

    The 𝑓𝑓𝑘𝑘 and 𝑔𝑔𝑘𝑘 are the feature functions and 𝜇𝜇𝑘𝑘

    and 𝜆𝜆𝑘𝑘 are the parameters that are trained from specific dataset (Lafferty et al., 2001). The fea-ture set employed in the CRF learning is also the optimized one as shown in Table 1. The training method utilized for the CRF model is a quasi-newton algorithm2. The automatically annotated corpus by the graph based label propagation will affect the trained parameters 𝜇𝜇𝑘𝑘 and 𝜆𝜆𝑘𝑘.

    4. Experiments

    4.1 Data We employ the SIGHAN bakeoff-3 (Levow,

    2006) MSRA (Microsoft research of Asia) train-ing and testing data as standard setting. To testify the effectiveness of the GBSSL method for CRF model in CNER tasks, we utilize some plain (un-annotated) text from SIGHAN bakeoff-2 (Emer-son, 2005) and bakeoff-4 (Jin and Chen, 2008) as external unlabeled data. The data set is intro-duced in Table 2 from the aspect of sentence number.

    Bakeoff-3 Corpus External

    Sentence Number

    Training Testing Unlabeled 50,425 4,365 31,640

    Table 2: Corpus Information.

    4.2 Result Analysis We set two baseline scores for the evaluation.

    One baseline is the simple left-to-right maximum matching model (MaxMatch) based on the train-ing data, another baseline is the closed CRF model (Closed-CRF) without using unlabeled data. The employment of GBSSL model into semi-supervised CRF learning is denoted as GBSSL-CRF.

    The training costs of the CRF learning stage are detailed in Table 3. The comparison shows that the extracted features grow from 8,729,098 to 11,336,486 (29.87%) due to the external da-taset, and the corresponding iterations and train-

    2 http://www.nag.com/numeric/fl/nagdoc_fl23/html/E04/e04conts.html

    ing hours also grow by 12.86% and 77.04% re-spectively.

    Training Costs Feature Iteration Time (h)

    Closed-CRF 8,729,098 350 4.53 GBSSL-CRF 11,336,486 395 8.02

    Table 3: Training Cost for CRF Learning.

    The evaluation results are shown in Table 4,

    from the aspects of recall, precision and the har-monic mean of recall and precision (F1-score). The evaluation shows that both the Closed-CRF and GBSSL-CRF models have largely outper-formed baseline-1 model (MaxMatch). As com-pared with the Closed-CRF model, the GBSSL-CRF model yielded a higher performance in pre-cision score, a lower performance in recall score, and finally resulted in a faint improvement in F1 score. Both the GBSSL-CRF and Closed-CRF show higher performance in precision and lower performance in recall value.

    Evaluation Scores

    Total-score Total-R Total-P Total-F MaxMatch 48.8 59.0 53.4

    Closed-CRF 77.95 90.27 83.66 GBSSL-CRF 77.84 90.62 83.74

    Table 4: Evaluation Results.

    To look inside the GBSSL performance on

    each kind of entity, we denote the detailed evalu-ation results from the aspect o