View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Automatic Measurement of Syntactic Development
in Child Language
Kenji Sagae
Language Technologies Institute
Student Research Symposium
September 2005
Joint work with
Alon Lavie and Brian MacWhinney
2
Using Natural Language Processing in Child Language Research
CHILDES Database (MacWhinney, 2000) Several megabytes of child-parent dialog transcripts Part-of-speech and morphology analysis
Tools available Recently proposed syntactic annotation scheme (Sagae et al.,
2004) Grammatical Relations (GRs) POS analysis not enough for many research questions Very small amount of annotated data
Parsing Can we use current NLP tools to analyze CHILDES GRs? Allows, for example, automatic measurement of syntactic
development
3
Outline
The CHILDES GR annotation scheme
Automatic GR analysis
Measurement of Syntactic Development
4
CHILDES GR Scheme(Sagae et al., 2004)
Addresses needs of child language researchers
Grammatical Relations (GRs) Subject, object, adjunct, etc. Labeled dependencies
Dependent Head
Dependency Label
6
Automatic Syntactic (GR) Analysis
Input: a sentence Output: dependency structure
(GRs)
Three steps Text preprocessing
Unlabeled dependency identification
Dependency labeling
7
STEP 1: Text Preprocessing Prepares Utterances for Parsing
CHAT transcription system Explicitly marks certain extra-grammatical material: disfluency,
retracing and repetitions
CLAN tools (MacWhinney, 2000) Remove extra-grammatical material Provide POS and Morphological analyses
CHAT and CLAN tools are publicly available
http://childes.psy.cmu.edu
8
Step 2: Unlabeled Dependency Identification
Why? Large training corpus: Penn Treebank (Marcus et al., 1993)
Head-table converts constituents into dependencies
Use an existing parser (trained on the Penn Treebank) Charniak (2000)
Convert output to dependencies
Alternatively, a dependency parser For example: MALT parser (Nivre and Scholz, 2004), Yamada and
Matsumoto (2003)
10
Domain Issues
Parser training data is in a very different domain WSJ vs Parent-child dialogs
Domain specific training data would be better But would have to be created (manually)
Performance is acceptable Shorter, simpler sentences Unlabeled dependency accuracy
WSJ test data: 92% CHILDES data (2,000 words): 90%
11
Final Step: Dependency Labeling
Training data is required
Labeling dependencies is easier than finding unlabeled dependencies Less training data is needed for labeling than for full labeled
dependency parsing
Use a classifier TiMBL (Daelemans et al., 2004) Extract features from unlabeled dependency structure GR labels are target classes
13
Features Used for GR Labeling
Head and dependent words Also their POS tags
Whether the dependent comes before or after the head
How far the dependent is from the head
The label of the lowest node in the constituent tree that includes both the head and dependent
14
Features Used for GR Labeling
Consider the words “we” and “eat”
Features: we, pro, eat, v, before, 1, S
Class: SUBJ
15
Good GR Labeling Results with Small Training Set
5,000 words for training 2,000 words for testing
Accuracy of dependency labeling (on perfect dependencies): 91.4%
Overall accuracy (Charniak parser + dependency labeling): 86.9%
16
Some GRs Are Easier Than Others
Overall accuracy: 86.9%
Easily identifiable GRs DET, POBJ, INF, NEG: Precision and recall above 98%
Difficult GRs COMP, XCOMP: below 65% Less than 4% of the GRs seen in training and test sets.
17
Precision and Recall of Specific GRs
GR Precision Recall F-score
SUBJ 0.94 0.93 0.93
OBJ 0.83 0.91 0.87
COORD 0.68 0.85 0.75
JCT 0.91 0.82 0.86
MOD 0.79 0.92 0.85
PRED 0.80 0.83 0.81
ROOT 0.91 0.92 0.91
COMP 0.60 0.50 0.54
XCOMP 0.58 0.64 0.61
18
Index of Productive Syntax (IPSyn)(Scarborough, 1990)
A measure of child language development
Assigns a numerical score for grammatical complexity
(from 0 to 112 points)
Used in hundreds of studies
19
IPSyn Measures Syntactic Development
IPSyn: Designed for investigating differences in language acquisition Differences in groups (for example: bilingual children) Individual differences (for example: delayed language
development) Focus on syntax
Addresses weaknesses of Mean Length of Utterance (MLU) MLU surprisingly useful until age 3, then reaches ceiling (or
becomes unreliable)
IPSyn is very time-consuming to compute
20
IPSyn Is More Informative Than MLUin Children Over Age 3yrs
IPSyn/MLU vs Age
0102030405060708090
100
24 30 36 42 48 54
age in months
IPS
yn
sc
ore
0
0.5
1
1.5
2
2.5
3
3.5
4
ML
U s
co
re
IPSyn
MLU
21
Computing IPSyn (manually)
Corpus of 100 transcribed utterances Consecutive, no repetitions
Identify 56 specific language structures (IPSyn Items) Examples:
Presence of auxiliaries or modals Inverted auxiliary in a wh-question Conjoined clauses Fronted or center-embedded subordinate clauses
Count occurrences (zero, one, two or more)
Add counts
22
Automating IPSyn
Existing state of manual computation Spreadsheets Search each sentence for language structures Use part-of-speech tagging to narrow down the number of
sentences for certain structures For example: Verb + Noun, Determiner + Adjective + Noun
Can’t we just use part-of-speech tagging? Only one other automated implementation of IPSyn exists, and it
uses only words and POS tags
23
Automating IPSyn without Syntactic Analysis
Use patterns of words and parts-of-speech to find language structures Computerized Profiling, or CP (Long, Fey and Channell, 2004) Works well for many IPSyn items
Det + Adjective + Noun sequence
But does not work very well for several important items Fronted or center-embedded subordinate clauses Inverted auxiliary in a wh-question
Cuts down manual work significantly (good) Fully automatic IPSyn scores only somewhat accurate (not so
good)
24
Some IPSyn Items Require Syntactic Analysis for Reliable Recognition
(and some don’t)
Determiner + Adjective + Noun Auxiliary verb Adverb modifying adjective or nominal Subject + Verb + Object Sentence with 3 clauses Conjoined sentences Wh-question with inverted auxiliary/modal/copula Relative clauses Propositional complements Fronted subordinate clauses Center-embedded clauses
25
Automating IPSyn with Grammatical Relation Analyses
Search for language structures using patterns that involve POS tags and GRs (labeled dependencies) Still room for under- and over-generalization, but patterns are
easier to write and more reliable
Examples
Wh-embedded clauses: search for wh-words whose head (or transitive head) is a dependent in a GR of types [XC]SUBJ, [XC]PRED, [XC]JCT, [XC]MOD, COMP or XCOMP
Relative clauses: search for a CMOD where the dependent is to the right of the head
26
Evaluation Data
Two sets of transcripts with IPSyn scoring from two different child language research groups
Set A Scored fully manually 20 transcripts Ages: about 3 yrs.
Set B Scored with CP first, then manually corrected 25 transcripts Ages: about 8 yrs.
(Two transcripts in each set were held out for development and debugging)
27
Evaluation Metrics: Point Difference
Point difference
The absolute point difference between the scores provided by our system, and the scores computed manually
Simple, and shows how close the automatic scores are to the manual scores
Acceptable range Smaller for older children
28
Evaluation Metrics:Point-to-Point Accuracy
Point-to-point accuracy
Reflects overall reliability over each scoring decision made in the computation of IPSyn scores
Scoring decisions: presence or absence of language structures in the transcript
Point-to-Point Acc = C(Correct Decisions)
C(Total Decisions)
Commonly used for assessing inter-rater reliability among human scorers (for IPSyn, about 94%).
29
Results
IPSyn scores from
Our GR-based system (GR)
Manual scoring (HUMAN)
Computerized Profiling (CP)
30
GR-based IPSyn Is Quite Accurate
System Avg. Point Difference to HUMAN
Point-to-point Reliability (%)
GR (total) 3.3 92.8
CP (total) 8.3 85.4
GR (set A) 3.7 92.5
CP (set A) 6.2 86.2
GR (set B) 2.9 93.0
CP (set B) 10.2 84.8
32
Error Analysis: Four Problematic Items Cause Half of Error
Four (of 56) IPSyn items account for about half of all mistakes made by our GR-based system
(a) Propositional complement: 16.9%“I said you can go now”
(b) Copula/Modal/Aux for emphasis or ellipsis: 12.3%“I thought he ate his cake, but he didn’t.”
(c) Relative clause: 10.6%“This is the car I saw.”
(d) Bitransitive predicate: 5.8%“I gave her the book.”
(a), (c), (d): Incorrect GR analysis(b): Imperfect search pattern
33
Conclusion and Future Work
We can annotate transcripts of child language with Grammatical Relations using current NLP tools and a small amount of manually annotated data
The reliability of an automated version of IPSyn that uses CHILDES GRs is close to that of human scoring
GR analysis still needs work More training data Other parsing techniques
Use of GR-based IPSyn by child language researchers should reveal additional problem areas
34
References
Charniak, E. 2000. A maximum-entropy-inspired parser. Proceedings of the First Annual Meeting of the North American Chapter of the Association for Computational Linguistics. Seattle, WA.
Daelemans, W., Zavrel, J., van der Sloot, K., and van den Bosch. 2004. TiMBL: Tilburg Memory Based Learner, version 5.1, Reference Guide. ILK Research Group Technical Report Series, no. 04-02, 2004.
Long, S. H., Fey, M. E., Channell, R. W. 2004. Computerized Profiling (version 9.6.0). Cleveland, OH: Case Western Reserve University.
MacWhinney, B. 2000. The CHILDES Project: Tools for Analyzing Talk. Mahwah, NJ: Lawrence Erlbaum Associates.
Marcus, M. P., Santorini, B., Marcinkiewics, M. A. 1993. Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics, 19.
Nivre, J., Scholz, M. 2004. Deterministic parsing of English text. Proceedings of the International Conference on Computational Linguistics (pp. 64-70). Geneva, Switzerland.
Sagae, K., MacWhinney, B., Lavie, A. 2004. Adding syntactic annotations to transcripts of parent-child dialogs. Proceedings of the Fourth International Conference on Language Resources and Evaluation. Lisbon, Portugal.
Scarborough, H. S. 1990. Index of Productive Syntax. Applied Psycholinguistics, 11, 1-22.
35
Where POS Tagging is not enough
Sentences with same POS sequence may have different structure
(a) Before [,] he told the man he was cold.
(b) Before he told the story [,] he was cold.
Some syntactic structures are difficult to recognize using only POS tags and words Search patterns may under- and over-generate Using syntactic analysis is easier and more reliable