Corpora
Annotation
Corpus Linguistics L615/415:Annotation - Week 2
Olga Scrivner
1 / 14
Corpora
Annotation
Review: Corpus Characteristics
Prototypical Corpus
1. Machine-readable (Unicode/ASCII) text files
2. Representative
3. Balanced
4. Data from natural communicative settings
http://images.clipartpanda.com/
row-of-books-clipart-5384556-pile-of-books--vector-illustration.jpg
2 / 14
Corpora
Annotation
Review: Corpus Characteristics
Prototypical Corpus
1. Machine-readable (Unicode/ASCII) text files
2. Representative
3. Balanced
4. Data from natural communicative settings
http://images.clipartpanda.com/
row-of-books-clipart-5384556-pile-of-books--vector-illustration.jpg
2 / 14
Corpora
Annotation
Unicode vs ASCII
Character encoding - translating a character to a numberMorse code → character to tone
ASCII - 7-bit encoding and only 128 characters (AmericanEnglish)
Unicode - 8-, 16-, or 32-bit characters (UTF-8, UTF-16,UTF-32)
A bit (binary unit) can hold only one oftwo values: 0 or 1
Eight bits make a byte
3 / 14
Corpora
Annotation
Unicode
https://www.branah.com/unicode-converter
http://online-toolz.com/tools/hex-binary-convertor.php4 / 14
Corpora
Annotation
What Are the Differences?
Documentary-linguistic corpora
Small corpus with audio/video recording designed toprovide an overview of an endangered language(unbalanced)
Prototypical corpora
Balanced corpus from natural communicative settings
Experimental corpora
Corpus violating natural communicative setting: subjectsbehavior is controlled with carefully-developedexperimental stimuli
5 / 14
Corpora
Annotation
What Are the Differences?
Documentary-linguistic corporaSmall corpus with audio/video recording designed toprovide an overview of an endangered language(unbalanced)
Prototypical corporaBalanced corpus from natural communicative settings
Experimental corporaCorpus violating natural communicative setting: subjectsbehavior is controlled with carefully-developedexperimental stimuli
5 / 14
Corpora
Annotation
Annotation
Process of assigning a label to a tokenized wordidentifying the part of speech of the word
Process of marking each word with its base (dictionary)form
Initial segmentation process (words, numbers,punctuation)
Annotation with a phrase-structure representation ordependency-tree representation
Annotation of senses of word forms
Annotation of the set of sounds
Annotation of features such as tone units, pause, stress
Nonverbal annotation
6 / 14
Corpora
Annotation
Lemmatization vs Stemming
Lemma → base form
Stem → truncation
1 worked working works
2 managed manager manageable managing
3 apples apple
Stemmer: http://9ol.es/porter_js_demo.html
Lemmatizer: http:
//textanalysisonline.com/nltk-wordnet-lemmatizer
7 / 14
Corpora
Annotation
POS Tagging
Task: POS tag the following sentence:Corpus linguistics is my favorite class!
http://textanalysisonline.com/nltk-pos-tagging
8 / 14
Corpora
Annotation
POS Tagging - Results
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html9 / 14
Corpora
Annotation
Parsing
Parse: It is very hot today.
10 / 14
Corpora
Annotation
Parsing - Results
Identify: a) phrase-structure parsing and b) dependency parsing
11 / 14
Corpora
Annotation
Semantic Annotation: Word Sense Disambiguation
“drop me a line when you get there”
http://wordnetweb.princeton.edu/perl/webwn
12 / 14
Corpora
Annotation
Semantic Annotation
“Semantic annotation is an extremely time- andresource-consuming task”
13 / 14