61
Building a corpus

Annotation and markBuilding a corpus -up · PDF fileGRAMMAR. A1 GENERAL AND ABSTRACT TERMS A1.1.1 General actions, ... Maximizers A13.3 Degree: Boosters A13.4 Degree: Approximators

Embed Size (px)

Citation preview

Annotation and mark-up Building a corpus

• The next step (?)

• A corpus can just contain ‘raw text’

• However, it is possible to add extra information into the corpus.

– e.g. what the corpus contains.

• This might help with the analysis of the data.

• Extra information = annotation or mark-up

Building a corpus

• What’s the difference?

• Annotation:

– ‘the practice of adding interpretative linguistic information to a corpus’ (Leech 2005);

– interpretative;

– Linguistic information;

– results in a value-added corpus.

Annotation and mark-up

• What’s the difference?

• Mark-up:

– descriptive; verifiable.

– features of the data beyond the words (layout, structure);

– metadata – data about the data (what the data is);

– results in a value-added corpus.

Annotation and mark-up

• WARNING! Markup and annotation.

• – these terms are often used interchangeably in the literature.

• And “Tagging” !

– can mean any process that adds extra information to a corpus.

Annotation and mark-up

Information about features of the text:

• Sentence starts/ends

• Paragraph starts/ends

• Page starts/ends/numbers

• Chapters

• Formatting

• etc.

Mark-up

Information about the text: • Where it came from

• Who produced it

• Genre

• etc.

• Usually at the start of a file in a header.

• Mark-up usually a manual process.

Mark-up

Annotation

Adding information that results from some sort of linguistics analysis of the data in the corpus:

• grammatical class (parts of speech).

• discourse presentation.

• speech acts.

• and so on...

• Can be done manually or automatically.

Annotation - example

and Bromssell having demanded that it should be free unto them to take againe their places, the first President did oppose it, saying, it would be time enough when all the informations are read. They thought this could be done this morning,

Annotation - example

and Bromssell having demanded that it should be free unto them to take againe their places, the first President did oppose it, saying, it would be time enough when all the informations are read. They thought this could be done this morning,

Deontic

Epistemic

Epistemic

Adding Annotation and Mark-up

• What does mark-up and annotation look like?

• XML-style mark-up and annotation frequently used.

• XML: eXtensible Mark-up Language.

Adding Annotation and Mark-up

• XML-style annotation and mark-up comprises:

– an element (mandatory; you choose the name);

– an attribute (optional; you choose the name);

– angle brackets < ... > ;

• <element> ....text.... </element>

• <element attribute=“ ... ” > ...text.... </element>

• There is always an end tag;

• shorthand reference – a tag.

Annotation - example

and Bromssell having demanded that it should be free unto them to take againe their places, the first President did oppose it, saying, it would be time enough when all the informations are read. They thought this could be done this morning,

Deontic

Epistemic

Epistemic

Annotation - example

and Bromssell having demanded that it <mod type="d">should</mod> be free unto them to take againe their places, the first President did oppose it, saying, it <mod type="e">would</mod> be time enough when all the informations are read. They thought this <mod type="e">could</mod> be done this morning,

Annotation - example

and Bromssell having demanded that it <mod type="d">should</mod> be free unto them to take againe their places, the first President did oppose it, saying, it <mod type="e">would</mod> be time enough when all the informations are read. They thought this <mod type="e">could</mod> be done this morning,

element attribute end tag

And example header from the Huddersfield EModE news corpus: <File id=“J2.1”> <Header> <Title> The spoyle of Antwerpe </Title> <Author> George Gascoigne </Author> <PubDate> 1576 </PubDate> <Source> EEBO </Source> <Words> 2112 </Words> </Header> <Text> ........ news text here ...... ........ ....... </Text> </File>

Mark-up - example

Mark-up - example

A Courante of newes from the East

India

A true Relation of the taking of the

Ilands of Lantore and Polaroone in

the parts of Banda in the East Indies

by the Hollanders, which Ilands had

yeelded themselves subject unto the

King of England. Written to the East

India company in England from their

Factors there.

About the Month of December 1620.

the Dutch Generall having prepared a

force of 16. ships, declared to our

President, that he intended an exploit

for the good of both Companies,

without mentioning any particulars of

his designes.

Mark-up - example

<p>A Courante of newes from the

East India</p>

<p>A true Relation of the taking of

the Ilands of Lantore and Polaroone

in the parts of Banda in the East

Indies by the Hollanders, which Ilands

had yeelded themselves subject unto

the King of England. Written to the

East India company in England from

their Factors there.</p>

<p>About the Month of December

1620. the Dutch Generall having

prepared a force of 16. ships,

declared to our President, that he

intended an exploit for the good of

both Companies, without mentioning

any particulars of his designes.</p>

Mark-up - example

<p>A Courante of newes from the

East India</p>

<p><i>A true Relation of the taking of

the Ilands of Lantore and Polaroone

in the parts of Banda in the East

Indies by the Hollanders, which Ilands

had yeelded themselves subject unto

the King of England. Written to the

East India company in England from

their Factors there.</i></p>

<p>About the Month of December

1620. the Dutch Generall having

prepared a force of 16. ships,

declared to our President, that he

intended an exploit for the good of

both Companies, without mentioning

any particulars of his designes.</p>

Mark-up - example

<p><h>A Courante of newes from the

East India</h></p>

<p><i>A true Relation of the taking of

the Ilands of Lantore and Polaroone

in the parts of Banda in the East

Indies by the Hollanders, which Ilands

had yeelded themselves subject unto

the King of England. Written to the

East India company in England from

their Factors there.</i></p>

<p>About the Month of December

1620. the Dutch Generall having

prepared a force of 16. ships,

declared to our President, that he

intended an exploit for the good of

both Companies, without mentioning

any particulars of his designes.</p>

Mark-up - example

<p><h>A Courante of newes from the

East India</h></p>

<p><i>A true Relation of the taking of

the Ilands of Lantore and <note

mod=“Pulau Run”> Polaroone

</note> in the parts of Banda in the

East Indies by the Hollanders, which

Ilands had yeelded themselves

subject unto the King of England.

Written to the East India company in

England from their Factors

there.</i></p>

<p>About the Month of December

1620. the Dutch Generall having

prepared a force of 16. ships,

declared to our President, that he

intended an exploit for the good of

both Companies, without mentioning

any particulars of his designes.</p>

Automated annotation

• Annotation and mark-up can be a manual process (takes ages)

• But some annotation can be done automatically

– e.g. grammatical class of each word in the corpus (noun, verb, etc.) (CLAWS)

– e.g. word meaning (semantic) (USAS)

Automated Annotation: CLAWS

• Constituent Likelihood Automatic Word-tagging System

• Developed over many years at Lancaster University

• 96-97% accurate

• Assigns Part Of Speech tags to words using a predefined list of tags (a tagset) – First tagset – C1: 132 tags

– C7: around 160 tags

Automated Annotation: CLAWS

Automated Annotation: CLAWS

CLAWS

I liked him, and he was different from other boys, not at all pushy, except pushy to please I suppose , but even that was sweet in a way

Automated Annotation: CLAWS

CLAWS: C7

I_PPIS1 liked_VVD him_PPHO1 ,_, and_CC he_PPHS1 was_VBDZ different_JJ from_II other_JJ boys_NN2 ,_, not_XX at_RR21 all_RR22 pushy_JJ ,_, except_CS pushy_JJ to_TO please_VVI I_PPIS1 suppose_VV0 ,_, but_CCB even_RR that_DD1 was_VBDZ sweet_JJ in_II a_AT1 way_NN1

Automated Annotation: USAS

• UCREL Semantic Analysis System

• (UCREL = University Centre for Corpus Research on Language)

• Assigns tags to each word using a hierarchical framework of categorization

• Based originally on McArthur’s (1981) Longman Lexicon of Contemporary English

Automated Annotation: USAS

• Has undergone many revisions during the development of USAS.

• Groups words together that are conceptually related.

–Hierarchy of 21 major discourse fields.

– These expand in 232 category labels.

– [http://www.comp.lancs.ac.uk/ucrel/usas/]

USAS 21 Top Level Semantic Categories

A GENERAL & ABSTRACT TERMS

B THE BODY & THE INDIVIDUAL

C ARTS & CRAFTS

E EMOTION

F FOOD & FARMING

G GOVERNMENT & PUBLIC DOMAIN

H ARCHITECTURE, HOUSING & THE HOME

I MONEY & COMMERCE (IN INDUSTRY)

K ENTERTAINMENT

L LIFE & LIVING THINGS

M MOVEMENT, LOCATION, TRAVEL, TRANSPORT

N NUMBERS & MEASUREMENT

O SUBSTANCES, MATERIALS, OBJECTS, EQUIPMENT

P EDUCATION

Q LANGUAGE & COMMUNICATION

S SOCIAL ACTIONS, STATES & PROCESSES

T TIME

W WORLD & ENVIRONMENT

X PSYCHOLOGICAL ACTIONS, STATES & PROCESSES

Y SCIENCE & TECHNOLOGY

Z NAMES & GRAMMAR

A1 GENERAL AND ABSTRACT TERMS A1.1.1 General actions, making etc. A1.1.2 Damaging and destroying A1.2 Suitability A1.3 Caution A1.4 Chance, luck A1.5 Use A1.5.1 Using A1.5.2 Usefulness A1.6 Physical/mental A1.7 Constraint A1.8 Inclusion/Exclusion A1.9 Avoiding A2 Affect A2.1 Affect:- Modify, change A2.2 Affect:- Cause/Connected A3 Being A4 Classification A4.1 Generally kinds, groups, examples A4.2 Particular/general; detail A5 Evaluation A5.1 Evaluation:- Good/bad A5.2 Evaluation:- True/false A5.3 Evaluation:- Accuracy A5.4 Evaluation:- Authenticity A6 Comparing A6.1 Comparing:- Similar/different A6.2 Comparing:- Usual/unusual A6.3 Comparing:- Variety A7 Definite (+ modals) A8 Seem A9 Getting and giving; possession A10 Open/closed; Hiding/Hidden; Finding; Showing A11 Importance A11.1 Importance: Important A11.2 Importance: Noticeability A12 Easy/difficult A13 Degree A13.1 Degree: Non-specific A13.2 Degree: Maximizers A13.3 Degree: Boosters A13.4 Degree: Approximators A13.5 Degree: Compromisers A13.6 Degree: Diminishers A13.7 Degree: Minimizers A14 Exclusivizers/particularizers A15 Safety/Danger B1 Anatomy and physiology B2 Health and disease B3 medicines and medical treatment B4 Cleaning and personal care B5 Clothes and personal belongings C1 Arts and crafts E1 EMOTIONAL ACTIONS, STATES AND PROCESSES General E2 Liking E3 Calm/Violent/Angry E4 Happy/sad E4.1 Happy/sad: Happy E4.2 Happy/sad: Contentment E5 Fear/bravery/shock E6 Worry, concern, confident F1 Food F2 Drinks F3 Cigarettes and drugs F4 Farming & Horticulture G1 Government, Politics and elections G1.1 Government etc. G1.2 Politics G2 Crime, law and order G2.1 Crime, law and order: Law and order G2.2 General ethics G3 Warfare, defence and the army; weapons H1 Architecture and kinds of houses and buildings H2 Parts of buildings H3 Areas around or near houses H4 Residence H5 Furniture and household fittings I1 Money generally I1.1 Money: Affluence I1.2 Money: Debts I1.3 Money: Price I2 Business I2.1 Business: Generally I2.2 Business: Selling I3 Work and employment I3.1 Work and employment: Generally I3.2 Work and employmeny: Professionalism I4 Industry K1 Entertainment generally K2 Music and related activities K3 Recorded sound etc. K4 Drama, the theatre and showbusiness K5 Sports and games generally K5.1 Sports K5.2 Games K6 Childrens games and toys L1 Life and living things L2 Living creatures generally L3 Plants M1 Moving, coming and going M2 Putting, taking, pulling, pushing, transporting &c. M3 Vehicles and transport on land M4 Shipping, swimming etc. M5 Aircraft and flying M6 Location and direction M7 Places M8 Remaining/stationary N1 Numbers N2 Mathematics N3 Measurement N3.1 Measurement: General N3.2 Measurement: Size N3.3 Measurement: Distance N3.4 Measurement: Volume N3.5 Measurement: Weight N3.6 Measurement: Area N3.7 Measurement: Length & height N3.8 Measurement: Speed N4 Linear order N5 Quantities N5.1 Entirety; maximum N5.2 Exceeding; waste N6 Frequency etc. O1 Substances and materials generally O1.1 Substances and materials generally: Solid O1.2 Substances and materials generally: Liquid O1.3 Substances and materials generally: Gas O2 Objects generally O3 Electricity and electrical equipment O4 Physical attributes O4.1 General appearance and physical properties O4.2 Judgement of appearance (pretty etc.) O4.3 Colour and colour patterns O4.4 Shape O4.5 Texture O4.6 Temperature P1 Education in general Q1 LINGUISTIC ACTIONS, STATES AND PROCESSES; COMMUNICATION Q1.1 LINGUISTIC ACTIONS, STATES AND PROCESSES; COMMUNICATION Q1.2 Paper documents and writing Q1.3 Telecommunications Q2 Speech acts Q2.1 Speech etc:- Communicative Q2.2 Speech acts Q3 Language, speech and grammar Q4 The Media Q4.1 The Media:- Books Q4.2 The Media:- Newspapers etc. Q4.3 The Media:- TV, Radio and Cinema S1 SOCIAL ACTIONS, STATES AND PROCESSES S1.1 SOCIAL ACTIONS, STATES AND PROCESSES S1.1.1 SOCIAL ACTIONS, STATES AND PROCESSES S1.1.2 Reciprocity S1.1.3 Participation S1.1.4 Deserve etc. S1.2 Personality traits S1.2.1 Approachability and Friendliness S1.2.2 Avarice S1.2.3 Egoism S1.2.4 Politeness S1.2.5 Toughness; strong/weak S1.2.6 Sensible S2 People S2.1 People:- Female S2.2 People:- Male S3 Relationship S3.1 Relationship: General S3.2 Relationship: Intimate/sexual S4 Kin S5 Groups and affiliation S6 Obligation and necessity S7 Power relationship S7.1 Power, organizing S7.2 Respect S7.3 Competition S7.4 Permission S8 Helping/hindering S9 Religion and the supernatural T1 Time T1.1 Time: General T1.1.1 Time: General: Past T1.1.2 Time: General: Present; simultaneous T1.1.3 Time: General: Future T1.2 Time: Momentary T1.3 Time: Period T2 Time: Beginning and ending T3 Time: Old, new and young; age T4 Time: Early/late W1 The universe W2 Light W3 Geographical terms W4 Weather W5 Green issues X1 PSYCHOLOGICAL ACTIONS, STATES AND PROCESSES X2 Mental actions and processes X2.1 Thought, belief X2.2 Knowledge X2.3 Learn X2.4 Investigate, examine, test, search X2.5 Understand X2.6 Expect X3 Sensory X3.1 Sensory:- Taste X3.2 Sensory:- Sound X3.3 Sensory:- Touch X3.4 Sensory:- Sight X3.5 Sensory:- Smell X4 Mental object X4.1 Mental object:- Conceptual object X4.2 Mental object:- Means, method X5 Attention X5.1 Attention X5.2 Interest/boredom/excited/energetic X6 Deciding X7 Wanting; planning; choosing X8 Trying X9 Ability X9.1 Ability:- Ability, intelligence X9.2 Ability:- Success and failure Y1 Science and technology in general Y2 Information technology and computing Z0 Unmatched proper noun Z1 Personal names Z2 Geographical names Z3 Other proper names Z4 Discourse Bin Z5 Grammatical bin Z6 Negative Z7 If Z8 Pronouns etc. Z9 Trash can Z99 Unmatched

USAS 21 Top Level Semantic Categories

A GENERAL & ABSTRACT TERMS

B THE BODY & THE INDIVIDUAL

C ARTS & CRAFTS

E EMOTION

F FOOD & FARMING

G GOVERNMENT & PUBLIC DOMAIN

H ARCHITECTURE, HOUSING & THE HOME

I MONEY & COMMERCE (IN INDUSTRY)

K ENTERTAINMENT

L LIFE & LIVING THINGS

M MOVEMENT, LOCATION, TRAVEL, TRANSPORT

N NUMBERS & MEASUREMENT

O SUBSTANCES, MATERIALS, OBJECTS, EQUIPMENT

P EDUCATION

Q LANGUAGE & COMMUNICATION

S SOCIAL ACTIONS, STATES & PROCESSES

T TIME

W WORLD & ENVIRONMENT

X PSYCHOLOGICAL ACTIONS, STATES & PROCESSES

Y SCIENCE & TECHNOLOGY

Z NAMES & GRAMMAR

Tags consist of an upper-case letter, which indicates the major discourse field. Upper-case letter followed by a digit, which indicates the first sub-division of that field. The major semantic field of government and the public domain is designated the letter G

This major field has three subdivisions: • government, politics and elections – G1; • crime, law and order – G2; • warfare, defence and the army – G3.

The first subdivision (G1 ) is further divided: • government etc. – G1.1; • politics – G1.2.

Automated Annotation: USAS

G - Government and the public domain

G1.1

G1.2

Government,

politics and

elections

Crime, law and

order

War, defence

and the army:

weapons

Government, etc.

Politics

G1

G2

G3

Automated Annotation: USAS

USAS

I liked him, and he was different from other boys, not at all pushy, except pushy to please I suppose , but even that was sweet in a way

Automated Annotation: USAS

USAS

I_Z8 liked_E2+ him_Z8,_PUNC and_Z5 he_Z8 was_A3+ different_A6.1- from_Z5 other_A6.1- boys_S2.2,_PUNC not_Z6 at_Z6 all_Z6 pushy_S1.2.3+ ,_PUNC except_Z5 pushy_S1.2.3+ to_Z5 please_E4.2+ I_Z8 suppose_X2.1 ,_PUNC but_Z5 even_A13.1 that_Z8 was_A3+ sweet_X3.1 in_A13.4 a_A13.4 way_A13.4

For and against tagging

Leech (2005) - annotation

adds value.

Sinclair (2004) - annotation = ‘perilous activity’

• Leech (2005): “annotation is a means to make a corpus much more useful — an enrichment of the original raw corpus.”

“adding annotation to a corpus is giving 'added value', which can be used for research by the individual or team that carried out the annotation, but which can also be passed on to others who may find it useful for their own purposes.”

For tagging

Against tagging

• Sinclair (2004):

“The interspersing of tags in a language text is a perilous activity, because the text thereby loses integrity…”

– ‘Current Issues in Corpus Linguistics’ (Sinclair 2004: 191)

OK – but you can always have two copies of the corpus: tagged and untagged. And many corpus tools will ignore tags if asked.

Against tagging

• Sinclair (2004):

“..one cosy consequence of using tagged text is that the description which produces the tags in the first place is not challenged – it is protected. The corpus data can only be observed through the tags; that is to say, anything the tags are not sensitive to will be missed”

OK – but also remember that tagging preserves your analysis for others to see and check. The act of tagging can also test out a description (model).

• Sinclair (2004):

“In corpus-driven linguistics you do not use pre-tagged text, but you process the raw text directly and then patterns of this uncontaminated text are able to be observed.”

Against tagging

OK – but you can always have two copies of the corpus: tagged and untagged. So you can do both sorts of approaches.

• There are types of searches and analysis that you cannot do without annotation:

– Advanced searches for more fine grained, or complex analyses, based on tags, or tags and words;

• e.g. find noun phrases with a single premodifying adjective.

– search for grammatical and other patterns (depending on what tags are added);

– and do quantitative analysis at other levels of abstraction (e.g. semantic analysis).

Why tag?

• Practical reasons why annotation is useful:

– Disambiguation – telling apart words that have the same spelling but different meanings.

• Depending on the corpus and how it is tagged, we can, for example:

– compare language use within different sections of corpora,

– or compare genres, or sections within genres.

Why tag?

• Manual examination of corpus: – preserves your analysis; objective record of

analysis; – tests out models (account for all data).

• Reusability of annotations: – Other may re-use, test out, or improve your

analysis.

• Multi-functionality – Other may re-use your analysis to answer different

RQs.

Why tag?

• However, certain reservations must be kept in mind: – it might be risky to use someone’s analysis

uncritically;

– we need to be aware of the annotation scheme used and any problems it might have;

– the annotation in a corpus might not be perfectly accurate;

– automatic tagging is subject to the limits of the software (% accuracy?).

Issues?

• Is there a clear dividing line between mark-up (representation of the text) and annotation (analysis of the text; metalinguistic info)?

• Any additions to the raw data in a corpus could be said to be acts of interpretation.

• For example, genre classification, deciding what is a sentence boundary …

Mark-up vs. annotation?

Mark-up vs. annotation?

• “No!” she exclaimed, “it can’t be!”

Summary

• It can help to disambiguate elements in our corpus.

• It can make possible searches and frequency counts at higher levels of abstraction than just the word level.

• It can create a common basis of analysis for everyone to use – and to inspect.

• It can open up new possibilities for corpus analysis.

Remember ....

The computer can’t do it all for us – we still have to analyse the results and ask ...

‘What does it all mean?’

Leech, G 2005 ‘Adding Linguistic Annotation’, in M. Wynne, Developing Linguistic Corpora: a Guide to Good Practice (Oxford: Oxbrow Books), pp. 17-29

[http://ahds.ac.uk/linguistic-corpora/]

Sinclair, J. 2004 Trust the Text: Language, Corpus and Discourse (London: Routledge)

References

Claws

• Web-based corpus tool:

http://ucrel.lancs.ac.uk/claws/trial.html

Test out the tagger with some text.

WMatrix

• Web-based corpus tool.

• Developed by Paul Rayson at Lancaster University.

• Server located at Lancaster.

WMatrix

You can use Wmatrix to produce:

• Word frequency lists – the standard stuff

• Grammatical frequency lists –based on the grammatical class of each word

• Semantic frequency lists – based on the semantic category of each word

WMatrix

• Using a web interface:

– Texts are uploaded onto the Wmatrix server (at Lancaster)

– The upload procedure automatically adds

• (i) Grammatical or Parts of Speech (POS) tags; and

• (ii) semantic tags

WMatrix

• CLAWS parts of speech (POS) tagger

• USAS semantic tagger.

– CLAWS = Constituent Likelihood Automated Word-tagging System (96-97% accurate)

– USAS = UCREL Semantic Analysis System (91% accurate)

• (UCREL = University Centre for Corpus Research on Language)

WMatrix

Allows analysis of texts at :

– the word level

– the grammatical level (POS)

– and the semantic level

WMatrix

Allows text comparison at:

– the word level

– the grammatical level (POS)

– and the semantic level

WMatrix

Keyness:

• A concept made popular by Mike Scott

• Wordsmith Tools (Scott 1996, 1999, 2007) has a ‘Keyword’ facility which has been used successfully in many studies

WMatrix

Keyness:

• Compare word frequency lists from Text A with those from Text B

• Apply a statistical test

• Find over-used (and under-used) words that are statistically significant (or not over/under-used by chance) – key words.

WMatrix

Keyness:

• Word level – Key words

• Grammatical level – Key POS

• Semantic level – Key concepts

– Wmatrix uses log-likelihood to calculate Keyness

WMatrix

• Semantic level analysis potentially useful

• Capture themes or identify important sections of a text that might have been missed by key word analysis alone

WMatrix

Note: – The grammatical and semantic quantifications

performed by Wmatrix are still word-based

– Wmatrix counts the word forms within each grammatical or semantic grouping based on the tags it automatically assigns

– So it is the number of words within a group that decides the groups ranking.

WMatrix

Next >>>

practical exercises using Wmatrix.