Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
1
Text Analysis
Khurshid Ahmad, Professor of Computer ScienceDepartment of Computer Science
Trinity College Dublin-2, IRELAND
PREAMBLE:MODELS FOR TEXT TECHNOLOGY?
Distribution of linguistic patterns (word, phrases, sentences); collocation; semantic prosody; synchronic/diachronic studies
Empirical Models:
Distribution of conceptual categories;acquisition; degeneration; Diachronic studies.
Psychological/
Observational Models
Distribution of grammatical categories; constituency; governance; synchronic studies
Intuitive Models
2
CORPUS LINGUISTICS
The aim of corpus linguistics is ‘to base accounts of language on corpora derived from systematic recordings of conversations and real discourse of other kinds, as opposed to examples obtained by introspection, by judgement of grammarians, or by haphazard observation’; and a corpus is defined ‘as any systematic collection of speech or writing in a language or variety of a language’ (Matthews 1997:78).
Matthews, P. H. (1997). Oxford Concise Dictionary of Linguistics. Oxford & New York: Oxford University Press.
Representative Corpora: The BNC
The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English from the later part of the 20th century, both spoken and written.
http://www.natcorp.ox.ac.uk/corpus/index.xml
Sample
General
Synchronic
Monolingual
The BNC is
3
Representative Corpora: The BNC
The written part of the BNC (90%) includes, for example, extracts from regional and national newspapers, specialist periodicals and journals for all ages and interests, academic books and popular fiction, published and unpublished letters and memoranda, school and university essays, among many other kinds of text.
http://www.natcorp.ox.ac.uk/corpus/index.xml
COMPRISES written sources: (a) samples of 45,000 words are taken from various parts of single-author texts; (b) shorter texts up to a maximum of 45,000 words, or multi-author texts such as magazines and newspapers, are included in full. Sampling allows for a wider coverage of texts within the 100 million limit, and avoids over-representing idiosyncratic texts.
Sample
INCLUDESmany different styles and varieties, and is not limited to any particular subject field, genre or register. In particular, it contains examples of both spoken and written language.
General
COVERS British English of the late twentieth century, rather than the historical development which produced it.
Synchronic
DEALS with modern British English, not other languges used in Britain. However non-British English and foreign language words do occur in the corpus.
Monolingual
The British National CorpusCharacteristic
Representative Corpora: The BNC
http://www.natcorp.ox.ac.uk/corpus/index.xml
245.00%TOTAL
01.02%as, like, her, than, as, how, well, way, our, as
11.18%these, also, people, any, first, only, new, may, very, should
01.37%so, did, about, your, now, me, no, more, other, just
11.57%him, into, its, then, two, when, up, time, my, out
01.90%as, who, have, do, that, one, said, them, some, could
02.42%their, has, would, what, will, there, if, can, all, her
03.25%which, or, we, an, n't, 's, were, that, been, have
04.35%are, not, this, but, 's, they, his, from, had, she
06.66%i, for, you, he, be, with, on, that, by, at
021.28%the, of, and, a, in, to, it, is, was, to
No. of OCWCumulative
Relative
Frequency
Token
. Distribution of the first 100 most frequent tokens in the BNC according to the cumulative frequency of ten tokens at a time.
4
Representative Corpora: The BNC
http://www.natcorp.ox.ac.uk/corpus/index.xml
. Distribution of the first 200 most frequent tokens in the BNC according to the cumulative frequency of ten tokens at a time.
296.09%TOTAL
60.43%system, local, during, most, although, next, small, case, great, things
30.46%went, ', came, after, children, always, four, without, one, within
30.49%why, each, while, part, on, number, out_of, made, different, really
20.54%never, under, one, most, old, over, know, something, mr, take
40.57%another, world, see, got, work, however, life, again, against, think
40.60%government, might, same, much, see, yes, go, make, day, man
10.65%oh, last, no, more, 'm, going, so, erm, after, us
10.70%'ll, must, still, even, know, too, here, get, own, does
40.77%'re, yeah, three, down, such, back, good, where, year, through
10.88%between, years, er, many, those, there, 've, being, because, do
No. of OCWCumulative
Relative
Frequency
Token
LANGUAGE AS A SYSTEM:INPUTS & OUTPUTS
InterpertamentTexts are responses to previous texts and the texts are then responded to in turn and the cycle continues � hence the diachronic dimension
5
LANGAUGE AS A SYSTEM The moonlighting terms- Lexicogenesis?
Chemical Atoms: The smallest particles in which the elements combine, or are known to possess the properties of a particular element.
1819
Physical Atoms: The supposed ultimate particles in which matter actually exists (without reference to its stability).
1650
An atom is a hypothetical body, so small as to be incapable of further division; and thus to be one of the ultimate particles of nature.
1477
The orthodoxyYear
NYE, M.J. (1986). The Question of the Atom-From the Karlsruhe Congress to the 1st Solvay Congress. A compilation of primary sources. Los Angeles: Tomash Publishers.
The orthodoxyYear
Bohr's theory of atomic structure – The one great fan of Rutherford’s scattering experiments
1913
Rutherford's ‘nucleus’ theory- An experimentalist interpreting Nagaoka’s and Crookes’ observations
1909
Rayleigh’s infinite electron atom: An elaboration of Thomson’s atomic structure
1906
Nagaoka's 'Saturnian' atom reinterpreting Maxwell’s observations about the planet
1904
Thomson's atomic structure based on Aepinus one fluid theory of electricity
1899
Conn, G.K.T. and Turner, H.D. (1965). The Evolution of the Nuclear Atom. London: IliffeBooks Ltd, New York: American Elsevier Pub. Co.
LANGAUGE AS A SYSTEM The moonlighting terms- Lexicogenesis?
6
Languages are constantly in flux
The corpus linguist explores the discourse as a system that can be explained without referring to a discourse external reality or to the mental state of the members of the discourse community.
Teubert, Wolfgang (2003). Writing, hermenutics and corpus linguistics. Logos and Language Vol.IV (no. 2) pp 1-17.
LANGUAGE AS A SYSTEM:INPUTS & OUTPUTS
LANGUAGE AS A SYSTEM: INPUTS & OUTPUTS
InterpertamentWhere will you find the evidence of use, definition, and elaboration of terms like:
• inclusive learning environment (e-Learning)• Borromean Halo Nuclei (Radioactive Nuclear Beam Physics)• honeycombed catalytic converter (Automotive Engineering)• indivualist weak supervenience (Philosophy of Science)
• indoor blood videotaping (Forensic Science)
EXCEPT IN A TEXT CORPUS?
7
Language as a System:The moonlighting terms – Lexicogenesis?
Verschuuren, G. M. N. (1986). Investigating the Life Sciences: An Introduction to the Philosophy of Science. Oxford: Pergamon Press.
a contemporaneous phenomenon with borders between the species (Darwin)
an absolute phenomenon that has been determined in the past (Linnaeus)
Species:
The distinction between species
a compression during systole of the heart (Harvey)
an explosion during diastole of the heart (Descartes)
Heartbeat:
Blood circulation is caused by
The mass of the object increases by gaining oxygen from air (Lavoisier)
That the mass of the object decreases by losing phlogiston to air (Priestley)
Combustion:
The burning of an object means
a turning earth (Kepler)
a rising Sun (Brahe)Solar Cycle:
Sunrise is caused by
something exerts 'attraction' (Galileo)
an in-built tendency to move (Aristotle)
Motion:
Objects move because of
The new ‘truth’The old ‘truth’Term/‘Concept'
LANGUAGE & CHANGEDEVELOPMENT OF CONCEPTS: ATOM
I. In philosophical and scientific use.
In senses 2 and 3 now generally held to
consist of a positively charged nucleus, in
which is concentrated most of the mass of
the atom, and round which orbit negatively
charged electrons.1. A hypothetical body, so infinitely small as to
be incapable of further division; and thus held to be
one of the ultimate particles of matter, by the
concourse of which, according to Leucippus and
Democritus, the universe was formed.
2. In Nat. Phil. physical atoms: the supposed
ultimate particles in which matter actually exists
(without reference to their divisibility or the
contrary), aggregates of which held in their places
by molecular forces, constitute all material bodies.
3. chemical atoms: a. The smallest particles
in which the elements combine either with
themselves, or with each other, and thus the
smallest quantity of matter known to possess the
properties of a particular element. b. The smallest
quantity in which a group of elements, called a
radical, forms a compound corresponding to one
formed by a simple element, or behaves like an
element; thus the smallest known quantity of a
chemical compound.
Entry printed from Oxford English Dictionary Online © Oxford University Press 2001
II. In popular use.
4. From sense 1, as the nearest popular conception to the
atoms of the philosophers: One of the particles of dust which
are rendered visible by light; a mote in the sunbeam. arch. or
Obs.
1784 COWPER Task I. 361 The rustling straw sends up a frequent
mist of atoms. 1821 BYRON Two Foscari III. i, Moted rays of light
Peopled with dusty atoms.
5. The smallest conceivable portion or fragment of
anything; a very minute portion or quantity, a particle, a
jot: a. of matter.
c1630 DRUMMOND OF HAWTHORNDEN Poems (1633) 166 Like tinder
when flints atoms on it fall. 1644 DIGBY Nat. Bodies vi. (1658) 54 Little attoms
of oyl..ascend apace up the week of a burning candle. 1835 SIR J. ROSS N.-W.
Pass. xxxiv. 477 There was not an atom of water.
b. of things immaterial. logical atom: one of the essential
and indivisible elements into which some philosophers hold
that statements can be analysed.
1873 C. S. PEIRCE in Mem. Amer. Acad. Arts & Sci. IX. II. 343 The
logical atom, or term not capable of logical division, must be one of
which every predicate may be universally affirmed or denied... 1918
[see ATOMISM 1b]. 1958 G. J. WARNOCK Eng. Philos. since 1900 v.
54 Russell's world of indefinitely numerous, independent logical atoms
is the metaphysical opposite of Bradley's Absolute.
8
LANGUAGE & CHANGE
DEVELOPMENT OF CONCEPTS: NUCLEUSPl. nuclei and nucleuses. [a. L. nucleus (nuculeus) kernel, inner
part, f. nucula or nuc-, nux nut. So F. nucleus, It., Sp., and Pg.
nucleo.]
I. 1. Astr. a. The more condensed portion of the head of a comet.
b. A more condensed, usu. brighter, central part of a
galaxy or nebula.
2. A supposed interior crust of the earth. 2. A supposed interior crust of the earth. Obs.Obs.
3. A central part or thing around which other parts or things are grouped, collected,
or compacted; that which forms the centre or kernel of some aggregate or mass.
a. Of material (esp. more or less solid) things.
b. Of communities or groups of persons.
c. Of immaterial things.
d. Of places, buildings, etc.
e. Of collections of things.
4. Archæol. A block of flint or other stone from which early implements have been
made. Entry printed from Oxford English Dictionary Online
© Oxford University Press 2001
LANGUAGE & CHANGE
DEVELOPMENT OF CONCEPTS: NUCLEUS
Pl. nuclei and nucleuses. [a. L. nucleus (nuculeus)
kernel, inner part, f. nucula or nuc-, nux nut. So F.
nucleus, It., Sp., and Pg. nucleo.]
II. 5. Botany a. The kernel of a nut. Now rare or Obs.
b. The kernel of a seed (see quots.).
c. The central part of an ovule.
d. In Lichens: (see quot. 1832).
e. In Fungi: (see quots.).
f. The hilum of a starch-granule.
Entry printed from Oxford English Dictionary Online
© Oxford University Press 2001
9
LANGUAGE & CHANGE
DEVELOPMENT OF CONCEPTS: NUCLEUS
Pl. nuclei and nucleuses. [a. L. nucleus (nuculeus)
kernel, inner part, f. nucula or nuc-, nux nut. So F.
nucleus, It., Sp., and Pg. nucleo.]II. 6. a. The rudiments of the shell in certain molluscs.
b. Any discrete mass of grey matter in the central nervous system.
The term is used in numerous English and mod.L. combs.
distinguishing the various different nuclei.
7. Biol. A cell organelle present in most of the cells of all organisms except
the most primitive, usu. as a single subspherical structure, and consisting
(except when undergoing division) of a membrane enclosing a ground
substance (the nuclear sap) in which lie the chromosomes, one or more
nucleoli, etc., and functioning as the repository of genetic information and
as the director of metabolic and synthetic activity of the cell.
Entry printed from Oxford English Dictionary Online
© Oxford University Press 2001
LANGUAGE & CHANGE
DEVELOPMENT OF CONCEPTS: NUCLEUS
Pl. nuclei and nucleuses. [a. L. nucleus (nuculeus)
kernel, inner part, f. nucula or nuc-, nux nut. So F.
nucleus, It., Sp., and Pg. nucleo.]II.
8. Chem. An arrangement of atoms, esp. a ring structure,
characteristic of a number of organic compounds.
9. A particle on which crystals, droplets, or bubbles can
form in a fluid.
10. A small group of bees, including a queen, used as the
foundation of a new colony.
Entry printed from Oxford English Dictionary Online
© Oxford University Press 2001
10
KNOWLEDGE & CHANGE
DEVELOPMENT OF CONCEPTS: NUCLEUSPl. nuclei and nucleuses. [a. L. nucleus(nuculeus) kernel, inner part, f. nucula or nuc-, nux nut. So F. nucleus, It., Sp., and Pg. nucleo.]II.
11. Physics. The positively charged central constituent of the atom, comprising nearly all its mass but occupying only a very small part of its volume and now known to be composed of protons and neutrons.
In Rutherford's 1911 paper called merely a ‘central charge’. In the examples in the first paragraph nucleus is used for various speculative notions concerning the atom.
Entry printed from Oxford English Dictionary Online
© Oxford University Press 2001
KNOWLEDGE & CHANGE
DEVELOPMENT OF CONCEPTS: NUCLEUS
Pl. nuclei and nucleuses. [a. L. nucleus (nuculeus)
kernel, inner part, f. nucula or nuc-, nux nut. So F.
nucleus, It., Sp., and Pg. nucleo.]II.
12. a. Phonetics. The syllable of a word (spoken in isolation)
that bears the primary accent; in an utterance, the syllable or
syllables given particular emphasis.
b. Linguistics. The main word or words in a combination, phrase,
or sentence; also = KERNEL n.1 8b.
Hence nucleus v. trans., to make into a nucleus, to concentrate.
1899 KIPLING Stalky 252 They'd withdrawn all the troops they could, but I
nucleused about forty Pathans.
Entry printed from Oxford English Dictionary Online
© Oxford University Press 2001
11
– The use of terminology distinguishes one domain from another; different concepts are emphasised in different domains.
– The same concept may be referred to by different names;
– The frequency of words in a text carry a signature –if the text is specialist then a select few terms are repeatedly used;
– Everyday, general language texts seldom carry a signature.
LEXICAL SIGNATURE?
TEXT, TEXTURE, TEXTUALITY
Etymologically, text comes from a metaphorical use
of the Latin verb textere – weave – suggesting a
sequences of sentences or utterances ‘interwoven’
structurally and semantically
A text can be regarded as sequential collection of
sentences or utterances which form a UNITY by
reason of their linguistic COHESION and
semantic COHERENCE. However, it is possible
for a text to comprise a single sentence, e.g. a
road sign.
12
TEXT, TEXTURE, TEXTUALITY
New definitions appended from the OED 1993
text, n.1
Add: [1.] e. Short for TEXT-BOOK n. 2.
[2.] d. Linguistics. (A unit of) connected
discourse whose function is communicative
and which forms the object of analysis and
description. Cf. text-frequency, linguistics
TEXT, TEXTURE, TEXTUALITYCOHESION IN TEXT
Cohesion refers to the means by which sentences in a
text are linked to each other, to form a paragraph.
This linkage leads to larger units as well: paragraphs
in a chapter, chapters in a book. The sentences are
made to stick together.
The sentences in themselves are words stuck
together through the use of and, but, not and so
on. Sometimes, we use he,she, it, they…..for
sticking sentences. At other times we repeat
words, the same word, words related to the word,
sounds related to the word, substitutes for a word
13
TEXT, TEXTURE, TEXTUALITY :COHESION IN TEXT
Two kinds of words which glue a text:
GRAMMATICAL WORDS;
Conjunctions: and, but, if, then
Pronouns: he, she, it, they, them
Prepositions: on, in, of
Modal verbs: verbs to be
LEXICAL WORDS (Repetition)
Nouns: Names of person, place and things
Adjectives: words that qualify a noun
Adverbs: Modifier of a verb
INFLECTIONS & DERIVATIONS:
Markers for plurals (car � cars)
Nominalisation: nouns verbs
(react � reaction)
TEXT, TEXTURE, TEXTUALITY :COHESION IN TEXT
LEXICAL WORDS (Repetition)Simple repetition (the word form + plurals):
Inflection: reaction � reactions
Complex Repetition: Derivation: react � reaction; reactant;
motor �motoringcrime � criminal
Paraphrase: Genus/Species/Instance:
Electrons/Protons � Particles �Building blocks of the Universe
Compounding: {stable, unstable, trans-Uranic, halo, compound, ..} +nucleus
forensic + {analysis, laboratory, technician, science…}
14
TEXT, TEXTURE, TEXTUALITY :COHERENCE IN TEXT
If a text makes sense, then we can identify it as such.
Sentences may be connected together because they
refer to the same person, place, event or thing. The
connectivity, or sticking together, is provided by the
content or the meaning.
Coherence can be understood more if one looks at literary texts: often terms like plot, narrative, and narration are used to describe the unity of a given
literary text.
TEXT, TEXTURE, TEXTUALITYTextuality is a term used to denote the various
standards that a text – a collection of linguistic units -
should have in order to be regarded as a text.
There are many features: Cohesion and coherence being the
most prominent. The authors and the readers typically have
a plan or purpose when they respectively write and read a
text: This is called intentionality. Acceptability is a standard
which refers to the possible use a text may have for its
readers. A text is generally expected to comprise new
information (informativity). A text is typically related to
other texts and the readers of a text usually expect it to be
the case (intertextuality). A text is expected to have relevance
to the context (situationality).
15
KNOWLEDGE & COMMUNICATION
Broadly the process of exchanging information or messages, and human language, in speech and writing, is the most significant and most complex communication system.
•A human language-based communications system is comparable to a machine (e.g. computer) based communications system: In 1949, Shannon and Weaver introduced an elegant theory of communication. Messages in Shannon and Weaver’s system are transmitted as signals from transmitter or sender to receiver via the medium of speech, for example, along the channel of sound waves.
•The human transmitter, however, is (usually) also the creator of the message; and what may be communicated may not only be factual, or even verbal, but also attitudinal, social or cultural information. Indeed, humans can communicate with each other when they are silent.
KNOWLEDGE & (Co-OPERATIVE) COMMUNICATION
• Language can be viewed as 'a communicative
process based on knowledge. Generally when
humans use language, the producer and
comprehender are processing information, making
use of their knowledge of the language and of the
topics of conversation. Language is a process of
communication between intelligent active
processors, in which both the producer and the
comprehender(s) perform complex cognitive tasks.
16
KNOWLEDGE & (Co-OPERATIVE) COMMUNICATION
The producer has communicative goals, including
effects to be achieved,
information to be conveyed, and
attitudes to be expressed
• The comprehender attempts to understand (the meaning
of the producers communicative goals):
by reacting (verbally or non-verbally),
by inferring new information,
by updating existing data about processes or devices,
by focusing attention on something or some of its properties, or
by preparing for subsequent utterances of the producer
KNOWLEDGE & (Co-OPERATIVE) COMMUNICATION
Producer
Current Goals
Cognitive
Processing
Knowledge Base
Knowledge of the
language
Knowledge of the
situation
Knowledge of the
world
Comprehender
Understood M eaning
Cognitive
Processing
Knowledge Base
Knowledge of the
language
Knowledge of the
situation
Knowledge of the
world
Medium
Speech
or
Writing
A model of co-operative communication
17
An ALGORITHM FOR TERMINOLOGY AND ONTOLOGY EXTRACTION
A. Input corpusGL /* a general language corpus comprising NGL individual words*/
B. Input corpusSL /* a corpus of specialist texts comprising NSL individual words*/
C. Conduct a uni-variate analysis of the contrastive distribution of linguistic tokens in the two corpora: extract terminology, ontology
D. Conduct a multi-variate analysis of the tokens within specialist texts to find keywords by the extent to which each keywords accounts for the variance in texts
An ALGORITHM FOR TERMINOLOGY AND ONTOLOGY EXTRACTION
UNIVARAITE ANALYSIS• Input corpusGL /* a general language corpus comprising
NGL individual words*/• Input corpusSL /* a corpus of specialist texts comprising
NSL individual words*/• Contrast the distribution of words in corpusGLand
corpusSL
• Select Single Words based on z-scores for relative frequency and weirdness (equivalent to tfidf)
• Find collocation patterns for selected single words;• Find hyponymic patterns using textual markers;• Construct a local grammar using collocation and
hyponymic linksI. Generate a recursive transition network based on local
grammars.
18
SPECIAL LANGUAGE
•The special language of focussed, single minded pursuits: Science, technology, sports, politics, philosophy,……
•A natural language privileges persons ; in contrast the “splinter of ordinary language”that we call [specialised] scientific discourse privileges a world of objects, processes, happenings, events.
•The ‘identificatory force’ of subject position in grammar [of Indo-European languages] is reserved for speakers and their fellow creatures.
•The ‘identificatory force’ of subject position in grammar of specialist discourse is reserved for objects, processes, happenings, events
SPECIAL LANGUAGE
19
•The ‘identificatory force’ of subject position in grammar [of Indo-European languages] is reserved for speakers and their fellow creatures.
GENERAL LANGUAGE
The Trial Franz Kafka (1916)
Chapter One Arrest - Conversation with Mrs. Grubach - Then Miss Bürstner
Someone must have been telling lies about Josef K., he knew he had done nothing wrong
but, one morning, he was arrested. Every day at eight in the morning he was brought his
breakfast by Mrs. Grubach's cook -Mrs. Grubach was his landlady - but today she didn't come. That had never happened before. K. waited a little while, looked from his pillow at the old woman who lived opposite and who was watching him with an inquisitiveness
quite unusual for her, and finally, both hungry and disconcerted, rang the bell. There was
immediately a knock at the door and a man entered. He had never seen the man in this
house before. He was slim but firmly built, his clothes were black and close-fitting, with many folds and pockets, buckles and buttons and a belt, all of which gave the impression of
being very practical but without making it very clear what they were actually for. "Who are you?" asked K., sitting half upright in his bed. The man, however, ignored the question as if
his arrival simply had to be accepted, and merely replied, "You rang?" "Anna should have
brought me my breakfast," said K. http://www.gutenberg.org/dirs/etext05/ktria11.txt Translation Copyright (C) by David Wyllie Translator contact email: [email protected]
•The ‘identificatory force’ of subject position in grammar [of Indo-European languages] is reserved for speakers and their fellow creatures.
GENERAL LANGUAGE
http://www.gutenberg.org/dirs/etext05/ktria11.txt Translation Copyright (C) by David Wyllie Translator contact email: [email protected]
106Max sentence length (words) :
17.83Average sentence length (words) :
82Sentence count :
1.44Average Syllables per Word :
4048Number of characters without spaces :
7901Total number of characters :
8.2Readability (Gunning-Fog Index) : (6-easy 20-hard)
59.7%Complexity factor (Lexical Density) :
429Number of different words :
718Total word count :
20
•The ‘identificatory force’ of subject position in grammar [of Indo-European languages] is reserved for speakers and their fellow creatures.
GENERAL LANGUAGE
70.80%6much
61.00%7room
61.00%7time
61.00%7looked
51.30%9man
51.30%9them
41.40%10what
31.70%12him
21.90%14said
12.80%20you
RankFrequencyOccurrencesWord
http://onlinebooks.library.upenn.edu/webbin/gutbook/lookup?num=7849
The ‘identificatory force’ of subject position in grammar of specialist discourse is reserved for objects, processes, happenings, events
SPECIAL LANGUAGE
45.8Readability (Alternative) beta : (100-easy 20-hard, optimal 60-70)
55Max sentence length (words) :
14.57Average sentence length (words) :
99Sentence count :
1.73Average Syllables per Word :
4690Number of characters without spaces :
8276Total number of characters :
8.4Readability (Gunning-Fog Index) : (6-easy 20-hard)
47.60%Complexity factor (Lexical Density) :
324Number of different words :
681Total word count :
21
81.2%8particle
81.2%8elements
71.3%9nuclear
61.5%10electrons
51.8%12scattering
51.8%12atomic
42.5%17atom
32.6%18charge
22.9%20number
13.8%26nucleus
RankFrequencyOccurrencesWord
Nuclear Constitution of AtomsBakerian LecturebySIR E. RUTHERFORD, F.R.S.Cavendish (Professor of Experimental Physics, University ofCambridge). The Proceedings of the Royal Society, A, 97, 1920, pp.374-400
The ‘identificatory force’ of subject position in grammar of specialist discourse is reserved for objects, processes, happenings, events
SPECIAL LANGUAGE
Mäki, Uskali. (2001). (Ed.) The Economic World View: Studies in the Ontology of Economics. Cambridge: Cambridge University Press.
Terminology, Ontology and Semantics: Theories and Things
Ontology and Metaphysics
22
Mäki, Uskali. (2001). (Ed.) The Economic World View: Studies in the Ontology of Economics. Cambridge: Cambridge University Press.
Terminology, Ontology and Semantics:
Lexical Signature from Prof Amazon?
account action agents argument assumption behavior behaviour
beliefs between cambridge cannot case causal causes change
choice claim conditions constraint different does
economic economists economy empirical even
evolutionary example expectations explanation fact factors firm first
game general good however idea individual interest kind laws
level machines macroeconomics market matter may means
metaphysical might model must natural nature neoclassical new
ontological ontology part particular patterns people point possible
preferences press price principle probability problem properties
question rather rational real reason rosenberg sargent say science
see seems sense set should social terms theory things thus
two university use value view whether work world
>1000 AND <150016
>100 AND < 25012
>250 AND <50014
FrequencyFont
Size
Special language is a language used in a subject field and characterized by the use of specific linguistic means of expression.
http://stats.oecd.org/glossary/detail.asp?ID=6151
SPECIAL LANGUAGES
23
SPECIAL LANGUAGES
Functionality and special languageMany researchers have purported to demonstrate that certain languages display particular features which are distinctly suited to serving a purpose required of that language or that have been melded by its use.
Strevens 1984Seaspeakoperational languages
Wikipedialanguage of military; police
Hoffman (1984language of commerce
Swales and Bhatia 1983legal language
Sager, Dungworth and McDonald 1981
language of sciencespecialist languages
John du Bois (1987)Sacapultec Mayanatural languages
ReferenceExemplarType
SPECIAL LANGUAGES:TEXT
TYPES
Imaginative Texts;
Informative Texts;
{Horatory Texts;}
{Instructive Texts;}
24
SPECIAL LANGUAGES:WORD
CLASSES
Open Class;
Closed Class;
Additional Class: Numerals & Interjections
SPECIAL LANGUAGE
073Instructive: Reports
700Instructive: Manuals
190Informative: Official
4228Informative: Book - Monographs
0023Informative: Book- Theses
537330Informative: Journal Papers
083Imaginative: Letters
2480Imaginative: News
5311Imaginative: Popular Science
3800Imaginative: Adverts
Register
1622Other English Texts
363737American English Texts
955929British English Texts
Language Variety
13215868Total Number of Texts
326621472108688733Total No. of Tokens
Text Size
Automotive
Engineering
Nuclear
Physics
Theoretical
Linguistics
25
SPECIAL
LANGUAGES:CHARACTERISITICS
Characteristics of special languages
Interlocking definitions;
Technical Taxonomies;
Special Expressions;
Lexical Density;
Syntactic Ambiguity;
Grammatical Metaphor;
Semantic Discontinuity
Halliday and Martin (1993:71-84)
SPECIAL
LANGUAGES:CHARACTERISITICS
3047.95%TOTAL
31.28%first, second, than, lexical, hierarchy, subject, when, like, however, different
71.37%rules, masculine, then, word, syntactic, given, x, feature, binding, feminine
51.51%b, morphology, plural, same, t, number, where, class, stem, so
51.74%would, singular, also, semantic, theory, languages, p, its, forms, see
21.99%some, they, may, between, more, example, e, c, only, mor
42.19%will, form, case, such, if, no, language, two, all, structure
22.67%one, but, agreement, noun, at, has, o, other, these, n
23.81%on, gender, an, from, have, s, can, or, there, nouns
6.92%are, be, this, we, it, which, with, by, not, i
24.47%the, of, in, a, and, is, to, that, as, for
No. of
OC
W
Cumulative
Rel.
Frequency
Token
Frequency distribution of the first 100 most frequent words in a linguistics corpus. Open class words (OCWs) are indicated through the use of bold type face.
26
SPECIAL
LANGUAGES:CHARACTERISITICS
Frequency distribution of the first 100 most frequent words in nuclear physics corpus. Open class words (OCWs) are indicated through the use of bold type face.
3350.98%TOTAL
51.31%t, elements, range, about, first, system, other, however, interaction, matter
41.45%density, more, core, f, d, section, if, theory, order, may
81.61%electron, mev, states, calculations, structure, than, nucleon, data, q, mass
31.76%h, k, state, where, only, atoms, also, model, very, but
31.98%number, n, such, target, atom, j, cross, m, or, potential
42.19%b, two, scattering, been, phys, one, particle, particles, e, between
22.64%s, c, can, neutron, p, electrons, not, will, has, was
43.65%at, energy, it, an, nuclei, nucleus, have, r, these, nuclear
06.25%with, are, by, this, I, as, from, which, we, on
027.44%the, of, in, and, a, to, is, for, that, be
No. of OCWCumulative
Relative
Frequency
Token
SPECIAL
LANGUAGES:CHARACTERISITICS
Frequency distribution of the first 100 most frequent words in an automotive engineering corpus. Open class words (OCWs) are indicated through the use of bold type face.
3946.72%Cumulative Frequency
21.28%so, however, two, into, driving, converter, three, low, other, between
51.38%sensor, valve, road, systems, engines, during, diesel, they, would, used
41.55%abs, use, standards, gas, up, nox, its, hc, if, time
61.72%high, unleaded, co, braking, new, temperature, european, g, conditions, one
21.91%these, than, fig, more, no, but, catalytic, also, only, when
62.36%vehicles, speed, will, wheel, car, pressure, were, all, brake, been
52.79%have, not, can, vehicle, cars, s, has, air, emission, test
63.70%from, fuel, was, system, emissions, catalyst, control, an, exhaust, or
15.93%as, be, are, by, that, this, at, it, which, engine
024.11%the, of, and, to, in, a, is, for, with, on
No.
of
OC
W
Cumulative
Relative
Frequency
Token
27
SPECIAL LANGUAGE TERMINOLOGY
Terminology.
1. refers to the usage and study of terms, that is to say wordsand compound words generally used in specific contexts.2. refers to a more formal discipline which systematically studies of the labelling or designating of concepts particular to one or more subject fields or domains of human activity, through research and analysis of terms in context, for the purpose of documenting and promoting correct usage. This study can be limited to one language or can cover more than one language at the same time (multilingual terminology, bilingual terminology, and so forth).
http://en.wikipedia.org/wiki/Terminology
SPECIAL LANGUAGE TERMINOLOGY
http://en.wikipedia.org/wiki/Terminology
Terminology is a subject in its own right with its theoretical formalism, methods, techniques and tools.
Terminologists:
analyze the concepts and concept structures used in a field or domain of activity
identify the terms assigned to the concepts establish correspondences between terms
in the various languages compile the terminology, on paper or in
databasesmanage terminology databases create new terms
28
Simple Methodology
• Extract nouns and verbs from a source text
• Find classes in SUMO for the nouns and verbs
• Record a mapping as being either equal, subsuming or instance.
– type a single word that relates to the UBL term in the "SUMO term" or "English Word" text areas in the SUMO browser
• Create a subclass of SUMO if it's a subsuming mapping
• Add properties to the subclass
– reusing SUMO properties
– extending SUMO properties by creating a &%subrelation of an existing property
• Add English definition to the class
– define constraints that express how the subclass is more specific than the superclass
• Express the classes and properties in KIF and begin creating axioms, based on the English definitions created previously
Permission to reuse granted so long as this notice is not altered – Author: Adam Pease [email protected], 2003
Suggested Upper Merged Ontology
Simple Methodology
an ontology is a data modelthat represents a domain and is used to reasonabout the objects in that domain and the relations between them.
http://en.wikipedia.org/wiki/Ontology_%28computer_science%29#Domain_ontologies_and_upper_ontologies
ways that objects can be related to one another
Relations
properties, features, characteristics, or parameters that objects can have and share
Attributes
sets, collections, or types of objects
Classes
the basic or "ground level" objects
Individuals
29
PREAMBLE:
Special LanguageA note on creativity or terminicide
31.6Cell
13.3Physics Today
4.0New Scientist
0.0Quality Newspaper
44.8Science
55.5Nature
Lexical DifficultySource
Donald Hayes (1992) ‘The growing inaccessibility of science’. Nature. Vol 356, pp 739-740
‘That science has become more difficult for nonspecialists to understand is a truth universally acknowledged’.The choice of words in a journal paper is very different to that in a quality newspaper – obscuring the work of the scientists.
Lexicogenesis: Diachronic Semantic Change
The establishment of the unstable nucleus (1990’s)
30
– The use of terminology distinguishes one domain from another; different concepts are emphasised in different domains.
– The same concept may be referred to by different names;
– The frequency of words in a text carry a signature –if the text is specialist then a select few terms are repeatedly used;
– Everyday, general language texts seldom carry a signature.
LEXICAL SIGNATURE?
LEXICAL SIGNATURE?
– The use of terminology distinguishes one domain from another; different concepts are emphasised in different domains.
– The same concept may be referred to by different names;
– The frequency of words in a text carry a signature – if the text is
specialist then a select few terms are repeatedly used;
– Everyday, general language texts seldom carry a signature.
Texts in forensic science can be identified by the signature:
SINGLE TERMS:evidence, crime, scene,
forensic, police, identificationcase, court, analysis, time, information, blood
& COMPOUND TERMS:
crime scene, forensic evidence, court case, blood analysis,
earprint, fingerprint, crime scenes
31
LEXICAL SIGNATURE?
– The use of terminology distinguishes one domain from another; different concepts are emphasised in different domains.
– The same concept may be referred to by different names;
– The frequency of words in a text carry a signature – if the text is
specialist then a select few terms are repeatedly used;
– Everyday, general language texts seldom carry a signature.
Texts in all specialist domains show a few repeatedly used terms form the SIGNATURE. These terms are used PRODUCTIVELY – in plural form, as (heads of) compounds, and in derivative formsnucleus ���� nuclei (PL.),
nuclear (Adjective); stable/unstable/nuclei;halo/closed shell nuclei;nuclear force/reaction; nuclear matter
crime ���� crime, criminal, crimes, criminals, criminalistics, criminology, criminalist(s), criminological, criminalitycrime scene; crime of passion; property crime;
– The use of terminology distinguishes one domain from another; different concepts are emphasised in different domains.
– The same concept may be referred to by different names;
– The frequency of words in a text carry a signature – if the text is
specialist then a select few terms are repeatedly used;
– Everyday, general language texts seldom carry a signature.
Texts in forensic science can be identified by the signature:
SINGLE TERMS:evidence, crime, scene,
forensic, police, identificationcase, court, analysis, time, information, blood
& COMPOUND TERMS:
crime scene, forensic evidence, court case, blood analysis,
earprint, fingerprint, crime scenes
BUILDING A THESAURUS
32
2.1%
2.6%
2.7%
2.9%
6.2%
BNC:
Relative Frequency
1.1
1.0
1.0
1.2
1.1
SFSC/BNC:
WEIRDNESS
2.4%a
2.5%to
2.7%and
3.7%of
6.8%the
SFSC:
Relative Frequency
British National Corpus (BNC) = 100 Million words;
Surrey Forensic Science Corpus (SFSC) = 0.58 Million words;
The 5 words have about the same distribution in the two corpora: These are the so-called closed class words, or grammatical words, and one may find these words with the same frequency as both corpora have English language texts. There is no weirdness in the use of these words in the Forensic Science corpus.
BUILDING A THESAURUS
0.028%
0.001%
0.007%
0.007%
0.021%
BNC:
Relative Frequency
9
473
40
57
22
SFSC/BNC:
WEIRDNESS
0.25%police
0.25%forensic
0.27%scene
0.40%crime
0.47%evidence
SFSC:
Relative Frequency
British National Corpus (BNC) = 100 Million words;
Surrey Forensic Science Corpus (SFSC) = 0.58 Million words;
The 5 words do not have the same distribution in the two corpora: These are the so-called open class words, or lexical words. For every 22 instances of evidence in the Surrey corpus there is only one instance of this word in the BNC. And, forensic is most weird: 473 instances in the Surrey Corpus as opposed to only one in the BNC.
BUILDING A THESAURUS
33
0.00002%
0.00001%
0%
0%
0%
BNC:
Relative Frequency
1263
634
SFSC/BNC:
WEIRDNESS
0.0146%ballistics
0.0139%pyrolysis
0.0115%accelerant
0.0137%earprint
0.0187%bitemark
SFSC:
Relative Frequency
British National Corpus (BNC) = 100 Million words;
Surrey Forensic Science Corpus (SFSC) = 0.58 Million words;
The first three words DO NOT EXIST in the BNC: These are the so-called neologisms, or new words. Pyrolysis & ballistics both are also lesser used words in the BNC.
∞
∞
∞
BUILDING A THESAURUS
BUILDING A THESAURUS
Collocation patterns – semantic prosody in Surrey Forensic Science Corpus
34
An ALGORITHM FOR TERMINOLOGY AND ONTOLOGY EXTRACTION
SLGL
SLGL
Nf
fNweirdness
)1( +=
)10,1,1()0
,1
,0
(
)1
(_
:
0
10
1 10
2)_
(:.
0
_
:.
≡→
++≥
>∑
=
−
=
>
−
=
UkkMetrices
iUk
ip
j
ipPeakedness
U
j
ip
j
ip
iUSpreadColl
k
ij
fij
f
ijkStrengthColl
σ
Ahmad, Khurshid., and Rogers, Margaret A. (2001). ‘Corpus Linguistics and Terminology Extraction’. In (Eds. ) Sue-Ellen Wright and Gerhard Budin. Handbook of Terminology Management (Volume 2). Amsterdam & Philadelphia: John Benjamins Publishing Company. pp 725-760.Smajda, Frank. (1994). Retrieving Collocations from Text: Xtract. In (Ed.) Susan Armstrong(-Warwick). Using Large Corpora. Cambridge, Massachusetts & London, England: MIT Press. pp143-177.
British National Corpus (BNC) = 100 Million words;
Surrey Nanotube Corpus (SFSC) = 1.09 Million words;
BUILDING A THESAURUS
Collocate Freq -5 -4 -3 -2 -1 1 2 3 4 5
nanotubes 690 8 8 9 2 0 647 6 0 7 3
nanotube 252 3 2 2 0 0 229 2 1 5 8
single-
walled
77 0 0 1 1 75 0 0 0 0 0
aligned 94 1 1 3 5 74 0 1 1 3 5
mult iwalled 70 1 1 2 0 59 0 0 1 5 1
amorphous 58 1 1 6 0 46 0 1 1 0 2
atoms 51 1 2 0 1 0 42 0 1 3 1
Collocations with carbon (frequency of 1506) in the Surrey Nanoscale science corpus.
35
British National Corpus (BNC) = 100 Million words;
Surrey Nanotube Corpus (SFSC) = 1.09 Million words;
BUILDING A THESAURUS
Collocations with Collocations with carbon nanotubes(frequency of 647) in the Surrey Nanoscale science corpus.
Collocate Frequenc
y
-5 -4 -3 -2 1 1 2 3 4
single-
walled
7 3 0 0 1 1 7 1 0 0 0 0
aligned 63 1 1 1 5 48 0 0 2 4
multiwalled 53 0 0 1 0 46 0 0 5 1
properties 60 1 4 15 32 0 0 0 6 2
multiwall 34 0 1 0 1 30 0 2 0 0
No.
Potential ‘Hyponymic’ Patterns
1 NP0 such as { NP1, NP2, ,…………….(and|or) NPn}
2 such NP0 as { NP1, NP2, ,…………….(and|or) NPn}
3 { NP1, NP2, ,……………., NPn} (and|or) other NP0
4 NP0 (including|especially) { NP1, NP2, ,.(and|or) NPn}
injury including broken bone, the bow lute, such as the Bambara ndang
An ALGORITHM FOR TERMINOLOGY AND ONTOLOGY EXTRACTION
36
An ALGORITHM FOR TERMINOLOGY AND ONTOLOGY EXTRACTION
•This method has been successfully applied in recent years in the
synthesis of various metal nanostructures such asnanowires, nanorods, and nanoparticles.•Occasional multiwall carbon nanotubes and othercarbon nanostructures were also found following
annealing at higher (> °C) temperatures.•The present method will be extended to find and fix
nanoparticles including polymers, colloids, micelles, and hopefully biological molecules/tissues in solution. •This technique is promising because many different types of
nanowires, like nanotubes or semiconductor nanowires, are now synthetically available
British National Corpus (BNC) = 100 Million words;
Surrey Philosophy of Science Corpus = 1.042 Million words; 164 Texts, 1990-2000;
Journal Papers; Letters, Conference Announcements, Courses
BUILDING A THESAURUS & ONTOLOGY
37
PREAMBLE:
Special LanguageA note on creativity or terminicide
0.0Quality Newspaper
-19.3Fiction
-22.6Nat. History magazine(Ranger Rick)
-27.4Children’s fiction
-63.8Farm-workers talking to cows
-4.7Popular Science(Discover)
Lexical DifficultySource
Lexicogenesis: Diachronic Semantic Change
The establishment of the nuclear atom (1890-1930)
Bohr