20
Evaluating - instances Introduction For the last four chapters, we have been studying concordances in one form or another. Each instance has been taken to be as important as any other, and has had to be accounted for. This is a valuable discipline, but only the very first step towards the automation of text study. In this chapter and the next, tve begin to evaluate concordances and devise new kinds of information about language. The starting poil~t of this ch3pter is pro5atlIy ~lne~yected--it is r h~t most actual examples are unrepresentative of the pattern of the P.-ord or phrase for which they are chosen. Such is the intricate nature of the ties between one segment of text and the surrounding text, and the relation betwren the text and the world and the intended outcomes of the communication, that the act of plucking a few words from any text is not likely to provide a freestanding instance of its constituent words, each acting typicall>-. The vast majority can be safel!~ discarded when their statistical contribution to the concordance as a whole has been recorded. We need a lot of text so that there will al~vays be a sufficient residue of useful examples, and also to provide criteria for discarding the others in the first place. Throw away your evidence The policy of discarding examples, and particularly examples which do not fit a description, is likely to have to struggle for popularity in linguistics. The Cult of the Counter-erample is still very strong, in myth if not always in observance, and it is important for students of text to define a careful position in this regard: xvhich will be quite different from that of students of sentences. For example, the computer corpora of the early sixties, Brown (KuEera and Francis 1967) and LOB (Hofland and Johansson 1982), represent a transitional stage; they

SInclair Corpus Concordance Collocation

Embed Size (px)

DESCRIPTION

John Sinclair, a well known name in the field of Applied linguistics. In this book he has proposed "unit of meaning", concordance and collocation and how it is evaluated.

Citation preview

Page 1: SInclair Corpus Concordance Collocation

Evaluating - instances

Introduction For the last four chapters, we have been studying concordances in one form or another. Each instance has been taken to be as important as any other, and has had to be accounted for. This is a valuable discipline, but only the very first step towards the automation of text study. In this chapter and the next, tve begin to evaluate concordances and devise new kinds of information about language.

The starting poil~t of this ch3pter is pro5atlIy ~lne~yected--it is r h ~ t most actual examples are unrepresentative of the pattern of the P.-ord or phrase for which they are chosen.

Such is the intricate nature of the ties between one segment of text and the surrounding text, and the relation betwren the text and the world and the intended outcomes of the communication, that the act of plucking a few words from any text is not likely to provide a freestanding instance of its constituent words, each acting typicall>-.

The vast majority can be safel!~ discarded when their statistical contribution to the concordance as a whole has been recorded. We need a lot of text so that there will al~vays be a sufficient residue of useful examples, and also to provide criteria for discarding the others in the first place.

Throw away your evidence The policy of discarding examples, and particularly examples which do not fit a description, is likely to have to struggle for popularity in linguistics. The Cult of the Counter-erample is still very strong, in myth if not always in observance, and it is important for students of text to define a careful position in this regard: xvhich will be quite different from that of students of sentences. For example, the computer corpora of the early sixties, Brown (KuEera and Francis 1967) and LOB (Hofland and Johansson 1982), represent a transitional stage; they

Page 2: SInclair Corpus Concordance Collocation

s, Concorc hnce, Coli location Evalua ting instan

nost carefully const an attemp lnstance is cherishr ~ r p o r a ar n words) for this to De posslble for determ~ned .,,,,,a,,.

I wisdom of corpus linguistics is that fairly vords or even fewer, are adequate for the frequency of occurrence of so ,,.,,,

natlcal or function quite high. In the LOB corpus, for )le (one million wo glish printed in the UK in the year , the commonest 1 r. are almost all grammatical, and in frequency from the at 68,315 to people at 953 (the 'lexical'

.ange are said 2,074, time 1,654, man 1,072, years re a few stragglers, like shall 348, itself 272, nor 200, to whatsoever 7, and whichsoever 1; but there are nu-

merous Instances of gran enable conven- tional grammatical stater

However, the availabil corpora makes it nosqible to evaluate conventlona. 5 c a n L l I l ~ a ~ ~ ~ a ~ a~a~ements. Presum-

he shorter original corpora did little more than confirm the Ily agreed positions on English grammar. The new evidence ts that grammatical generalizations do not rest on a rigid

of hundreds of 'ferent

were I each i millio

~ --

meted in :d. The cc

1

~t to be re1 .e just sm . .

Occasional I is given in the grammar to Survey material, but no attempt i 1 confront and account for the evidence. Hardly any of the ~ A ~ I I I ~ ~ L ~ are citations, though citations must have been readily available. One is forced to conclude that the authors were following a methodology which gave low priority to one of the concerns of this book, which is to press for the use of actual language data as a basis for all descriptive statements.

A valid generalization about data must relate to the data in a systematic way; each relevant instance must either support the gener- alization or exhibit features which make the generalization subordi- nate to some other descriptive statement. Hence, it is important to fix on a particular body of data (which is best chosen on non-linguistic principles), and then engage with every instance. If that procedure is adopted in language work, it soon becomes necessary to acquire very large quantities of data, or else generalizations cannot be made. Language is very complex, and people use it for their own ends, without normally being conscious of the relation between their verbal behaviour and the way that behaviour is characterized. They are creative, or expedient, or casual, or confused; or they have unusual matters to put into usual words, so they have to combine them in unusual ways.

It is, therefore, necessary to have access to a large corpus because the normal use of language is highly specific, and good representative examples are hard to find. This is as true of grammar as of lexis, because grammar is not made of just the patterns of the common grammatical words, but relies on the whole vocabulary of the language.

One further factor makes it essential to collect a large number of instances. Many words have more than one meaning, sense, or usage, and these occur in very uneven distribution. As far as I know, no systematic research has yet been done in this area, so the following remarks are speculations based on observation and occasional probings.

Frequent words have, in general, a more complex set of senses than infrequent words. If we divide and number senses in the conventional dictionary manner, we may discover a statistical relationship between the number word and the number of different senses it realizes. 1 ation of instances of a frequent word is not just ma ever more clear evidence of complexity.

In addition to this, we must allow that, just as some words are much non than others, some senses of one word are r e

xesentati, all enoug c-,bnle -,

ve, and h (one

.eference s made to ."",..I',, - r

: received ra, of one 31 purpos . .

small gram-

-rqIl,=rl

The corpo. matic:

-. - - . . .

million v es, since ? .

gramr exam1 1961) ranee

words is lrds of En 00 word:

0-

words 1,067: else 1 C

in that r I. There a. i9, down

lmatical nents to t ity nowac

words, SUI

)e made. lays of ml 1 ,,,,,,

Fficient to

~ c h larger *:..."I -- r ---

ably, t

genera sugges . found; indivic when

S l ~ r l .

ition, but lual worc IOU look ; . .

are the a Is and ph at a lot of

ccumulat (rases. Th . . . ~t at oncc --a,. ...?":I

ion of the e langua! b.

-l.l- L-L-

: patterns se looks

1 evldence has not L c ~ i l ava l lau l r : uelurc. Llngulscs nave nad to 1 their intuitions, their limit ity for thorough textual is, and whatever has caugl *ye or ear as they have ltered large extents of language behavlour, in their dailv lives or

ma1 work onaries o, In belngs t e their here is as ~~s t i ru t e . This method 1s l~kely to

CllL UI Ids~aI in E I I X I ~ ~ I I and perhaps miss some of the regular, humdrum pattel

In grammars, since A Modem Englrsh Grammar (Jesp,,~~.. 1/39; 1949). c v e n n Com-

sive Grammar of the Englis, et al. relies heavily on invented ex, CGEL had the corpus of the ilable

r v l ~ . is a corDus approaching one mllllon words, spann~ng t w e ~ five ye: lcluding a ial proportion of spoken Engl

- --- rely or analysi encour

red capac ~t their e . . .

in thei~ The:

r professic great dicti les, and t h+ the .....

f English L

yet no SI --1:-I.

tsed hums 1 . ' . examp

h;"hI;"

'ns. the tradit . # - ion of citi

,- ded away

F- . -~ . A

prehen 1985) I

The I tn :+. .I.

b Langua amples.

- -

-) (Quirk

Jsage ava

ge (CGEI

' English I

o f occurr Hence, thc Ire of the s . .

ences of a : accumul ;ame, but Survey of .,,. ' tY-

ish. nuch mor irs. and ir lore comr

Page 3: SInclair Corpus Concordance Collocation

Corpus, ( ing instanc

common than other senses of the same wo common. So if we need, say, fifty occurrences ,, a

order to describe it thoroughly, then the corpus has to b i lalxc cllt

ifty instances of the least common sense. In pr; lecision about the 'least common sense' is an art ?hat the size, there are always loose ends, un,~~,, , ,,

~mples, etc. But wherever this limit is 1

.epancies in the frequency of the reco uce a heavy demand for very long te,

luding a high proportion that could never occur? The : characterized the field linguists in the first half of this

u L y , ,,,, the latter choice has been evident in the linguistics of the thirty years. he new option opened up b nputer is to evalua

l l l ~ ~ a n c e ~ and select the most typi ~plete set of typical plify the dominant al patterns of the language ~ r s e to abstraction, to generalization. The mass :ach contain just a nent of typicality, but a few

contaln several typical features. 11, Jucll ,,Lcumstances, although it may sound paradoxical, examples which are typical are rather uncommon, and have to be found by statistical methods.

It is, therefore, unnecessary to make a sharp distinction between abstract and actual language structure-the sort of distinction em- bodied in Saussure's langue and parole or Chomsky's competence and performance. The existence of these dichotomies is to allow us to abstract from the chaos of life a system of meaningful choices and to insulate the abstract system. I have already conceded that some proportion of the complexity of text may be attributable to accidental or random factors, but that is far from sufficient explanation. It may indeed have obscured what actually goes on. In fact, the main sim- plification that is introduced by conventional grammar has nothing to do with the purity of abstraction as against the chaos of life. It is merely the decoupling of lexis and syntax.

In the explicit theoretical statement of linguistics, grammatical and lexical patterns vary independently of each other. In most grammars, it is an assumption that is obviously taken for granted. For example, it is rare for a grammar to note that a certain structure is only appropriate for a particular sense of a word. The same goes for mor~hologv. In contrast, grammars attribute independent meaning to SYl

ionary to note the commor C

Pa 11ar sense. Pedagogical dictic e

increasingly seelng t h ~ s as essential information for learners, but it n added in the form of afterthoughts such as usage notes. The implicit stance of a conventional dictionary entry is that most of the words in daily use have several meanings, and any occurrence of the word could

me of the meanings. If this ]ally the I 1-

would be virtually impossik

y times I

e o f a w o 3 I,,", -,.

irred, inc ner choicc ...-7 - " A ,

OCCL

forn ---a

to f that the d matter m

Jug11 * find e; no nses,

CClll

last T

...C*.

.te actual instances

~y the con cal. A con

occasion; observe t and this

a1 odd ex: luge disc1 will prod1

fixed, we gnized se, CtS.

shall nses,

uld exem' lout recol nstances c

: structur; or indeed small eler '" .-..,-I.

id langi uage > .

I ne alstlnctlon has otten btcll 1114°C UCLWCCII LCXL allu language on a dimension of abstraction. Language i! 1; it is realized in text, which is a collection of instant an inadequate point of view, because we do not en g like text by 'generating' word strings from grammars. In particular, there is hardly any allowance for the combinatorial meanings in text. If text (in- cluding, and in particular, spoken text) is not a strict realization of meaningful abstract decisions, then either it is subject to random

result of decisions which are not recorded hich take precedence over those which are. lnents of the rather mystical notion of

'coherence' that is lpetence of grammars. Random factors wi coherence arises; so we are forced to con route is not through conventional grammars, but L l l I " U f ; ~ ~ DVILIC KIIIJ of functional analysis.

Actual text will always be deviant v ltural rules of the conventional kind. Some of the fa ieviance have already been mentioned-creativity, U,, , , .~ ,U,IDLI~U~II , ,~ , ex~ediency, inattention, confusion, and the need I ther major factor is shared knowledge amc :ads to the actual occurrence of many utte rule (for example, an obligatory translt~ve verk by

t an ere the real-world thing that could gi ssed )bvious). ammarian's dilemma is this: does he or she study ac*--l

it of them are untyj loes he or study a set of inst ~ich have

s an abstr. ces. This i d up wit1

act systen is clearly ; I anythin

. .

- distortior in the abs Many of , .

1, or it is il tract systc

these ar .

1 part the :m, but wl 'e compo beyond I

ill certain! d u d e th;

the gener, ly not exp at the re. +l.--..-L .

ative CON

llain how alization . - - - I - : - -

vith respe ctors t h a ~ "C,l$,,""

- ~ ... ..

ct to struc t lead to c .,.-..----,

to expres! Png comm :ranees w . .

; the unu: ~unicaton hich are I

A A

jual. Ano j, which 1( 3roscribec

-. ltactic an Equally, i tterns of :

. .

rangemen it is rare a word in . .

tS. for a dict a particu

I syntacti jnaries ar

. . . ) occurrin ve rise to

~g withou an expre!

.~ ~

object wh abject is c

The gr; - mstances, gnore acc

knowing :ual insta

..---

, that mos nces and

c u a a

she not

were act1 Ae.

case, CON ;rial any (

unication

Page 4: SInclair Corpus Concordance Collocation

Corpu. s, Concorc lance, Col location Evab rating instances

The dump If two

: decoupling of lexi: :ax leads t tion of a I that is called 'idio~ ;eology', ' m', and t systems are held tl lependent 1 other, tk

.ces of one constraining the other will be consigned to a limbo for :atures, occasional observations, usage notes, etc. But if evidence iulates to suggest that a substantial proportion of the language ption is of this mixed nature. then the original decoupline must

. ~ be called into c grave doubts 01

and syntax. In modern l e x ~ c a ~ researcn, Ir IS p an or me long-rerm rasK ro specify

accurately the established phrases of a language. A phrase can be defined for the moment as a co-occurrence of words which creates a sense that is not the simple combination of the sense of each of the words. One is first strucl of phrases, then by their flexibility and v; :teristically crea- tive extensions and a d a ~ imes more often *I,-n *he 'ordinary' form

tful to start by supposing that lexical [an that they vary independently of

sand synt my, 'phra: o vary inc . .

o the crea 'collocatic :ly of eack

:ubbish he like. ien any . -

I would threaten the Ition of realization in language-that structure realizes sen :refore normally differentiates one sense from another.

ture are not independent of each other and not ir must be associated. Here we can frame a hypothesis tl bstitute for the IangueJparole distinction. We can postulate that the underlying unit of composition is an integrated sense- structure complex, but that the exigencies of text frequently obscure this. This position offers a sharp contrast to the atomistic model featured by most grammars, and the argument is developed in the next chapter.

Our descriptive task then becomes the identification of the regular and typical associations, leading to the identification of one or more 'citation forms' for each distinct sense. The distinguishing features of the citation forms could then be stated, and explanations could be offered for the occurrence of non-citation forms. A citation form would involve a modest step in abstraction. It is also likely that many citation forms contain some systematic variables, such as pronoun selections, which leaves a modicum of independence to the grammar.

basic nc se and the

Instan odd fe accurr descri

If sense lseparable lat can ac . .

and struc :, then they x as a sul

pestion. 1 the wisc

The evidc !om of pc

" becomin, separate

" ~-

le casts of lexis

:rice now lstulating

g availab domains

( by the fi xriability, )tations v

xity and I then by t rhich occl

:egularity .he charac ur, somet

LA,-,, L

In tl and s) each c

'lis work, I

rntactic cl )there

it is much hoices c o ~

Procedure d struci der of thi . ?

Meaning an ture How, then, do we find the citation forms, especially since we believe text to be largely composed of non-citation forms? I propose to outline a method for tackling one area of structure, in this case collocation, which gives promise of valuable results. The same principle can be applied to other structural features.

The procedure begins with a machine-generated concordance to a large corpus, as we have used in previous studies in this book. The usual kind of concordance is adequate, where all the occurrences of a word-form are retrieved, each in the middle of a line of text. A line of text may contain as many as eight or nir )n either side of the central word, or node, and we do not expe more than four or five on either side.

A concordance h of the properties of a natural text, and it is reasonable forth ,s of statistical analysis to treat each cited line as if it were a r and so to examine the vocabulary of the concordance. In or1 this, a list is compiled in frequency order, of all the word-forms III the concordance. These are called the

3de. This raw list is then processed as

For the remain, s chapter lapter 4), I should like to widen the domaln ot syntax to include lexical structure as well, and call the broader domain structure. In the spirit of the preceding argument, I shall define structure as any privileges of occurrence of morphemes; we do not in the first analvsis have to decide whether these are lexical

tactic--o then best arable? tunately I ~ssible. than one sense can we reallzea ~y tne same srructure, and, in the :st case, by the same word. must, then, consider whethe f it is much more than incidental an, ,,,,,L,,,,,, it will

constitute strong evidence for the independence of lexis and I

However, although ambiguity causes great headaches in aut parsing, if we look at the way people actually operate with langc see it as a sporadic and almost accidental coincidence of I

.arely constituting a communicative F

(as in Cl . . .

or syn Is it

Unfor N r T - ~

r as so of to hypoth not. If thi

t of both. sense and J, ambig1

I I . I

ten-a bi. esize that at were sc 1 1.

are insep; d be impc

structure ~ i t y woull

lvlore simplt

We atic. I

ie words c ct to need las many I

:r ambigu . . lental or 5 . - - I +Lam

ity is incic A nPP..'-;.-

syntax. omatic (age we .,,1;,,

e purpose Gentence, der to d o , . -. . . . . - .

Much mo follows: collocate: r of the nc

Page 5: SInclair Corpus Concordance Collocation

ance, Coll, Corpus -. -.uating instances

so that or the node Daley ( I !

node thert wclc llu statistical indicatic he node. At present meen one and five

valanced and unbalanLcu, Lu 11 L l I C l c 1 3 all U p ~ ~ ~ I ~ ~ ~ ~ ~ 3 C L L l l l s .

: lines are ly to be an inclair, Jc

. .

trimmed :

tracted by mes, and 0 ..,--- ..-

11y those \ are left in 370) that

words tha . It was str beyond fi

t are reasc ,ongly sug our word

mably ;gested s from

Findings I

This technique, in ma1 form, was recently applied to the concordance of the woru 3econd. The word was chosen as being fairly frequent (over 1,200 occurrences in 7.3 million), and as having two rather distinctive major senses. It was found that the first pass identified the Second World War as a phrase which had 14 oc- currences in the 50 most typical. The next pass, omitting the 14 phrases, identified a major sense which was strongly associated with preceding the, occasionally his or her, and with words like first, third, time, year, act, child, and wife in the environment. The next pass identified a sense which was strongly associated with preceding per, and before that a word like cycles, radians. A number of similar in- stances had a instead of per but a is also occasionally used in the other main sense.

There was little else except a hint of possible phrases second hand and second class.

The two main meanings of second, then, are associated one with definiteness and the other with indefiniteness. This is at least as important as the observation that one is a modifier and one a noun.

A closer look at the full concordance confirms these findings. There is, however, a third fairly prominent use of second which does not emerge in the collocational analysis. This requires neither a definite nor an indefinite determiner, and the word functions as a discourse organizer. It is quite often preceded by and. It is not surprising that this use does not attract strong lexical collocations, because it occurs according to the exigencies of the discourse and should be largely independent of such things as content, topic, message.

indings are crude, preliminary, and partial. No doubt a study ily would identify the third sense of second as a discourse

to be absorbed into the lemma second(1y). The study of night add res and new uses, and so on. In due course,

has at least managed to isolate the most n another trial at an international confer-

I CIILC 111 I/oo. Lllr; CIIuLuLl)rlL system successfully distinguished among

sole = bottom of shoes or fel

a. The like in S a provisic

--.- -> - - power ents of , both ..--

the of tl ben I.-1,

Ins of the : :ing with t

side of ," ,... ,.-+.

~ttractive :nvironml the node -..- ",.+.

we are e, : words I

,,A +, ,,

cperiment on either a : c ,.L,...-

b. There is no point in considering very infrequent collocates, and there is usually a long tail to the frequency lists. A suitable cut-off point-for example, less than ten per cent of the frequency of the node-should be de te~ :mined.

collocater - - A . c. Each of the remaining 1 ting its

freauenc~ in the c o n c u ~ u a ~ ~ c c LU ~ L S uvclall ~ L C U U C I I L Y 111 he full d a word which score high.

; is given : .- .&- ----

a weightir --I1 L ---..

pus. So a <es a disti . ..

common nctive co'

word gel llocation

:s a low r with the 1

ating, ant node will

. . - ,. ,,<h line ot the concordance is now examined tor the typicality dding up the weightings of each collocate 2 concordance is now re-sorted into an order most typical instances should come to the

Ltes, by a ment. Thc and the

this poinl shed, and

omatic procedure is not yet fully s largely on a subjective basis. First,

there is a search ction and repul- rhich frequently

1 1 L l l c a a l i r c llllc. 4 1 1 " C l d 1 1 3 W 1 1 1 C 1 1 1 1 c V C 1 Uu. 1 ihe other asDects

t onward the study

S, an aut continue:

any ob for the sion, f ,.-*..-.

vious phr clusterin

or examF , *L, ,A,.

ases are ic g of collo Je, pairs ,I: , , ,,A

lentified a cates and and grou

I -,.:-.. ..,L

nd remov their mu,

p s of col i*L ..a=.--

red. Next, tual attra, locates w A,. TI.-.. 4 U C C U l 1

of stru freque c n ,."

~ctural pa nt words,

;ht in-th s, orderir

ie occurre ~g of item!

:nce of th s in the lir

le very ie, and

tterning : syntactic

are broug structure

If it i an atte sense i: . .

~ h e i e f of seconc organizer

s suspected that there are two or more principal senses of a word, :mpt is made to isolate a sense, using explicit criteria. When a s fully described, all the lines that exemplify it are then removed,

and the new, shorter concordance is reprocessed from the begin1 ' Gradually, this procedure should identify the distinct senses

word. Each cycle will, however, reduce the size of the remai concordance substantially, and the overall size of the corpllc quick]: a limiting factor.

new featu echnique leaning. h

ling. of a

ning will

seconds n we shall I

basic con ---a :.. I

- see. The t trast of m 000 + L A .

sole = on . .

ly, sole = fish, a i d et.

Page 6: SInclair Corpus Concordance Collocation

f a n e , Col location

clusion The cc that can 1

..-a 1,. I---:. )e drawn - - - A ----A

from this --- -- -:A

and other .-_ . C _ L - ~

;is that it is folly r v U ~ W U ~ I C 1~x1s allu a y ~ ~ ~ a x , or elrner or rnose ana semantics.

The realization of meaning is much more explicit than is suggested by abstract grammars. The model of a highly generalized formal syntax, with slots into which fall neat lists of words, is suitableonlv in rare uses and specialized texts. By far the majori? occurrence of common words in common pa of those common patterns. Most everyda independent meaning, or meanings, but are components or a rich repertoire of multi-word patterns te up text. This is totally obscured by the procedures of c o ~ 1 grammar. I

The next chapter takes up this a ~ g u u l r ~ ~ t in detail. The notion of I

(Sinclair 1984). i

i of text tterns, or y words

is made in slight v do not h

- .-.-

of the ariants ave an

. . Introduction This chapter concludes the description of word co-occurrence as we currently conceive it. The next stage is t o write a dictionary of collocations, and the project is in hand (Sinclair et al. forthcoming). The argument brings together a number of themes that have been developing throughout the book, in particular, the notions of dependent and independent meaning, and the relation of texts to grammar.

that ma1 lventiona

citatic )n forms i s develop parate pc

Two models of interpretation It is contended here that in order to explain the way in which meaning arises from language text, we have to advance two different principles of interpretation. One is not enough. No single principle has been advanced which accounts for the evidence in a satisfactory way. The two principles are.

The open-choice principle

This is a way of seeing language text as the result of a very large number , of complex choices. At each point where a unit is completed (a word or a phrase or a clause), a large range of choice opens up and the only restraint is grammaticalness.

This is probably the normal way of seeing and describing language. iller' model, evisaging texts 2 of d from a lexicon which sa cal ally any word can occur. Sin

is believed to operate simultaneously on several levels, there IS a very complex pattern of choices in progress at any mome :he underlying principle is simple enough.

Anv ~egmental approach to description which deals with prugrrssive of this tyl re shows it clearly: 1

- - . 1s a series tisifies 10, Ice langua

It is often slots whi, restraints

. .

called a ': ch have t . At each

slot-and-f :o be fille slot, virtu

. .

. ..., "-r

choices is 3e. Any trc the nodes

Page 7: SInclair Corpus Concordance Collocation

structed in of course is not the preposition of that is found in gram! S.

The preposition of is normally found after the noun head o a1 group, or in a quantifier like a pint of ... . In an open-choic 3f

can be followed bv anv nominal group (see Chapter 6 for details). Similarly, c untable noun that dictionaries mention; its meanin: 3f the word, but of the phrase. If it were a countablF llv,,, ,,I the singular it would have to be preceded by a determiner t o be grammatical, so it clearly is not.

It would be reasonable to add phrases like of course to the list of compounds, like cupboard, whose elements have lost their semantic identity, and make allowance for the intrusive word space. The same treatment could be given to hundreds of similar phrases-any occasion I where one decision leads to more than one word in text. Idioms, I proverbs, clichks, technical terms, jargon expressions, phrasal verbs, and the like could a red by a fairly simple statement.

However, the pri idiom is far more pervasive and elusive than we have allovc . It has been noted by many writers on language, but its importance nas been largely neglected. Some features of the idiom principle follow:

the t: on tt

ree are thc ie open-cl

e choice F hoice prin

mar book f a nomin e model, ( . . ..

~oints. Virtually all grammars are con: ciple.

zourse is I

g is not a I, en.... ;*

lot the col property ( . .

clear that words dc [-choice principle (

nts on consecutive cnolces. we wvulu not prouuce normal text ly by operating the open-choice principle. ) some extent, the nature of the world around us :d in the nization of language and contributes to the unrandomness. Things

which occur physicallj chance of being mentioned together; als~ ;ophical area, and the results of exercisin features such as contrasts or series. But even allowlng for these, there are many ways of saying things, many choices withi ;e that have little or nothing to do with the world outside.

There are sets of linguistic choi come under the heading of ;ister, and which can be seen as large-scale conditioning choices. ice a register choice is made, a: choices, :n all the slot-by-slot choices ar or even, some cases, pre-empted. Allowing for register as well, there is still f, ch opportunity for

1 choice in the model, and the principle of idio orward to account : for the restraints that are not captured by th oice model.

"le principle of idiom is that a language user nas available t o him -r a large number of semi-preconstructed phrases that constitute e choices, even though they might appear to be analysable into lents. To some extent, this may reflect the recurrence of similar ltions in human affairs; it may illust tural tendency t o .omy of effort; or it may be motivated the exigencies of time conversation. However it arises, :n relegated to an ior position in most current linguistics, v ~ ~ c x u a i it does not fit the I-choice model. : its simplest, the p~ ~ltaneous choice of I ,ates effectively as :turally bogus, may another.

..'here there is no v a r i a r ~ o ~ ~ 111 me unrasr. we are ueallrlr wlrn a ralrlv trivi.

that the u g h re- .. - 1

3 not occ~ joes not

- I . - : - . . . provide i I",. srrali

simp Tc

orga r togethe] o concept ~g a num

r have a s in thesa ber of 01 . , - a

stronger me philos rganizing

11 be cove nciple of red so far.

n languap

ces which

nd these a e massive

Ire norma ly reducec

lly social J in scope

a. Many phrases have an indeterminate extent. As an example, consider set eyes seems to attract a pronoun subject, and either never or a 1 conjunction like the moment, the first time, and the wc an auxiliary to set. How much of this is integral to the yluaaL, and how much is in the nature of collocational attraction?

b. Many phrases allow internal lexical variation. For example, there seems t o be little to choose bemeen in some cases and in some in- stances gn fire and set fire to x.

on. This I tempora )rd has as ,LC-'.#.,

ar too mu1 Im is put fi le open-ch .~ 1..

situa econ real- ;..$-.

rate a nal in part b) it has bet c h,,n..c.3

een set x (

ow intern; 1";- -"I.,""

a1 lexical syntactic variation. ( he i r 5 rrw I ~ L v ~ J , I U ~ ~ , ~ to ... . The word it is part of the phrase, and

-though this verb can vary to was and perhaps can Jot can be replaced by any 'broad' negative, including

,-, ,,.,, dtc. In is fixed, but hiscan be replaced by any possessive )me names with 's. Nature is

c. Many I -LA---

)hrases all :z9- - - - L ;.- , 1111LL

oper A1

simc

pnrase so is th include hardly,

e verb is- ! modals. P I.,",mlN 0

.inciple ol w o word

F idiom ca S, for exal

n be seen nple, . . of cc

in the ap 3urse. Thi

parently s phrase . .

oper a single \.

disappea vord, anc r in time, :

I the war'

3s we see i~ d space, n maybe, I

which is gnyway,

pronot

d. Many . . m and per

phrases a!

fixed.

er. Contir is not in i

strw and

'VCI

IIOW some some variation in word ordl ing the last example, we can postulate to recriminate

:n thc wri ting syste "

' grammai o t in the ; nature of

11 1

m and the a1 mismat

Page 8: SInclair Corpus Concordance Collocation

Collocatic us, Conco

Iany uses ,110cation ~idence.

of word I; for exal

s and ph mple, har

rases attl d work, r

-act other bard luck

words il , hard fa(

n strong :ts, hard

between t frequent c . -.

he sense t me.

.o which our intuitions give priority, a .nd the ml

4 I.he commonest mean~ngs ot many less common words are not tnose d by introspection. Sense 1 offered in the CED 1 is )w (a fugitive etc.) in order to capture or overtak far nmonest meaning is sense 5, 'to apply onesc ~e's

stuales, hobbies, interests etc.)'.

Lany uses ot words and phrases show a tendency to co-occur with main grammatical choices. For example, it was pointed out in hapter 5 that the phrasal verb set about, in its meaning of some- king like 'inaugurate', is closelv associated with a following verb in le -ing form, for example, sei :cond verb is usually transiti ery often, set will be found I

Iany uses of word! ' ' ncy to occur in a main semantic en. le verb happen is sociated with unpl d the like.

i I

supplie 'to follc the col

, .

for pursuc e', yet by :If to (on

t ;?bout /el ve, for ex in co-occi

wing ... . :ample, sc Jrrence p;

What is n !t about tc atterns.

lore, the ?sting it. From this we can put forward some tentative generalizations:

1 There is a broad general tendency for frequent words, or frequent senses of words, to have less of a clear and independent meaning than less frequent words or senses. These meanings of frequent words are difficult to identify and explain; and, with the very ' frequent words, we are reduced to talking about uses rather than meanings. The tendency can be seen as a progressive delexicalization, or reduction of the distinctive contribution made by that word to the meaning.

s and phr vironmen leasant th

.ases s h o ~ t. For ex ings-act

overwhelming nature ot this ev~dence leads us to elevate the :iple of idiom from being a rather minor feature, compared with

I lmar, to being at least as important as grammar in the explanation 3w meaning arises in text. S u ~ ~ o r t comes unexpectedly from a 1 diffe rent quar

ence fron L 1

2 This dependency of meaning correlates with the operation of the idiom principle to make fewer and larger choices. The evidence of collocation supports the point. If the words collocate significantly, then to the extent of that significance, their presence is the result of a single choice.

z long rex I r currenr iexical analysis of long texts, a numDer of problems have I

icipated:

Evid 1- ,I 111 1111

arise n, not all

le 'meani~ . .

of which

~ g s ' of vel . A-..

were ant

.y frequer ",",",I...

kt, SO-call( ancal words are a JIbadache ir, dlJr I F A L C V ~ ~ ~ ~ I I ~ . but the LJLULJICIII they typify fits in with some of the newer diffic~

:d gramm ---Ll-- 3 The 'core' meaning of a word-the one that first comes to mind for

most people-will not normally be a delexical one. A likely hy- pothesis is that the 'core' meaning is the most frequent independent sense. This hypothesis . . would have to be extensively tested, but if it

1 to hold it would help to explain the discrepancy :en the most frequent sens hat lost important or central on

2 Some 'meanings' of frequent woras seem 1 ry little meaning at all. for example, take, in takea look at this; r n a ~ e ~ n makeup your mind. 1

:o have vei ...- 1..:- ..

inings of y introspc - _ L A

provec referre intuitic

. - -

;ood then we betwe ts is the n

)nest me: the commonest words are not the le commc eanings si rck as 'the

d to ab; Dn sugges

;e and w e. -L A -

~pplied b xtion; for example, the meaning of : posterior parL ur rhe human body, extending from the

mck to the pelvis' (Collins English Dictionary (CED) 2nd edition '86 sense 1) is not a very common meaning. Not until sense 47, the cond adverbial sense, do we come to 'in, to or towards the original ~rt ing point, place or condition', which is closer to the commonest

usage in our evidencc

4 ~ o s t normal text a maae up of the occurrence of frequeuc woruh. and '

the frequent senses of less frequent words. Hence, normal Fly delexicalized, and appears to be formed by exercise lom nrincivle, with occasional switchine to the open-choice CIIIIICI~IC. . I

text is larl of the id: ..-:-A-ln

set St; "

and unrei is also u - ,C c s v c .

) subject to atteml .,,,, +r. h,

of courst ~t to ana

? to 6

lyse xed

-. English v -..: 3.- .-

5 Just a: gramn eramn

s it is misleading ; (ealing tc natical analysis, it nhelpful naticallv anv ~ortioll "1 r r n L ~vhich app,,~, Lv ,e construc

I th o f :

link most speakers of senses, whatever the eviuence rrom rrequency. wnat 1s alsquletlng

is the apparer ' good re;

vould agrt c ~ - ~ - - c ~

ee with th - e CED's c r, . I .

"-- -

on the : idiom pr , L

inciple. ason for

112

the enorr nous disc repancy

Page 9: SInclair Corpus Concordance Collocation

us, Conco Collocatio

The last polnt contalns an implication that a descrip how users know which way to interpret each portic The boundaries between stretches constructed on d will not normally be clear-cut, and not all stretches carry as much

'COUYSC ds 3f gramm : recogniz

ire ~ncompatible with each other. I'he one into her; the switch from one model to tl arp. The els are diametrically opposed. le last two points taken together suggesr one reason wny language is often indetermin lence very flexible e. If the 'switch poi1 tepretation are not ~ys explicitly signal r sharply contrast- vays of interpreting the data, then ~t IS qulte llkely that an utterance not be interpreted in exac in which it was tructed. Also, two listeners, ill not interpret in iselv the same way

forward the prop the first rinciple, since mos~ c t will be Whenever there is [son, the

,pretlve process swltcnes to tne open-choice princ~p~e, and quickly : again. Lexical choices which are unexpected in their environment presumably occasion a switch; choices which, if grammatically .preted, would be unusual are an affirmation of the o~erat ion of the n princip )me texts r ater than nal use of txample. e poems may contrast tne two prlnclples or ~nterpretation. But

are specialized genres that dditional ding. thus appears that a model of !au5ua5r; which divlur;s rlallllllal a l ~ d

lexls, and whi e grammar to prov e points, is a sec iodel. It cannot be :t still has many: intswhere the open- o --' - . I t has an aosrracr relevance, in thesense tnacmucn or cne text snows

tential for being analysed as the result of open choices, butthe other ciple, the idiom principle, dominates. The open-choice analysis d be imagined as an analytical process which goes on in princivle

all the time, but whose results are only intermitten

tion must . indicate tterance. rinciples

This view of how the two principles are deployed in interpretation can be used to make predictions about the way people behave, and the accuracy of the predictions can be used as a measure of the accuracy of the model. Areas of relevant study include: the transitional prob- abilities of words; the prevalent notion of chunking (see Chapter 9); the occurrence of hesitations, etc., and the placement of boundaries; and the behaviour of subiects trving to guess the next word in a mystery text.

)n of an u, lifferent p

evidc norn

It

:nce as 01 nal rules ( should be

oes to suf ar. ed that th . .

;gest that

e two ma . -.

it is not c

ldels of la

:onstructc

nguage th

:d by the

lat are in use 2

anot mod

TI

re is no sl le other 1

Collocation text in us alwa

ate in its i ~ t s ' betwe led, and t . .

nterpreta en two mi he two ml . . .

tion and 1 odes of in1 odes offel . . . a .

The above is the framework within which I would like to consider the role of collocation. Collocation, as has been mentioned, illustrates the idiom principle. On some occasions, words appear to be chosen in pairs or groups and these are not necessarily adjacent.

One aspect of collocation has been of enduring interest. When two words of different freauencies collocate significantly, the collocation has a different value in tl tion of each of the two words. If word a is

Ing v will cons Drec

.tly the s, or two r(

ame way -aders, w

losal that :of the tel good rea . .

r FC mod

' inter

ie descripl word b, thc I it is for 1

)r normal e to be ap ,pretable

texts, wl plied is th by this p

e can put e idiom p rinciple. '

twice as frequent as :n each time they occur together is twice as important for b t ha~ 2.Thi.s is because that particular event ac- counts for twice the proportion of the occurrence of b than of a. Inter

back will inter

So when all theoccuriences ofa with b are counted UD and evaluated. one figure is recorded in the profile of a, and another figure double the size, is recorded in the profile of b.

By entering the same set of events twice, once as the collocation of a with b and again as the collocation of b with a, one incurs the strictures of Benson, Brainerd, and Greaves (1985) who say 'there are

le. nay beco~ the open-'

mposed ir choice pri

. . I

I a traditic nciple; lej

In which r :a1 statern . I r .

nakes gre; lents, for z norn

#-

t mnting of nodes and double of ems here: . The part: kes compi

double cc s now add utation ur ... . .

counting n the who ~ccurate'. . .. .

require ac practice i n under c up to considerably more thai le, I ider any statistical model inz In practice, the posslbll~ty ot double entry allows us to highlight two different aspects of collocation.

I would like to consider separately the two types of collocation instanced above, using the term node for the word that is being studied, and the term collocate for any word that occurs in tl ed

ch uses th :ondary n switch poi . L . & ~ ~~

ide a strin relinquist -choice mi . t

- ~g of lexic, ~ed , becat 3del will c 1 r ,

a1 choic Ise a tex ome intl

he specifi IS both no environment of a node. Each successive word in a text is t h ~ de

and collocate, though never at the same time. W ~ L -- - is node and b is collocate, I shall call this d~tonwuru col-

tly called wncn u

'ocation- . When b ,ith a less frequent word (b) collocatic

Page 10: SInclair Corpus Concordance Collocation

Collocatio: us, Concoi

o-occurre ;enera1 sig n the title -.

c back is infrequent and carries no convi~ I Y g , Of the last category, the form anger :rs i! ~y Look Back in Anger.

The nouns and verbs listed below as collocating witn oacu are representative only. Given the uncertainty at the limits of statistical significance, it could be more misleading to include doubtful contend- ers. Thus, whileget, go, and bring are unlikely to be challenged. beach, box, and I onvincing when the actual ir .re examined.

The qua itance being scrutinized is co ce within four words of back, on either side, this being the cut-oft polnt established some years ago (Jones and Sinclair 1974). No account is taken of syntax, punctuation, change of speaker, or anything other than the word-forms themselves.

No doubt the studies which succeed this one will sharpen up the picture considerably. For example, the evidence of back suggests that few intuitively interesting collocations cross a punctuation mark. But it would be unwise to generalize from the pattern of one word, particularly such an unusual one as back. Now that tagged and parsed texts are becoming available, the co-patterning of lexical and gram- matical choices is open to research. But it is still important to draw attention to the strength of patterning which emerges from the rawest of unprocessed data.

In pushing forward into new kinds of observation of language, the computer is simultaneously pulling us back to some very basic facts that are often ignored in linguistics. The set of four choices, a,b,c,k, from the alphabet, arranged in the sequence b,a,c,k with nothing in between them, that is, back, is an important linguistic event in its own right, long before it is ascribed a word-class or a meaning. It is difficult for users of English to notice this, but it is the computer's starting point.

:nce with 1 ;nificance. of the pl:

ction of a1 only occu

node and a is collocate, I shall call this upward collocation. The whole of a given word list may be treated in this way.

There appears to be a systematic difference between upward and downward collocation. Upward collocation, of course, is the weaker pattern in statistical terms, and the words tend to be elements of grammatical frames, or superordinates. Downward collocation by ----rast gives us a semantic analvsis of a word.

724s are m'

llification

uch less cl

for an ins .. .

€ back

1 provisio rentiate se

nal way, 'parate sel

terns, in : ,t to diffe~ ~c groups.

I standard ot statlstlcal slgnlficance is clalmed at present, because y typical collocations are of such low frequency compared with the all length of a text. Because of the low frequency of the vast ~rity of words, almost any repeated collocation is a most unlikely t, but bec: :t of texts i this kind still be t h lf chance I ~wever, n r of Engl tance of ! patterns. vllc lciognizes ~ I I C I I I illllliediatel~. UCLdUaC they are Ires of the organization of te bly retrieved by introspectio distinguishing upward and downward collocation 1 have made a

buffer area of (plus or n :he node word. For example, let ;when it is examined as a node,

Let u worc will 1

N,

IS illustral 1 back. I sl put the cc

:e colloca~ la11 make )Ilocates i . , .

tional pat no atteml nto ad hc . , . .

with the xes, but

L .. man: over; majc even' may

Hc *I.,,,

%use these le result o 10 speake n,, ,,,

is so large. factors. ish woulc .I.", :,,

,unlikely 1

i doubt t.

events of i

he impor La,,..,, L l I C > C

featu relial

In . F ,

xts; often n.

. - sublimin: al, they c:

~ inus) 15 us take a . collocate!

per cent c word occ~ are grou

bf the freq urring l , C ~ped into:

uency of i 100 times;

pward co :nt of the eutral col

. -

Ilocates- node frec locates-

-those wh luency (tf between

ose own iat is, 1,l 85 per ce

- - -

occurrent

50); nt and 13

- -

:e is over

.5 per cer .~ ~ .

115 per

~t of the . - a

~ d e frequ' rea; Iwnward

ency (in t

collocate

his instan

s-less th

ce, 850 a1

an 80 per

nd 1,150), this is t t Ie buffer

ze, 850).

Analysis of the collocational pattern c )f back

Upward collocates: back Prepositions/adverbs/conjuncr~ons: UL, (down), from, ini

cent (in tl to, now, I

Neui warc a sul .&--

tral collocates are added on an ad hoc basis to upward o 1 groups, and are given round brackets. Since this h: nmary account of a very large set of data, I have remov

I L C ~ U S which seem to be of little general significance. These ILICIUUC

persc es, contracted forms like I'll, and word-form

lr down- IS to be ed some

:--1...l-

then, tc

Pronouns n,...,a.-.-:%,.

I, up, wht ;: her, him

?n

1, me, she. .c. L o w L;

, them, we

r uaacwlrc prulluulla. .,c., , ,,.S, my, (your) ma1 nam s whose

Page 11: SInclair Corpus Concordance Collocation

dance, Co llocation Collocation

borne, hotel, ofice, road, streets, village, The n after pronc

' . I

neaning of back as 'return' attracts expressions ot time and place; and where are also prominent. The presence of four subject suns may have a more general explanation than anything to do .

back, but the absence of you and I from the list may be worth ling. Possessive prc lggest the anatomical sense of back vould explain why their d o not figure prominently. The rerbsget and go ar t 3UyLIVlJinate~ of a large number of verbs of In, many of which will be found in the downward collocates. ave selected a few examples of these words to show the way in h the basic syntax of back is established. The sets of examples

~ u l l o w the four categories mentioned abov-

I

I Nouns: camp, flat, garden, I yard bed, chair, couch, door, sofa, wall, window, feet, forehead, hair, hand, head, neck, shoulder, car, seat mind, sleep, kitchen, living room, porch, room.

wirn pursi and v two 1

,nouns SL

they and . .?..,.a%?,-.-, The word-class groupings above are based on frequency with back; many words actually occur in more than one word-class. Verbs are given in their most frequent form. Note the preponderance of past tense verbs, reflecting the temporal meaning of back.

The prepositions and adverbs suggest some typical phrases with back, and the nouns are largely those of direction, physical space, and human anatomy. A few typical examples follow:

motic I h

whicl L - l l - -

.eally was : drive bac --.. -..- -

g back at to the ten ,, l.,,I, 4

school race L,.*" n-..:.

like bein :k down 1 ---..+.. ? A .

Verbs: You arrive back on the Thursday May bring it back into fashion We climbed back up on the stepladder The ne back to England She back on flowers It P ites back to the war The bearer drew back in fear We drove back to Cambridge You can fall back on something definite I flew back home in a light aircraft He flung back the drapes joyously Don't try to hold her back She lay back in the darkness He leaned back in his chair He looked back at her, and their eyes met Pay me back for all you took from me Pulled back the bedclothes and climbed into bed I pushed back my chair and made to rise Shall I put it back in the box for you I rolled back onto the grass She sat back and crossed her legs Edward was sent back to school He shouted back

I The girl stared back ' walking back to Fifth Aven

w IIFII VUL pal C I I L ~ L ~ I I I C uaLn 1 I u1r6 I all:

~llowed him back into the u '

lefty slap o n the back n turned back to the booksht

have him ck to her nice to h;

? went Pack to the Pungalov

Ten can I went ba

~ o u l d be

back hot typing ave them

1

ne, docto

back

:y had cot never cut

ossibly dh

E has goni b want bat . .

e back to :k into hi

an back to m y cabi.. your dorr get back

L y uaik to the sallic u L a L

her pare1 s office n

) back to : )w I must a,, ,, Lor

nitory a t to work .-ma .-.net

Dow Verb

J ~~

nward co s: arrive,

llocates: 1 bring, et J

back c., climbc

/ - I 1 , etc., dal

J~ 1 1

?d, come, . n~ etc., cut n I

:es, etc., ~ w , erc., arove, erc., Tall, erc., Tiew, pung, nanaea, nold, etc., ked, lay, etc., leaned, etc., looked, looking, etc., pay, pulled, etc., shed, etc., put, ran, rocking, rolled, rush, sank, sat, etc., sent, etc., wted, snapped, stared, stepped, steps, etc., stood, threw, traced, pried, etc., walked, etc., wavc

art jer P U shc

.d. ), past, t o ~sitions: along, behind, ontc ward, tot

right uards

.:rbs: again, forth, further, slowly, strh ctive: nor, ey started

Page 12: SInclair Corpus Concordance Collocation

Prepo

Adver

sitions: I 7

location

I tive: s: I

He stepped back and said ... He then stood back for a minute The woman threw her head back These could be traced back to the early sixties -1e turned back to the book! she walked back to the bus ! We waved back like anythin 3ands heid behind Walked back towa: ;ater we came bacl Rock us gently bac If you look further The straight back t -4e went slowly bac rhings wc : crawled . .

~ u l d soon back to c . .

~y back tc ry back tc ,ven a bac - L,.J

his back rd the h o ~ k a ~ a i n .. ..

0---

k and for back in n o his cabi , . . .

:k to his I get back

amp '11 drive you back to your flat Vot a bit like his back garden -Ie turned and went back home We had to go back to the hotel You've just got back from the ofice Set back from the road The back streets of Glasgow 911 the wa ) the village 3n his WL; ) the apartment Without e .k yard 30 back t~ "CU

i e leaned back in I jtepping outside th 9 man standing by the back rom went back to 3ritain would be b; l e brushed back h With the back of his nana ;he put her head back against the seat The hairs on the back of my neck l e gestured back over his shoulder rhey got back into

is chair e back dc . . . the wind( ~ c k on its is hair r I

th ny files n 3ook

21

the car

20

)or wall >w feet

Collocation

:re was so he back o :n we go I

,me beer on the back seat tf his mind lack to sleep again

You must come back to the kitchen She went back into the living room Beside me here on the back porch He came back into the room

variation. n two pal ms of sen . . .

Conch All the eviaence polnrs ro an underlying rigidity of phraseology, despite a rich superficial , Hardly any collocates occur more than once in more tha. xerns. The phraseology is frequently dis- criminatory in ter se; for example, there are almost as many

I instances of flat on her back as back to her flat. Some, like arrive, seem characteristic of the spoken language, some, like hotel, show the wisdom of allowing a nine-word span for collocation.

Early predictions of lexical structure were suitably cautious; there was no reason to believe that the patterns of lexis should map on to semantic structures. For one thing, lexis was syntagmatic and se- mantics was paradigmatic; for another, lexis was limited to evidence of physical co-occurrence, whereas semantics was intuitive and asso- ciative.

The early results given here are characteristic of present evidence; there is a great deal of overlap with semantics, and very little reason to posit an independent semantics for the purpose of text description.

Page 13: SInclair Corpus Concordance Collocation

words about words

Introduction In the final chapter, we look at the way in which people explain the meaning of words, especially in dictionaries. Although lexicography is a practical skill, a dictionary is a systematic description of a language. In turn, it must be assumed that any such description rests on the foundations of a theoretical position, whether articulated or not.

The argument in this chapter makes something of a contrast with that in earlier work (Sinclair 1984), where I make a case against the attempt to devise a theory of lexicography. At that time, lexicography seemed t o me to be almost entirely a matter of managing a number of routine factors like resources and project aims. The relevant theory was linguistic theory, pure and simple. Expertise in computation, printing, book design, reference, and other skills was required from time to time, but this was not felt to be of a theoretical nature. even when. as in computational science, theory was readily available. Lexicography was held up by the practitioners to be a largely practical matter, and

~ -

theories in the way. However, in the later stages of compiling the Cobuild dictionary

(Sinclair etal. 1987) it was decided to develop a new style of presenting lexicographical information. The process began in a straightforward attempt to explain the meaning and use of words in ordinary English sentences, and it ended in a radical critique of conventional lexicd- graphy. This exercise now appears to be the first step in articulating a theory of language reflexivity-the capacity of language to talk about itself. The importance of this capacity has not been properly recognized as yet, or even the extent of its occurrence in everyday usage. This chapter hopes to contribute to a better understanding of language about language.

The rationale is I 3anks (1987). Eachentry for in the Cobuild dictionar: vith some formal matters l i k n g of word-forms. and a ruruc to ~ronunciation. Then. ~ a ~ a e ~ a o h bv

jet out in I y begins \: - -.-:-I-

a word :e a listi - - - - - - -

~ h , the me; "

~nings anc 1 uses of tk

123

ie word ar end

Page 14: SInclair Corpus Concordance Collocation

Corpus,

explanat an extra

:ics. . .

znce, Collc

e main lr and

is usually ;

with abl of each ion there i an example. To the side of th, text is column ~reviated notes on gramm: semanl

In thls chapter, I shall concentrate on the structure of explanations. The explanations lead to hypotheses about inference, metalanguage, and the general nature of lexical statement.

Here are some re~resentative lexical statements from the Cobuild diction

A hc eople live If YOU uclcdr aul~lcurlc, yuu w11r d victory O k c l LIICIII 111 a c v l ~ ~ c a ~ a u u l

battle, game or argument. ire substance is not mixed wit nething happens often, it hap1 nuch of the time.

Struc These I

11 sum

The fir: shown

I . I ana rnl topic o text. F' and i f

the op,

1 which p .-*. ...... -

ture statement

f you defe; A pure

-AL:-- 1. -.

.g else. times or r

risible into two principal parts:

A house 1s a b I which people live ... at someone you M ry over them in a contest ... : substance is not i th anything else.

ernlng nappens often it happens many times or much ofthe time.

uilding in lin a victo~ : mixed w - - -.

st parts of each sentence break down further into two sub-parts, by the type-face changes. One or more words are in bold type,

e rest is in roman. The word or phrase in bold type is called the ~f the sentence, and the rest of the first part can be called the co- or example, in the second statement above, defeat is the topic, you and someone constitute the co-text.

~ h k second part of each sentence is an explanatory comment on the topic, and is called the comment. Comments are sometimes divisible ac- cording to the surface syntax. This is called chunking; in this kind of sen- tence, successive chunks express gradually increasing depth of dec ''

There is another element of struc ~ c h of the it can occur physically in either thc t or the se element is an indication of the actu,. structu,,, called

erator. In the statements abc

a. the outset of the first part: if; b. at the outset of the second part

Table 1 shows t he analys is so far:

:ture in ea : first par1 , , I ,,"*,".

Statemen :cond par ,..a ,."A :c

we, the o

: is.

perators ; are:

call. ts, and t. This

-. .

rds about r

Page 15: SInclair Corpus Concordance Collocation

us, Conco rdance, C, rds about

Var The1 nresc

iation i n co-te: Kt ' the first I:

.rt

'topic' 'operator' 'comment'

a woman asa COW ...

First Pa e types of ar.

!art that a re quite d 'e are som ented so f TOPIC

Abo

The

ut the wo

statement l l l a v "c auuui lllc wulu lrscll. dllu 111dy not use the device

lrd itself - - - -* L- - u describe

utting the

3u use nat I .

word as

urally to i~ r-

topic in a

ndicate th; 1. I I

In appror

at you thir 1 .

riate con text, for c

ing is very t I

:xample:

obvious. 1

I

eport s tm~ ctures Table 2: R ik someth

I , . . The other escribed in similar ways. The terms topic, operator, anu commenr are re-used but in lower case and inside inverted commas to make it clear that they are embedded. Note that the comment a t the lower level is the topic a t the text level.

The Cobuild dictionary is sparing with explanations of this type, and only uses them when it would be misleading to ignore the subjective quality of the meaning. For example, it was implied above that 'smooth' and 'strong' were inherent qualities, but on closer inspection they are seen to be quite subjective. Something is smooth only if there is general agreement about that as a description of it. Some objective qualities of the object referred to will be relevant in deciding whether or not it can be called smooth. In contrast, if you consider something or someone 'smast : seems to be a very personal judgement.

5 can be dl -.-- - . -L

, example! .-3 .-.. oralnary wrltten kngllsn, tne wora naturally In tne aDove example

Id be highlighted in some way-italics or inverted commas usually. e it is a dictionary headword and in bold face, it is not further nguished. However. this type of sentence is a different way of ling the jc amples are:

-. WOU

Sincl disti tack ,b of exp lanation. Other ex

o r thing! aturalistic j that ... leanwhile I l l c a l l > W I I I I C a "articular thing is happening.

the air.

: describ . ----- les people

--.LZI- - : means : some01

to rise an 1e means

d float in ...

w n:

The or P

rc people mean ling', that t may be : her than

lbout wh; what the

at people word o r 1

mean wh phrase mc

en they use a word Eans:

statemen' hrase, rat Structure: verb explanations

If you descl If you say t If you say t l L you refer LU ~ U I I I C U I I C d plp~qucnn. Y U U ...

ribe a wo hat some hat some *-

man as a thing get5 thing is SI .-- -- - -

Animate subjects COW ... ; up your mashing, .: ---..,.,. 1.

nose ... you ... ..-..

I return to the normal type of entry, to consider further the structure of the co-text. The focus is on the explanation of verbs.

In nearly every entry there is reference to a person; the sort of person who will be using English. The neutral way of referring t o this person is with the pronoun you, and in this sense it is used many times on each page of the Cobuild dictionary.

Occasionally, though, you is felt not t o be appropriat ~pli- cation of using you is that the sentence expresses so; that anyone . might . reasonably . . and normally do, so when we a ling

ndesirable, the pronoun you may be re- mple:

lese cases, rtant to point out t t nion of the speaker ~cia l tocol :.Nothing is inheren ling', in the way that ight be 'srnuucn ur 'strong'. The co-tex~ ~ ~ ~ c ~ u d e s a verb such as ribe, say, refer, call, and the topic is fou llilar secondary structure. This strategy 1

.ammar, and typically a report contains a statement ~ n s ~ d e ence, in: 'If you describe a woman as 2 IOU are reported as ng that the woman is a cow. That I a cow is close in cture to a house is a building, which is i e already analysed. new structure including reDort can be reurt.>mted as follows in le 2:

lat the opi tly 'smash LA..& :--I..

I t I11

desc a sin in gr

.e. The im rnething 1

re explair

nd in a su is part o f t

bordinate he 'report

. . .

clause or ' category it.

1 cow ...' ! uoman is a structur' - - - - - - -

things w placed b!

hich are ! y some on^

socially u e, for exa: sayi

eone b u q lake a noise ... eone totters, they walk in an unsteady way ... eone fling prison, ...

)s, they m I ne Tab ;s you intc

Page 16: SInclair Corpus Concordance Collocation

us, Conco rdance, CI I T ~ S about words

someone someone

defrauds you, ... burgles a building.

If someone who is very ill is sinking, ... If someone, especially a child, sneaks on you, ... If someone in authority rules on a particular situation or problem ... Ice that you sometimes reappears as the oblect or the verb, so that

:sents the action as something happeni rson who er, rather than the other way round. 3t of room here for interpretation of \ ken to be

u,.uL,irable, and the dictionary proiects an C Y ~ I U ~ L L Y ~ view of ~ o r l d through devices of this nature. t present the use of someone as both sut ce begins 'If you seduce someone', rather rnan 11 someone seauces

rneone'. The third possibility, 'If someone seduces you', carries an plication of this being a reasonably prob; , and so is avoided. Someone also replaces you if the sentencc ~ a n activity which difficult, unusual, or outside the subject s control, for example:

the c may

TI on,..,

:o-text pre be the us

lere is a I( ,Il., ..-Aa.-

ng to a pel

vhat is tal "*."I.."*:.*

It is a small step fr topic verb:

,om that t o name directly a suitable subject for the

If the police a r r m yuu, ... When artists exhibit ...

JVCI'

the 7

A1 sedu

voided, so I The disadvantage of this last structure is that it appears by a natural sort

of implication to exclude anyone other than the named people. It is thus a risky statement to make in a dictionary because the conventions of language are so easily extended. On the other hand, it is clumsy and uneconomical to keep saying 'If someone such as a policeman arrests you ...', when the likelihood of being arrested by anyone else is very small.

We must presume, then, that in the cases of direct naming it is unlikely, but not impossible, that someone other than the named person may be an appropriate subject.

Playing is associated particularly with children, and also with pet animals. Adults do it occasionally, but their play is more often

1 expressed in clauses with an object, for example, 'Do you play chess?'. The entry in Cobuild opens thus:

When children, animals or perhaps adults play,

able event e expresse ..

slips into a particu

vns, ... lar state ( r, they change into

'hen somc someone

:one sews tames a I

8, ... wild anim

t .

la1 or bird, ... :re an actlvlty 1s so unaes~rable or unusual that it would sound ~ r d to sug: ~rdinary people do it, the choice of a subject is ldoned a1 Instead, the explanation takes the form of a :ment abc 3rd itself, as instanced earlier in the section on iation in co-text':

abst a b a ~

gest that ( together. )ut the wc

1 Similarly, the entry for sting begins: rise and means ...

& - - -

float in t t o levitate o torture - - - -. - - -

means to someone 1 - l L - _ _ _ *

le air ... If an insect, animal, or plant stings you, ...

1 The creatures that cause a sting are partially identified. Similarly, the

! entry for lay includes the sense: I

I When a bird or female animal lays an egg ...

U C C ~ S I U I I ~ I ~ I L C T I I ~ C I V ~ . co sumeune IS me word people, which is a1 and so i ~ctivities: .S more na

~~ ~ -

le descrip plur mmunal :

'hen people ski, ... people riot, ... [hen people demon people agree on sc

The conventions of interpretation apply to non-humans in the same way as to humans, and to mixtures of human and non-human. For example, it would not be sufficiently specific to present the verb play in the co-text 'when you play ...' or even 'when people r'--- ' use of words like > v r r ~ e u r ~ e and ~ e o b l e allows an i m ~ o r t a n t de-

pment of the co-text which i Aems of explaining usaee is I i applies to t-~ch ne-.v ~vord. he words someone and people can ue uuallneu ~y auuing an 111us- ve or restrictive phrase:

velo prot

IS deiied'~ how to de

to the wo r ide whic

Ine of the : alterna-

te subject! I . I lnanlmate o~ jec t s and abstract entities are dealt with by a slrnllar set

of conventions based on the word something. Here are so I

I

trati

Page 17: SInclair Corpus Concordance Collocation

This ones:

Tf :

set

Mixc

Quit1 anim

If !

With

w

-.A L.1,

with

something glows, .. something ensues, something goes wit ing else, . pronoun 2

something -..--&L:-*

something something

things suc

re the likc

3 play is t hen an o t hen the sl

structure :

hen jelly, :s.

:d subject

e commo late or in:

someone 1

I plural fa . a .

5 such as - -.. -L . -

aken off, )ject breal In scts, ... allows thc

1 C .

glue, cer

nly a verl .-:---.. 7

or somett

Irms, this

s,.beliefs I

t is extrer

3r statem1

nely restr

... ks, ...

: use of a f;

~ ~- ~ - ~~ -.

nent, or E

ling capti

becomes: . ..

!rise wher 1 .. - - - .

nerates, ..

we evade! c - - - - - - A

ents swee

,icted, i t c

pressions

5 you, ... .hinz. ...

rds about words

2 fd 2 2 2 E E z

; can be qualified: !? E 5 0

success, glory or lc : 3 ,UII I~LII I I I~ ; sucn as an idea or subject hts somet -----,, . . 8 o P

:, for example pair E:

; unpleasant sets in e If there 1s an accent on the plural, rnzngs can replace somernzng: u 5 m

E E: f 1 :h as idea: p a place ... e E 9

a le :ly subjec :an be named: 1 i,

! 5 6 5 s 2 2 z r 0 2 2 7 m w

.z 3 $ I 2 5

2 5 - in 2 3 - a g . g Z s 2 % ~ .g 1 ! u

airly general subject as well as specific 3 s Q c: W -? 2 : E M 2 .- f P ! 3 powerrul rorce tears something from somewhere, ... < 2 2 9, 3

a

;ome other soft or liquid substance 5 5 5 Z? . .. Ej. - x

M 9 2 Y .- a i s 2 L

0 - = 0

d ;g ; % L

2 has a se e the subject can be either m - - B U

I ~ ~ I I I I ~ L C . NO suitab~t: yrulluun exists in English for this F 2 3 =c e, so we have to resort to c11 CI 5 i g P)

2 a 0 ?. & Z Si R -C,

someone or something dege~ T P - .- someone or something falls, 8 - 1

i Y 0 .- ... E;: U

k n 5 5- u 3

8 8 . 2 -

! cu E-E

I z u U

I 0 3, I 8 & G i 5 rr)

e examples glven so tar, ~t WIII have been not~ced that they all start 5 2 g g s !% 3

s an opera1 if or when. The Cobuild dictionary compilers

130 131

!

vates yuu, ...

hen thlngs or people disappoint you,

wo;

Page 18: SInclair Corpus Concordance Collocation

us, Conco rdance, Cc it words

chosc resul

: whichr t seems tc

9n 3ctivity w h ~ c h is broadly speaking within the discr ndividual, whereas there is rather more inevitabilii 1. With when, the relation between subject and verb : mutually determined. So, since sewing and skiing are ,, ,,,,,,, 3n activities, they are introduced by it rioting , while still recognizably human, a .pical, an duced by if.

In relation to inanimate objects, a similar distinction is observed. If the action seems to be inherent in the nature of the object, for example, to break the operator will be when; when also with the sun sets. If the action is onlv something that mieht h a ~ ~ e n , then if is used.

i the m wer was ) be that i . . . .

expressit mals, if in . .

on. The ~troduces First Part Second Part

O Y ~ K A I UR GLOSS

A house is a building If you defeat someone YOU win a victory over them

A pure substance is not mixed with

anything else

If something happens often it happens many times

ore corn I to peopll . . n relation e and anii

. . . .etion of ty about is a little +X,*;,-O 11-,

u.1 'a

the i wher mnre . **VL.

hum; gling intro

when. BU re not ty

anb bui- d so are

Table 4: Analysis of the second part

Of the items in the framework, a, you, and happens are repetitions of words in the co-text, and it and them are pronouns which refer back as follows:

it refers back to something; them refers back to someone.

This analysis permits us to isolate the gloss element and study its role and function. In the examples above:

a. building is a replacement of house, and we can assume that the two words are in a recognizable semantic relation - in this case hyponomy, with building the superordinate. The following chunk in which people live provides a restriction on building, giving a classic definition:

superordinate restriction This can be paraphrased as 'a house is a type of building. It is different from other types of building by the fact that people live in it.'

b. win aces defeat. Defeat is a transitive verb, with the ODJeCt someone in the co-text. In the comment. the verb is repl: ~ i t h a verb. its obiect, and a p obje he object tion

re. Here t elation is ~ n d an ad

Gtructural options that have been identified so far do not present y formidable array, and the main lines can be set out simply as in e 3. This shows how the first parts of the verb explanations are ed. Verb explanations are the most structurally complex, and the

v fewer options. Idiomatic phrases

Ihes a ver Tab11 rarLl

E speech special at

will s h o ~ tention.

' the sec rt f the expl

Ana The : of words proceeds ~ l n n g similar lines. ~ll , , , , references back to elements of the

llanatory

inalysis o . .. f the seco . . -. nd part o >n,-', ".-a

anations I . . ' . L " l L t first part, and

1. then a SI cture of c

detai

First

The descr

chunk

first chun . ' I - . 3 .

~k of the second part of the statemei v further -iDeu, returning to the original examples of several different of speech, as in Table 1. In many examples, some of the words

1 words in the co-text, either by repetition or other types ~hesion. These are called the framework of the explanation.

a victory 3 . .

over repl; - - -

parts recal of cc

n. The eposi-

repositiol of the pr

aced by as ct of the (

itructure 7

xiginal vl erb now 1

remainde d the glos e 4:

irst chunl le examp

The , calle, Tabll

r of the f s. So in tl

< is a rep les given,

hrasing o the analy

f the topi sis is as fc

ic, and is ~ l l o w s in

mixed with anything else re] ;es on the negative, and the st ctive is replaced by a past p:

places pu; emantic r~ articiple a

:he expla~ antonym: junct.

lation y. The

Page 19: SInclair Corpus Concordance Collocation

~ r d s about words

vlany timc roup.

a replace :S often, t :he adver b giving way to a naturally be led to expect that the same kind of type-face, or the same kindof phraseology, will carry the same sort of information. Thereverse of this is that differences are meaningful, and so phraseology should be standardized at least to the point where differences can be iustified.

nominal

3CCC

The havc

I

lunks of 1 been met

second cl : already I

cond part, :

which peoI

a contest such as a batt

argument

late to the first chunks in wa ;t part analysis (see Table 5 )

qualifier adjunct

ys which

There are a number ot applications of the kind of analysis presented here. First of all, it will be possible to make an exhaustive comparison of actual dictionary writing.Theanalysiswill bea professional tool with which the wording of dictionaries can be improved. Lexicographers will be able to consider alternative expressions, knowing exactly how they differ. Problem areas, for example adjectives,can be experimented with in a systematic way. The vocabulary and- syntax of lexical statement will be made explicit, so that people can understand how and

jecond chur

3f nouns I )le live :d building

le, game or

e time

exernplif

branch

Ta bl

Dis -.

Z 5: Analy

cussion why meanings are explained.

Secondly, the description will be a part of the general description of English. Since the Cobuild type of explanation relies on the natural use of words, there ar conventions to learn. The sl and meanings set out i~ xer have been derived from t lary

le outllne analysls of the language of lex~cal statement shows a slight ecialization of the normal conventions of English. The only physical 'ference is the identification of the topic by using bold face, and the alysis hinges on this. The rules of English grammar and semantics are

unaffected. The Cobuil explanatory style

out of a set o int and used their natural choices or language ro express tne meanlngs that they wanted

mvey. Several stagi luced the range of rent structures tow nce then has made ler rationalizations lere is no particular communlcatlve vlrtue in having an obviously lulaic style of explanation. Traditional dictionaries use a set of pression techniques which require specific decoding skills, and in ,ting these in favour of ordinary English it would be counterpro- ive to return to a formula in disguise. Rigid rules of compilation ompilers into a false sense of security, and may obscure important nctions of usage. For someapplications, the repetitive simplicity of ~ulae may give greater accessibility than more accurate an essions. Equally, there is some point in cutting down varia IS over-subtle for the kind of communication in which it dictionary is a text in which most units of discourse are very br~et,

the overall structure is highly repetitive. The user will

1 I-

S!? dil an

e no new I this c h a ~ :d in adva . . .

tructures he dictior

text, not prescribe nce. The same cannot be s a ~ d for most dictionary definitions, wnlcn are

very obscurely structured for anyone approaching from normal Eng- lish. Because of some long-established habits (for example, that the first 1 made an

)ut restrai d diction; ,f guide-li

C l ~ - -

ilers, whc ted withc

I

ary comp~ nes, worl part must be the topic oi ly) and a great concern for compression, they

require rules as if they were written in another language. So the structural resources of English are hardly available to the compilers in this kind of lexicography, and the explanations of this type of conven- tional dictionary are not able to be assimilated into the general repertoire of the user.

to cc diffe furtf

:s of cons rhat is put

iultative e dished, ar

diting rec id work si

TI fo r r corn roInr

Inferences and implications ;hown tha plications rased as fi

The analysis has s of entailments, im explanation is p h ~

~t each explanation gives rise to a number , and inferences. For example, the first verb 3llows:

a contest 5

I I

fo r r expr seer

A

~d varied tion that appears. . . .

If you as a b

defeat sol attle, g a r

neone, yo ie or argu

,u win a victory over them in : ment.

;uch

nnt I l l l u t c u r a L L t y , rrL can assume that, in these circumstances (ana - -

and in which

Page 20: SInclair Corpus Concordance Collocation

us, Concol lrds about

)attle is a contest kame is a contest argument is a contest u defeat someone in a contest feating someone means winning a vic u win a victory over someone in a co:

I I

activity, if carried out thoroughly and accurately, is a reasonable model for the understanding of the text.

That is t o say, if a person can routinely rephrase a given sentence in

1 his or her own words and state the difference between the two sentences in a third sentence, that person will be seen as understanding the language. If a machine can perform in a similar way, the machine can reasonably be described as understanding the language. And once a machine can be seen as understanding a language, the map of information technology will have to be re-drawn.

them tory over ntest

The subject of defeat is you, so considered an unusual or repret signifying that it is not an inherent :,. ,I. .-tlnctly . likely to indulge in from tlme to tlme. I ne object ot deteat

neone, indicating i ~t ing is done to a person, but not a ally selected one. the combination ot the general conventions of English and the zular con ictionary are powerful in xeting th

we can ~ensible a activity, 1

infer that ctivity. T but one th

. -3

g is not tor is if, nan race ? . ?

: defeatin he opera1 at the hur . . 15 U15

is so? speci.

So

Summing up This book is an attempt to show that there is a lot more to learn about the English language than it was possible to imagine a few years ago. It has been suspected, of course, in all the work on idiom usage that has accumulated in language teaching materials. While grammars and dictionaries continue to report the structure of language as if it could be neatly divided, many of those people who are professionally engaged in handling language have known in their bones that the division into grammar and vocabulary obscures a very central area of meaningful organization. In fact, it may well be argued on the basis of the work in this book that when we have thoroughly pursued the patterns of co-occurrence of linguistic choices there will be little or no need for a separate residual grammar or lexicon.

That remains to be seen. Certainly, the first application of computers to the study of language corpora has uncovered a lot of new facts which have to be built in to our descriptions of languages. And it should be stressed that this book reports only the first dipping of an inquisitive toe into the vast pool of language texts. The corpus of the 1980s, although boasting a central size of 20 million words, will be seen in another decade as a relatively modest repository of evidence; the software tools increase in sophistication month by month, and must still be regarded as primitive compared with what the real needs are. Most limiting of all, our concepts, our ideas of what to expect and how to understand what we are observing, are not keeping pace with the evidence available. There is as yet little or no discussion at an international level and, beyond the Cobuild project, no thorough exploitation of corous linguistics.

that defe:

ventions e explana

of the C ~tions.

The l part c nnlv

anguage U ~ C LU explain t l Is is an important 3f our linguistic repertoire. I here, it is clearly a slight extension of the ord~nary use of English. Thus, all the ~ility of a natural language is available for implications, infer- , etc. In turn, these can be developed into a set of tools by which ~naries can be constructed and understood. All sorts of valuable

semantic and structural statements can be retrieved. For the future, this analysis offers the possibility (

of the most powerful, but least understood, featc laneuage-the features of paraphrase. The lexical statements set up

alences, -ies. See : 6 for an

ie meanin [n the sty1

~g of worc le studied .- .

-... flexik ences dictic

)f harness Ires of a

sing one natural

" equiv Table

which are analysis (

found in 3f our ori P I - - - 1 T . L

the topic ginal exa

le 4)

: and glo! mples:

~ic (Table 1 wturb ( I a 0

= building = win a victo = not mixed w ~ t h anyth~ng else

- hou defc ry over

. . purl ofte

Tal

The 9

- -

phrase ... . ..

many timer

F house. .., dnalysls w ~ l l establish that ba,,,',', a auvcLululuarc UI

he rest are synonymic. The g ple, in defeat - win a victory

and tl exam

:al replacr also be m,

:ment sha ade explic

rammatic over will

lwn, for :it. This