36
Lexicographic Evidence In this part how to design acquire and process a In this part how to design, acquire, and process a collection of linguistic data which will form the raw material for a dictionary is going to be explained material for a dictionary is going to be explained.

lexicographic evidence

Embed Size (px)

Citation preview

Page 1: lexicographic evidence

Lexicographic Evidence

In this part how to design acquire and process aIn this part how to design, acquire, and process acollection of linguistic data which will form the rawmaterial for a dictionary is going to be explainedmaterial for a dictionary is going to be explained.

Page 2: lexicographic evidence

C h i Q ti (1)Comprehension Questions (1)

1. What is a reliable dictionary?

2. What is subjective evidence and its limits?

3 What is a citation?3. What is a citation?

4. What should be the basic steps in setting up a 4. What should be the basic steps in setting up a reading programme?

5 Wh t th d t d di d t f 5. What are the advantages and disadvantages of citations?

Page 3: lexicographic evidence

C h i Q ti (2)Comprehension Questions (2)

6. What is a corpus?

7. What are the points that should be considered in designing a corpus?g g p

8. How large should a corpus be?

9. How do we decide what kinds of written or spoken material our corpus should include?

10. Can a corpus be representative?

Page 4: lexicographic evidence

C h i Q ti (3)Comprehension Questions (3)

11 Wh t i ‘ k i ’?11. What is ‘skewing’?

12 Wh h i h h ld b 12. What are the questions that should be answered before starting to form the corpus?

13. What is linguistic annotation?

Page 5: lexicographic evidence

A ‘R li bl ’ Di tiA ‘Reliable’ Dictionary

A reliable dictionary is one whosegeneralizations about word behaviorgeneralizations about word behaviorapproximate closely to the ways in whichpeople normally use language when engagingpeople normally use language when engagingin real communicative acts. Yet, it isdifficult to determine how people normallyp p yuse words. There is a need for evidence.

Page 6: lexicographic evidence

Subjective Evidence and Its Limits

l l lIntrospection: consulting your own mental lexicon, is aform of evidence, but it cannot form the basis of areliable dictionary alone, since one individual’s store ofli i ti k l d i i it bl i l t dlinguistic knowledge is inevitably incomplete andidiosyncratic.

Informant-testing: in which speakers of a language arequestioned about their use of words, is also of limitedvalue for mainstream lexicography for similar reasons.B h f h i ll bj i f f

g p yBoth of them are essentially subjective forms ofevidence.

Creating a reliable dictionary involves a number ofchallenging tasks, but it is for sure that the observationof language in use is the indispensable first stage in thef g g p f gprocess.

Page 7: lexicographic evidence

Cit tiCitationsA citation is a short extract from aA citation is a short extract from atext which provides evidence for a

d h i iword, phrase, usage, or meaning inauthentic use. Until the latetwentieth century, the OED’scitations would be written incitations would be written inlonghand on index cards known asslips These were filed alphabeticallyslips. These were filed alphabeticallyaccording to the keyword of theit ticitation.

Page 8: lexicographic evidence

DNAIf a blog has a common ancestorwith the diary one can say that itwith the diary, one can say that ithas a DNA.

E g ‘MySpace’ shares at least someE.g. MySpace shares at least someof its DNA with the ‘scrapbook’.

Page 9: lexicographic evidence

d Setting up a Reading ProgrammeSome dictionary publishers provide onlineSome dictionary publishers provide onlineforms to enable members of the public tocontribute citations Most of these publisherscontribute citations. Most of these publishersget unusable citations since their programmesare not well-planned. A good readingp g gprogramme, on the other hand, will often havegreat value.

Page 10: lexicographic evidence

d Setting up a Reading ProgrammeThere is a need for at least four main data fields:1- keyword or phrase: the usage that the citation illustrates,

filed under the headword to which it relates.

2- the citation itself: usually a single sentence is adequate, butthere may be more than one.

3- Information about the source of the citation: the date, title,and author’s name are all important; additional information(such as the page number) may be useful for specialized or( p g ) y phistorical dictionaries.

4- a comment field: this gives readers the option of adding aa c mm nt f th g r a r th pt n f a ng anote to clarify the citation; it may, for example, be a newmeaning that needs explaining, or it may be characteristic ofone particular dialect.

Page 11: lexicographic evidence

Ad t f Cit ti Advantages of Citations

1- they are helpful to monitor language changey p g g g

2- They give information about the terminology2 They give information about the terminologyfrom a specific subject field or a particularvariety or dialect.y

3- They are helpful in training the 3 They are helpful in training the lexicographers

Page 12: lexicographic evidence

Di d t f Cit tiDisadvantages of Citations

1 Collecting data in this way is labour intensive1- Collecting data in this way is labour-intensive,so volumes will always be low.

2- Although instances of usage are authentic,th is bi s bj ti l m nt in th ithere is a big subjective element in theirselection

Page 13: lexicographic evidence

h l l f h The Central Role of the CorpusCitation bank alone - even the largest one –Citation bank alone even the largest one cannot usually supply language data in the required volumes so the case for a large q m f gcorpus is clear.

A “corpus” is a collection of pieces of language text in electronic form, selected g g ,according to external criteria to represent, as far as possible, a language or language

i f d f li i i variety as a source of data for linguistic research (Sinclair 2005).

Page 14: lexicographic evidence

S I bl T thSome Inescapable TruthsThere is no such thing as a perfect corpus forg p p

lexicography.

First of all, the corpus is a sample. It is not possible toF f , p mp . pexamine every extant example of usage for the languages. Tocreate a sample that fairly reflects the wider population,there is a need for carefully selected criteria.

Secondly, selecting texts on the basis of their ‘quality’, andexcluding those which fail this test, is fundamentally at odds

ith th d s ipti th s f p s lin isti s Wh is twith the descriptive ethos of corpus linguistics. Who is tojudge which texts are ‘good’, and on what basis? It is clearthat a lexicographic corpus must be a genuine – and inclusive-snapshot of a language, not a set of texts that have beensnapshot of a language, not a set of texts that have beenspecially chosen to advance someone’s notion of whatconstitutes ‘good’ usage.

Page 15: lexicographic evidence

C D i ICorpora: Design IssuesDesigning a corpus means making decisions about:Designing a corpus means making decisions about:

1 how large it will be1- how large it will be.

2 which broad categories of text it will include2- which broad categories of text it will include.

3 what proportions of each category it will include3- what proportions of each category it will include.

4 hi h i di id l t t it ill i l d4- which individual texts it will include.

Page 16: lexicographic evidence

Size: How large is large enough?Size: How large is large enough?

It i f th t th d t h thIt is for sure that the more data we have the more welearn. Yet, there are also some hypotheses on the sizeof the corpus. Zipf’s Law predicts that the tenth mostfrequent word in a corpus will occur twice as often asthe 20th most frequent word, ten times as often as the100th most frequent word, and 100 times as often asq ,the 1000th most frequent word. Thus, it can be saidthat in a corpus of 100 million words, a simple right orleft sorted corpus clearly shows most of the normalleft sorted corpus clearly shows most of the normalpatterns of usage for all words except the very rare.

Page 17: lexicographic evidence

ff d ff lDifferent texts, different stylesHowever large its size may be if the words areHowever large its size may be, if the words aretaken from only a limited area (for instance fromnewspapers), they cannot represent all aspects ofth l n nd th s lts m b misl dinthe language, and the results may be misleading.(For instance; the meaning of the word party willmost frequently occur as a political organizationq y p grather than a social event. A corpus consisting of asingle type of text will reflect only the stylisticand subject-matter features of that particularand subject-matter features of that particulargenre. It will as corpus linguistics say, a ‘skewed’corpus. Therefore, the corpus should included ff d d ff ldifferent texts and different styles.

Page 18: lexicographic evidence

Can a Corpus be Representative?

The standard way of avoiding bias is to collect a ‘random sample’.Y t d s li t s t th l ll OYet random sampling may not represent the language well. Onepartial solution is to apply stratified sampling. This involvesbreaking up the total population into a number of subcategories orbreaking up the total population into a number of subcategories ortypes, then creating independent random samples from each ofthese groupings. But this immediately raises two questions:g p g y q

1- How do we define these subcategories?

2- How do we decide what proportions of each subcategory the2 How do we decide what proportions of each subcategory thecorpus should include?

Page 19: lexicographic evidence

It is almost impossible to define the population that the corpus should be representative of,

d h l l d and since the population is unlimited, it is logically impossible to establish ‘correct’

ti f h t A hi bl proportions of each component. An achievable objective should be “a balanced corpus”.

Page 20: lexicographic evidence

S l ti T tSelecting TextsThe corpus collection is usually recursive.p y

First some texts from a range of sources are gatheredNext the texts are analyzed to identify recurring clusters

of linguistic features.f g f .It enables us to establish provisional categories of texts,

grouped on the basis of shared linguistic features.Then more texts are collected to reflect these featureThen more texts are collected to reflect these feature

distributions.Then the analysis is repeated on the enlarged corpus, on

more texts.The process thus proceeds in a cyclical fashion until we

collect a large corpus whose contents reflect the proportionsin which the various key features are observable in largeb di f t tbodies of text.

Page 21: lexicographic evidence

S k D t A S i l CSpoken Data: A Special CaseWith a corpus of spoken language there are no With a corpus of spoken language, there are no obvious objective measures that can be used to define the target population The spoken data define the target population. The spoken data should represent the variables like gender, social class, age and religion. The conversations , g gthat form the corpus should reflect the diversity of the spoken language.

Page 22: lexicographic evidence

A N t ‘Sk i ’A Note on ‘Skewing’Skewing refers to a form of bias in dataSkewing refers to a form of bias in datawhereby a particular feature is either over orunder represented to a degree that distortsunder represented to a degree that distortsthe general picture. As corpora grow larger,usually problems with skewing gradually recede.y p g g y

Page 23: lexicographic evidence

There are some questions that should be answeredThere are some questions that should be answeredbefore starting to form the corpus.Language: Will the corpus be monolingual, bilingual, org g p g gmultilingual? This is an important question beforestarting to form the corpus.Time: Will the corpus be synchronic or diachronic? InTime: Will the corpus be synchronic or diachronic? Ina synchronic corpus, the constituent texts come fromone specific period of time, whereas the texts makingp p gup a diachronic corpus come from an extended period.Mode: Will the corpus include written texts, spokentexts or both? The status of the chat roomtexts or both? The status of the chat roomconversations which have the characteristics of bothspoken and written texts is another point that requirep p qattention in corpus formation.

Page 24: lexicographic evidence

M diMediumMedium refers to the channel in which the textMedium refers to the channel in which the textappears. A simple classification here woulddistinguish print media and spoken media. Thef m in l d b ks n sp p s m in sformer include books, newspapers, magazines,journals, dissertations, movie scripts, governmentdocuments and legal statutes. Spoken mediag pinclude face-to-face conversations, broadcasts andpodcasts, public meetings, and educational settings.Once again traditional categories became blurredOnce again, traditional categories became blurredwhen we add the web to the mix. Some ‘new’ texttypes (blogs and social networking sites, for

l ) l h b bexample) are exclusive to the web, but manydocuments exist in both print and electronic media.

Page 25: lexicographic evidence

D li ith S blDealing with SublanguagesWhen we think about the vocabulary of aWhen we think about the vocabulary of alanguage, it is useful to make a broaddistinction between core usages andsublanguages. The word deuce is part of asublanguage: it belongs to the vocabulary oftennis A word like important on the othertennis. A word like important, on the otherhand, belongs to the core vocabulary ofEnglish. The following question arises at thisg f g qpoint: will we include the sublanguages?

Page 26: lexicographic evidence

Collecting Written Data

In the past, the work of lexicographers wasp g pnot so easy. Earlier corpora made extensiveuse of scanning and keyboarding which wereb h l d l bboth slow and labour-intensive processes.Today it is possible to find the digital form of

i t tvarious texts.

Page 27: lexicographic evidence

Collecting Spoken Data

Traditionally, spoken data has been difficult rad t onally, spoken data has been d ff cult and extensive to collect. Consequently, although the majority of communicative events g j yin a language occur in spoken mode, few corpora include high proportions of spoken material. For instance, only 10 per cent of the BNC is spoken. Nowadays, web-derived spoken d t hi h ff t d t t i l i l data which offers up-to-date material in large quantities and at low cost begins to look like an attractive alternative attractive alternative.

Page 28: lexicographic evidence

Collecting Data from the Web

Th sti f ‘ h th th b isThe question of ‘whether the web is acorpus’ is a hotly debated topic inlanguage engineering circles. Forlexicography, it is better to see theg p y,web as a source of texts from whicha lexicographic corpus can bea lexicographic corpus can beassembled.

Page 29: lexicographic evidence

Sample Size

There are arguments for using completeThere are arguments for using completetexts rather than extracts. In manyregisters, the discourse structure andh l f fg

rhetorical features of a text may vary as itproceeds from its opening paragraphs,through its central sections, to thethrough its central sections, to theconcluding chapters. The BNC’s solution tothis was to ensure that 40000 word sampleswere taken variously from the beginningwere taken variously from the beginning,middle, and end of its source documents.

Page 30: lexicographic evidence

C i ht d P i iCopyright and Permissions

Unless a corpus is made up of much older texts, mostof its source material is likely to be protected by

i ht S b ild h ld t i icopyright. So, corpus-builders should get permissionsfrom the copyright owners to include the documents intheir corpus. This is not an easy task. It is one of themost time consuming aspects of the project It ismost time consuming aspects of the project. It isrecommended that the corpus builders should neveroffer to pay for permission to include a text. Oncemoney starts changing hands a precedent would bemoney starts changing hands, a precedent would beestablished that could have fatal consequences tocorpus-creation efforts worldwide.

Page 31: lexicographic evidence

Processing and Annotating g gthe Data

To give the final form to the corpus from its raw g f f p fstate, some operations are carried on.

Page 32: lexicographic evidence

Clean-up, standardization, pand text encoding

Essentially the process of taking aheterogeneous collection of input documentheterogeneous collect on of nput documentand converting them all to a standard, usableform. For instance, non-linguistic sounds ingspoken data (like erm, ooh, mhm) and unusabletexts in written data (like indexes, tables,diagrams) are not included in the corpus.

Page 33: lexicographic evidence

D iDocumentation

Providing each input text with a unique‘header document’ which records its essentialheader document wh ch records ts essent alfeatures. Headers typically give bibliographicinformation (title, author’s name, date andplace of publication, and the like) andprecisely locate each text in whatevertypology is being used.

Page 34: lexicographic evidence

Linguistic Annotation

Enriching raw text by adding grammaticalEnriching raw text by adding grammaticalinformation which will enable corpus usersto frame sophisticated queries and extractp qmaximum benefit from the data. Forinstance, She is tagged as a personal

d R ll d lpronoun, and Really is tagged as a generaladverb. A well-tagged corpus allows us tofocus on each pattern in turn and view afocus on each pattern in turn and view amanageable number of examples.

Page 35: lexicographic evidence

Fi l Th htFinal ThoughtsIn this part, a methodology for building a

corpus for use in lexicography has been p g p youtlined. It is for sure that this is a difficult task, and there is no perfect corpus since p planguage is diverse and dynamic. The aim is to form a balanced, standardized, well-tagged ggcorpus. For many kinds of research, a corpus with meticulously detailed headers and fine-ygrained linguistic annotation is precisely what is needed.

Page 36: lexicographic evidence

Turkish Summary: Sözlüksel Kanıt

Bu bölümde oluşturulması planlanan bir sözlüğe kaynaklık edecek olanBu bölümde oluşturulması planlanan bir sözlüğe kaynaklık edecek olanverilerin nasıl tasarlandığı, toplandığını ve işlendiğini anlatılmıştır.Öncelikle, sözlüğü hazırlayacak kişilerin kendi sözcük dağarcıklarınınönemi vurgulanmalıdır. Ancak şurası kesin ki ne kadar geniş olursaolsun bireysel sözcük dağarcığı böyle bir çalışma için yeterli değildir.Geçmişte en sık kullanılan veri toplama yöntemi alıntılama yöntemi idiGeçmişte en sık kullanılan veri toplama yöntemi alıntılama yöntemi idi.Günümüzde bu yöntem eskiye oranla önemini biraz yitirse de halakullanılmaktadır. Hatta, internet üzerinden, gönüllülerden verilertoplanması amacıyla özel programlar geliştirilmiştir. Bilgisayar veinternet teknolojilerinin gelişmesiyle en fazla kullanılan veri toplamayöntemi derleme yöntemi olmuştur Bölüm boyunca derlemeyiyöntemi derleme yöntemi olmuştur. Bölüm boyunca derlemeyihazırlayan kişilerin pek çok soruyla karşı karşıya oldukları ve işlerininne derece zor olduğu vurgulanmıştır. Dil çeşitlilik gösteren ve dinamikbir yapıdadır bu nedenle mükemmel bir derleme yapılması beklenemez.Sözlükçülerin amacı dengeli, kullanılan dili en iyi temsil eden, dilikullanan kişilere ve dilin kullanıldığı ortamlara göre ortaya çıkankullanan kişilere ve dilin kullanıldığı ortamlara göre ortaya çıkandeğişkenleri dikkate alan, ve sözlük hazırlanmasında işe yarayacakşekilde düzenlenmiş bir derleme oluşturmak olmalıdır.