1986 Neuhaus Shakespeare Lexical Database

  • Upload
    aduchon

  • View
    243

  • Download
    0

Embed Size (px)

Citation preview

  • 8/3/2019 1986 Neuhaus Shakespeare Lexical Database

    1/4

    Lexieal Database D e s i g n :T h e S h a k e s p e a r e D i c t i o n a r y M o d e l

    H. Joachim NeuhausWestf~ilische Wilhelms-U niversit~it, FB 12

    D-4400 Mfinster , West Germany

    1 . T h e D a t aT h e S'hakespgare Dt 'ci tbnar~/ (SHAD) projec t ha s

    been us ing s t ruc tured da tabases s ince 1983. The sys t em i simplemented on a PRI M E 2 50- I I compute r us ing s tanda rdCO DA SYL --DBM S sof tware and r e l at ed tools . The projec thas b een able to draw on a va st repo si tory of computer izedmater ia l deal ing with Shakespeare and the Engl ish lexicon.Initially, it was part of the "Sonderforschungsbereich I00Elektronische Sprachforschung" sponsored on the nat ionallevel by the D eutsch e Forschungsgemeinschaf t. There sea rch t eam has been di r ec t ed by Marvin Spevack and H.Joachim N euhau s, now both a t Miinster , and Tho masFinkens taedt , now a t Augsburg. Sp evack ' s O'oraplete andSya teraah 'c ~oncor danoe to t he W o rks o / , ~hakospeare(Hildesheim and New Y ork, 1968-1978) and Finken staedt ' sOhrona log t~a l E n] h) h Dich onar j / ( He ide lberg , 1970), both inmachine readable form, were used in a com pu ter-ass is tedlemmatizat ion procedure (Spevack, Neuhaus, and Finken-s taedt 1974) .

    A chronologically arranged dictionary, where entriesare so r ted according to the y ear of f i rs t occurren ce, makesi t poss ible to "s top" the development of the recordedEnglish vocabulary at any desired moment and to compare,for ins tance, Shakespeare ' s vocabulary with the corpus ofEngl ish words recorded up to 1 6 2 3 , when the F , s t F o h bappea red (N euhaus 1978). The se t of words in Shakespea recan be compared with the complement se t of wordsavailable in Elizabethan English, but not attested inShakespea re ' s works . In th i s way the re i s a sys t ema t i cintegration in to the total vocabulary. A s a result, ourdatabase model can eas i ly be expanded or t ransfered tocover larger or different vocabularies.In orde r to pre sent t he comple te Shakespea reanvocabulary and to disengage SHAD from dependence on as ingle edi tion of Sha kespe are , th e data were expand ed toinclude a l l : J tage direct ions and speech-pref ixes in a l lquar tos up to and including the F, ' r , , t Foho (Volume V I I ofth e ~omple~e an d oyaJ!emait~ Ctoncordance to the Wo rkso/ S 'hakesI ,eare ) , and the "bad" qua r tos (Volume VI I I ) .Volume IX presents a l l substant ive var iants , producing acomposi te Shakespearean vocabulary in modern andeventually old spelling.In analysing this material a strict differentiationbetween vocabulary level and text level has been observed.Fur the r da ta -prepa ra t ion on the vocabula ry l eve lconcent r a t ed on forma l prope r t ie s of Shakespea reanlemmata , such as morphological s t ructure , or e tymologicalbackground. There is a complete morphology for a l l lemmata

    (ca . 20,0 00 records) , which gives deta iled s t ructura ldescriptions of derivations, compo unds, and othercombinat ions , as we l l as a l l inflected wo rd-form s, as theyoccur in the text . The e tymological data include wordhis tor ies and loan rela tions , ag ai n supplemented bychronological data . Con tent-o r iented cr iteria were used in ataxonom ic classification of all lemm ata (Spevack 1977 ). Onthe whole , there are more than thir ty f ie lds of informationin the or iginal lem ma -reco rd f i le . Fo r the mult idimensionalanalysis and presenta t ion of thes e reso urce s i t seeme dnatural to use database concepts .

    Due to a specia l intervention of the Deutsch eForschungsgeme inscha f t and the suppor t of t he Mini s t ry ,which we both gra teful ly acknowledge, we could implementour f i rs t database in 1983 on a newly ins ta lled P R IM E250- I I compute r . The PRIME DBMS sof tware , which weuse, is ac tual ly one of the f i rs t commercia l products whichc losely adhe red to the CO DA SY L ne twork da ta mode l .The design s tar ted with a database schema forShakespearean word-formation and e tymology. Since thenthe system has grown s teadi ly including now a thesau russ tructu re and a l in k to the tex t i tsel f . Th e database isacces sed in ba tch mode u:~zing the FO RT RA N and COBOLinterfaces , and interact ively with the VIS TA query languageand rep ort generator. Of cou rse, in a first implemen tationnot only the database schema i tse lf , but the preparat ion off i les , and the programming of the database creat ion job haveto be carr ied out . The f i rs t word-formation database wasestabl ished in three separate s teps . T he tota l time neededto complete the job was about 17 hours . Physical des ign isespecially important in large databases. Our Miin ster teamwas interes ted in that aspe ct f rom the very beginning (DSge1984).2 . P r e l i r a i n a r y D e s i g n C o n s i d e r a t i o n s

    Linguis ts and lexicographers are la tecomers to thef ield of database appl icat ions . Database sof tware h as beenavai lable s ince the ear ly 1960's . Th e ea r ly 197 0's broug ht awide var ie ty of commercial pro ducts and a consol idation o nthe conceptual side, which ultimately led to standardization,design phi losophies, and specif ica t ions of "norm al forms". A tthat t ime lexicograph ers s ti l l used th e con cept of an archivewhen talking about new technologies, such as Barnart (1973),Chapman (197 3) , and L ehmann (1973) a t the 1972I n t ern a t ion a l Oon /grence on Le$1~o jzraphy in E nC/ t sh .Similarly, in the late 19 70'% we witnessed prep aratio ns fo r aStanford Com puter Arc hiv e of Language Mater ia ls. Th ere isnothing w rong with the idea of an archive . B ut a database is

    441

  • 8/3/2019 1986 Neuhaus Shakespeare Lexical Database

    2/4

    something different . B y n ow , the e xpression "database"should only be used as a technical term. Perhaps "databank" may be used instead of "database" when talking aboutfiles of data, or archives in a conventional sense. TheAs~oosal:an /or lh'te rar y and .L:nym'~hv Oorayult'ny mayhave had th is clarification in mind when n amin g its specialistgroup "Structured Data Bases".Alth oug h hierarchical data models and netw orkmodels had been avai lable s ince the ear ly 196 0s, andrelational a rchi tectu res since th e early 1970 s (Codd 1970),softw are implementations wer e not generally accessible inuniversity computing eentres due to high cost, and lack ofspecial support. Alth oug h the M finster computing cen trehad th e hierarchical IMS software, a product of IBM , i t wasnot made available for our project. Looking back from today,that may not have been a handicap for a t least two reasons:lexical relationships are on ly rarely hierarchical in a naturalsense, and , more importantly , hierarchical syste ms d o nothave a common standard. There is no migrat ion path fromone software product to another . Since a Shak espearedatabase wil l have a ra th er long l i fe cycle , and was m eant tobe a model for s imilar projects , the requirement of as tandard model seemed to be imperat ive. The process ofstandardization has bee n proceeding m ore rapidly for theCODASYL ne twork mode l than fo r any o the r a rch i t ec tu re .In the ear ly 1980s there was just this m odel that fulf i lledour requirements, and this is basically true e ven today.Beginning with t he early 1980'$ lexical symposia andconferences had an ample share of papers report ing onongoin/~ research which used th e database con cept in avariety o f ways. In 1981 Na~ao et a l. reported on "A nAt tem pt to Computerize Dictionary Data Bases" (198~). A tthe same conference a Universi ty of Bonn group (Brustkernand H ess 1982) presen ted "The BonnIex Lexicon System",which two years la ter evolved into a "Cumulated Word DataBase for the German Lang uage" [Brustke rn and Schulze1983). A list of similar proje cts could easily be extend ed.One might have expec ted th at the logical design of lexicaldatabases wou ld have buil t on s tructural ~ wherewe typically fi nd entities and relationships, and in general,set theoret ic not ions, which can direct ly be t ranslated intoconceptual data-s tructures .Surprisingly, in many designs, linguistic considerationsdid not seem to have played a major role. Instead, theauthors s imulate conventional lay-out and typeset t ingarran gem ents of printed dictionaries. A n example is thewidespread dictionary usa ge to print on e "Headword" in boldtype and then use special symbols , such as the t i lde, torefer t o the headword, or p ar ts of it , thus saving space forthe t reatment of fur ther lexical i tems with the samespelling. Naga o et al. (1982} very faithfully tran sfere d thisand other lay-out details into their design. But should aconventional "Headword" and i ts dependencies be a ser iouscandidate for a database ent i ty? A re the reasons that leddictionary publishers to accept cer ta in lay-out tech niqu es atal l re levant for an electronic database? These quest ionsseem not to have been raised. The design seems to have

    442

    SYSTBMI

    S y s te m - to -M o r p h e m e

    S y s t e m - t o - A i l o m o r p h _ [ M o r p h e m e - to -A l io m o r p h. . . . . I A [ ,b O M O R P H

    .~ I A l lo m o r p h - t o -S e g m e n tSBQMBNT

    S y s t e m - t o - L e m m at [ LBMMA L e m m a - t o - S e g m e n tFigure 1 . Data-Structure for Morphological Famil ies( SHAD, database fragment}

    beco me a paradigm ca se o f an imitation design, wher e a newtechnology rep licates design featu res of an older technology.Th e basic misunderstanding is the false identification of amere presentation in a printed dictionary with an underlyinglexical information structure.If the "Headword" is not a re levant database ent i ty ,which ent i ty should be taken instead? There is only oneserious candidate: the lemm a. Th e lemma is a w ell def inedlinguistic notion. It is also weI[ known in computational w orkdue to various automatic or semi-automatic lemmatizat ionalgorithms. It is an abstract notion in the sense that printeddict ionaries and database systems need a lemma-name torefer to i t . Language specif ic conventions usual ly govern thechoice of a lemma-name. Lat in verbs, for example, arecustomarily lemmatized using the first person singularpresent form as lamina-name. A [emma is the set o f a ll itsinf lected word-forms. I t thus comprises a completeinflectional par adig m. Some lemmata ha ve defect iveparadigms or suppletive paradigms. Conventional dictionariesquite often include paradigmatic information in their frontmatter . The user has to re late specif ic cases to theseexamples. A datab ase can relate th ese explicitly. A naturalway to do this is by a one -to -m an y relat ionship betweenlemma and word-form. In an author dict ionary word-formswil l be fur ther re la ted to the text , and i ts internal s t ru cture .A machine-readable dict ionary is just a s tar t ingpoint for a stru ctur ed lexical database. [n the Bonn "WordData Base for the German Language" (Brus tkern andSchulze 1983b} there is but one database ent ity , "LexicalEntry", which seems to correspond to the lemma ratherthan to a "Headword". The au thors speak abou t the"microstructure" and the "macrostructure" in resp ect to

  • 8/3/2019 1986 Neuhaus Shakespeare Lexical Database

    3/4

  • 8/3/2019 1986 Neuhaus Shakespeare Lexical Database

    4/4

    Frequency Da t ingFamily I r o R lvb. t row 1 7 Oldeng.n. tro th 111 1175v b . betro th 12 1303a d j . t r o t h - p l i g h t 2 1830n . t r o t h - p l i gh t 1 1513pp. ne w - t r o t he d 1 1598pp. f a i r -be t ro thed 1 1607Family l r u r Pn . t r u c e 15 1225

    F a m i l y I F l l t ~adj. t rue 849 Oldeng.adv. t ruly 180 Oldeng.n. t ruth 361 Oldeng.a dj. u n t r u e 7 0 l d e n g .n . u n t r u th 4 0 l d e n g .n . t r ue - l o ve 10 8 0 0n. t rue 3 6 1 3 0 0pp . t r ue - he a r t e d 3 1471pp. t rue r -he a r t ed 1 1471n . t ru ep enny 1 1519pp. t r ue - bo r n 2 1589pp . t r ue - a no i n t e d 1 1590pp . t r ue - de r i ve d 1 1592pp. t rue -d i spo s ing I 1592pp. t rue -d ivin ing 1 1593pp. t rue - t e l l i ng 1 1593pp . t r ue - de vo t e d 1 1594a d j . h o n e s t - t r u e 1 1596pp . t r ue - be go t t e n 1 1596pp . t r ue - b r e d 3 1596pp. t rue - f ix ed 1 1599pp . t r u e - m e a n t I 1604Family trtt ln. t rus t 1 1225vb. t rus t 22 1225adj . t rus ty 21 1225n. mis t rus t 9 1 3 7 4vb. mis t rus t 1 4 1 3 7 4vb. dis t rus t 3 1 4 3 0n . d is t ru s t 3 1 5 1 3a d j . m i s t r u s t fu l 2 1529adj. t rus t less 1 1530n. t rus ter 2 1537n . s e l f - t r u s t 1 1588a d j . d i s t r u s t f u l 1 1589Figure 4. Etymological Grouping ofFamilies

    4 4 4

    four Morphological

    Re f e r e n c e s

    Barnhart, Clarence L. "P la n for a Central A rchiv e forLexicography in English." In Annals o/ the Hew YorkAc ade m y o~ ,fct'~ncea, No . 211 (1975), pp. 302 -306.Brus tke rn , J . and K. H. Hess . "The Bonnlex LexiconSystem:" In iextcography tn the Electrontc A]e. Ed. JGoe t scha lckx and L. Rol l ing. Ams te rdam: Nor th-Hol l and,1982, pp. 38 -40.Brustk ern, J . and W. Schulze . "Towards a Cumulated WordDa ta Base for the German Lang uage ." Prec . SixthInternational Con ference on Comp uters and the Hum anities.6-8 June 1983. Rale igh, N orth Carol ina ."The St ruc ture of t he Word Da ta Base for theGerman Language." Prec . International Co nference on DataBase s in the H umanit ies and Social Sciences . 10 -12 June1983. New Brunswick, New Jersey .Chapman, Ro bert L. "On Collect ing for the CentralArchive." In Annals e l the t , ew York A cademy o /~qcs~ncgs, No . 211 (1975}, pp. 307-31 1.tod d, E. F. "A Rela tional Model of Data for Large SharedData Banks," ~ornmunlcahbns el the AC A~, 13.6 (1970),377-887.D6ge, Michael . P ro b le m s c lu e s E O D A E Y i -D a te n b a n k -systems, dargeste l l t am Bstap te l des DBM S-,~o f lw are -Pakelas der Ft?ma PRIME.. Miinster, 1984.Finkenstaedt , Thomas. A ~hronolocteal EnghshDschbnary . Z:ahn @0~000 fords m Or der o / the irEarfiest Known Occurrence. Heidelberg, 1970 (w ith ErnstLeisi, Dieter Wolff).Lehmann W . P. "On the Design of a Centra l Arch ive forLexicography in English." In A n n a # e l th e tI e w Y o rkAcademy o/,re:knees, No . 211 (1975}, pp. 3 12-31 7.Nagao, M. e t al . "An Atte m pt to Computer ize Dict ionaryData Bases." In Ieztco.craphy tn the Electronlc Age. Ed. JGoe t scha lckx and L. Rol l ing. Am s te rdam: Nor th-H ol l and,1982, pp. 51-78.Neuh aus, H. Joachim. "Au thor Vocabu lar ies compared withChronological Dictionaries." Bullehn el the Assoc, 'ahb, /orZderary and i tn#m~hc Computing, 6 ( 1 9 7 8 ) 15-19.

    "Design Options for a Lexical Database of OldEnglish." Problems el Old Enghah ie .r~cofraphy. Ed. AlfredBam mesberger. Eich st~tte r Beitfiige 115. Regensb urg, 1985,197-210.Spevack, Marvin. A Uom 1e le and ~qya lemahcConcordance to the Works e l Shakeap#are. 9 volumes.Hildesheim, 1968-1980.

    "SHAD: A Shakespeare Dict ionary," #omlmter~ ,nthe tIumamh~s. Ed. J. L. Mitchell. Edinbu rgh, 1974, 111-123.(with Th. Finkenstaedt , H. J . Neuhaus)"SHAD [A Shakespea re Dic t iona ry) . Toward aTaxonomic Class if ica t ion of the Shakespeare Corpus ."~om yuh ny tn the Arnmandtba. Proceedings e l /he Th/rdInterna honal O'on/erenee on O'ompul:n tn thetfuman,h'es. Ed. Serge Lusignan and John S. North.Waterloo, Ontario, 1977, 107-114.