41
SPOKEN LANGUAGE CORPUS PROJECT SPOKEN CORPORA FOR THE 9 OFFICIAL SOUTH AFRICAN AFRICAN LANGUAGES

SPOKEN LANGUAGE CORPUS PROJECT

  • Upload
    nura

  • View
    70

  • Download
    0

Embed Size (px)

DESCRIPTION

SPOKEN LANGUAGE CORPUS PROJECT. SPOKEN CORPORA FOR THE 9 OFFICIAL SOUTH AFRICAN AFRICAN LANGUAGES. The Asmara Declaration – Rusandre What’s the point of spoken language corpora? – Jens Overview of the project and it’s phases – Rusandre. The recording phase – Jens/Mmem - PowerPoint PPT Presentation

Citation preview

  • SPOKEN LANGUAGE CORPUS PROJECTSPOKEN CORPORA FOR THE 9 OFFICIAL SOUTH AFRICAN AFRICAN LANGUAGES

  • Workshop OverviewThe Asmara Declaration RusandreWhats the point of spoken language corpora? JensOverview of the project and its phases RusandreThe recording phase Jens/MmemThe transcription phase JensThe checking phase JensThe tagging phase Leif/RusandreResearch output - Jens

  • THE ASMARA DECLARATION - 2000Dialogue among African languages is essential: African languages must use the instrument of translation to advance communication among all people, including the disabled.All African children have the inalienable right to attend school and learn in their mother tongues. All effort should be made to develop African languages at all levels of education.

  • ASMARA DECLARATION -CNTDPromoting research on African languages is vital for their development, while the advancement of African research and documentation will be best served by the use of African languages.The effective and rapid development of science and technology in Africa depends on the use of African languages and modern technology must be used for the development of African languages.

  • Whats the point of spoken language corpora?Jens Allwood

    Corpus linguistics / Armchair linguistics

  • PROJECT MANAGEMENT

  • OBJECTIVESTo develop a platform of computer supported basic linguistic resources for the previously disadvantaged languages of SA The resources will be in the form of archived audio-visual recordings of activity-based natural language use;machine-readable transcriptions of recordings for corpus-driven searches; morphologically tagged corpora for corpus-based searches.

  • PROJECT PHASES2002 - 2004Ongoing Audio-video recordings of activity-based spoken language use (min. 200hrs p/l).Transcriptions (enriched with comment lines) of recordings in machine-readable text format.Checking and editing of transcriptions.Manual morphological tagging of corpora.Automated tagging of corpora.Research outputs.

  • The recording phaseWhat to recordActivity typesWhat to think about when recording natural language dialoguesKeep it naturalThe video camera, microphone, etcKeep the camera fixed!

  • Recording and transcriptionPractical exercise!

    A short recordingTranscribe together

  • Transcription StructureHeader (background information about transcription and recorded activity)Body (the actual transcription consisting of two kinds of elements)Contributions (transcribed utterances of participants in the recorded activity)Information lines - marks various peculiar aspects in the contributions and recorded activity

  • Example of a header@ Recorded activity ID: V010501@ Activity type: Informal conversation@ Recorded activity title: Getting to know each other@ Recorded activity date: 20020725@ Recorder: Britta Zawada@ Participant: A = F2 (Lunga)@ Participant: B = F1 (Bukiwe)@ Transcriber: Mvuyisi Siwisa@ Transcription date: 20020805@ Checker: Rusandre Hendrikse@ Checking date: 20020912@ Anonymised: No@ Activity Medium: face-to-face@ Activity duration: 00:44:30@ Other time coding: Each section@ Tape: V0105@ Section: Family affairs@ Section: Crime@ Section: Unemployment@ Section: Closing@ Comment: Medunsa open ended conversation between two adult speech therapy students Bukiwe and Lunga

  • Transcription header@ Recorded activity ID: V010501V = Video, 01 = project number05 = Tape number within this project01 = Recording number

    @ Activity type: Informal conversation

    @ Recorded activity title: Getting to know each other@ Recorded activity date: 20020725@ Recorder: Britta Zawada

  • Transcription header, cont@ Participant: A = F2 (Lunga)@ Participant: B = F1 (Bukiwe)F stands for femaleF1 is unique for Bukiwe in the entire corpus A and B are ID:s for the participants

  • Transcription header, cont@ Transcriber: Mvuyisi Siwisa@ Transcription date: 20020805

    @ Checker: Rusandre Hendrikse@ Checking date: 20020912

  • Transcription header, cont@ Anonymised: NoIndicates whether personal names, etc have been changed to pseudonyms (Yes) or not (No) both in the header and in the conversation

    @ Activity Medium: face-to-faceNormally spoken, face to face, but could also have other values, like telephone conversations.

  • Transcription header, cont@ Activity duration: 00:44:30Duration in hours, minutes and seconds

    @ Other time coding: Each sectionThere is a time line for each section

    @ Tape: V0105This is a part of the recorded activity ID

  • Transcription header, cont@ Section: Family affairs@ Section: Crime@ Section: Unemployment@ Section: Closing

    @ Comment: Medunsa open ended conversation between two adult speech therapy students Bukiwe and LungaAny relevant information that is not covered by any of the required headings

  • The bodyThis is the actual transcription - the background information is in the headerFour kinds of lines:$A: uyakhonza kaneneContribution@ < nod >Information line At officeSection line# 00:10:00Time line

  • Sections Family affairs$B: sibabini kuphela esibabalwe sada safunda ke noko sakwazi ukuphangela sikwazi ke noko kuba ndinobhuti wam osebenzayo... Religion$B: uyakhonza kanene$A: ndiyakhonza owu ndiyamthand{a} [4 uthixo ndiyamthanda andisoze ndimlahle undibonisile ukuba mkhulu nantso ke into efunekayo qha ]4 kuphela$B: [4 nantso ke sisi e: e: ]4$B: nantso ke into efunekayo uthixo ulithemba lethu [5 uthixo ulithemba lethu ulixhadi lethu ]5 uligwiba$A: [5 ulixhadi lethu ulixhadi lethu]5$B: [6 uligwiba andazi ukuba ndingangendithini ngendiphi na xa uthixo heyi ]6

    Situation on their arrival at Medunsa$A: [6 ucinga ukuba ngesiphi na ngesisemedunsa ]6$B: uye wasithatha khona waza kusibeka kule ndawo...

  • Contributions Religion$B: uyakhonza kanene$A: ndiyakhonza owu ndiyamthand{a} [4 < uthixo > ndiyamthanda andisoze ndimlahle undibonisile ukuba mkhulu nantso ke into efunekayo qha ]4 kuphela@ < name: Gods name >

  • Overlaps Religion$B: uyakhonza kanene$A: ndiyakhonza owu ndiyamthand{a} [4 < uthixo > ndiyamthanda andisoze ndimlahle undibonisile ukuba mkhulu nantso ke into efunekayo qha ]4 kuphela$B: [4 nantso ke sisi // e: e: ]4@ < name >

  • Contrastive stress, pauses and lengthening$B: abanye ke bazihlalele nje: / abanye ABAZANGE bafune sikolo // uyayiqonda ke la meko yokungabikho mzali uqhubayo / uthi aba baza emva kwam bobabini ABAZANGE bafunde kuyaphi // kodwa ke // andigxeki nto kuba ke / ndibakhona ngethuba le ngxaki nobhuti ke [2 abeyinkxaso kakhulu ]2$A: [2 ya / m: ewe ]2 hayi izinto zikuthixo azikho kuthi nam obu bushuman bam ndiseza kutshata ndiseza kutshata

  • Unclear speech and glottal stop$M: loo nto ke njengo{ku}ba sekunyanzeleke ukuba ndiye phaya nje (...) ndikwazi ukuncedisa phaya ndiyiphushile ukwenzela ukuba ndibe neclaim endizakuba nayo that is why ndithole because ndiyaclaimer so that at least uba ndiclayimile ndikwazi ukuhamba$T: ke ngoku ke yenye yezinto endifuna ukuyoyenza $M: ngolwesithathu (what she said to me ngoku bendiphaya ngecawe) besingcwaba umfazi kasicaka jama$T: ee andekufuni ukutya

  • Comment Lines$A: kunetha imvula sinemithwalo engaka < yebhegi > < yho yho yho > nako sisa@ < loan English: bag > @ < gesture: hand wipes >$B: esingazi lo mntwana ngoba kaloku siza apha asazi mntu < wakwandungwana > ukuba wayengekho ngesasitheni na asazi mntu < >@ < name: clan name >@ < comment: A drops her book >

  • Research outputJens AllwoodA distributed database (corpus)Networks (homepages)Spoken language corpus activities (seminars, workshops)

  • TAGGING SPOKEN LANGUAGE SAMPLESPROBLEMATIC ISSUES CONVENTIONS & STANDARDSA P Hendrikse 16/03/04

  • PROBLEMATIC ISSUESLoans and codeswitchingFixed expressionsSpoken language reductionsMorphophonological issuesDesigning a tag setManual taggingA drag-and-drop taggerAutomated tagging

  • Loans and CodeswitchingNon-indigenised codeswitching ndifuna Indigenised but non-standardised codeswitching loans >ndiyakleyimisha?ndiyaklayimisha? ndiyafonisha?ndiyafowunisha?

  • Fixed ExpressionsA continuum:Idioms/proverbs prefabricated expressions collocationsHow fixed is fixed?Into yokuba (*izinto zokuba)Nantso ke (*nantsi ke?)(Ke) kaloku (ke)Bafondini/mfondiniUndincedileUngadinwa nangomso

  • Fixed Expressions cntdFlagging fixed phrasesInto_yokubaKe_kaloku_keMorphosyntactic tagging or not?Ke_kaloku_keOrKe_kaloku_ke

  • Spoken language reductionsStandardised reductionsNgokuba > ngobaWritten standard reduction: reconstruction convention {} not used, i.e. *ngo{ku}baNon-standardised reductionsMusa ukuhamba > sukuhamba (wsr) >Suhamba (non-standardised)

  • Spoken Reductions cntdReconstruction conventionS{uku}hambaTaggedS{uku}hamba

  • Morphophonological IssuesCoalescenceNenkomo > nenkomoNeenkomo > neenkomoSyllabificationNgasendl{w}ini > ngasendl{w}iniAyikafiki > ayikafiki

  • Morphophonological cntdElisionAndinamoto > andinamotoStem modificationsEmlanjeni > emlanjeni

  • Designing a tag setGranularityLexical categoriesN, V (Tagging lexical categories is problematic in an agglutinating language)Syntagmatic morphological slotsamadodana > amadodana

  • Designing cntdParadigmatic instantiations within a syntagmatic slotgnp = ---Word categoriesnje (wenjenje) nje; njalo; njeya ke ke kaloku keke kaloku keke_kaloku_keemlanjeni>??

  • Designing cntdSpoken language expressionsNon-word like expressions 2 problemsStandardising orthographic representationTags e: mh:uh_uh_uh

  • Designing cntdWord-like expressions thixoThixoThixoHeyi_wethuNantso_keSuka_(wena)

  • Manual taggingManual tagging necessary for 3 reasons Identifying tagging problems and problematic phenomena and revising the tag set Developing a training corpus Correcting automated tagging errors Manual (typing) tagging not ideal Tedious Error-prone Solution: Drag-and-drop tagger

  • Drag-and-drop taggerDemonstration of drag-and-drop tagger