60
June 14, 2013 | IALLT Conference A Video Corpus for Language Learning Open Source Tools & Materials from the Corpus-to- Classroom Project

A Video Corpus for Language Learning: Open Source Tools & Materials from the Corpus-to-Classroom Project

Embed Size (px)

DESCRIPTION

Presentation at IALLT 2013

Citation preview

  • 1. June 14, 2013 | IALLT ConferenceA Video Corpus for Language LearningOpen Source Tools & Materials from the Corpus-to-Classroom Project

2. Who we are Rachael Gilg Project Manager / Web Developer Arthur Wendorf Educational Technologist / Developer / Spanish Instructor Mart Quixal Computational Linguist / Developer / Spanish Instructor Almeida Jacqueline Toribio & Barbara E. Bullock Project Co-Directors Carl Blyth Director of COERLL2 3. 3 4. Agenda1. Introduction to the Corpus-to-Classroom Project2. Project results: The SpinTX Video Archive: a pedagogically-friendlyinterface to the Spanish in Texas Corpus. Involving teachers in the development of openeducational resources. A model for open source corpus development.4 5. Introduction to the Corpus-to-Classroom Project5 6. Corpora in the Classroom: the promise Corpus = a large, structured, collection of language Benefits for language learning: Naturalistic language use Motivation Real language Discovery learning6 7. Example: CORPUS DEL ESPAOL7 8. Example: CORPUS DEL ESPAOL8Pros: View examples of language in context. Linguistic annotations enable searchingby part-of-speech, etc. 9. Example: CORPUS DEL ESPAOL9Cons: Designed for researchers, not educators. Limited utility to untrained end users. Content not openly licensed. 10. Example: YouTube10 11. Example: YouTube11Pros: Engaging video content, many with captions. Many videos are openly licensed (CC-BY). 12. Example: YouTube12Cons Searching is time-consuming. Content can disappear without warning. Sometimes blocked by K12 schools. 13. Our two-pronged approachSpinTX: Corpus-to-ClassroomGrant from the University of TexasLonghorn Innovation Fund forTechnology (2012-2013)13Spanish in Texas VideoCorpusA project of COERLL, aNational Foreign LanguageResource Center (2010-2014) 14. Spanish in Texas Corpus Goals: make publically available authentic data about variation inSpanish as spoken in Texas for education for research encourage teachers/students/public to view local varietiesas a resource14A collection of sociolinguistic video interviews thatprovide rich content for language learning. 15. Corpus-to-Classroom Goals: develop a pedagogically friendly interface for the Spanish inTexas Corpus involve teachers and learners in the development of openeducational resources based on the corpus create a model for using open source tools and a pedagogicalinterface that can be adapted for any language corpus15A searchable collection of pre-selected, corrected, annotated clips from the largercorpus 16. About the Corpus16Spanish in Texas Corpus SpinTX Video Archive92 sociolinguistic interview videos(avg. 3045 min)327 video clips from 33 speakers (avg.1-4 min)Transcribed (approx. 650,000 words) Transcribed (approx. 80,000 words)Time-synced video caption files Time-synced video caption filesTagged for linguistic features Tagged for linguistic and pedagogicalfeaturesCompletely open (no registrationrequired, open CC license)Teacher-friendly interface 17. 17 18. The SpinTX Video Archive: apedagogically-friendly interfaceto the Spanish in Texas Corpus18 19. Needs assessment with educators19 20. Needs assessment with educators20 How do you use authentic video in your teaching? How do you find videos to use? What problems doyou encounter? How can you imagine using the Spanish in Texasvideos in your classes? 21. Primary goals of the interface Enable educators to easily find and use videos that suitthe curriculum. Search by grammar point, theme, vocabulary, etc. Enable accessibility and content openness. Downloadable from open site with a license enabling remixing Enable educators to curate sets of videos for comparisonand study. Favoriting and tagging videos Provide access to supporting materials (lessonplans, activity templates, etc). Develop a community to share ready-made materials andtemplates21 22. Secondary goals of the interface Employ in the development of materials for teachertraining. Engage students as co-researchers.22 23. 23 24. Technical Overview of SpinTX Archive Drupal 7 Taxonomy module integration Community tags module Apache Solr search engine Keyword search Faceted browsing24 25. Ideas for future development Advanced search capability support for wildcards improved phrase searching improved keyword in context result view Data visualizations word and/or tag clouds language maps Enhanced word-level annotations hover over a word in a transcript and see all annotations25 26. Formative evaluation of Beta versionData collection methods: Online user survey (http://goo.gl/4Lbbg) Web analytics (navigation patterns, popular content) Search analytics User observation and feedback through ongoingworkshops and focus groups26 27. Formative evaluation of Beta versionData collection methods: Online user survey (http://goo.gl/4Lbbg) Web analytics (navigation patterns, popular content) Search analytics User observation and feedback through ongoingworkshops and focus groups27Results of formative evaluation will drive futuredevelopment of the interface. 28. Involving Teachers in theDevelopment of OER28 29. Workshops with Educators Summer 2012 Workshop ~100 secondary and college Spanish teachers Fall 2012 Working Group ~10 Univ. of Texas Spanish teachers Spring 2013 Workshops Multiple conferences & Univ. of Texas Spanish teachers Summer 2013 Working Group ~10 secondary and college Spanish teachers29 30. Sample materials from the community (1)30 31. 31 32. Sample materials from the community (2) Idea from teacher workshop: Use videos for grammarlessons to develop the students metalinguistic and criticalthinking skills as they pertain to language. Searched and selected clips for lesson on por vs. para. Lesson tested in heritage learners class. Anecdotal evidence that video lessons were effective andmotivating to students.32 33. Current Templates Four templates: Cloze Data-Driven Learning (DDL) Variation Schema33 34. Cloze Template34 35. Cloze Template: Activity35 36. Data-Driven Learning (DDL) Template36 37. Data-Driven Learning (DDL) Template:Activity37 38. Variation Template: Pre-class Preparation38 39. Variation Template: Activity39 40. Schema Template: Pre-class Preparation40 41. Schema Template: Activity41 42. Publication of OER Templates and community-developed lesson plans will beavailable on the SpinTX website by August, 2013 We encourage the publication of videos on third-partyplatforms for remixing educational content.42 43. A Model for Open SourceCorpus Development43 44. Sharing development practices and code Use of open source software and open APIs Custom code developed for the project Public GitHub repository: http://github.com/coerll Project documentation (research protocols, developmentprocesses and methodologies, etc): Corpus-to-Classroom Blog: http://sites.la.utexas.edu/corpus-to-classroom/ For Researchers page onspanishintexas.orghttp://spanishintexas.org/for-researchers/44 45. Recruit locally Recruit and train interns Internal Review Board training Video shooting and audio recording Practice interviews on site Recruit family, friends, acquaintances Any Spanish-speaking resident of TX Conduct interviews in their home communities45 46. Interview protocol Sampling of a large set of questions (~75) from NPR Storycorps (Historias) biographical information Average Length: 30-45 min. Language: Spanish and mixed Consent form and talent release Metadata on speaker and interviewer Google docs46 47. Interview Metadata 48. Processing the Videos Intake interview materials create unique ID for video and forms archive raw video and remove from camera Video and transcript preparation Edit and export videos using Final Cut Pro Sound and image correction Upload to Automatic Sync to be transcribed by bilingual transcriber 3-5 day turnaround Approx $85 per hour of video48 49. Original Transcript from Automatic Sync 50. Upload video and transcript to YouTube for syncing 51. Download SRT file 52. Prepare Transcript for TreeTagger 53. Run through TreeTagger 54. Combine Data from SRT File andTreeTagger File, and add additional Tags 55. Manual clip selection and description 56. Divide CSV Files and Videos into Clips andadjust Timings and Numberings 57. Automatic Pedagogical Annotation of Clips57 58. SpinTX Clip Data Published on GitHubhttp://www.github.com/coerll58 59. Questions?59 60. Links SpinTX Video Archive:http://www.spintx.org Spanish in Texas Corpus:http://www.spanishintexas.org Slides from this Presentation will be posted at:http://www.slideshare.net/spanish_in_texas60