Cleaning plain text books with Text::Perfide::BookCleaner

  • View
    586

  • Download
    1

Embed Size (px)

DESCRIPTION

Slides from a presentation about Text::Perfide::BookCleaner given at PtPW2011. T::P::BC is a Perl module created to clean books in plain text format, making them suitable for further automatic text processing activities.

Text of Cleaning plain text books with Text::Perfide::BookCleaner

  • 1. Cleaning plain text books withText::Perde::BookCleaner Andr Santos e andrefs@cpan.org September 23, 2011
  • 2. Introduction Per-Fide1 Introduction Per-Fide Text alignment Books2 Text::Perde::BookCleaner3 Conclusions, wish list and future work Andr Santos andrefs@cpan.org e Cleaning plain text books with Text::Perde::BookCleaner
  • 3. Introduction Per-Fide1 Introduction Per-Fide Text alignment Books2 Text::Perde::BookCleaner3 Conclusions, wish list and future work Andr Santos andrefs@cpan.org e Cleaning plain text books with Text::Perde::BookCleaner
  • 4. Introduction Per-FideProject Per-Fide Joint venture between the Computer Science Department and the School of Humanities of the University of Minho Andr Santos andrefs@cpan.org e Cleaning plain text books with Text::Perde::BookCleaner
  • 5. Introduction Per-FideProject Per-Fide Joint venture between the Computer Science Department and the School of Humanities of the University of Minho Portuguese in parallel with six languages: Espaol, Russian, Franais, Italiano, Deutsch, n c English Andr Santos andrefs@cpan.org e Cleaning plain text books with Text::Perde::BookCleaner
  • 6. Introduction Per-FideProject Per-Fide Joint venture between the Computer Science Department and the School of Humanities of the University of Minho Portuguese in parallel with six languages: Espaol, Russian, Franais, Italiano, Deutsch, n c English Build parallel corpora that will establish a relation between Portuguese and the other 6 languages Andr Santos andrefs@cpan.org e Cleaning plain text books with Text::Perde::BookCleaner
  • 7. Introduction Per-Fide[Parallel] Corpora Corpora Collection of natural language texts Parallel corpora Collection of nat. lang. bitexts Bitext Pair formed by a text in a given language and its translation in another language, frequently aligned. Alignment Mapping between the sentences/paragraphs/words of one text and the other. Andr Santos andrefs@cpan.org e Cleaning plain text books with Text::Perde::BookCleaner
  • 8. Introduction Per-FideProject Per-Fide Original texts in the seven languages and their translations Andr Santos andrefs@cpan.org e Cleaning plain text books with Text::Perde::BookCleaner
  • 9. Introduction Per-FideProject Per-Fide Original texts in the seven languages and their translations Two main genres: contemporary ction and non-ction Andr Santos andrefs@cpan.org e Cleaning plain text books with Text::Perde::BookCleaner
  • 10. Introduction Per-FideProject Per-Fide Original texts in the seven languages and their translations Two main genres: contemporary ction and non-ction non-ction: judicial, journalistic, religious, technical, ... Andr Santos andrefs@cpan.org e Cleaning plain text books with Text::Perde::BookCleaner
  • 11. Introduction Per-FideProject Per-Fide Original texts in the seven languages and their translations Two main genres: contemporary ction and non-ction non-ction: judicial, journalistic, religious, technical, ... ction: contemporary novels and short stories Andr Santos andrefs@cpan.org e Cleaning plain text books with Text::Perde::BookCleaner
  • 12. Introduction Per-FideProject Per-Fide Original texts in the seven languages and their translations Two main genres: contemporary ction and non-ction non-ction: judicial, journalistic, religious, technical, ... ction: contemporary novels and short stories per-fide.di.uminho.pt Andr Santos andrefs@cpan.org e Cleaning plain text books with Text::Perde::BookCleaner
  • 13. Introduction Text alignmentText alignment Manual or automatic Andr Santos andrefs@cpan.org e Cleaning plain text books with Text::Perde::BookCleaner
  • 14. Introduction Text alignmentText alignment Manual or automatic Paragraph/sentence/word level Andr Santos andrefs@cpan.org e Cleaning plain text books with Text::Perde::BookCleaner
  • 15. Introduction Text alignmentText alignment Manual or automatic Paragraph/sentence/word level Automatic alignment tools/algorithms generally fall into three categories: Andr Santos andrefs@cpan.org e Cleaning plain text books with Text::Perde::BookCleaner
  • 16. Introduction Text alignmentText alignment Manual or automatic Paragraph/sentence/word level Automatic alignment tools/algorithms generally fall into three categories: length based: when two sentences correspond, the words in them also correspond Andr Santos andrefs@cpan.org e Cleaning plain text books with Text::Perde::BookCleaner
  • 17. Introduction Text alignmentText alignment Manual or automatic Paragraph/sentence/word level Automatic alignment tools/algorithms generally fall into three categories: length based: when two sentences correspond, the words in them also correspond lexical/dictionary based: relies on lexical information or dictionaries to perform the alignment Andr Santos andrefs@cpan.org e Cleaning plain text books with Text::Perde::BookCleaner
  • 18. Introduction Text alignmentText alignment Manual or automatic Paragraph/sentence/word level Automatic alignment tools/algorithms generally fall into three categories: length based: when two sentences correspond, the words in them also correspond lexical/dictionary based: relies on lexical information or dictionaries to perform the alignment partial similarity (cognates) based: relies on occurrences of tokens graphically or otherwise identical (cognates) Andr Santos andrefs@cpan.org e Cleaning plain text books with Text::Perde::BookCleaner
  • 19. Introduction Text alignmentText alignment Example Table: Extract of sentence-level alignment performed using Portuguese and Russian subtitles from the movie Tron. Andr Santos andrefs@cpan.org e Cleaning plain text books with Text::P