Upload
alison-pearson
View
216
Download
0
Embed Size (px)
Citation preview
Rencontres TEI CouncilLyon 2009
Serge HeidenICAR Laboratory / Lyon [email protected]
Council, ENS-LSH, Lyon (France), 1 April 2009
Context (1/2)
Project objective (2007-2009) : To develop an open-source software platform for Textometry analysis of textual data
Partners : Univ. of Lyon (lead) [Weblex] Univ. of Nice [Hyperbase] Univ. of Franche-Comté [Diatag] Univ. of Paris 3 [Lexico] Univ. of Oxford [Xaira] Univ. of Montréal [Sato]
Web sites : http://textometrie.ens-lsh.fr (project site) http://textometrie.sourceforge.net (dev site)
And others :Univ. of Chicago [PhiloLogic]
Context (2/2)
Textometry methodology TEI encoded and NLP enriched textual data analysis Qualitative data analysis
Deep Text Search Engine, kwic concordances Hyper Textual data rendering and navigation
Quantitative data analysis factorial analysis, classification, specificity N-gram analysis, cooccurrence, collocation, burst
TEI Role and Usage
Open-source contract between data and software Textometry point of view for data input from TEI :
Textual dimensions (main language, secondary language, cited text, out of text - comments, notes, titles…)
<index> Lexical units (words, phrases…) and their properties (pos,
lemma…) <w> Contextual units (sentence, verse, chapter, text…) and their
properties (language, number, domain, genre…) <s> Contrasts between units Structural units (navigation : physical - page, logical) <pb/> References (unit coordinates based on their properties) Rendering (device, segmentation, style) Alignment (between two corpora)
Discussion (1/2) : Textometry related TEI element types (BFM : A. Lavrentiev)
Tokenize words (segment + value) >= : expan|note|name|s = : w|abbr|num < : c|ex
Segment sentences (segment + value) > : TEI|text|front|body|div|head|trailer|p|ab|sp|speaker|list >~ : q|quote|item
Transversal : ~ : choice|corr|sic|add|del|reg|orig|foreign|hi|title|supplied|subst|
damage|pb|lb|milestone|gap Meta : note, teiHeader Primary linguistic content of a text : index ? NLP results : specify stand-off
Discussion (2/2) : Software related information
bind software parameters to TEI texts meta.xml file of the ODT format corpus_parameters.xml of Xaira software
=> external pointer in teiHeader (like image or audio files) ?