6
Rencontres TEI Council Lyon 2009 Serge Heiden ICAR Laboratory / Lyon University [email protected] Council, ENS-LSH, Lyon (France), 1 April 2009

Rencontres TEI Council Lyon 2009 Serge Heiden ICAR Laboratory / Lyon University [email protected] Council, ENS-LSH, Lyon (France), 1 April 2009

Embed Size (px)

Citation preview

Page 1: Rencontres TEI Council Lyon 2009 Serge Heiden ICAR Laboratory / Lyon University slh@ens-lsh.fr Council, ENS-LSH, Lyon (France), 1 April 2009

Rencontres TEI CouncilLyon 2009

Serge HeidenICAR Laboratory / Lyon [email protected]

Council, ENS-LSH, Lyon (France), 1 April 2009

Page 2: Rencontres TEI Council Lyon 2009 Serge Heiden ICAR Laboratory / Lyon University slh@ens-lsh.fr Council, ENS-LSH, Lyon (France), 1 April 2009

Context (1/2)

Project objective (2007-2009) : To develop an open-source software platform for Textometry analysis of textual data

Partners : Univ. of Lyon (lead) [Weblex] Univ. of Nice [Hyperbase] Univ. of Franche-Comté [Diatag] Univ. of Paris 3 [Lexico] Univ. of Oxford [Xaira] Univ. of Montréal [Sato]

Web sites : http://textometrie.ens-lsh.fr (project site) http://textometrie.sourceforge.net (dev site)

And others :Univ. of Chicago [PhiloLogic]

Page 3: Rencontres TEI Council Lyon 2009 Serge Heiden ICAR Laboratory / Lyon University slh@ens-lsh.fr Council, ENS-LSH, Lyon (France), 1 April 2009

Context (2/2)

Textometry methodology TEI encoded and NLP enriched textual data analysis Qualitative data analysis

Deep Text Search Engine, kwic concordances Hyper Textual data rendering and navigation

Quantitative data analysis factorial analysis, classification, specificity N-gram analysis, cooccurrence, collocation, burst

Page 4: Rencontres TEI Council Lyon 2009 Serge Heiden ICAR Laboratory / Lyon University slh@ens-lsh.fr Council, ENS-LSH, Lyon (France), 1 April 2009

TEI Role and Usage

Open-source contract between data and software Textometry point of view for data input from TEI :

Textual dimensions (main language, secondary language, cited text, out of text - comments, notes, titles…)

<index> Lexical units (words, phrases…) and their properties (pos,

lemma…) <w> Contextual units (sentence, verse, chapter, text…) and their

properties (language, number, domain, genre…) <s> Contrasts between units Structural units (navigation : physical - page, logical) <pb/> References (unit coordinates based on their properties) Rendering (device, segmentation, style) Alignment (between two corpora)

Page 5: Rencontres TEI Council Lyon 2009 Serge Heiden ICAR Laboratory / Lyon University slh@ens-lsh.fr Council, ENS-LSH, Lyon (France), 1 April 2009

Discussion (1/2) : Textometry related TEI element types (BFM : A. Lavrentiev)

Tokenize words (segment + value) >= : expan|note|name|s = : w|abbr|num < : c|ex

Segment sentences (segment + value) > : TEI|text|front|body|div|head|trailer|p|ab|sp|speaker|list >~ : q|quote|item

Transversal : ~ : choice|corr|sic|add|del|reg|orig|foreign|hi|title|supplied|subst|

damage|pb|lb|milestone|gap Meta : note, teiHeader Primary linguistic content of a text : index ? NLP results : specify stand-off

Page 6: Rencontres TEI Council Lyon 2009 Serge Heiden ICAR Laboratory / Lyon University slh@ens-lsh.fr Council, ENS-LSH, Lyon (France), 1 April 2009

Discussion (2/2) : Software related information

bind software parameters to TEI texts meta.xml file of the ODT format corpus_parameters.xml of Xaira software

=> external pointer in teiHeader (like image or audio files) ?