Accessing distributed linguistic resources An XML based architecture Laurent Romary Laboratoire Loria, Nancy (F) Samuel Cruz-Lara, Patrice Bonhomme, Christophe

Accessing distributed linguistic resources

An XML based architecture

Laurent Romary

Laboratoire Loria, Nancy (F)

Samuel Cruz-Lara, Patrice Bonhomme, Christophe de Saint Rat

Overview

Objectives General Network organization Role of XML in the architecture Implementation Perspectives

Objectives

Distributed access to linguistic resources– Linguistic resources

• multilingual texts (books, newspaper articles), mono or multilingual dictionaries, transcription of spoken data etc.

– Usages• Researchers: linguists, lexicographers

• Professionals: translators, teachers

• Larger public: information on language use

Objectives - cont.

– Distributed servers• Local maintenance of resources

– Linguistic competence (Finnish!)

– Specific philological and/or scholar competencies (historical manuscripts, transcriptions of ethnographic work etc.)

– Copyright aspects (local agreements with editors)

• Distribution and allocation of load– Large amount of data

– Main processing done on the server side

General context

National– Silfide project

• CNRS and Agence des Universités Francophones• Registering and distributing French linguistic

resources

European– MLIS/Elan project

• EU - DG XIII funding• Networkig existing LR access environments

General Network Organization

User scenario (workflow)

User connection Selection of servers

– server profiles

Selection of resources– header queries

Content queries– Concordances, word lists, statistics etc.

Servers: two main sets of functionalities Local access servers

– User identification (User DB)– Query broadcast - Result set merging

Resource servers– Query interpretation (resource DB)

An extensive use of XML

– Linguistic resources are semi-structured documents

(cf. Abiteboul, Buneman etc.)

– Linguistic resources have for long (but not everywhere) been encoded in SGML

• Cf. TEI: Text Encoding Initiative

– Historical links between the TEI and XML• MC Sperberg-McQueen, Steve de Rose, Henry

Thompson etc.

XML and linguistic resources

Being able to isolate sub-documents– E.g. dictionary entries, concordance lines etc.

Being able to filter|merge|sort data extensively– E.g. combining results extracted from various

(and probably heterogeneous) documents

Introducing flexibility in document presentation (cf. variety of usages): XSL

Document structure - XML<TEI.2>

<teiHeader>…</teiHeader>

<text>

<front>…</front>

<body>

<div n=“a”>

<entry>…</entry>

<entry>…</entry>

</div>

<div n=“b”>…</div>

</body>

<back>…</back>

</text>

</TEI.2>

Document structure

teiHeader

front

entry entry entry …

dive.g.: letter 'a'

entry …

divletter 'b'

div …

body back

text

TEI.2

XML in the network architecture

Why?• Coherence between the content and the “glue”

• E.g. combining results and user information

How?• At the user level

– User identification

– Workspace

• At the information flow level– Queries

– Result sets

An umbrella document: SIL

SIL: Silfide Interface Language

loginid - #required

passwd#PCDATA

defaultgroup=guest|admin|project

project*#PCDATA

access

uid (ws | ui | ql | rs)+

sil

User Information (<ui>)

<name>: user name<first>Patrice</first><last>Bonhomme</

last><email>[email protected]</email>

<org>: organization informationAttribute status=public|private etc.

<orgname>

<function>

<address>

Workspace (<ws>)

<prefs>: List of preferences<pref name=“language” value=“FR”>

<basket>+: List of resources<resource sid=“BirServer” idno=“Shakespeare23”>

<histos>?: access history

Queries (<ql>)

A query language combining:– Constraints the XML structure (à la Xpath)– Constraints on the linguistic content

• ELAN Common Query Language to be implemented (or interfaced) by all servers

Rem: To be merged with recent proposals on XQL

Query Language: example<ql>

<query><select selscrit=“all”><output>

<xptr alias=“x1”>descendant(1,orth)</xptr></output><from url=“dico.xml”>

<xptr id=“x1”>root().descendant(all,entry)</xptr></from><where>

<term neg=“asis”><grep nag=“asis” case=“yes”>

<xptr alias=“x1”>descendant(all,pos).child(all,#text)

</xptr><regexp>n.</regexp>

</grep></term>

</where></select>

</query></ql>

Result sets (<rs>)

<meta>: metadata information about the result (cf. query)

<attr name=“from”>10</attr>

<attr name=“to”>20</attr>

<result>: a list of elementary results/records<record count=“11”>

<attr name=“leftContext”>Time</attr>

<attr name=“nodeWord”>flies</attr>

<attr name=“rightContext”>like an arrow</attr>

</record>

Putting things together

SilUI/XML

SilWS/XML Query SilQ

L/XML

Broadcast

Result SilRS/XML

Implementation

Main technical choices– Access servers implemented as Java servlets

within an http server– Resource servers interfaced through a servlet

A single element of centralization: the Network Management Unit (NMU)– Corba connection to query and administrate the

NMU

Administration

RS_status

NmuClientServlet

Dispatcher

ResourceServlet

Server 1

CORBA

HTTP / XML

Web Browser

RS_status

NmuClientServlet

Dispatcher

ResourceServlet

Server 2

N M U

Client Applet

RS_status

NmuClientServlet

Dispatcher

ResourceServlet

Server 1

CORBA

HTTP / XML

Web Browser

RS_status

NmuClientServlet

Dispatcher

ResourceServlet

Server 2

N M U

Client Applet

Cache capabilities

DB

Leiden

ElanQueryHandlerdriver

connection+

native/SilRScache

Silfide server

QueryServlet

cache

Silfide server

QueryServlet

DB

Birmingham

connection+

native/SilRSElanQueryHandler

driver

cache

Silfide server

BroadcastServletSIL/CQL/XML

SIL/RS/XML

SIL/CQL/XML

SIL/CQL/XML SIL/RS/XML

Conclusions

Experiment• A first network with Nancy(FR), Birmingham(UK),

Leiden (NL)[, Pisa(IT)]

• Check demo availability at http://www.loria.fr/projets/MLIS/ELAN

Genericity of the model– Coping with other distributed information

environment

Perspectives

– Specific problems associated with linguistic resources

– Clusters of documents (e.g. multilingual alignment) — RDF?

– On-line edition/annotation of documents

– Aiming at a moving target• XSL: self-contained filtering mechanisms• XQL: real DB+query engines associated with XML?

– Still: experimenting is VERY useful to understand problems and make things evolve

Documents

Accessing distributed linguistic resources An XML based architecture Laurent Romary Laboratoire Loria, Nancy (F) Samuel Cruz-Lara, Patrice Bonhomme, Christophe