Upload
robyn-riley
View
218
Download
1
Embed Size (px)
Citation preview
Accessing distributed linguistic resources
An XML based architecture
Laurent Romary
Laboratoire Loria, Nancy (F)
Samuel Cruz-Lara, Patrice Bonhomme, Christophe de Saint Rat
Overview
Objectives General Network organization Role of XML in the architecture Implementation Perspectives
Objectives
Distributed access to linguistic resources– Linguistic resources
• multilingual texts (books, newspaper articles), mono or multilingual dictionaries, transcription of spoken data etc.
– Usages• Researchers: linguists, lexicographers
• Professionals: translators, teachers
• Larger public: information on language use
Objectives - cont.
– Distributed servers• Local maintenance of resources
– Linguistic competence (Finnish!)
– Specific philological and/or scholar competencies (historical manuscripts, transcriptions of ethnographic work etc.)
– Copyright aspects (local agreements with editors)
• Distribution and allocation of load– Large amount of data
– Main processing done on the server side
General context
National– Silfide project
• CNRS and Agence des Universités Francophones• Registering and distributing French linguistic
resources
European– MLIS/Elan project
• EU - DG XIII funding• Networkig existing LR access environments
General Network Organization
User scenario (workflow)
User connection Selection of servers
– server profiles
Selection of resources– header queries
Content queries– Concordances, word lists, statistics etc.
Servers: two main sets of functionalities Local access servers
– User identification (User DB)– Query broadcast - Result set merging
Resource servers– Query interpretation (resource DB)
An extensive use of XML
– Linguistic resources are semi-structured documents
(cf. Abiteboul, Buneman etc.)
– Linguistic resources have for long (but not everywhere) been encoded in SGML
• Cf. TEI: Text Encoding Initiative
– Historical links between the TEI and XML• MC Sperberg-McQueen, Steve de Rose, Henry
Thompson etc.
XML and linguistic resources
Being able to isolate sub-documents– E.g. dictionary entries, concordance lines etc.
Being able to filter|merge|sort data extensively– E.g. combining results extracted from various
(and probably heterogeneous) documents
Introducing flexibility in document presentation (cf. variety of usages): XSL
Document structure - XML<TEI.2>
<teiHeader>…</teiHeader>
<text>
<front>…</front>
<body>
<div n=“a”>
<entry>…</entry>
<entry>…</entry>
</div>
<div n=“b”>…</div>
</body>
<back>…</back>
</text>
</TEI.2>
Document structure
teiHeader
front
entry entry entry …
dive.g.: letter 'a'
entry …
divletter 'b'
div …
body back
text
TEI.2
XML in the network architecture
Why?• Coherence between the content and the “glue”
• E.g. combining results and user information
How?• At the user level
– User identification
– Workspace
• At the information flow level– Queries
– Result sets
An umbrella document: SIL
SIL: Silfide Interface Language
loginid - #required
passwd#PCDATA
defaultgroup=guest|admin|project
project*#PCDATA
access
uid (ws | ui | ql | rs)+
sil
User Information (<ui>)
<name>: user name<first>Patrice</first><last>Bonhomme</
last><email>[email protected]</email>
<org>: organization informationAttribute status=public|private etc.
<orgname>
<function>
<address>
Workspace (<ws>)
<prefs>: List of preferences<pref name=“language” value=“FR”>
<basket>+: List of resources<resource sid=“BirServer” idno=“Shakespeare23”>
<histos>?: access history
Queries (<ql>)
A query language combining:– Constraints the XML structure (à la Xpath)– Constraints on the linguistic content
• ELAN Common Query Language to be implemented (or interfaced) by all servers
Rem: To be merged with recent proposals on XQL
Query Language: example<ql>
<query><select selscrit=“all”><output>
<xptr alias=“x1”>descendant(1,orth)</xptr></output><from url=“dico.xml”>
<xptr id=“x1”>root().descendant(all,entry)</xptr></from><where>
<term neg=“asis”><grep nag=“asis” case=“yes”>
<xptr alias=“x1”>descendant(all,pos).child(all,#text)
</xptr><regexp>n.</regexp>
</grep></term>
</where></select>
</query></ql>
Result sets (<rs>)
<meta>: metadata information about the result (cf. query)
<attr name=“from”>10</attr>
<attr name=“to”>20</attr>
<result>: a list of elementary results/records<record count=“11”>
<attr name=“leftContext”>Time</attr>
<attr name=“nodeWord”>flies</attr>
<attr name=“rightContext”>like an arrow</attr>
</record>
Putting things together
SilUI/XML
SilWS/XML Query SilQ
L/XML
Broadcast
Result SilRS/XML
Implementation
Main technical choices– Access servers implemented as Java servlets
within an http server– Resource servers interfaced through a servlet
A single element of centralization: the Network Management Unit (NMU)– Corba connection to query and administrate the
NMU
Administration
RS_status
NmuClientServlet
Dispatcher
ResourceServlet
Server 1
CORBA
HTTP / XML
Web Browser
RS_status
NmuClientServlet
Dispatcher
ResourceServlet
Server 2
N M U
Client Applet
RS_status
NmuClientServlet
Dispatcher
ResourceServlet
Server 1
CORBA
HTTP / XML
Web Browser
RS_status
NmuClientServlet
Dispatcher
ResourceServlet
Server 2
N M U
Client Applet
Cache capabilities
DB
Leiden
ElanQueryHandlerdriver
connection+
native/SilRScache
Silfide server
QueryServlet
cache
Silfide server
QueryServlet
DB
Birmingham
connection+
native/SilRSElanQueryHandler
driver
cache
Silfide server
BroadcastServletSIL/CQL/XML
SIL/RS/XML
SIL/CQL/XML
SIL/CQL/XML SIL/RS/XML
Conclusions
Experiment• A first network with Nancy(FR), Birmingham(UK),
Leiden (NL)[, Pisa(IT)]
• Check demo availability at http://www.loria.fr/projets/MLIS/ELAN
Genericity of the model– Coping with other distributed information
environment
Perspectives
– Specific problems associated with linguistic resources
– Clusters of documents (e.g. multilingual alignment) — RDF?
– On-line edition/annotation of documents
– Aiming at a moving target• XSL: self-contained filtering mechanisms• XQL: real DB+query engines associated with XML?
– Still: experimenting is VERY useful to understand problems and make things evolve