Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
u lric
h sc
häfe
r -
dfk
i la n
g uag
e te
chno
logy
lab
del
p h-i n
su m
mit
f efo
r 06
/20 0
6
Heart of GoldTutorial
An XML-based middlewarefor the integration of deep and shallow
natural language processing components
Ulrich SchäferDFKI language technology lab
Mus
eu d
os
coch
es,
Lis
bon
u lric
h sc
häfe
r -
dfk
i la n
g uag
e te
chno
logy
lab
del
p h-i n
su m
mit
f efo
r 06
/20 0
6
Talk outline
history middleware application clients modules pet input chart transformation service practical tour, configuration SDL cascades visualization gadgets web page
u lric
h sc
häfe
r -
dfk
i la n
g uag
e te
chno
logy
lab
del
p h-i n
su m
mit
f efo
r 06
/20 0
6
Heart of Gold History
roots in Whiteboard (2000-2003) WHAM shallow XML standoff annotation, XSLT ("WHAT"),
PET extensions, pipeline integration API-based, focus on German
yy extensions to PET (~2001) DeepThought: Heart of Gold (2002-2004)
multilinguality, RMRS output flexible configuration, networking fallback to shallow if deep fails
extensions in QUETAL (2003-2005) SDL (sub-architectures with loops, parallelism) automatic stylesheet generation (NER, RMRS) new modules (Sleepy, Treetagger), ontology interf.
u lric
h sc
häfe
r -
dfk
i la n
g uag
e te
chno
logy
lab
del
p h-i n
su m
mit
f efo
r 06
/20 0
6
Heart of Gold Application
NLP components
Res
ults
Que
ries
Deep parser, tagger, named entity recognizer, ...
u lric
h sc
häfe
r -
dfk
i la n
g uag
e te
chno
logy
lab
del
p h-i n
su m
mit
f efo
r 06
/20 0
6
Heart of Gold Application
NLP components
Res
ults
Que
ries
MIDDLEWARE
u lric
h sc
häfe
r -
dfk
i la n
g uag
e te
chno
logy
lab
del
p h-i n
su m
mit
f efo
r 06
/20 0
6
Heart of Gold
Computed annotations XML,RMRS
Application
Module Communication Manager
Re
sults
Que
ries
External, persistent annotation database
Modules
NLP components
MIDDLEWAREXML-RPC
u lric
h sc
häfe
r -
dfk
i la n
g uag
e te
chno
logy
lab
del
p h-i n
su m
mit
f efo
r 06
/20 0
6
Heart of Gold
Computed annotations XML,RMRS
Application
Module Communication Manager
Res
ults
Qu
eries
External, persistent annotation database Modules
External NLP components
TransformationService
u lric
h sc
häfe
r -
dfk
i la n
g uag
e te
chno
logy
lab
del
p h-i n
su m
mit
f efo
r 06
/20 0
6
Application Clients
open a session with a configuration of active modules
each query ("analyze") has parameters session ID input text depth of deepest analysis requested (e.g. 10 for
tokens, 40 for NER, 100 for PET) language code
client gets result of deepest analysis as answer, other analyses on request
u lric
h sc
häfe
r -
dfk
i la n
g uag
e te
chno
logy
lab
del
p h-i n
su m
mit
f efo
r 06
/20 0
6
Processing Strategy
Shallowest component first (e.g. tokenizer).
Then other components with increasing depth, up to requested depth.
Fallback to result of previous component if no result from component with requested depth.
Each component gets the output of previous component as input plus the output from other components if configured.
The result of the query is the result of the deepest component in the sequence.
Analyses results from other components are re-turned on request.
u lric
h sc
häfe
r -
dfk
i la n
g uag
e te
chno
logy
lab
del
p h-i n
su m
mit
f efo
r 06
/20 0
6
Annotation Storage
Session Annotation collection (1 per input text)
Standoff annotations (analyses computed by components)
● XML standoff annotation and/or RMRS● in Main Memory, XML:DB, File System
u lric
h sc
häfe
r -
dfk
i la n
g uag
e te
chno
logy
lab
del
p h-i n
su m
mit
f efo
r 06
/20 0
6
Modules
modules are adapters to external NLP components (PET, tagger, NER, ...)
connection direct (e.g. process streams) or via XML-RPC
depth, language, name are mandatory configuration properties
input is output from previous module, alternative and additional input configurable
XML output mandatory (RMRS generation optional, e.g. via XSLT stylesheet)
u lric
h sc
häfe
r -
dfk
i la n
g uag
e te
chno
logy
lab
del
p h-i n
su m
mit
f efo
r 06
/20 0
6
Sample Module Configuration: PETno
# configuration file for PET module#module.name=PETmodule.depth=100module.language=no## root element name for XML outputmodule.rootelement=pet## common modules settings end here -----## path to cheap binarypet.binary=components/pet/bin/cheap## additional library search path for cheappet.libs=components/pet/lib## working directory (where the grammar is)pet.grammardir=components/pet/norwegian## prefix for grammar filepet.grammarprefix=norsourcepet## command line options for cheappet.options=-mrs=xml -limit=30000 -nsolu-
tions=1## character set encoding for PET inputpet.inputencoding=ISO-8859-1#
# character set encoding for PET outputpet.outputencoding=ISO-8859-1## input annotation(s), comma-separated# (for use in conjunction with PIC mode)# use "rawtext" for raw input text.# omitting/empty value means take input from # previous component (XML)pet.inputannotation=rawtext
u lric
h sc
häfe
r -
dfk
i la n
g uag
e te
chno
logy
lab
del
p h-i n
su m
mit
f efo
r 06
/20 0
6
Name Purpose depth Lang. resources Implemented in
JTok tokenizer, SBR 10 en, de, it Java
TnT stat. PoS tagger 20 en, de C
ChaSen tagger, segmentation 20 ja C
TreeTagger tagger 20 de, en, fr, it, es... C
Chunkie stat. chunker 30 en, de C
ChunkieRmrs RMRS of chunks 35 en, de XSLT, XTDL, SDL
SProUT morph, IE/NER 40 en,de,el,fr,es,ja,... Java
LingPipe NER, coreference resolver 40 en, es, ... Java
Corcy coreference resolver 45 en Python
RASP stat. parser 50 en Lisp
Sleepy stat. parser 50 de OCaml
PET deep parser 100 en,de,el,ja,[it,no] C, C++
SDL sub-architectures n - Java
RMRSmerge merge RMRSes 110 - XSLT, SDL
Integrated Components
u lric
h sc
häfe
r -
dfk
i la n
g uag
e te
chno
logy
lab
del
p h-i n
su m
mit
f efo
r 06
/20 0
6
Simple Integration of new Components
1. Subclass Module
2. Implement init(), process() and shutdown()
3. Use e.g. XSL transformation to generate
RMRS output (cf. TnT, SProUT integration)
u lric
h sc
häfe
r -
dfk
i la n
g uag
e te
chno
logy
lab
del
p h-i n
su m
mit
f efo
r 06
/20 0
6
Annotation Metadata <metadata acid="collection0002" component="PET" created="Do, 4 Dez 2003 18:25:16 +0100" processingtime="00:08,140" sessionid="session0001"> diagnosis="OK"> <conf name="pet.cfg"> <entry name="module.rootelement" value="pet"/> <entry name="module.language" value="en"/> <entry name="module.depth" value="100"/> <entry name="pet.grammarprefix" value="english"/> <entry name="pet.options" value="-mrs=xml"/> <entry name="pet.inputencoding" value="ISO-8859-1"/> <entry name="pet.outputencoding" value="ISO-8859-1"/> <entry name="pet.inputannotation" value="rawtext"/> </conf></metadata>
u lric
h sc
häfe
r -
dfk
i la n
g uag
e te
chno
logy
lab
del
p h-i n
su m
mit
f efo
r 06
/20 0
6
PET XML Input Chart ('PIC', 'PiXML')
generalisation and extension of yy input mode (cf. example; DTD in HoG doc)
TnT-, ChaSen-, SproutModule adapted to generate PiXML as additional annotation
XML-wise 'concatenation' of n input charts via XSLT stylesheet
PicModule for text input without PoS tagger
u lric
h sc
häfe
r -
dfk
i la n
g uag
e te
chno
logy
lab
del
p h-i n
su m
mit
f efo
r 06
/20 0
6
PET XML Input Chart
conf/en/pet.cfg:
# input annotations, comma-separatedpet.inputannotation=TnTpiXML,SProUTpiXML
# stylesheet for XML chart combinationpet.combinestylesheet=xsl/pic/combinepixml.xsl## stylesheet for preprocessing the PET input chart (opt.)pet.preprocstylesheet=xsl/pic/remove-subspan-items.xsl
TnTpiXML
SProUTpiXML
u lric
h sc
häfe
r -
dfk
i la n
g uag
e te
chno
logy
lab
del
p h-i n
su m
mit
f efo
r 06
/20 0
6
Heart of Gold
Computed annotations XML,RMRS
Application
Module Communication Manager
Re
sults
Que
ries
External, persistent annotation database Modules
External NLP components
u lric
h sc
häfe
r -
dfk
i la n
g uag
e te
chno
logy
lab
del
p h-i n
su m
mit
f efo
r 06
/20 0
6
with TransformationService
Computed annotations XML,RMRS
Application
Module Communication Manager
Res
ults
Que
ries
External, persistent annotation database Modules
External NLP components
TransformationService
u lric
h sc
häfe
r -
dfk
i la n
g uag
e te
chno
logy
lab
del
p h-i n
su m
mit
f efo
r 06
/20 0
6
TransformationService
Central XSLT class with access to computed Heart of Gold annotations via special URI:
URI syntax (in XPath): document(hog://sid/acid/aid)/PATH/TO/ELEMENT
where sid = session ID, acid = annotation collection ID, aid = annotation ID
Session Annotation collection (1 per input text)
Standoff annotations (analyses computed by components)
u lric
h sc
häfe
r -
dfk
i la n
g uag
e te
chno
logy
lab
del
p h-i n
su m
mit
f efo
r 06
/20 0
6
XSLT for Component Integration
post-processing of SProUTput:
1. PET input chart generation with mapping to generic HPSG NE types
2. RMRS generation both stylesheets gene-
rated automatically at compile time from TDL type hierarchies of SP-roUT named entity grammars (5500 more lines...):
u lric
h sc
häfe
r -
dfk
i la n
g uag
e te
chno
logy
lab
del
p h-i n
su m
mit
f efo
r 06
/20 0
6
XSLT for Component Integration
IE-like structured RMRS output for application:
only NE span and type information for PET:
SProUTput:
u lric
h sc
häfe
r -
dfk
i la n
g uag
e te
chno
logy
lab
del
p h-i n
su m
mit
f efo
r 06
/20 0
6
Sample Module Configuration: PETen
# configuration file for PET module#module.name=PETmodule.depth=100module.language=en## root element name for XML outputmodule.rootelement=pet## common modules settings end here -----## path to cheap binarypet.binary=components/pet/bin/cheap## additional library search path for cheappet.libs=components/pet/lib## working directory (where the grammar is)pet.grammardir=components/pet/erg## prefix for grammar filepet.grammarprefix=english## command line options for cheappet.options=-xml_counts -mrs=xml -default-
les -limit=30000 -nsolutions=2## character set encoding for PET inputpet.inputencoding=UTF-8#
# character set encoding for PET outputpet.outputencoding=UTF-8## input annotation(s), comma-separated# (for use in conjunction with yy mode)# use "rawtext" in conjunction with non-yy mode# omitting/empty value means take input from # previous component (XML)pet.inputannotation=TnTpiXML,SProUTpiXML#pet.inputannotation=rawtext## stylesheet for XML input chart combinationpet.combinestylesheet=xsl/pic/combinepixml.xsl## stylesheet for preprocessing the input chart# no transformation if unset#pet.preprocstylesheet=xsl/pic/remove-subspan-
items.xsl## stylesheet for postprocessing fragments# return only the n longest fragments# unset=return all (=no stylesheet application)pet.postprocstylesheet=xsl/rmrs/extract-
longest-fragment.xsl## stylesheet parameter: number of fragments to
return# unset=return all (=no stylesheet application)pet.postprocfragments=5
u lric
h sc
häfe
r -
dfk
i la n
g uag
e te
chno
logy
lab
del
p h-i n
su m
mit
f efo
r 06
/20 0
6
Heart of Gold tour
download/installation prerequisites: Python, Java, Mozilla/Firefox directory structure below hog/ ISO 639 language codes Logging: log4j configuration Heart of Gold configuration files in conf/
XML-RPC server and ant configuration: conf/mocoman.cfg
-> logging configuration in conf/logging/ session configuration in conf/en/
-> module configurations in conf/en/ -> component configurations ion components/XXXX/YYY
Starting and stopping server, using clients
u lric
h sc
häfe
r -
dfk
i la n
g uag
e te
chno
logy
lab
del
p h-i n
su m
mit
f efo
r 06
/20 0
6
SDLModule
Generic module that plugs SDL (Krieger '03) sub-architectures into the Heart of Gold
Generic SProUT and XSLT SDL modules implemented (SProUT grammars and XSLT stylesheets via configuration)
Access to other (computed) Heart of Gold annotations via TransformationService
Application: RMRS construction from chunks Can also serve as 'standalone' SProUT
wrapper for shallow cascades
u lric
h sc
häfe
r -
dfk
i la n
g uag
e te
chno
logy
lab
del
p h-i n
su m
mit
f efo
r 06
/20 0
6
Heart of Gold Schema with SDLModule
Computed annotations XML,RMRS
Application
Module Communication Manager Re
sults
Queries
External, persistent annotation database Modules
External NLP components
SDLModule
Compiled SDLsub-architecture(s)
TransformationService
SDL XsltModules
SDL SproutModules
u lric
h sc
häfe
r -
dfk
i la n
g uag
e te
chno
logy
lab
del
p h-i n
su m
mit
f efo
r 06
/20 0
6
ChunkieRMRS cascade within HoG
ChunkieRMRS (SDL-defined module)
Constraint-Based RMRS Construction from Shallow Grammars (Frank et al. 2004)
u lric
h sc
häfe
r -
dfk
i la n
g uag
e te
chno
logy
lab
del
p h-i n
su m
mit
f efo
r 06
/20 0
6
SDL definition of SproUT-XSLT cascade for ChunkieRMRS
de.dfki.lt.quetal.sdlgen.chunkiermrs_de = ( sprout_rmrs_pos + xslt_morph_filter + sprout_rmrs_lex + xslt_nodeid_cat + sprout_rmrs_comp + sprout_rmrs_final + xslt_fsxml2rmrsxml + xslt_reorder )
sprout_rmrs_pos = de.dfki.lt.sdl.sprout.SproutModulesTextXmlEncapsulated ("components/sdl/chunkiermrs/rmrs-pos/rmrs/rmrs-pos.cfg", "SDLs-RMRS-pos")
xslt_morph_filter = de.dfki.lt.sdl.xslt.XsltModulesStringStringEncapsulated ("components/sdl/chunkiermrs/morphposfilter.xsl", "SDLx-Morph-filter", "aid", "Chunkie")
sprout_rmrs_lex = de.dfki.lt.sdl.sprout.SproutModulesXmlXmlEncapsulated ("components/sdl/chunkiermrs/rmrs-lex/rmrs/rmrs-lex.cfg", "SDLs-RMRS-lex")
xslt_nodeid_cat = de.dfki.lt.sdl.xslt.XsltModulesStringStringEncapsulated ("components/sdl/chunkiermrs/nodeinfo.xsl", "SDLx-Node-info", "aid", "Chunkie")
sprout_rmrs_comp = de.dfki.lt.sdl.sprout.SproutModulesXmlXmlEncapsulated ("components/sdl/chunkiermrs/rmrs-cascade/rmrs/rmrs-cascade.cfg", "SDLs-RMRS-casc")
sprout_rmrs_final = de.dfki.lt.sdl.sprout.SproutModulesXmlXmlEncapsulated ("components/sdl/chunkiermrs/rmrs-final/rmrs/rmrs-final.cfg", "SDLs-RMRS-final")
xslt_fsxml2rmrsxml = de.dfki.lt.sdl.xslt.XsltModulesStringStringEncapsulated ("components/sdl/chunkiermrs/rmrsfs2rmrsxml.xsl", "SDLx-RMRS-2dtd")
xslt_reorder = de.dfki.lt.sdl.xslt.XsltModulesStringStringEncapsulated ("components/sdl/chunkiermrs/reorderrmrsdtrs.xsl", "SDLx-RMRS-reorder")
u lric
h sc
häfe
r -
dfk
i la n
g uag
e te
chno
logy
lab
del
p h-i n
su m
mit
f efo
r 06
/20 0
6
Configuration of ChunkieRMRS SdlModule
# configuration file for Chunkie RMRS module (SDL)#module.name=ChunkieRmrsmodule.depth=35module.language=en# root element name for XML outputmodule.rootelement=chunkiermrs# ----- common modules settings end here -----# name of input annotation (raw text for first cascade/SProUT)sdl.inputannotation=rawtext# class name of compiled SDL definition# (same as class name at beginning of .sdl file)# can be compiled using 'ant chunkiermrs'sdl.classname=de.dfki.lt.hog.sdlgen.chunkiermrs_en
u lric
h sc
häfe
r -
dfk
i la n
g uag
e te
chno
logy
lab
del
p h-i n
su m
mit
f efo
r 06
/20 0
6
SDL definition of RmrsMerge cascade
XSLT Stylesheets developed by Anette Frank
de.dfki.lt.hog.sdlgen.rmrsmerge = ( rmrs_ep_rargs2rels + adjust_nespans + merge_ne_to_petrasp + rmrs_rels2ep_rargs + reorder_rmrs_dtrs )
rmrs_ep_rargs2rels = de.dfki.lt.sdl.xslt.XsltModulesStringDomEncapsulated ("xsl/sdl/rmrsmerge/rmrs_ep_rargs2rels.xsl", "SDLx_rargs2rels")
adjust_nespans = de.dfki.lt.sdl.xslt.XsltModulesDomDomEncapsulated ("xsl/sdl/rmrsmerge/adjust_nespans.xsl", "SDLx_adjustnespans", "aid", "Sprout")
merge_ne_to_petrasp = de.dfki.lt.sdl.xslt.XsltModulesDomDomEncapsulated ("xsl/sdl/rmrsmerge/merge-ne-to-rasp.xsl", "SDLx_netorasp", "aid", "Sprout")
rmrs_rels2ep_rargs = de.dfki.lt.sdl.xslt.XsltModulesDomDomEncapsulated ("xsl/sdl/rmrsmerge/rmrs_rels2ep_rargs.xsl", "SDLx_rels2rargs")
reorder_rmrs_dtrs = de.dfki.lt.sdl.xslt.XsltModulesDomStringEncapsulated ("xsl/sdl/rmrsmerge/reorderrmrsdtrs.xsl", "SDLx_reorderdtrs", "aid", "xmltext")
u lric
h sc
häfe
r -
dfk
i la n
g uag
e te
chno
logy
lab
del
p h-i n
su m
mit
f efo
r 06
/20 0
6
Configuration of RmrsMerge module
# configuration file for RmrsMerge module (SDL)#module.name=RmrsMergemodule.depth=110module.language=en# root element name for XML outputmodule.rootelement=merged-rmrs# ----- common modules settings end here -----# name of input annotation (PET or RASP)sdl.inputannotation=PET# class name of compiled SDL definition# (same as class name at beginning of .sdl file)# can be compiled using 'ant rmrsmerge'sdl.classname=de.dfki.lt.hog.sdlgen.rmrsmerge
u lric
h sc
häfe
r -
dfk
i la n
g uag
e te
chno
logy
lab
del
p h-i n
su m
mit
f efo
r 06
/20 0
6
Visualization Gadgets
HTML (generic XML, RMRS) xsl/html/xml2html.xsl, rmrs2html.xsl
AVM (generic XML, SProUTput): applet part of SProUT runtime
LaTeX (FS-XML, SProUTput, RMRS) fs2latex tool, xsl/latex/rmrs2latex.xsl
Complete PHP-based Webdemo portal is part of Heart of Gold CVS
u lric
h sc
häfe
r -
dfk
i la n
g uag
e te
chno
logy
lab
del
p h-i n
su m
mit
f efo
r 06
/20 0
6
Documentation, Papers, Downloads
● core middleware is LGPL
● different licences for (externally developed) components
● http://heartofgold.dfki.de
● http://lists.delph-in.net