19
E BERHARD K ARLS U NIVERSITÄT T ÜBINGEN Seminar f ¨ ur Sprachwissenschaft Dictionary Entry Parsing Lothar Lemnitzer, Claudia Kunze [email protected], [email protected] Computational Lexicography at ESSLLI 2005 Dictionary Entry Parsing – p.1

Dictionary Entry Parsing

Embed Size (px)

Citation preview

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NS

emin

arfu

rS

prac

hwis

sens

chaf

t

Dictionary Entry ParsingLothar Lemnitzer, Claudia Kunze

[email protected], [email protected]

Computational Lexicography at ESSLLI 2005

Dictionary Entry Parsing – p.1

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NS

emin

arfu

rS

prac

hwis

sens

chaf

tTopics

What is Dictionary Entry Parsing ?

The Structure of Dictionary Entries

Problems of Parsing Entries

Architecture of the LexParse Parser

Resources

Dictionary Entry Parsing – p.2

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NS

emin

arfu

rS

prac

hwis

sens

chaf

tWhat is Dictionary Entry Parsing

Dictionary Entry Parsing

takes entries of print dictionaries as input

segments the entries

classifies the segments according to theirfunction

converts the entry into a tree-like or an sgmlpresentation

Dictionary Entry Parsing – p.3

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NS

emin

arfu

rS

prac

hwis

sens

chaf

tWhy parsing dictionary entries?

Printed dictionaries are considered to be a storeof valuable linguistic information

They have been re-used as a source for lexicaldatabases in NLP

Some print dictionaries have been convertedinto electronic dictionaries (electronic publishing)

Dictionary Entry Parsing – p.4

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NS

emin

arfu

rS

prac

hwis

sens

chaf

tThe structure of dictionary entries

Abbildung 1: Bilingual dictionary entry for the head-

word black

Dictionary Entry Parsing – p.5

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NS

emin

arfu

rS

prac

hwis

sens

chaf

tThe structure of dictionary entries

Most dictionary entries bear the following structuralcharacteristics:

They consist of information items (e.g. part ofspeech, equivalent(s))

Information items serve the function to provideinformation about the headword

Some of the information items are optional

Some of the information items are grouped

Dictionary Entry Parsing – p.6

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NS

emin

arfu

rS

prac

hwis

sens

chaf

tThe structure of dictionary entries

Structural relations between information items indictionary entries

Linear precedence: some information itemsprecede / follow others

(Immediate) dominance: some higher nodesdominate groups of information items (grammar

part of speech, inflection)

Dictionary Entry Parsing – p.7

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NS

emin

arfu

rS

prac

hwis

sens

chaf

tAdditional difficulties

Implicit information must be made explicit (e.g.gender part of speech)

Abbreviations must be resolved (e.g. the tildesymbol representing the headword)

Dictionary Entry Parsing – p.8

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NS

emin

arfu

rS

prac

hwis

sens

chaf

tStructure indicators

Structure indicators are essential to the formatof dictionary entries

They mark the beginning and end of informationitems (’fields’)

Punctuation and other symbols are used asnontypographic structure indicators

Fonts and typefaces are used as typographicstructure indicators

Dictionary Entry Parsing – p.9

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NS

emin

arfu

rS

prac

hwis

sens

chaf

tElements of a dictionary entry grammar

The dictionary entry grammar guides theanalysis of the entries

It defines the set of well-formed entries

A dictionary entry grammar is a quadruple (CEI,CNI, R, WA)

CEI=terminal alphabet; CNI=non-terminalsymbols, R=set of rules; WA=initial symbol

Dictionary Entry Parsing – p.10

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NS

emin

arfu

rS

prac

hwis

sens

chaf

tTasks of the LexParse parser

Split any (standard) entry of any dictionary intosegments

Reconstruct the hierarchical structure of theentry

Resolve abbreviations and make all informationexplicit

Report on malformed entries

Represent the data in a well-defined format (e.g.sgml, database records)

Dictionary Entry Parsing – p.11

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NS

emin

arfu

rS

prac

hwis

sens

chaf

tParser configuration and input

General: executable programme, configurationfile

Specific: dictionary entry grammar, dictionarydata

Dictionary Entry Parsing – p.12

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NS

emin

arfu

rS

prac

hwis

sens

chaf

tArchitecture of LexParse

Abbildung 2: The Lexparse Parser, developed by

Storrer and Hauser

Dictionary Entry Parsing – p.13

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NS

emin

arfu

rS

prac

hwis

sens

chaf

tDirectives for preprocessing

Directives prepare the input file or typesettingtape for the parser

Delete superfluous lines and patterns

Convert patterns into XCodes

Dictionary Entry Parsing – p.14

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NS

emin

arfu

rS

prac

hwis

sens

chaf

tXcodes

Xcodes reflect structure indicators

Typeface information, brackets and some specialcharacters should be converted into XCodes

e.g.: Cat XBRPO, *, XBRPC (a category isexpanded to a string enclosed by brackets)

ambiguous cases are resolved by treating themajority of cases correctly or by definingsub-patterns (e.g. for the semicolon)

Dictionary Entry Parsing – p.15

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NS

emin

arfu

rS

prac

hwis

sens

chaf

tGrammar

The grammar is a set of rewrite rules

non-terminal symbols are expanded to sets ofterminal and non-terminal symbols

e.g.: WA FK, SK

LexParse style: WA XFLBE, FK, SK, XFLEN

Dictionary Entry Parsing – p.16

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NS

emin

arfu

rS

prac

hwis

sens

chaf

tDisplay options

Specifies the format of the output

Options are: SGML, Tree, Map

Dictionary Entry Parsing – p.17

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NS

emin

arfu

rS

prac

hwis

sens

chaf

tDirectives for postprocessing

Directives clean the output

Delete superfluous lines and patterns

Convert patterns (e.g. German Umlaute)

Dictionary Entry Parsing – p.18

EB

ER

HA

RD

KA

RL

SU

NIV

ER

SIT

ÄT

BIN

GE

NS

emin

arfu

rS

prac

hwis

sens

chaf

tConclusion

LexParse is a general parser for (standard)dictionary entries

LexParse deviates in some respects from ageneral language parser, since the language ofdictionary entries is special

LexParse prepares the data for subsequentformal processing (e.g. in a lexical database)

LexParse provides error reports and is thereforeuseful for consistency checking

Dictionary Entry Parsing – p.19