Upload
seokhwan-kim
View
106
Download
2
Tags:
Embed Size (px)
Citation preview
A Semi-supervised Method for Efficient Construction of Statistical Spoken
Language Understanding Resources
Seokhwan Kim, Minwoo Jeong, and Gary Geunbae Lee
Pohang University of Science and Technology (POSTECH), South Korea
ABSTRACT
We present a semi-supervised framework to construct spoken
language understanding resources with very low cost. We
generate context patterns with a few seed entities and a large
amount of unlabeled utterances. Using these context patterns,
we extract new entities from the unlabeled utterances. The
extracted entities are appended to the seed entities, and we can
obtain the extended entity list by repeating these steps. Our
method is based on an utterance alignment algorithm which is a
variant of the biological sequence alignment algorithm. Using
this method, we can obtain precise entity lists with high
coverage, which is of help to reduce the cost of building
resources for statistical spoken language understanding systems.
SCORING CANDIDATES
● The score of the alignment between a raw utterance and a
context pattern
OVERALL PROCEDURE
1. Prepare seed entity list E and unlabeled corpus C
2. Find utterances containing lexical of entities in E in the
corpus C, and replace the parts of matched entities in the
found utterances with a label which indicates the location
of entities. Add partially labeled utterances to the
context pattern set P.
3. Align each utterance in the corpus C with each context
pattern in P, and extract new entity candidates in the utterance
which is matched with the entity label in the context
pattern.
4. Compute the score of extracted entity candidates in step
3, and add only high-scored candidates to E.
5. If there is no additional entities to E in step 4, terminate
the process with entity list E, context pattern set P, and
partially labeled corpus C as results. Otherwise, return
to step 2 and repeat the process.
MOTIVATION
Spoken Language Understanding (SLU) is a problem of
extracting semantic meanings from recognized utterances and
filling the correct values into a semantic frame structure. Most
of the statistical SLU approaches require a sufficient amount of
training data which consist of labeled utterances with
corresponding semantic concepts. The task of building the
training data for the SLU problem is one of the most important
and high-cost required tasks for managing the spoken dialog
systems. We concentrate on utilizing a semi-supervised
information extraction method for building resources for
statistical SLU modules in spoken dialog systems.
EXTRACTING CONTEXT PATTERNS
To overcome the context sparseness problem of spoken utterances, we make use of not sub-phrases of an utterance, but the full
utterance itself as a context pattern for extracting named entities in the utterance. First, we assume that each entry in the entity list
is absolutely precise and uniquely belong to only a category whether it is a seed entity or an extended entity as an intermediate of
the overall procedure. For each entry in the entity list, we find out utterances containing it, and make an utterance template by
replacing the part of entity in the utterance with the defined entity label. In this replacing task, we exclude the entities which are
located at the beginning or end of the utterance, because context patterns containing the entities located in such positions can lead
to confusion of determining the boundaries of each entity in the later procedure.
ALIGNMENT-BASED NAMED ENTITY RECOGNITION
We firstly align a raw utterance with a context pattern containing entity labels. Then, from the result of the best alignment
between them, we extract the parts of the raw utterance which are aligned to the entity labels in the context pattern as an entity
candidate belonging to the category of the corresponding entity label.
MATRIX COMPUTATION TRACE BACK
The traceback step is started at the position with
maximum score from among the first column and the
first row. Then, the next position of the position [i, j] is
determined by following policies.
• If tar(i) and ref(j) are identical, then the next position
is [i + 1, j + 1].
• Otherwise, the position with maximum score from
among [i + 1, j + 1] ~ [i + 1, n] and [i + 1, j + 1] ~ [m, j +
1] is the next position.
● The score of an entity candidate ej which is extracted by a
context pattern refi
● The final score of an entity candidate ej
EXPERIMENTS
We evaluated our method on the CU-Communicator corpus,
which consists of 13,983 utterances. We chose the three most
frequently occurring semantic categories in the corpus, CITY
NAME, MONTH, and DAY NUMBER. we empirically set the
entity selection threshold value to 0.3.
● Result of automatic entity list extension
Category # of
seeds
# of
extended
entities
# of
total
entities
Precision Recall
CITY_NAME 20 123 209 65.04% 37.91%
MONTH 1 10 12 100% 83.33%
DAY_NUMBER 3 27 34 100% 79.41%
Category Precision Recall F-measure
CITY_NAME 91.30 86.83 89.01
MONTH 98.98 87.24 92.74
DAY_NUMBER 92.00 82.03 86.73
Overall 93.24 85.53 89.22
where ref is a context pattern, tar is a raw utterance, n is the
number of words in ref, m is the number of words in tar, t is the
number of aligned entity labels, and e is the number of words
extracted as entity candidates
● Result of corpus labeling experiment