1
A Semi-supervised Method for Efficient Construction of Statistical Spoken Language Understanding Resources Seokhwan Kim, Minwoo Jeong, and Gary Geunbae Lee Pohang University of Science and Technology (POSTECH), South Korea ABSTRACT We present a semi-supervised framework to construct spoken language understanding resources with very low cost. We generate context patterns with a few seed entities and a large amount of unlabeled utterances. Using these context patterns, we extract new entities from the unlabeled utterances. The extracted entities are appended to the seed entities, and we can obtain the extended entity list by repeating these steps. Our method is based on an utterance alignment algorithm which is a variant of the biological sequence alignment algorithm. Using this method, we can obtain precise entity lists with high coverage, which is of help to reduce the cost of building resources for statistical spoken language understanding systems. SCORING CANDIDATES The score of the alignment between a raw utterance and a context pattern OVERALL PROCEDURE 1. Prepare seed entity list E and unlabeled corpus C 2. Find utterances containing lexical of entities in E in the corpus C, and replace the parts of matched entities in the found utterances with a label which indicates the location of entities. Add partially labeled utterances to the context pattern set P. 3. Align each utterance in the corpus C with each context pattern in P, and extract new entity candidates in the utterance which is matched with the entity label in the context pattern. 4. Compute the score of extracted entity candidates in step 3, and add only high-scored candidates to E. 5. If there is no additional entities to E in step 4, terminate the process with entity list E, context pattern set P, and partially labeled corpus C as results. Otherwise, return to step 2 and repeat the process. MOTIVATION Spoken Language Understanding (SLU) is a problem of extracting semantic meanings from recognized utterances and filling the correct values into a semantic frame structure. Most of the statistical SLU approaches require a sufficient amount of training data which consist of labeled utterances with corresponding semantic concepts. The task of building the training data for the SLU problem is one of the most important and high-cost required tasks for managing the spoken dialog systems. We concentrate on utilizing a semi-supervised information extraction method for building resources for statistical SLU modules in spoken dialog systems. EXTRACTING CONTEXT PATTERNS To overcome the context sparseness problem of spoken utterances, we make use of not sub-phrases of an utterance, but the full utterance itself as a context pattern for extracting named entities in the utterance. First, we assume that each entry in the entity list is absolutely precise and uniquely belong to only a category whether it is a seed entity or an extended entity as an intermediate of the overall procedure. For each entry in the entity list, we find out utterances containing it, and make an utterance template by replacing the part of entity in the utterance with the defined entity label. In this replacing task, we exclude the entities which are located at the beginning or end of the utterance, because context patterns containing the entities located in such positions can lead to confusion of determining the boundaries of each entity in the later procedure. ALIGNMENT-BASED NAMED ENTITY RECOGNITION We firstly align a raw utterance with a context pattern containing entity labels. Then, from the result of the best alignment between them, we extract the parts of the raw utterance which are aligned to the entity labels in the context pattern as an entity candidate belonging to the category of the corresponding entity label. MATRIX COMPUTATION TRACE BACK The traceback step is started at the position with maximum score from among the first column and the first row. Then, the next position of the position [i, j] is determined by following policies. • If tar(i) and ref(j) are identical, then the next position is [i + 1, j + 1]. • Otherwise, the position with maximum score from among [i + 1, j + 1] ~ [i + 1, n] and [i + 1, j + 1] ~ [m, j + 1] is the next position. The score of an entity candidate e j which is extracted by a context pattern ref i The final score of an entity candidate e j EXPERIMENTS We evaluated our method on the CU-Communicator corpus, which consists of 13,983 utterances. We chose the three most frequently occurring semantic categories in the corpus, CITY NAME, MONTH, and DAY NUMBER. we empirically set the entity selection threshold value to 0.3. Result of automatic entity list extension Category # of seeds # of extended entities # of total entities Precision Recall CITY_NAME 20 123 209 65.04% 37.91% MONTH 1 10 12 100% 83.33% DAY_NUMBER 3 27 34 100% 79.41% Category Precision Recall F-measure CITY_NAME 91.30 86.83 89.01 MONTH 98.98 87.24 92.74 DAY_NUMBER 92.00 82.03 86.73 Overall 93.24 85.53 89.22 where ref is a context pattern, tar is a raw utterance, n is the number of words in ref, m is the number of words in tar, t is the number of aligned entity labels, and e is the number of words extracted as entity candidates Result of corpus labeling experiment

A semi-supervised method for efficient construction of statistical spoken language understanding resources

Embed Size (px)

Citation preview

Page 1: A semi-supervised method for efficient construction of statistical spoken language understanding resources

A Semi-supervised Method for Efficient Construction of Statistical Spoken

Language Understanding Resources

Seokhwan Kim, Minwoo Jeong, and Gary Geunbae Lee

Pohang University of Science and Technology (POSTECH), South Korea

ABSTRACT

We present a semi-supervised framework to construct spoken

language understanding resources with very low cost. We

generate context patterns with a few seed entities and a large

amount of unlabeled utterances. Using these context patterns,

we extract new entities from the unlabeled utterances. The

extracted entities are appended to the seed entities, and we can

obtain the extended entity list by repeating these steps. Our

method is based on an utterance alignment algorithm which is a

variant of the biological sequence alignment algorithm. Using

this method, we can obtain precise entity lists with high

coverage, which is of help to reduce the cost of building

resources for statistical spoken language understanding systems.

SCORING CANDIDATES

● The score of the alignment between a raw utterance and a

context pattern

OVERALL PROCEDURE

1. Prepare seed entity list E and unlabeled corpus C

2. Find utterances containing lexical of entities in E in the

corpus C, and replace the parts of matched entities in the

found utterances with a label which indicates the location

of entities. Add partially labeled utterances to the

context pattern set P.

3. Align each utterance in the corpus C with each context

pattern in P, and extract new entity candidates in the utterance

which is matched with the entity label in the context

pattern.

4. Compute the score of extracted entity candidates in step

3, and add only high-scored candidates to E.

5. If there is no additional entities to E in step 4, terminate

the process with entity list E, context pattern set P, and

partially labeled corpus C as results. Otherwise, return

to step 2 and repeat the process.

MOTIVATION

Spoken Language Understanding (SLU) is a problem of

extracting semantic meanings from recognized utterances and

filling the correct values into a semantic frame structure. Most

of the statistical SLU approaches require a sufficient amount of

training data which consist of labeled utterances with

corresponding semantic concepts. The task of building the

training data for the SLU problem is one of the most important

and high-cost required tasks for managing the spoken dialog

systems. We concentrate on utilizing a semi-supervised

information extraction method for building resources for

statistical SLU modules in spoken dialog systems.

EXTRACTING CONTEXT PATTERNS

To overcome the context sparseness problem of spoken utterances, we make use of not sub-phrases of an utterance, but the full

utterance itself as a context pattern for extracting named entities in the utterance. First, we assume that each entry in the entity list

is absolutely precise and uniquely belong to only a category whether it is a seed entity or an extended entity as an intermediate of

the overall procedure. For each entry in the entity list, we find out utterances containing it, and make an utterance template by

replacing the part of entity in the utterance with the defined entity label. In this replacing task, we exclude the entities which are

located at the beginning or end of the utterance, because context patterns containing the entities located in such positions can lead

to confusion of determining the boundaries of each entity in the later procedure.

ALIGNMENT-BASED NAMED ENTITY RECOGNITION

We firstly align a raw utterance with a context pattern containing entity labels. Then, from the result of the best alignment

between them, we extract the parts of the raw utterance which are aligned to the entity labels in the context pattern as an entity

candidate belonging to the category of the corresponding entity label.

MATRIX COMPUTATION TRACE BACK

The traceback step is started at the position with

maximum score from among the first column and the

first row. Then, the next position of the position [i, j] is

determined by following policies.

• If tar(i) and ref(j) are identical, then the next position

is [i + 1, j + 1].

• Otherwise, the position with maximum score from

among [i + 1, j + 1] ~ [i + 1, n] and [i + 1, j + 1] ~ [m, j +

1] is the next position.

● The score of an entity candidate ej which is extracted by a

context pattern refi

● The final score of an entity candidate ej

EXPERIMENTS

We evaluated our method on the CU-Communicator corpus,

which consists of 13,983 utterances. We chose the three most

frequently occurring semantic categories in the corpus, CITY

NAME, MONTH, and DAY NUMBER. we empirically set the

entity selection threshold value to 0.3.

● Result of automatic entity list extension

Category # of

seeds

# of

extended

entities

# of

total

entities

Precision Recall

CITY_NAME 20 123 209 65.04% 37.91%

MONTH 1 10 12 100% 83.33%

DAY_NUMBER 3 27 34 100% 79.41%

Category Precision Recall F-measure

CITY_NAME 91.30 86.83 89.01

MONTH 98.98 87.24 92.74

DAY_NUMBER 92.00 82.03 86.73

Overall 93.24 85.53 89.22

where ref is a context pattern, tar is a raw utterance, n is the

number of words in ref, m is the number of words in tar, t is the

number of aligned entity labels, and e is the number of words

extracted as entity candidates

● Result of corpus labeling experiment