Upload
eugene-reeves
View
221
Download
3
Tags:
Embed Size (px)
Citation preview
IBM T.J. Waston 1CLSP, The Johns Hopkins University
Using Random Forests Language Models in IBM
RT-04 CTS
Peng Xu1 and Lidia Mangu2
1. CLSP, the Johns Hopkins University2. IBM T.J. Waston Research Center
March 24, 2005
IBM T.J. WastonCLSP, The Johns Hopkins University
n-gram SmoothingSmoothing: take out some probability mass from seen n-grams and distribute among unseen n-grams Over 10 different smoothing techniques were
proposed in the literature. Interpolated Kneser-Ney: consistently the best
performance [Chen & Goodman, 1998]
IBM T.J. WastonCLSP, The Johns Hopkins University
More Data…There’s no data like more data
[Berger & Miller, 1998] Just-in-time language model. [Zhu & Rosenfeld, 2001] Estimate n-gram counts
from web. [Banko & Brill, 2001] Efforts should be directed
toward data collection, instead of learning algorithms. [Keller et. al., 2002] n-gram counts from the web
correlates reasonably well with BNC data. [Bulyko et. al., 2003] Web text sources are used for
language modeling. [RT-04] U. of Washington web data for language
modeling.
IBM T.J. WastonCLSP, The Johns Hopkins University
More Datamore data solution to data sparseness The web has “everything”: web data is noisy. The web does NOT have everything: language
models using web data still have data sparseness problem.
[Zhu & Rosenfeld, 2001] In 24 random web news sentences, 46 out of 453 trigrams were not covered by Altavista.
In domain training data is not always easy to get.
Do better smoothing techniques matter when training data is of millions of words?
IBM T.J. WastonCLSP, The Johns Hopkins University
OutlineMotivationRandom Forests for Language Modeling Decision Tree Language Models Random Forests Language Models
Experiments Perplexity Speech Recognition: IBM RT-04 CTS
LimitationsConclusions
IBM T.J. WastonCLSP, The Johns Hopkins University
Dealing With Sparseness inn-gram
Clustering: combine words into groups of words All components need to use smoothing. [Goodman, 2001]
Decision trees: cluster histories into equivalence classes
Appealing idea, but negative results were reported. [Potamianos & Jelinek, 1997]
Maximum entropy: use n-grams as features in an exponential model
There is almost no difference in performance from interpolated Kneser-Ney models. [Chen & Rosenfeld, 1999]
Neural networks: represent words with real vectors The models rely on interpolation with Kneser-Ney models in
order to get superior performance. [Bengio, 1999]
IBM T.J. WastonCLSP, The Johns Hopkins University
Our Motivation
Better smoothing technique is desirable. Better use of available data is often
important! Improvements in smoothing should
help other means of dealing with data sparseness problem.
IBM T.J. WastonCLSP, The Johns Hopkins University
Our Approach
Extend the appealing idea of history clustering from decision trees. Overcome problems in decision tree
construction
…by using Random Forests!
IBM T.J. WastonCLSP, The Johns Hopkins University
Decision Trees Language Models
Decision trees: equivalence classification of histories Each leaf is specified by the answers to
a series of questions which lead to the leaf from the root.
Each leaf corresponds to a subset of the histories. Thus histories are partitioned (i.e.,classified).
IBM T.J. WastonCLSP, The Johns Hopkins University
Construction of Decision Trees
Data Driven: decision trees are constructed on the basis of training dataThe construction requires:
1. The set of possible questions2. A criterion evaluating the desirability of
questions3. A construction stopping rule or post-
pruning rule
IBM T.J. WastonCLSP, The Johns Hopkins University
Decision Tree Language Models: An Example
Example: trigrams (w-2,w-1,w0)
Questions about positions: “Is w-i2S?” and “Is w-i2Sc?” There are two positions for trigram. Each pair, S and Sc, defines a possible split of a node, and therefore, training data. S and Sc are complements with respect to training data
A node gets less data than its ancestors.(S, Sc) are obtained by an exchange algorithm.
IBM T.J. WastonCLSP, The Johns Hopkins University
Decision Tree Language Models: An Example
Training data: aba, aca, bcb, bbb, ada
{ab,ac,bc,bb,ad}a:3 b:2
{ab,ac,ad}a:3 b:0
{bc,bb}a:0 b:2
Is the first word in {a}? Is the first word in {b}?
New event ‘bdb’ in testNew event ‘adb’ in test
New event ‘cba’ in test: Stuck!
IBM T.J. WastonCLSP, The Johns Hopkins University
Construction of Decision Trees: Our Approach
Grow a decision tree until maximum depth using training data
Questions are automatically obtained as a tree is constructed
Use training data likelihood to evaluate questions
Perform no smoothing during growing
Prune fully grown decision tree to maximize heldout data likelihood
Incorporate KN smoothing during pruning
IBM T.J. WastonCLSP, The Johns Hopkins University
Smoothing Decision TreesUsing similar ideas as interpolated Kneser-Ney smoothing:
Note: All histories in one node are not smoothed
in the same way. Only leaves are used as equivalence
classes.
IBM T.J. WastonCLSP, The Johns Hopkins University
Problems with Decision Trees
Training data fragmentation: As tree is developed, the questions are
selected on the basis of less and less data.
Optimality: The exchange algorithm is a greedy
algorithm. So is the tree growing algorithm.
Overtraining and undertraining: Deep trees: fit the training data well, will
not generalize well to new test data. Shallow trees: not sufficiently refined.
IBM T.J. WastonCLSP, The Johns Hopkins University
Amelioration: Random Forests
Breiman applied the idea of random forests to relatively small problems. [Breiman 2001] Using different random samples of data and
randomly chosen subsets of questions, construct K decision trees.
Apply test datum x to all the different decision trees. • Produce classes y1,y2,…,yK.
Accept plurality decision:
IBM T.J. WastonCLSP, The Johns Hopkins University
Example of a Random Forest
T1 T2 T3
An example x will be classified as according to this random forest.
IBM T.J. WastonCLSP, The Johns Hopkins University
Random Forests for Language Modeling
Two kinds of randomness: Selection of positions to ask about
Alternatives: position 1 or 2 or the better of the two. Random initialization of the exchange
algorithm
100 decision trees: ith tree estimatesPDT(i)(w0|w-2,w-1)
The final estimate is the average of all trees
IBM T.J. WastonCLSP, The Johns Hopkins University
Experiments
Perplexity (PPL):
UPenn Treebank part of WSJ: about 1 million words for training and heldout (90%/10%), 82 thousand words for test
Normalized text
IBM T.J. WastonCLSP, The Johns Hopkins University
Experiments: Aggregating
Considerable improvement already with 10 trees!
IBM T.J. WastonCLSP, The Johns Hopkins University
Embedded Random Forests
Smoothing a decision tree:
Better smoothing: embedding!
IBM T.J. WastonCLSP, The Johns Hopkins University
Speech Recognition Experiments
Word Error Rate by Lattice Rescoring IBM 2004 Conversational Telephony System
for Rich Transcription Fisher data: 22 million words WEB data: 525 million words, using frequent
Fisher n-grams as queries Other data: Switchboard, Broadcast News, etc.
Lattice language model: 4-gram with interpolated Kneser-Ney smoothing, pruned to have 3.2 million unique n-grams
Test set: DEV04
IBM T.J. WastonCLSP, The Johns Hopkins University
Speech Recognition Experiments
Baseline: KN 4-gram110 random DTsSampling data without replacementFisher+WEB: linear interpolationEmbedding in Fisher RF, no embedding in WEB RF
Fisher 4-gram
Fisher+WEB 4-gram
KN 14.1% 13.7%
RF 13.5% 13.1%
p-value <0.001 <0.001
IBM T.J. WastonCLSP, The Johns Hopkins University
Practical Limitations of the RF Approach
Memory: Decision tree construction uses much more
memory. It is not easy to realize performance gain
when training data is really large. Because we have over 100 trees, the final
model becomes too large to fit into memory. Computing probabilities in parallel incurs
extra cost in online computation.Effective language model compression or pruning remains an open question.
IBM T.J. WastonCLSP, The Johns Hopkins University
Conclusions: Random Forests
New RF language modeling approachMore general LM: RF DT n-gramRandomized history clustering
Good generalization: better n-gram coverage, less biased to training dataSignificant improvements in IBM RT-04 CTS on DEV04