Download pdf - Word Syllabification with Linear-Chain Conditional Random Fields

1

Word Syllabification

with

Linear-Chain Conditional Random Fields CSE 250B, Winter 2013, Project 2

Clifford Champion Computer Science and Engineering

UC San Diego, CA 92037

[email protected]

Malathi Raghavan Computer Science and Engineering

UC San Diego, CA 92037

[email protected]

Abstract

In this paper we consider the application of machine learning to

orthographic syllabification of English words. We detail our choice of

linear-chain conditional random fields for this task, and our choice of specific feature function classes. For numerical solutions we employ two

gradient-based optimization solvers, stochastic gradient ascent and

closely related Collins Perceptron, and explore parameter values for regularization and learning rate. We apply our methods to a labeled data

set of English words and report our challenges and findings.

1 Introduction

Automatic orthographic syllabification is an important and interesting problem. It is

important because of its direct use in column-boundary hyphenation for print communi-

cations. Hyphenation used in this way, when combined with other print techniques such

justification and anti-“river” algorithms, helps immensely in readability, thus enabling

information to be consumed more efficiently.

It is an interesting problem because from the outside it appears both non-trivial but

possibly tractable to apply machine learning successfully. The reason it is non-trivial is

because, as with most natural languages, English word formation is neither perfectly

regular nor constant over time, and is similarly so with respect to rules for choosing the

most appropriate place to insert hyphenation. Lastly, either you are inspecting the word

on a letter by letter basis, or you are relying on higher-level knowledge such as phonet-

ics, etymology or other aspects to natural language, and none are non-trivial in scope.

In this paper we limit our consideration to letter-based learning. The length of words

and number of possible characters is quite large. For a very rough sense of possible, sen-

sical (present or future) English words, consider the number of six-letter words formed

from 26 possible letters without repetition, which amounts to 26!

(26−6)! or 165.7 million

possible words, and of course not all words are precisely six letters long. One key point

here is that we cannot train on future words, thus it is important that for the limited

2

number of current words available to train on that some “portability” should exist on the

knowledge acquired in the learning algorithm, so that it is suitable for unseen and future

words.

1.1 Observable Characteristics of Word Formation and Spelling

Nevertheless, tractability seems possible for a number of reasons. Firstly, many English

words are actually compounds of Latin, Greek, and Germanic roots and affixes, thus the

entropy of the true space of words is much smaller. Secondly and similarly, many Eng-

lish words are merely inflections of a base word (e.g. “walk” versus “walked”), and these

inflections often form with regularity in spelling and pronunciation. Thirdly, given the

nature of vowels and consonants, there are certain vowel pairs and consonant pairs that

are unlikely to be divided by syllabification. Fourthly and finally, there is a degree of

correlation (to the untrained eye) between word formation and both spelling and syllabi-

fication. For example, consider “nonplussed” with Latin roots “non” and “plus(s)”, and

English inflectional suffix “ed”, which has a standard syllabification of non - plussed.

The last two points above give us the most cause for hope in our purely letter-based ap-

proach for finding accurate and tractable automatic syllabification.

1.2 Choosing the Right Model

Because most words are formed from smaller units (one or more roots, prefixes, suffixes),

and because each unit tends not to affect the others of the word, there is a locality of

influence among contiguous letter groups. For instance consider in - sti - tute and in - sti

- tu - tion. The hyphenation for letter group “institu” is unchanged between these two

words, which is to say, the change of “-te” to “-tion” had no effect on the hyphenation for

the first half of the words.

Thus because of strong isolation between letter groups, a linear-chain graphical model is

believed appropriate here. Further, we can only expect to train a model correctly if our

training examples are fully labeled, thus we also need conditional probability model. The

two obvious choices are a directed HMM model, or an undirected random field model.

The HMM model has probably more complexity than is necessary, and other work has

already shown strong results using an undirected model [1]. Thus we choose a linear-

chain conditional random field (CRF) with the assumption that for our goals it is both

necessary and sufficient.

1.3 Applying Linear-Chain CRFs to Letter Tagging

Conceptually the inputs and outputs to a fully-trained linear-chain CRF are simple. For

an input word �̅� of length 𝑛, our output should be a tag sequence �̅� also of length 𝑛,

representing the most appropriate syllabification for �̅�. Note that our tag sequence �̅� is

also called our label of �̅�. Each tag position in �̅� corresponds naturally to a position in �̅�.

Tag encoding mechanisms are described in further detail later in this paper.

As we will see later in more detail, querying our linear-chain CRF for the most likely �̅�

for a given �̅� will mean constructing �̅� tag by tag in sequence. The likelihood of a certain

next tag in the sequence should depend at most on its neighboring tags (this is the line-

ar-chain restriction) and on �̅� in some way, otherwise it is not linear-chain. Thus

learning the best linear-chain CRF means learning the best predictors for a “next tag”

3

given a “previous tag” (and �̅�). Beyond the linear-chain restriction on the structure of

predicting �̅�, we introduce a further restriction on how wide our search net is within the

elements of input �̅�. Again by assuming locality of influence between letters and tags, we

assume the “next tag” in the construction of �̅� is only influenced by the “previous letter”

in �̅�, and zero or more following letters (we designate the total number of considered

elements 𝑐, short for “context size”). We explore different values (and combinations) of 𝑐

later in our experiments, basically amounting using different sized n-grams (c-grams) for

help choosing the next tag.

1.4 As a Linear Model

General linear models are simple and effective [2], and a log-linear model in particular is

appropriate and intuitive for representing our linear-chain CRF. There are two im-

portant and equivalent ways of looking at the log-linear model representation here:

through the lens of linear-chain CRFs, and as a simple conditional likelihood.

Through the lens of linear-chain CRFs, our model is composed of 𝐽 many feature func-

tions 𝑓𝑗(𝑦𝑖−1, 𝑦𝑖 , �̅�, 𝑖) that quantify relationships between letters and tagging; 𝐽 many

weights stored as vector �̅� that quantify the relative important of difference feature

functions; and finally a score function 𝑔𝑖(𝑦𝑖−1, 𝑦𝑖) that is simply the following weighted

sum.

𝑔𝑖(𝑦𝑖−1, 𝑦𝑖|𝑥; 𝑤) = ∑ 𝑤𝑗𝑓𝑗(𝑦𝑖−1, 𝑦𝑖 , �̅�, 𝑖)𝐽

𝑗=1

for letter position 𝑖

Intuitively, this linear equation gives us a “score” of how likely a tag value (𝑦𝑖) at letter

position 𝑖 is to follow the previous tag value (𝑦𝑖−1) at position 𝑖 − 1. It should be noted

that the above notation for 𝑓𝑗 is actually more general than what our c-gram based ap-

proach needs (as discussed in section 1.3). A more precise form would be

𝑓𝑗(𝑦𝑖−1, 𝑦𝑖 , 𝑥𝑖−1, … , 𝑥𝑖−1+𝑐(𝑗)−1) to remind us that we are only considering c-grams taken

from �̅� and aligned with the position 𝑖 − 1 (𝑐 is a function of 𝑗 in this form since we may

use different size c-grams for different subsets of our feature function set). We will con-

tinue to use the first form given for consistency with existing literature.

4

In contrast, we can view our problem in terms of simple conditional likelihood, where

the general linear model outputs the entire label (tag sequence) rather than individual

tags position-at-a-time. The resulting equation is very similar.

𝑝(�̅�|�̅�; 𝑤) = ℎ(�̅�|�̅�; �̅�)

∑ ℎ(�̅�′|�̅�; �̅�)�̅�′∈𝑌

where

ℎ(�̅�|�̅�; �̅�) = exp (∑ 𝑤𝑗𝐹𝑗(�̅�, �̅�) 𝐽

𝑗=1)

𝐹𝑗(�̅�, �̅�) = ∑ 𝑓𝑗(𝑦𝑖−1, 𝑦𝑖 , �̅�, 𝑖)𝑛

𝑖=1

This is basically the familiar form for multi-class logistic regression. Here 𝑝 is a formal

probability distribution due to its denominator, also denoted as 𝑍(�̅�, �̅�), which ensures

the sum of ℎ over all possible �̅� is 1. However if you consider only the expression inside

the exponent of ℎ, we see that 𝑝 and ℎ are very similar to 𝑔𝑖, the only difference being

the former measure the likelihood of pairs of entire �̅� and �̅�, while the latter measures

the (unnormalized) likelihood of pairs of aligned tag(letter)-groups of �̅� and �̅�. This rela-

tionship between ℎ and 𝑔𝑖 is captured explicitly by the equation for 𝐹𝑗 given in terms of

𝑓𝑗 summed over every position 𝑖.

1.5 Learning and the Log Conditional Likelihood

As with most learning problems, we start by formalizing the learning problem in terms

of an objective function that must be optimized. Here our goal is to optimize the log conditional likelihood (LCL) of our model according to the training sample set we wish

to learn from. For a training sample of size 𝑇 our objective function is the following.

𝐿𝐶𝐿 = ∑ log 𝑝(𝑦�̅�|𝑥�̅�; �̅�)𝑇

𝑡=1

Where 𝑝(�̅�|�̅�; �̅�) is as given above, and each 𝑥�̅� and 𝑦�̅� are the tth human-provided word

and syllabification tagging from the traing sample set.

Our free parameters are the weights and inclusion/exclusion of feature functions. All else

being equal, learning weights is a continuous optimization problem and thus we can use

gradient methods. Learning the inclusion/exclusion of feature functions strictly speaking

is not continuous, thus the feature function set must be decided upon by a human being.

However, we can still in some sense find an optimal set of feature functions by simply

including many feature functions very liberally, and letting our optimization over pa-

rameter �̅� effectively exclude feature functions which are inconsequential. As we will see

in section 3, we can in fact filter our feature function set significantly even before we

begin optimizing over �̅�.

Formally this leaves us with �̅� as our only free parameter. We seek an optimal value �̂�

defined as follows.

5

�̂� = argmax�̅� 𝐿𝐶𝐿

As we will see in section 2, there are ways to locate this optimum efficiently, by exploit-

ing the structure and assumptions of linear-chain CRFs.

2 Design and Analysis of Algorithms

We utilize two different approaches to solving the log conditional likelihood (LCL) op-

timization. Both are a form of gradient following, however each estimate the gradient in

very different ways. As with gradient methods in general, the goal is to find parameter

values such that the gradient of the objective function is zero. Because there is no miss-

ing data, the objective function is convex and any local optimum will in fact be the

global optimum [3].

2.1 Stochastic Gradient Ascent

The gradient of the log conditional likelihood as defined in section 1 can be shown to be

as follows.

𝜕

𝜕𝑤𝑗

𝐿𝐶𝐿 = ∑ (𝐹𝑗(𝑥�̅� , 𝑦�̅�) − 𝐸𝑦′̅̅̅̅ ~𝑝(𝑦′̅̅̅̅ |𝑥𝑡̅̅̅̅ ;�̅�)[𝐹𝑗(𝑥�̅� , 𝑦 ′̅)])𝑇

𝑡=1

In standard gradient ascent our update rule would be 𝑤𝑗 ≔ 𝑤𝑗 + 𝜆 (𝜕

𝜕𝑤𝑗𝐿𝐶𝐿) for some

learning rate 𝜆, however for stochastic gradient ascent (SGA), we drop the sum over 𝑇

and instead update each 𝑤𝑗 after evaluating the summand expression for just one, ran-

domly drawn example, and repeat this process until convergence. Proof of the

convergence of gradient ascent is beyond the scope of this paper. We can assume the

randomization has taken place once before starting SGA, and so we will continue to use

subscript 𝑡 to refer to individual training examples. Our SGA update rule for a compo-

nent of 𝑤𝑗 then becomes the following.

𝑤𝑗 ≔ 𝑤𝑗 + 𝜆(𝐹𝑗(𝑥�̅� , 𝑦�̅�) − 𝐸�̅�′~𝑝(�̅�′|𝑥𝑡̅̅̅̅ ;�̅�)[𝐹𝑗(�̅�𝑡 , �̅�′)])

Where for each training example (𝑥�̅�, 𝑦�̅�) we update all values 𝑤𝑗 by computing the con-

tribution of the training example to the total gradient, applied to learning rate 𝜆.

Note that computing the value of the feature function 𝐹𝑗 is constant time, but compu-

ting the expectation 𝐸[𝐹𝑗] is not. In order to compute the expection quickly, we rely on

the so-called forward and backward vectors, and 𝑔𝑖. It can be shown [2] that the expec-

tation can be rewritten as follows.

𝐸�̅�′~𝑝(�̅�′|𝑥𝑡̅̅̅̅ ;�̅�)[𝐹𝑗(�̅�𝑡 , �̅�′)] = ∑ ∑ ∑ 𝑓𝑗(𝑦𝑖−1, 𝑦𝑖 , �̅�, 𝑖)𝛼(𝑖 − 1, 𝑦𝑖−1) exp(𝑔𝑖(𝑦𝑖−1, 𝑦𝑖)) 𝛽(𝑦𝑖 , 𝑖)

𝑍(�̅�, �̅�)𝑦𝑖𝑦𝑖−1

𝑛

𝑖=1

In the above equation, 𝛼 and 𝛽 are look-ups in the forward and backward matrices, and

the two innermost sums are over all possible tag values for elements in �̅�. The forward

and backward vectors are well-documented in other sources [2] and so we will not go

6

into detail here other than to remind the reader that they capture unnormalized, mar-

ginal probabilities over tag sequences ending (or beginning) with specific tag values, and

that to precompute 𝛼 and 𝛽 takes O(nm2) time for each new �̅� or �̅�.

The outer three summations take O(nm2) time, times the cost of the inner expression,

which is constant after 𝛼, 𝛽, 𝑍(�̅�, �̅�), and 𝑔𝑖 have been computed. 𝑍 can be computed in

constant time from 𝛽. Computing 𝑔𝑖 for all tag value pairs takes O(Jnm2) time but can

be reused for all values of 𝑗 for that update. Thus computing the expectation 𝐸[𝐹𝑗] for a

single training example (𝑥�̅�, 𝑦�̅�), for 𝑗 weight updates in one iteration of SGA, is simply

O(Jnm2 + Jnm2) or just O(Jnm2).

As we will discuss in section 3, we are able to substantially reduce the size of 𝐽 used by

SGA without affecting the accuracy of our training.

2.2 Collins Perceptron

A close cousin to stochastic gradient descent is Collins Perceptron. The basic argument

against (stochastic) gradient ascent is that computing the expectation is expensive and

unnecessary. Collins Perceptron proposes a reasonable approximation to the true expec-

tation, by assuming 𝐸[𝐹𝑗] ≅ 𝐹𝑗(�̅�, �̂�) where �̂� = argmax�̅�′𝑝(�̅�′|�̅�; �̅�). Proof of convergence

is beyond the scope of this paper.

The update step for one element of �̅� using Collins perceptron instead of SGA is below.

𝑤𝑗 ≔ 𝑤𝑗 + 𝜆 (𝐹𝑗(�̅�𝑡 , �̅�𝑡) − 𝐹𝑗(�̅�𝑡 , �̂�))

Thus, in place of computing the expectation 𝐸[𝐹𝑗] we compute an argmax to find �̂�.

Solving the argmax is essentially the inference problem for linear-chain CRFs, and is

computed efficiently using the Verterbi algorithm. We will not go into much detail of

the Viterbi algorithm [2] except to remind the reader of the recursive equation for find-

ing �̂�.

�̂�𝑘−1 = argmax𝑢[𝑈(𝑘 − 1, 𝑢) + 𝑔𝑘(𝑢, �̂�𝑘)]

Function 𝑈 above provides the score of the best path through tag values at each posi-

tion and is computable in O(m2) time and depends directly on 𝑔𝑖, which is computable

in O(Jnm2) time as described earlier but only depends on �̅� and �̅�. Thus, computing

optimal �̂� thus takes O(Jnm2 + nm2) time.

The value �̂� does not change while iterating through 𝑗 in the update step, thus compu-

ting �̂� is a one-time cost per iteration. All other computations for a single 𝑤𝑗 are

constant, thus the complexity of one iteration of Collins Perceptron is O(J + Jnm2) or

simply O(Jnm2). As with SGA, we can greatly reduce the size of 𝐽 for greater efficiency

in practice.

7

2.3 Sparsity of Feature Function Output

As alluded to and described in more detail in section three, each feature function is

based on inspecting two sequential tags 𝑦𝑖−1 and 𝑦𝑖, and c-gram from �̅�. In fact each

feature function 𝑓𝑗 will be defined as a conjunction of indicator functions, based on the

presence (or absence) of specific character sequences, such as “gol” if a 3-gram. The space

of feature functions therefore is quite large. However, for any given word �̅�, most feature

functions will return 0, simply because each word cannot contain more than a handful of

specific c-grams. Further, for any given training set of words, some c-grams will not ap-

pear at all (such as “qqq”).

Thus we introduce two optimizations that both effectively reduce 𝐽. The first is to re-

move any feature functions that depend on c-grams not seen in the training set. Such

feature functions would end up with 𝑤𝑗 value of 0, thus simply removing them from

consideration has no effect on correctness.

The second optimization we perform is to associate with each example �̅�𝑡 a set of feature

function indices corresponding to those feature functions which depend on c-grams con-

tained in �̅�𝑡. For example, if �̅�𝑡 is “golf” and 𝑐 is strictly 2 then the feature function

indices for �̅�𝑡 should involve only those feature functions depending on “go”, “ol”, and “lf”.

Our setup is described in more detail in section 3, and includes some other nuance not

specific to the algorithm analysis here.

Because all of the above algorithms depend on computing 𝑔𝑖, and 𝑔𝑖 depends on both �̅�

and each 𝑓𝑗, the above two optimizations have a significant effect on the overall compu-

ting time.

3 Design of Experiments

Our sample set comes from a modified Celex dataset of hyphenated English words, as

prepared and used by Elkan et al. [1] for their paper on the same problem topic. The

dataset is about 66,000 words, excludes proper nouns, words with numbers, punctuation,

or accent markings, and excludes multiple alternative hyphenations per single word if

any.

3.1 Label Space and Word Preprocessing

For our experiment we tested two different tag spaces for representing the given hy-

phenated words.

Our first scheme, “BIO”, has three possible tags for any letter of the input word. ‘B’ de-

notes the beginning of a syllable, ‘I’ denotes an intermediate letter in the syllable, and

‘O’ denotes the end of a syllable. As a convention, single letter syllables are encoded as

“B”, and 2-letter syllables as “BO”. For instance, the hyphenated word “a-base” will be

encoded as “BBIIO”.

The second encoding scheme we use is our “OX” scheme inspired by prior work [1],

which has two possible tags for any letter. ‘X’ denotes a letter the end of a syllable,

8

while ‘O’ denotes any other letter. As a convention, single letter syllables are encoded as

‘O’. For instance, the hyphenated word “a-base” will be encoded as “XOOOO”

Further, we wrap each word and each label with a special starting and ending tokens, ‘^’

and ‘$’ respectively (borrowing a convention from regular expressions), to make it more

convenient to compute our feature functions at word boundaries, and because we also

want to be able to learn any patterns that are related to start of word or end of word.

The final step of our preprocessing is to map all characters into a contiguous space of 8-

bit integer values. For our BIO experiments the mapping is given below.

{ ^, $, B, I, O } ⇒ { 0,1, 2, 3, 4 }

{ ^, $, A, B, C, …, Z } ⇒ { 0, 1, 2, 3, 4, …, 28 }

3.2 Feature Function Design

As described earlier, we consider feature functions that depend on c-grams instances

within �̅�. We devised templates for 2-grams, 3-grams, and 4-grams, and can invoke these

templates for every possible c-gram permutation of letters and special symbols.

One example of a low level feature function we used can be defined as, all 3-letter se-

quences “ase” with the corresponding tag label as “BI*”, where * can be any tag.

𝑓(𝑥, 𝑖, 𝑦𝑖−1, 𝑦𝑖) = 𝐼(𝑥𝑖−1𝑥𝑖𝑥𝑖+1 = 'ase') ∗ 𝐼(𝑦𝑖−1𝑦𝑖 = 'BI' )

For a BIO scheme using 2-grams, 3-grams and 4-grams, the total number of feature

functions possible using this template 9 × 262 + 9 × 263 + 9 × 264 = 4,277,052. We no-

tice that not all of these feature functions will appear in the input set and will maintain

a zero weight anyway. Therefore, instead of enumerating all the possible feature function

instantiations, we parse the input set and define in memory only those feature functions

that appear at least once. Performing this step reduces our feature function set size to

47,599 in the OX scheme and 53,590 in the BIO scheme. We provide our reduction of 𝐽 in greater detail (per n-gram choices) below.

9

3.3 Programming Environment and Code Optimizations

Our initial efforts were focused on leveraging the R programming language, as was used

in our most recent paper. However, for the goals of this paper, R soon became unwieldy,

primarily due to poorer development tools, and the absence of static typing. Thus, half-

way into our efforts we ported our existing code to Java, immediately showing perfor-

mance gains. Any remaining bottlenecks we were able to identify and optimize using the

Visual VM profiler included in the JDK.

Owing to the large number of feature functions and input samples, initial performance

(in Java) was worrisome at first. Over a much smaller subset totalling only 7500 exam-

ples, our initial implementation of Collins Perceptron 2-grams took 20 minutes for one

epoch, while SGA took 2 minutes for just one example (for 2-grams feature functions).

Through profiling we ultimately introduced a number of optimizations at key points in

the code. These optimizations included temporary memoization of 𝑔𝑖, better multiplica-

tive short-circuiting inside loops as soon as any zero term was detected, and the

introduction of a feature function index list per example, as described in section 2.3,

which greatly improved speed in general. Other minor optimizations included using 32-

bit integers instead of 64-bit floating point numbers for Collins Perceptron where cor-

rectness is not affected, and rewriting code to avoid unnecessary copying when

performing subsequence comparisons.

The above optimizations reduced our runtime for one epoch of Collins’ Perceptron down

to 45 seconds, and one update of SGA down to 1-2 seconds.

Individual experiments were divided and executed over different machines in order to

ensure we could include results for every permutation, including two multi-core ma-

chines with 6 GB and 12 GB of RAM, and one cloud instance with dedicated 1.6 GB of

RAM and two cores. Casual observation never indicated our experiments utilizing more

than 400 MB of allocated memory.

0

20

40

60

80

100

120

Decrease in J after Preprocessing Optimization

% decrease

10

3.4 Regularization

For SGA we include a regularization factor to prevent overfitting and weight explosion.

This modifies our SGA update rule slightly to include a −2𝜇𝑤𝑗 term [4].

𝑤𝑗 ≔ 𝑤𝑗 + 𝜆(𝐹𝑗(𝑥�̅�, 𝑦�̅�) − 𝐸�̅�′~𝑝(�̅�′|𝑥𝑡̅̅̅̅ ;�̅�)[𝐹𝑗(�̅�𝑡, �̅�′)] − 2𝜇𝑤𝑗)

The optimal choice of 𝜇 is determined via grid search.

3.5 Hyper-parameter Search

We performed a randomized 70%/30% split of the sample set for use in training and

validation respectively.

Among our training set (70% of the original data set), we perform a 7-fold rotating

cross-validation for two separate grid searches [5] for the hyperparameters of SGA:

learning rate 𝜆 and regularization constant 𝜇.

After sufficiently expanding our search limits, our grid search candidates were as follows.

λ: {10-7, 10-6, …, 100}

μ: {2-7, 2-6, …, 2-1}

We first grid search over 𝜆 using 𝜇 = 0.125, and limit SGA to 2000 iterations per candi-

date value. The point of grid search is not to fully converge, but to get a quick estimate

of the best 𝜆 for convergence during the full set training. After the best 𝜆 is selected, we

use it during grid search over 𝜇 to find the best regularization rate. We again limit SGA

to 2000 iterations per candidate value.

3.6 Full Training Stopping Conditions

For Collin’s Perceptron we determine convergence by comparing the trailing average of

validation set accuracy of between two consecutive epochs. The trailing average is com-

puted by taking the average of the accuracy percentage of the last three epochs. We

stop the process when the trailing average starts decreasing, or after hitting a predeter-

mined limit on the number of epochs. As a convenience in the code, instead of

computing the accuracy as a percentage, we count the number of correct predictions on-

the-fly while performing the perceptron update.

11

For SGA we simply halt on a hard limit on number of epochs only. As we wanted to

explore a breadth of experiment permutations (different tag schemes, feature function

schemes) within our time limits, we opted for a hard limit of 1 epoch for SGA and 40

epochs for Collins Perceptron.

In both cases, we would ideally use a much higher hard limit on epoch count to allow

convergence to happen on its own, within some measure (e.g. trailing average as above,

or by a threshold 𝑤𝑗 difference in magnitude).

4 Results of Experiments

4.1 Grid Search Results

The results of grid search for learning rate and regularization constant are shown below,

based on 2-gram, OX scheme experiment configuration.

The optimal value of learning rate we found is 0.01.

Collins Perceptron

while prev_trailing_avg < current_trailing_avg

and epoch number < max epochs do set num_correct = 0 foreach sample x do

set �̂� = Viterbi(x, weights)

if 𝑦 ≠ �̂� then

foreach weight 𝑤𝑗 do

𝑤𝑗 ≔ 𝑤𝑗 + 𝜆 (𝐹𝑗(𝑥, 𝑦) − 𝐹𝑗(𝑥, �̂�))

end foreach

else do

set num_correct = num_correct + 1 end foreach

compute cur_trailing_avg for last 3 trials

set prev_trailing_avg = cur_trailing_avg

end while

Stochastic Gradient Ascent

for count = 1 to T do

for = 1 to J do

𝑤𝑗 ≔ 𝑤𝑗 + 𝜆(𝐹𝑗(𝑥�̅�, 𝑦�̅�) − 𝐸�̅�′~𝑝(�̅�′|𝑥𝑡̅̅̅̅ ;�̅�)[𝐹𝑗(�̅�𝑡, �̅�′)])

end for

end for

12

Grid search for the regularization constant indicated an optimal value of 0.25.

4.2 Full Training Results

For final training over our training set (70% of entire data set), we performed and com-

pared 16 different experiment permutations, spanning both training methods, both

tagging schemes, and various combinations of c-grams choices of c for our feature func-

tions. We estimate the success of each permutation by measuring per-letter accuracy

over the remaining 30% of the data set. Our letter-level accuracy results are shown be-

low.

0

10

20

30

40

50

60

70

80

90

Learning Rate vs Accuracy

1

0.1

0.01

0.001

0.0001

0.00001

0.000001

0

10

20

30

40

50

60

70

80

90

Regularization Constant vs Accuracy

Accuracy

13

We find that the overall accuracy of the SGA solver is lower than Collins’s Perceptron.

This could possibly be due to our hard stopping condition for SGA, not giving it suffi-

cient time to sufficiently converge. Within SGA, we see that the accuracy for the BIO

tag set is lesser than that of OX tag set. One reason for this could be that, we used the

same hyperparameters for both tag sets. Otherwise, BIO tag encoding performed about

as good as OX with Collins Perceptron and in some cases slightly better.

Because of the faster running time for Collins Perceptron, we were also able to compare

4-grams with and 2-through-4-grams. We found that using 4-gram feature functions in-

creases accuracy compared to 2-grams and 3-grams, but the combination of 2-grams, 3-

grams, and 4-grams feature functions does not cause any major change in the accuracy

over simply using 4-grams only.

Tag set 4-gram 2, 3 and 4-gram

OX 96.77% 96.34%

BIO 95.96% 96.32%

Computed using Collins Perceptron only.

5 Findings and Lessons Learned

We were very pleased with the speed and efficiency of Collins Perceptron. We spent less

total processing time per experiment than with SGA, yet received better final accuracy

than SGA.

The nature of the code optimizations employed opened up a new way of thinking about

machine learning algorithms in general for us. Where before �̅�𝑡 would have been thought

of as simply an input on a conveyor belt, we now look at each example as pairable with

meta information (e.g. the use of a per-example feature function index list), solely for

the purpose of using �̅�𝑡 in the most efficient way possible during learning. This perspec-

0

10

20

30

40

50

60

70

80

90

100

2-gramOX

2-gramBIO

3-gramOX

3-gramBIO

2 and 3grams OX

2 and 3gramsBIO

Letter-level Accuracy

Collins

SGA

14

tive shift will continue to influence how we implement our object-oriented or functional

approaches to machine learning going forward.

For Collins Perceptron we find that using c-grams feature functions for a single value c

gives better results compared to combination of multiple c-grams. For instance, the ac-

curacy for 3-gram OX is 93.48% whereas 2 and 3-gram OX is 92.83%. One explanation

for this might be that when 2 and 3-gram feature functions are both available that the

weight for a particular 2-gram feature function is diluted over two weights, one for the

2-gram and another for related 3-grams.

We under-estimated the importance of performing grid-search for each individual per-

mutation to the experiments, which we feel is the most likely explanation for why SGA

performed very competitively for the 2-grams OX experiment (on which we executed

grid search), while comparatively poorly for all other experiment permutations. In the

future we see the importance of being much more thorough about our use of grid search.

References

[1] N. Trogkanis and C. Elkan, "Conditional Random Fields for Word Hyphenation," in

Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 2010.

[2] C. Elkan, "Log-linear models and conditional random fields," 2013. [Online].

Available: http://cseweb.ucsd.edu/~elkan/250B/loglinearCRFs.pdf. [Accessed

Feburary 2013].

[3] C. Sutton and A. McCallum, "An Introduction to Conditional Random Fields," 2010.

[4] C. Elkan, "Maximum Likelihood, Logistic Regression, and Stochastic Gradient

Training," 17 January 2013. [Online]. Available:

http://cseweb.ucsd.edu/~elkan/250B/logreg.pdf. [Accessed February 2013].

[5] C.-W. Hsu, C.-C. Chang and C.-J. Lin, "A Practical Guide to Support Vector

Classification," 15 April 2010. [Online]. Available:

http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf. [Accessed January 2013].