1
Word Syllabification
with
Linear-Chain Conditional Random Fields CSE 250B, Winter 2013, Project 2
Clifford Champion Computer Science and Engineering
UC San Diego, CA 92037
Malathi Raghavan Computer Science and Engineering
UC San Diego, CA 92037
Abstract
In this paper we consider the application of machine learning to
orthographic syllabification of English words. We detail our choice of
linear-chain conditional random fields for this task, and our choice of specific feature function classes. For numerical solutions we employ two
gradient-based optimization solvers, stochastic gradient ascent and
closely related Collins Perceptron, and explore parameter values for regularization and learning rate. We apply our methods to a labeled data
set of English words and report our challenges and findings.
1 Introduction
Automatic orthographic syllabification is an important and interesting problem. It is
important because of its direct use in column-boundary hyphenation for print communi-
cations. Hyphenation used in this way, when combined with other print techniques such
justification and anti-“river” algorithms, helps immensely in readability, thus enabling
information to be consumed more efficiently.
It is an interesting problem because from the outside it appears both non-trivial but
possibly tractable to apply machine learning successfully. The reason it is non-trivial is
because, as with most natural languages, English word formation is neither perfectly
regular nor constant over time, and is similarly so with respect to rules for choosing the
most appropriate place to insert hyphenation. Lastly, either you are inspecting the word
on a letter by letter basis, or you are relying on higher-level knowledge such as phonet-
ics, etymology or other aspects to natural language, and none are non-trivial in scope.
In this paper we limit our consideration to letter-based learning. The length of words
and number of possible characters is quite large. For a very rough sense of possible, sen-
sical (present or future) English words, consider the number of six-letter words formed
from 26 possible letters without repetition, which amounts to 26!
(26−6)! or 165.7 million
possible words, and of course not all words are precisely six letters long. One key point
here is that we cannot train on future words, thus it is important that for the limited
2
number of current words available to train on that some “portability” should exist on the
knowledge acquired in the learning algorithm, so that it is suitable for unseen and future
words.
1.1 Observable Characteristics of Word Formation and Spelling
Nevertheless, tractability seems possible for a number of reasons. Firstly, many English
words are actually compounds of Latin, Greek, and Germanic roots and affixes, thus the
entropy of the true space of words is much smaller. Secondly and similarly, many Eng-
lish words are merely inflections of a base word (e.g. “walk” versus “walked”), and these
inflections often form with regularity in spelling and pronunciation. Thirdly, given the
nature of vowels and consonants, there are certain vowel pairs and consonant pairs that
are unlikely to be divided by syllabification. Fourthly and finally, there is a degree of
correlation (to the untrained eye) between word formation and both spelling and syllabi-
fication. For example, consider “nonplussed” with Latin roots “non” and “plus(s)”, and
English inflectional suffix “ed”, which has a standard syllabification of non - plussed.
The last two points above give us the most cause for hope in our purely letter-based ap-
proach for finding accurate and tractable automatic syllabification.
1.2 Choosing the Right Model
Because most words are formed from smaller units (one or more roots, prefixes, suffixes),
and because each unit tends not to affect the others of the word, there is a locality of
influence among contiguous letter groups. For instance consider in - sti - tute and in - sti
- tu - tion. The hyphenation for letter group “institu” is unchanged between these two
words, which is to say, the change of “-te” to “-tion” had no effect on the hyphenation for
the first half of the words.
Thus because of strong isolation between letter groups, a linear-chain graphical model is
believed appropriate here. Further, we can only expect to train a model correctly if our
training examples are fully labeled, thus we also need conditional probability model. The
two obvious choices are a directed HMM model, or an undirected random field model.
The HMM model has probably more complexity than is necessary, and other work has
already shown strong results using an undirected model [1]. Thus we choose a linear-
chain conditional random field (CRF) with the assumption that for our goals it is both
necessary and sufficient.
1.3 Applying Linear-Chain CRFs to Letter Tagging
Conceptually the inputs and outputs to a fully-trained linear-chain CRF are simple. For
an input word �̅� of length 𝑛, our output should be a tag sequence �̅� also of length 𝑛,
representing the most appropriate syllabification for �̅�. Note that our tag sequence �̅� is
also called our label of �̅�. Each tag position in �̅� corresponds naturally to a position in �̅�.
Tag encoding mechanisms are described in further detail later in this paper.
As we will see later in more detail, querying our linear-chain CRF for the most likely �̅�
for a given �̅� will mean constructing �̅� tag by tag in sequence. The likelihood of a certain
next tag in the sequence should depend at most on its neighboring tags (this is the line-
ar-chain restriction) and on �̅� in some way, otherwise it is not linear-chain. Thus
learning the best linear-chain CRF means learning the best predictors for a “next tag”
3
given a “previous tag” (and �̅�). Beyond the linear-chain restriction on the structure of
predicting �̅�, we introduce a further restriction on how wide our search net is within the
elements of input �̅�. Again by assuming locality of influence between letters and tags, we
assume the “next tag” in the construction of �̅� is only influenced by the “previous letter”
in �̅�, and zero or more following letters (we designate the total number of considered
elements 𝑐, short for “context size”). We explore different values (and combinations) of 𝑐
later in our experiments, basically amounting using different sized n-grams (c-grams) for
help choosing the next tag.
1.4 As a Linear Model
General linear models are simple and effective [2], and a log-linear model in particular is
appropriate and intuitive for representing our linear-chain CRF. There are two im-
portant and equivalent ways of looking at the log-linear model representation here:
through the lens of linear-chain CRFs, and as a simple conditional likelihood.
Through the lens of linear-chain CRFs, our model is composed of 𝐽 many feature func-
tions 𝑓𝑗(𝑦𝑖−1, 𝑦𝑖 , �̅�, 𝑖) that quantify relationships between letters and tagging; 𝐽 many
weights stored as vector �̅� that quantify the relative important of difference feature
functions; and finally a score function 𝑔𝑖(𝑦𝑖−1, 𝑦𝑖) that is simply the following weighted
sum.
𝑔𝑖(𝑦𝑖−1, 𝑦𝑖|𝑥; 𝑤) = ∑ 𝑤𝑗𝑓𝑗(𝑦𝑖−1, 𝑦𝑖 , �̅�, 𝑖)𝐽
𝑗=1
for letter position 𝑖
Intuitively, this linear equation gives us a “score” of how likely a tag value (𝑦𝑖) at letter
position 𝑖 is to follow the previous tag value (𝑦𝑖−1) at position 𝑖 − 1. It should be noted
that the above notation for 𝑓𝑗 is actually more general than what our c-gram based ap-
proach needs (as discussed in section 1.3). A more precise form would be
𝑓𝑗(𝑦𝑖−1, 𝑦𝑖 , 𝑥𝑖−1, … , 𝑥𝑖−1+𝑐(𝑗)−1) to remind us that we are only considering c-grams taken
from �̅� and aligned with the position 𝑖 − 1 (𝑐 is a function of 𝑗 in this form since we may
use different size c-grams for different subsets of our feature function set). We will con-
tinue to use the first form given for consistency with existing literature.
4
In contrast, we can view our problem in terms of simple conditional likelihood, where
the general linear model outputs the entire label (tag sequence) rather than individual
tags position-at-a-time. The resulting equation is very similar.
𝑝(�̅�|�̅�; 𝑤) = ℎ(�̅�|�̅�; �̅�)
∑ ℎ(�̅�′|�̅�; �̅�)�̅�′∈𝑌
where
ℎ(�̅�|�̅�; �̅�) = exp (∑ 𝑤𝑗𝐹𝑗(�̅�, �̅�) 𝐽
𝑗=1)
𝐹𝑗(�̅�, �̅�) = ∑ 𝑓𝑗(𝑦𝑖−1, 𝑦𝑖 , �̅�, 𝑖)𝑛
𝑖=1
This is basically the familiar form for multi-class logistic regression. Here 𝑝 is a formal
probability distribution due to its denominator, also denoted as 𝑍(�̅�, �̅�), which ensures
the sum of ℎ over all possible �̅� is 1. However if you consider only the expression inside
the exponent of ℎ, we see that 𝑝 and ℎ are very similar to 𝑔𝑖, the only difference being
the former measure the likelihood of pairs of entire �̅� and �̅�, while the latter measures
the (unnormalized) likelihood of pairs of aligned tag(letter)-groups of �̅� and �̅�. This rela-
tionship between ℎ and 𝑔𝑖 is captured explicitly by the equation for 𝐹𝑗 given in terms of
𝑓𝑗 summed over every position 𝑖.
1.5 Learning and the Log Conditional Likelihood
As with most learning problems, we start by formalizing the learning problem in terms
of an objective function that must be optimized. Here our goal is to optimize the log conditional likelihood (LCL) of our model according to the training sample set we wish
to learn from. For a training sample of size 𝑇 our objective function is the following.
𝐿𝐶𝐿 = ∑ log 𝑝(𝑦�̅�|𝑥�̅�; �̅�)𝑇
𝑡=1
Where 𝑝(�̅�|�̅�; �̅�) is as given above, and each 𝑥�̅� and 𝑦�̅� are the tth human-provided word
and syllabification tagging from the traing sample set.
Our free parameters are the weights and inclusion/exclusion of feature functions. All else
being equal, learning weights is a continuous optimization problem and thus we can use
gradient methods. Learning the inclusion/exclusion of feature functions strictly speaking
is not continuous, thus the feature function set must be decided upon by a human being.
However, we can still in some sense find an optimal set of feature functions by simply
including many feature functions very liberally, and letting our optimization over pa-
rameter �̅� effectively exclude feature functions which are inconsequential. As we will see
in section 3, we can in fact filter our feature function set significantly even before we
begin optimizing over �̅�.
Formally this leaves us with �̅� as our only free parameter. We seek an optimal value �̂�
defined as follows.
5
�̂� = argmax�̅� 𝐿𝐶𝐿
As we will see in section 2, there are ways to locate this optimum efficiently, by exploit-
ing the structure and assumptions of linear-chain CRFs.
2 Design and Analysis of Algorithms
We utilize two different approaches to solving the log conditional likelihood (LCL) op-
timization. Both are a form of gradient following, however each estimate the gradient in
very different ways. As with gradient methods in general, the goal is to find parameter
values such that the gradient of the objective function is zero. Because there is no miss-
ing data, the objective function is convex and any local optimum will in fact be the
global optimum [3].
2.1 Stochastic Gradient Ascent
The gradient of the log conditional likelihood as defined in section 1 can be shown to be
as follows.
𝜕
𝜕𝑤𝑗
𝐿𝐶𝐿 = ∑ (𝐹𝑗(𝑥�̅� , 𝑦�̅�) − 𝐸𝑦′̅̅̅̅ ~𝑝(𝑦′̅̅̅̅ |𝑥𝑡̅̅̅̅ ;�̅�)[𝐹𝑗(𝑥�̅� , 𝑦 ′̅)])𝑇
𝑡=1
In standard gradient ascent our update rule would be 𝑤𝑗 ≔ 𝑤𝑗 + 𝜆 (𝜕
𝜕𝑤𝑗𝐿𝐶𝐿) for some
learning rate 𝜆, however for stochastic gradient ascent (SGA), we drop the sum over 𝑇
and instead update each 𝑤𝑗 after evaluating the summand expression for just one, ran-
domly drawn example, and repeat this process until convergence. Proof of the
convergence of gradient ascent is beyond the scope of this paper. We can assume the
randomization has taken place once before starting SGA, and so we will continue to use
subscript 𝑡 to refer to individual training examples. Our SGA update rule for a compo-
nent of 𝑤𝑗 then becomes the following.
𝑤𝑗 ≔ 𝑤𝑗 + 𝜆(𝐹𝑗(𝑥�̅� , 𝑦�̅�) − 𝐸�̅�′~𝑝(�̅�′|𝑥𝑡̅̅̅̅ ;�̅�)[𝐹𝑗(�̅�𝑡 , �̅�′)])
Where for each training example (𝑥�̅�, 𝑦�̅�) we update all values 𝑤𝑗 by computing the con-
tribution of the training example to the total gradient, applied to learning rate 𝜆.
Note that computing the value of the feature function 𝐹𝑗 is constant time, but compu-
ting the expectation 𝐸[𝐹𝑗] is not. In order to compute the expection quickly, we rely on
the so-called forward and backward vectors, and 𝑔𝑖. It can be shown [2] that the expec-
tation can be rewritten as follows.
𝐸�̅�′~𝑝(�̅�′|𝑥𝑡̅̅̅̅ ;�̅�)[𝐹𝑗(�̅�𝑡 , �̅�′)] = ∑ ∑ ∑ 𝑓𝑗(𝑦𝑖−1, 𝑦𝑖 , �̅�, 𝑖)𝛼(𝑖 − 1, 𝑦𝑖−1) exp(𝑔𝑖(𝑦𝑖−1, 𝑦𝑖)) 𝛽(𝑦𝑖 , 𝑖)
𝑍(�̅�, �̅�)𝑦𝑖𝑦𝑖−1
𝑛
𝑖=1
In the above equation, 𝛼 and 𝛽 are look-ups in the forward and backward matrices, and
the two innermost sums are over all possible tag values for elements in �̅�. The forward
and backward vectors are well-documented in other sources [2] and so we will not go
6
into detail here other than to remind the reader that they capture unnormalized, mar-
ginal probabilities over tag sequences ending (or beginning) with specific tag values, and
that to precompute 𝛼 and 𝛽 takes O(nm2) time for each new �̅� or �̅�.
The outer three summations take O(nm2) time, times the cost of the inner expression,
which is constant after 𝛼, 𝛽, 𝑍(�̅�, �̅�), and 𝑔𝑖 have been computed. 𝑍 can be computed in
constant time from 𝛽. Computing 𝑔𝑖 for all tag value pairs takes O(Jnm2) time but can
be reused for all values of 𝑗 for that update. Thus computing the expectation 𝐸[𝐹𝑗] for a
single training example (𝑥�̅�, 𝑦�̅�), for 𝑗 weight updates in one iteration of SGA, is simply
O(Jnm2 + Jnm2) or just O(Jnm2).
As we will discuss in section 3, we are able to substantially reduce the size of 𝐽 used by
SGA without affecting the accuracy of our training.
2.2 Collins Perceptron
A close cousin to stochastic gradient descent is Collins Perceptron. The basic argument
against (stochastic) gradient ascent is that computing the expectation is expensive and
unnecessary. Collins Perceptron proposes a reasonable approximation to the true expec-
tation, by assuming 𝐸[𝐹𝑗] ≅ 𝐹𝑗(�̅�, �̂�) where �̂� = argmax�̅�′𝑝(�̅�′|�̅�; �̅�). Proof of convergence
is beyond the scope of this paper.
The update step for one element of �̅� using Collins perceptron instead of SGA is below.
𝑤𝑗 ≔ 𝑤𝑗 + 𝜆 (𝐹𝑗(�̅�𝑡 , �̅�𝑡) − 𝐹𝑗(�̅�𝑡 , �̂�))
Thus, in place of computing the expectation 𝐸[𝐹𝑗] we compute an argmax to find �̂�.
Solving the argmax is essentially the inference problem for linear-chain CRFs, and is
computed efficiently using the Verterbi algorithm. We will not go into much detail of
the Viterbi algorithm [2] except to remind the reader of the recursive equation for find-
ing �̂�.
�̂�𝑘−1 = argmax𝑢[𝑈(𝑘 − 1, 𝑢) + 𝑔𝑘(𝑢, �̂�𝑘)]
Function 𝑈 above provides the score of the best path through tag values at each posi-
tion and is computable in O(m2) time and depends directly on 𝑔𝑖, which is computable
in O(Jnm2) time as described earlier but only depends on �̅� and �̅�. Thus, computing
optimal �̂� thus takes O(Jnm2 + nm2) time.
The value �̂� does not change while iterating through 𝑗 in the update step, thus compu-
ting �̂� is a one-time cost per iteration. All other computations for a single 𝑤𝑗 are
constant, thus the complexity of one iteration of Collins Perceptron is O(J + Jnm2) or
simply O(Jnm2). As with SGA, we can greatly reduce the size of 𝐽 for greater efficiency
in practice.
7
2.3 Sparsity of Feature Function Output
As alluded to and described in more detail in section three, each feature function is
based on inspecting two sequential tags 𝑦𝑖−1 and 𝑦𝑖, and c-gram from �̅�. In fact each
feature function 𝑓𝑗 will be defined as a conjunction of indicator functions, based on the
presence (or absence) of specific character sequences, such as “gol” if a 3-gram. The space
of feature functions therefore is quite large. However, for any given word �̅�, most feature
functions will return 0, simply because each word cannot contain more than a handful of
specific c-grams. Further, for any given training set of words, some c-grams will not ap-
pear at all (such as “qqq”).
Thus we introduce two optimizations that both effectively reduce 𝐽. The first is to re-
move any feature functions that depend on c-grams not seen in the training set. Such
feature functions would end up with 𝑤𝑗 value of 0, thus simply removing them from
consideration has no effect on correctness.
The second optimization we perform is to associate with each example �̅�𝑡 a set of feature
function indices corresponding to those feature functions which depend on c-grams con-
tained in �̅�𝑡. For example, if �̅�𝑡 is “golf” and 𝑐 is strictly 2 then the feature function
indices for �̅�𝑡 should involve only those feature functions depending on “go”, “ol”, and “lf”.
Our setup is described in more detail in section 3, and includes some other nuance not
specific to the algorithm analysis here.
Because all of the above algorithms depend on computing 𝑔𝑖, and 𝑔𝑖 depends on both �̅�
and each 𝑓𝑗, the above two optimizations have a significant effect on the overall compu-
ting time.
3 Design of Experiments
Our sample set comes from a modified Celex dataset of hyphenated English words, as
prepared and used by Elkan et al. [1] for their paper on the same problem topic. The
dataset is about 66,000 words, excludes proper nouns, words with numbers, punctuation,
or accent markings, and excludes multiple alternative hyphenations per single word if
any.
3.1 Label Space and Word Preprocessing
For our experiment we tested two different tag spaces for representing the given hy-
phenated words.
Our first scheme, “BIO”, has three possible tags for any letter of the input word. ‘B’ de-
notes the beginning of a syllable, ‘I’ denotes an intermediate letter in the syllable, and
‘O’ denotes the end of a syllable. As a convention, single letter syllables are encoded as
“B”, and 2-letter syllables as “BO”. For instance, the hyphenated word “a-base” will be
encoded as “BBIIO”.
The second encoding scheme we use is our “OX” scheme inspired by prior work [1],
which has two possible tags for any letter. ‘X’ denotes a letter the end of a syllable,
8
while ‘O’ denotes any other letter. As a convention, single letter syllables are encoded as
‘O’. For instance, the hyphenated word “a-base” will be encoded as “XOOOO”
Further, we wrap each word and each label with a special starting and ending tokens, ‘^’
and ‘$’ respectively (borrowing a convention from regular expressions), to make it more
convenient to compute our feature functions at word boundaries, and because we also
want to be able to learn any patterns that are related to start of word or end of word.
The final step of our preprocessing is to map all characters into a contiguous space of 8-
bit integer values. For our BIO experiments the mapping is given below.
{ ^, $, B, I, O } ⇒ { 0,1, 2, 3, 4 }
{ ^, $, A, B, C, …, Z } ⇒ { 0, 1, 2, 3, 4, …, 28 }
3.2 Feature Function Design
As described earlier, we consider feature functions that depend on c-grams instances
within �̅�. We devised templates for 2-grams, 3-grams, and 4-grams, and can invoke these
templates for every possible c-gram permutation of letters and special symbols.
One example of a low level feature function we used can be defined as, all 3-letter se-
quences “ase” with the corresponding tag label as “BI*”, where * can be any tag.
𝑓(𝑥, 𝑖, 𝑦𝑖−1, 𝑦𝑖) = 𝐼(𝑥𝑖−1𝑥𝑖𝑥𝑖+1 = 'ase') ∗ 𝐼(𝑦𝑖−1𝑦𝑖 = 'BI' )
For a BIO scheme using 2-grams, 3-grams and 4-grams, the total number of feature
functions possible using this template 9 × 262 + 9 × 263 + 9 × 264 = 4,277,052. We no-
tice that not all of these feature functions will appear in the input set and will maintain
a zero weight anyway. Therefore, instead of enumerating all the possible feature function
instantiations, we parse the input set and define in memory only those feature functions
that appear at least once. Performing this step reduces our feature function set size to
47,599 in the OX scheme and 53,590 in the BIO scheme. We provide our reduction of 𝐽 in greater detail (per n-gram choices) below.
9
3.3 Programming Environment and Code Optimizations
Our initial efforts were focused on leveraging the R programming language, as was used
in our most recent paper. However, for the goals of this paper, R soon became unwieldy,
primarily due to poorer development tools, and the absence of static typing. Thus, half-
way into our efforts we ported our existing code to Java, immediately showing perfor-
mance gains. Any remaining bottlenecks we were able to identify and optimize using the
Visual VM profiler included in the JDK.
Owing to the large number of feature functions and input samples, initial performance
(in Java) was worrisome at first. Over a much smaller subset totalling only 7500 exam-
ples, our initial implementation of Collins Perceptron 2-grams took 20 minutes for one
epoch, while SGA took 2 minutes for just one example (for 2-grams feature functions).
Through profiling we ultimately introduced a number of optimizations at key points in
the code. These optimizations included temporary memoization of 𝑔𝑖, better multiplica-
tive short-circuiting inside loops as soon as any zero term was detected, and the
introduction of a feature function index list per example, as described in section 2.3,
which greatly improved speed in general. Other minor optimizations included using 32-
bit integers instead of 64-bit floating point numbers for Collins Perceptron where cor-
rectness is not affected, and rewriting code to avoid unnecessary copying when
performing subsequence comparisons.
The above optimizations reduced our runtime for one epoch of Collins’ Perceptron down
to 45 seconds, and one update of SGA down to 1-2 seconds.
Individual experiments were divided and executed over different machines in order to
ensure we could include results for every permutation, including two multi-core ma-
chines with 6 GB and 12 GB of RAM, and one cloud instance with dedicated 1.6 GB of
RAM and two cores. Casual observation never indicated our experiments utilizing more
than 400 MB of allocated memory.
0
20
40
60
80
100
120
Decrease in J after Preprocessing Optimization
% decrease
10
3.4 Regularization
For SGA we include a regularization factor to prevent overfitting and weight explosion.
This modifies our SGA update rule slightly to include a −2𝜇𝑤𝑗 term [4].
𝑤𝑗 ≔ 𝑤𝑗 + 𝜆(𝐹𝑗(𝑥�̅�, 𝑦�̅�) − 𝐸�̅�′~𝑝(�̅�′|𝑥𝑡̅̅̅̅ ;�̅�)[𝐹𝑗(�̅�𝑡, �̅�′)] − 2𝜇𝑤𝑗)
The optimal choice of 𝜇 is determined via grid search.
3.5 Hyper-parameter Search
We performed a randomized 70%/30% split of the sample set for use in training and
validation respectively.
Among our training set (70% of the original data set), we perform a 7-fold rotating
cross-validation for two separate grid searches [5] for the hyperparameters of SGA:
learning rate 𝜆 and regularization constant 𝜇.
After sufficiently expanding our search limits, our grid search candidates were as follows.
λ: {10-7, 10-6, …, 100}
μ: {2-7, 2-6, …, 2-1}
We first grid search over 𝜆 using 𝜇 = 0.125, and limit SGA to 2000 iterations per candi-
date value. The point of grid search is not to fully converge, but to get a quick estimate
of the best 𝜆 for convergence during the full set training. After the best 𝜆 is selected, we
use it during grid search over 𝜇 to find the best regularization rate. We again limit SGA
to 2000 iterations per candidate value.
3.6 Full Training Stopping Conditions
For Collin’s Perceptron we determine convergence by comparing the trailing average of
validation set accuracy of between two consecutive epochs. The trailing average is com-
puted by taking the average of the accuracy percentage of the last three epochs. We
stop the process when the trailing average starts decreasing, or after hitting a predeter-
mined limit on the number of epochs. As a convenience in the code, instead of
computing the accuracy as a percentage, we count the number of correct predictions on-
the-fly while performing the perceptron update.
11
For SGA we simply halt on a hard limit on number of epochs only. As we wanted to
explore a breadth of experiment permutations (different tag schemes, feature function
schemes) within our time limits, we opted for a hard limit of 1 epoch for SGA and 40
epochs for Collins Perceptron.
In both cases, we would ideally use a much higher hard limit on epoch count to allow
convergence to happen on its own, within some measure (e.g. trailing average as above,
or by a threshold 𝑤𝑗 difference in magnitude).
4 Results of Experiments
4.1 Grid Search Results
The results of grid search for learning rate and regularization constant are shown below,
based on 2-gram, OX scheme experiment configuration.
The optimal value of learning rate we found is 0.01.
Collins Perceptron
while prev_trailing_avg < current_trailing_avg
and epoch number < max epochs do set num_correct = 0 foreach sample x do
set �̂� = Viterbi(x, weights)
if 𝑦 ≠ �̂� then
foreach weight 𝑤𝑗 do
𝑤𝑗 ≔ 𝑤𝑗 + 𝜆 (𝐹𝑗(𝑥, 𝑦) − 𝐹𝑗(𝑥, �̂�))
end foreach
else do
set num_correct = num_correct + 1 end foreach
compute cur_trailing_avg for last 3 trials
set prev_trailing_avg = cur_trailing_avg
end while
Stochastic Gradient Ascent
for count = 1 to T do
for = 1 to J do
𝑤𝑗 ≔ 𝑤𝑗 + 𝜆(𝐹𝑗(𝑥�̅�, 𝑦�̅�) − 𝐸�̅�′~𝑝(�̅�′|𝑥𝑡̅̅̅̅ ;�̅�)[𝐹𝑗(�̅�𝑡, �̅�′)])
end for
end for
12
Grid search for the regularization constant indicated an optimal value of 0.25.
4.2 Full Training Results
For final training over our training set (70% of entire data set), we performed and com-
pared 16 different experiment permutations, spanning both training methods, both
tagging schemes, and various combinations of c-grams choices of c for our feature func-
tions. We estimate the success of each permutation by measuring per-letter accuracy
over the remaining 30% of the data set. Our letter-level accuracy results are shown be-
low.
0
10
20
30
40
50
60
70
80
90
Learning Rate vs Accuracy
1
0.1
0.01
0.001
0.0001
0.00001
0.000001
0
10
20
30
40
50
60
70
80
90
Regularization Constant vs Accuracy
Accuracy
13
We find that the overall accuracy of the SGA solver is lower than Collins’s Perceptron.
This could possibly be due to our hard stopping condition for SGA, not giving it suffi-
cient time to sufficiently converge. Within SGA, we see that the accuracy for the BIO
tag set is lesser than that of OX tag set. One reason for this could be that, we used the
same hyperparameters for both tag sets. Otherwise, BIO tag encoding performed about
as good as OX with Collins Perceptron and in some cases slightly better.
Because of the faster running time for Collins Perceptron, we were also able to compare
4-grams with and 2-through-4-grams. We found that using 4-gram feature functions in-
creases accuracy compared to 2-grams and 3-grams, but the combination of 2-grams, 3-
grams, and 4-grams feature functions does not cause any major change in the accuracy
over simply using 4-grams only.
Tag set 4-gram 2, 3 and 4-gram
OX 96.77% 96.34%
BIO 95.96% 96.32%
Computed using Collins Perceptron only.
5 Findings and Lessons Learned
We were very pleased with the speed and efficiency of Collins Perceptron. We spent less
total processing time per experiment than with SGA, yet received better final accuracy
than SGA.
The nature of the code optimizations employed opened up a new way of thinking about
machine learning algorithms in general for us. Where before �̅�𝑡 would have been thought
of as simply an input on a conveyor belt, we now look at each example as pairable with
meta information (e.g. the use of a per-example feature function index list), solely for
the purpose of using �̅�𝑡 in the most efficient way possible during learning. This perspec-
0
10
20
30
40
50
60
70
80
90
100
2-gramOX
2-gramBIO
3-gramOX
3-gramBIO
2 and 3grams OX
2 and 3gramsBIO
Letter-level Accuracy
Collins
SGA
14
tive shift will continue to influence how we implement our object-oriented or functional
approaches to machine learning going forward.
For Collins Perceptron we find that using c-grams feature functions for a single value c
gives better results compared to combination of multiple c-grams. For instance, the ac-
curacy for 3-gram OX is 93.48% whereas 2 and 3-gram OX is 92.83%. One explanation
for this might be that when 2 and 3-gram feature functions are both available that the
weight for a particular 2-gram feature function is diluted over two weights, one for the
2-gram and another for related 3-grams.
We under-estimated the importance of performing grid-search for each individual per-
mutation to the experiments, which we feel is the most likely explanation for why SGA
performed very competitively for the 2-grams OX experiment (on which we executed
grid search), while comparatively poorly for all other experiment permutations. In the
future we see the importance of being much more thorough about our use of grid search.
References
[1] N. Trogkanis and C. Elkan, "Conditional Random Fields for Word Hyphenation," in
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 2010.
[2] C. Elkan, "Log-linear models and conditional random fields," 2013. [Online].
Available: http://cseweb.ucsd.edu/~elkan/250B/loglinearCRFs.pdf. [Accessed
Feburary 2013].
[3] C. Sutton and A. McCallum, "An Introduction to Conditional Random Fields," 2010.
[4] C. Elkan, "Maximum Likelihood, Logistic Regression, and Stochastic Gradient
Training," 17 January 2013. [Online]. Available:
http://cseweb.ucsd.edu/~elkan/250B/logreg.pdf. [Accessed February 2013].
[5] C.-W. Hsu, C.-C. Chang and C.-J. Lin, "A Practical Guide to Support Vector
Classification," 15 April 2010. [Online]. Available:
http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf. [Accessed January 2013].