1
Learning the IHMM through iterative map-reduce ebastien Brati` eres, Jurgen van Gael, Andreas Vlachos, Zoubin Ghahramani University of Cambridge Context Web-scale datasets frequently cannot do without dis- tributed learning. Indeed, parallelization and acceleration methods are not sufficient to make learning tractible in a realistic amount of time. Non-parametric models, like Dirichlet process-based mod- els, the Indian buffet process, or the infinite HMM used here, are very attractive from a theoretical point of view: their inference does not require model selection in order to fit ”capacity” parameters, such as number of states or clusters – these are learned just like any other parameter. Here, we selected one such model, the infinite HMM; ap- plied it to a task, part-of-speech tagging; implemented it on the map-reduce platform Hadoop; and examined exe- cution times. Our learning algorithm is based on Gibbs sampling, thus requires several thousands of iterations. It turns out that the per-iteration-overhead required by Hadoop makes it prohibitive to iterate as often as this. Map-reduce im- plementations targeting precisely this requirement, like Twister, currently still under development, are expected to bring a real advantage. The infinite HMM k =1 ¢¢¢ 1 s 0 s 0 s 1 s 1 s 2 s 2 y 1 y 1 y 2 y 2 ¼ k ¼ k µ k µ k ¯ ° ® H State sequence s 1 ,s 2 , ..., s t . A state stems from a collection of K states. Observation sequence y 1 ,y 2 , ..., y T , with observa- tions stemming from a vocabulary of size V . Transition matrix π consisting of rows π k , with π kl = p(s t+1 = l |s t = k ) Emission vector θ k for each state, of length V . p(y t |s t s t ) is a simple categorical distribution. The prior for π k is a Dirichlet distribution with concen- tration α and base β , drawn from a symmetric Dirichlet distribution parameterized by γ . Emission vectors have as prior a symmetric Dirichlet distribution whose parameter is H . Here is the outline of the inference algorithm for the IHMM, applied to PoS tagging, with individual steps im- plemented as map-reduce jobs: count transitions (map over sentences) draw each π k from a Dirichlet(β +counts) (map over states) count emissions (map over sentences) draw each θ k from the posterior, a Dirich- let(symmetric H + counts) (map over vocabulary) sample auxiliary variables (used for beam sampling, an instance of auxiliary variable MCMC) from each sequence of two states in sentences (map over sen- tences) with reference to the Chinese Restaurant represen- tation of the Dirichlet Process, sample the number of tables used by the state transitions (considering each transition in the counts a new customer) (map over elements in the transition matrix) based on table counts, resample β expand β , π k , θ k run the dynamic program to sample new state as- signments (map over sentences) clean up (prune unused states) β , π k , θ k (map over sentences twice: first take not of used states, then remove unused ones) ... and iterate Map-reduce with Hadoop Hadoop is an open-source Apache project which imple- ments the map-reduce distribution paradigm. Map-reduce operates on data formatted as <key, value> pairs. Steps: split input data (<K1, V1>) into chunks execute a task on each chunk (the map task): obtain intermediate data <K2, V2> sort on K2 send each set of K2 to one reducer perform the reduce task: obtain final data <K3, V3> (possibly sort again, on K3) Experiments We ran different versions of the same algorithm on differ- ent sizes of learning corpus, and measured iteration dura- tion. The learning corpus consisted of Wall Street Journal sentences: we used subsets of 1e3, 1e4 and 1e5 words, the entire corpus of 1e6 words, and created an artificial data set of size 1e7 by duplicating the 1e6 data set. Configurations were as follows: parallel implementation of the iHMM in .NET which uses multithreading on a quad core 2.4 GHz ma- chine with 8GB of RAM hadoop-1-* Hadoop version, running on a cluster, where each iteration contains 9 map-reduce jobs; * stands for the number of slave nodes; hardware: Amazon “small” type, i.e. 32-bit platforms with one CPU equivalent to a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor, 1.7 GB of memory, with 160 GB storage. hadoop-2-* Hadoop version, where each iteration con- tains only one map-reduce job, the most CPU- intensive one (the dynamic programming step); same hardware, always just 1 slave node hadoop-3-* same as hadoop-2 but on more efficient hardware: Amazon “extra large” nodes, 64-bit plat- forms with 8 virtual cores, each equivalent to 2.5 times the reference 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor, 7 GB of memory, 1690 GB of storage. Here are the 18 clusters obtained after 1160 iterations, starting from 10 clusters, learning from a 1e6 word cor- pus. A token appears in a cluster if for the last Gibbs iteration, it was assigned at least once to the correspond- ing state. Tokens are ranked inside each cluster according to a relevance metric, the log-likelihood ratio. Tokens may be repeated among clusters, as there is no deterministic token-to-cluster assignment. SENTENCE_END SENTENCE_START ., ?, !, ", %, quips it, n’t, he, which, they, be, , share, she, there, ’s, more, who, not, Japan, them, expected, well, Congress, but , of, to, in, and, the, ’s, for, is, that, on, by, with, from, at, as, are, was, will, said The, In, But, ", ", A, It, He, They, Mr., And, That, If, For, This, As, At, a, Some, While coupled, crocidolite, success, blacks, 1965, 8.07, Arnold, Richebourg, conversations, new-home, smokers, ward, worksheets, prove, Bell, Darkhorse, Italy, Panama, buses, insisted the, a, $, its, of, an, and, their, Mr., his, New, no, other, any, this, these, recent, some, The, two also, We, spokesman, Corp., I, addition, There, company, admitting, Computer, declined, Acquisition, L., Ross, White, House, you, has, Inc., officials %, million, company, market, billion, York, trading, cents, year, president, Inc., ", funds, shares, months, days, is, markets, yen Yeargin, High, end, carrier, Hills, dollar, lot, Angeles, IRS, Orleans, principal, circuit, most, Mississippi, buyer, Lane, Street, female, Senate, result Co, N.J, 2645.90, Calif, Mass, UNESCO, fines, soon, pollen, Baltimore, fired, 1956, Conn, Danville, Egypt, anx- ious, goodwill, privilege, severable, shelter Trotter, home, 106, 301, Amadou-Mahtar, Appellate, Chadha, Confidence, Corrigan, Crew, Curt, Dakota, Express- Buick, Federico, Fuentes, Glove, Harrison, Islamic, Jeremy, Judiciary Always, Heiwado, N.V, Stung, Perhaps, 4, 5, 3, 2, ) River, Jr., States, says, School, Bridge, sheet, voice, Corps, Times, piece, Giuliani, Price, underwriter, Code, Register, Tashi, accurate, condition, fills new, of, same, own, vice, first, appropriations, U.S., financial, Japanese, executive, 10, Big, major, 100, good, stock, previous, bad, joint Earns Haruki, Nomenklatura, abuzz, alumni, ambassadors, attended, backyard, boarding, celebrate, diminish, finite, less- ening, milestones, preventative, profess, rebuild, relax, tapping, thin-lipped, cloth Danube Results ı Æ Æ 10 4 10 5 10 6 10 7 data points 500 1000 2000 5000 iteration runtime @secD H1-16 H1-8 Æ H1-4 H1-3 H1-2 ı H1-1 ¨¨ ¨¨ ¨¨ ¨¨ ¨¨ ¢ ¢ ¢ Α Β Β 10 4 10 5 10 6 10 7 data points 10 100 1000 iteration runtime @secD Β H3-4 Α H3-1 H2-f H2-e H2-d ¢ H2-c H2-b H2-a ¨¨ parallel ¨¨ ¨¨ ¨¨ ¨¨ ¨¨ ı Æ Æ ¢ ¢ ¢ Α Β Β 10 4 10 5 10 6 10 7 data points 10 100 1000 iteration runtime @secD Β H3-4 Α H3-1 H2-f H2-e H2-d ¢ H2-c H2-b H2-a H1-16 H1-8 Æ H1-4 H1-3 H1-2 ı H1-1 ¨¨ parallel Next steps 1. Reproduce the experiment with Twister, a map-reduce framework built specifically for iterative algorithms. It leaves the mapper running from one iteration to the next, instead of restarting it. Twister is under development at the University of Indiana. Stay tuned !... 2. Scale from the Wall Street Journal corpus to the En- glish Wikipedia (from 1e6 to 2e9 words). NLP evaluation cannot be against a golden standard, such as the WSJ labels, used so far, so we will resort to indirect methods. References [1] M. J. Beal, Z. Ghahramani, and C. E. Rasmussen. The Infi- nite Hidden Markov Model. Advances in Neural Information Processing Systems, 14:577 – 584, 2002. [2] Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the Amer- ican Statistical Association, 101(476):1566–1581, 2006. [3] J. Van Gael, Y. Saatci, Y. W. Teh, and Z. Ghahramani. Beam Sampling for the Infinite Hidden Markov Model. In Proceed- ings of the 25th International Conference on Machine learn- ing, Helsinki, 2008. [4] Jurgen Van Gael, Andreas Vlachos, and Zoubin Ghahramani. The infinite HMM for unsupervised PoS tagging. In Proceed- ings of 2009 Conference on Empirical Methods in Natural Language Processing, pages 678–687, Singapore, 2009.

ss - University of Cambridgeav308/aistats2010poster.pdf · equivalent to a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor, 1.7 GB of memory, with 160 GB storage. hadoop-2-* Hadoop

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ss - University of Cambridgeav308/aistats2010poster.pdf · equivalent to a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor, 1.7 GB of memory, with 160 GB storage. hadoop-2-* Hadoop

Learningthe IHMMthrough iterativemap-reduce

Sebastien Bratieres, Jurgen van Gael, Andreas Vlachos, Zoubin Ghahramani

University of Cambridge

ContextWeb-scale datasets frequently cannot do without dis-tributed learning. Indeed, parallelization and accelerationmethods are not sufficient to make learning tractible in arealistic amount of time.

Non-parametric models, like Dirichlet process-based mod-els, the Indian buffet process, or the infinite HMM usedhere, are very attractive from a theoretical point of view:their inference does not require model selection in orderto fit ”capacity” parameters, such as number of states orclusters – these are learned just like any other parameter.

Here, we selected one such model, the infinite HMM; ap-plied it to a task, part-of-speech tagging; implemented iton the map-reduce platform Hadoop; and examined exe-cution times.

Our learning algorithm is based on Gibbs sampling, thusrequires several thousands of iterations. It turns out thatthe per-iteration-overhead required by Hadoop makes itprohibitive to iterate as often as this. Map-reduce im-plementations targeting precisely this requirement, likeTwister, currently still under development, are expectedto bring a real advantage.

The infinite HMM

k = 1 ¢ ¢ ¢1

s0s0 s1s1 s2s2

y1y1 y2y2

¼k¼k

µkµk

¯°°

®®

HH

• State sequence s1, s2, ..., st. A state stems from acollection of K states.

• Observation sequence y1, y2, ..., yT , with observa-tions stemming from a vocabulary of size V .

• Transition matrix π consisting of rows πk, withπkl = p(st+1 = l|st = k)

• Emission vector θk for each state, of length V .p(yt|st, θst) is a simple categorical distribution.

The prior for πk is a Dirichlet distribution with concen-tration α and base β, drawn from a symmetric Dirichletdistribution parameterized by γ. Emission vectors have asprior a symmetric Dirichlet distribution whose parameteris H.

Here is the outline of the inference algorithm for theIHMM, applied to PoS tagging, with individual steps im-plemented as map-reduce jobs:

• count transitions (map over sentences)• draw each πk from a Dirichlet(β+counts) (map

over states)• count emissions (map over sentences)• draw each θk from the posterior, a Dirich-

let(symmetric H + counts) (map over vocabulary)• sample auxiliary variables (used for beam sampling,

an instance of auxiliary variable MCMC) from eachsequence of two states in sentences (map over sen-tences)

• with reference to the Chinese Restaurant represen-tation of the Dirichlet Process, sample the numberof tables used by the state transitions (consideringeach transition in the counts a new customer) (mapover elements in the transition matrix)

• based on table counts, resample β• expand β, πk , θk• run the dynamic program to sample new state as-

signments (map over sentences)• clean up (prune unused states) β, πk , θk (map over

sentences twice: first take not of used states, thenremove unused ones)

• ... and iterate

Map-reduce with HadoopHadoop is an open-source Apache project which imple-ments the map-reduce distribution paradigm. Map-reduceoperates on data formatted as <key, value> pairs.

Steps:

• split input data (<K1, V1>) into chunks• execute a task on each chunk (the map task): obtain

intermediate data <K2, V2>• sort on K2• send each set of K2 to one reducer• perform the reduce task: obtain final data <K3,

V3>• (possibly sort again, on K3)

ExperimentsWe ran different versions of the same algorithm on differ-ent sizes of learning corpus, and measured iteration dura-tion. The learning corpus consisted of Wall Street Journalsentences: we used subsets of 1e3, 1e4 and 1e5 words, theentire corpus of 1e6 words, and created an artificial dataset of size 1e7 by duplicating the 1e6 data set.

Configurations were as follows:

parallel implementation of the iHMM in .NET whichuses multithreading on a quad core 2.4 GHz ma-chine with 8GB of RAM

hadoop-1-* Hadoop version, running on a cluster, whereeach iteration contains 9 map-reduce jobs; * standsfor the number of slave nodes; hardware: Amazon“small” type, i.e. 32-bit platforms with one CPUequivalent to a 1.0-1.2 GHz 2007 Opteron or 2007Xeon processor, 1.7 GB of memory, with 160 GBstorage.

hadoop-2-* Hadoop version, where each iteration con-tains only one map-reduce job, the most CPU-intensive one (the dynamic programming step);same hardware, always just 1 slave node

hadoop-3-* same as hadoop-2 but on more efficienthardware: Amazon “extra large” nodes, 64-bit plat-forms with 8 virtual cores, each equivalent to 2.5times the reference 1.0-1.2 GHz 2007 Opteron or2007 Xeon processor, 7 GB of memory, 1690 GB ofstorage.

Here are the 18 clusters obtained after 1160 iterations,starting from 10 clusters, learning from a 1e6 word cor-pus. A token appears in a cluster if for the last Gibbsiteration, it was assigned at least once to the correspond-ing state. Tokens are ranked inside each cluster accordingto a relevance metric, the log-likelihood ratio. Tokens maybe repeated among clusters, as there is no deterministictoken-to-cluster assignment.

SENTENCE_END

SENTENCE_START

., ?, !, ", %, quips

it, n’t, he, which, they, be, , share, she, there, ’s, more, who, not, Japan, them, expected, well, Congress, but

, of, to, in, and, the, ’s, for, is, that, on, by, with, from, at, as, are, was, will, said

The, In, But, ", ", A, It, He, They, Mr., And, That, If, For, This, As, At, a, Some, While

coupled, crocidolite, success, blacks, 1965, 8.07, Arnold, Richebourg, conversations, new-home, smokers, ward,

worksheets, prove, Bell, Darkhorse, Italy, Panama, buses, insisted

the, a, $, its, of, an, and, their, Mr., his, New, no, other, any, this, these, recent, some, The, two

also, We, spokesman, Corp., I, addition, There, company, admitting, Computer, declined, Acquisition, L., Ross,

White, House, you, has, Inc., officials

%, million, company, market, billion, York, trading, cents, year, president, Inc., ", funds, shares, months,

days, is, markets, yen

Yeargin, High, end, carrier, Hills, dollar, lot, Angeles, IRS, Orleans, principal, circuit, most, Mississippi,

buyer, Lane, Street, female, Senate, result

Co, N.J, 2645.90, Calif, Mass, UNESCO, fines, soon, pollen, Baltimore, fired, 1956, Conn, Danville, Egypt, anx-

ious, goodwill, privilege, severable, shelter

Trotter, home, 106, 301, Amadou-Mahtar, Appellate, Chadha, Confidence, Corrigan, Crew, Curt, Dakota, Express-

Buick, Federico, Fuentes, Glove, Harrison, Islamic, Jeremy, Judiciary

Always, Heiwado, N.V, Stung, Perhaps, 4, 5, 3, 2, )

River, Jr., States, says, School, Bridge, sheet, voice, Corps, Times, piece, Giuliani, Price, underwriter, Code,

Register, Tashi, accurate, condition, fills

new, of, same, own, vice, first, appropriations, U.S., financial, Japanese, executive, 10, Big, major, 100, good,

stock, previous, bad, joint

Earns

Haruki, Nomenklatura, abuzz, alumni, ambassadors, attended, backyard, boarding, celebrate, diminish, finite, less-

ening, milestones, preventative, profess, rebuild, relax, tapping, thin-lipped, cloth

Danube

Results

õõ

ó

ó

ó

ç

ç

ç

á

á

í

í

í

í

��

104 105 106 107ð data points

500

1000

2000

5000

iteration runtime @secD

�H1-16

íH1-8

áH1-4

çH1-3

óH1-2

õH1-1

ÈÈ ÈÈ

ÈÈ

ÈÈ

ÈÈ

ô

ô

ô

ò

ò

ò

¢

¢

¢

ææ

à

à

à

ìì

ΑΑ

Β

Β

104 105 106 107ð data points

10

100

1000

iteration runtime @secD

Β H3-4

Α H3-1

ì H2-f

à H2-e

æ H2-d

¢ H2-c

ò H2-b

ô H2-a

ÈÈ parallel

ÈÈ ÈÈ

ÈÈ

ÈÈ

ÈÈõõ

ó

ó

ó

çç

ç

á

á

íí

í

í

��

ô

ô

ô

ò

ò

ò

¢

¢

¢

ææ

à

à

à

ìì

ΑΑ

Β

Β

104 105 106 107ð data points

10

100

1000

iteration runtime @secD

Β H3-4Α H3-1ì H2-fà H2-eæ H2-d¢ H2-cò H2-bô H2-a� H1-16í H1-8á H1-4ç H1-3ó H1-2õ H1-1ÈÈ parallel

Next steps1. Reproduce the experiment with Twister, a map-reduceframework built specifically for iterative algorithms. Itleaves the mapper running from one iteration to the next,instead of restarting it. Twister is under development atthe University of Indiana. Stay tuned !...

2. Scale from the Wall Street Journal corpus to the En-glish Wikipedia (from 1e6 to 2e9 words). NLP evaluationcannot be against a golden standard, such as the WSJlabels, used so far, so we will resort to indirect methods.

References

[1] M. J. Beal, Z. Ghahramani, and C. E. Rasmussen. The Infi-nite Hidden Markov Model. Advances in Neural InformationProcessing Systems, 14:577 – 584, 2002.

[2] Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, and D. M.Blei. Hierarchical Dirichlet processes. Journal of the Amer-ican Statistical Association, 101(476):1566–1581, 2006.

[3] J. Van Gael, Y. Saatci, Y. W. Teh, and Z. Ghahramani. BeamSampling for the Infinite Hidden Markov Model. In Proceed-ings of the 25th International Conference on Machine learn-ing, Helsinki, 2008.

[4] Jurgen Van Gael, Andreas Vlachos, and Zoubin Ghahramani.The infinite HMM for unsupervised PoS tagging. In Proceed-ings of 2009 Conference on Empirical Methods in NaturalLanguage Processing, pages 678–687, Singapore, 2009.

1