NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u...

NLP Demys*fied

Noah Smith Language Technologies Ins*tute Machine Learning Department School of Computer Science Carnegie Mellon University

nasmith@cs.cmu.edu

Outline

1.  Automa*cally categorizing documents 2.  Decoding sequences of words 3.  Clustering documents and/or words

Categorizing Documents: Examples

•  Mosteller and Wallace (1964): authorship of the Federalist papers

•  News categories: U.S., world, sports, religion, business, technology, entertainment, ...

•  How posi*ve or nega*ve is a review of a film or restaurant?

•  Is a given email message spam? •  What is the reading level of a piece of text? •  How influen*al will a research paper be? •  Will a congressional bill pass commiZee?

The Vision

•  Human experts label some data •  Feed the data to a learning algorithm that constructs an automa*c labeling func*on

•  Apply that func*on to as much data as you want!

Basic Recipe for Document Categoriza*on

1.  Obtain a pool of correctly categorized documents D.

2.  Define a func*on f from documents to feature vectors.

3.  Define a parameterized func*on hw from feature vectors to categories.

4.  Select h’s parameters w using a training sample from D.

5.  Es*mate performance on a held-‐out sample from D.

1. Obtain Categorized Documents

Spinoza, 17th century ra*onalist

2. Define the Feature Vector Func*on

•  Simplest choice: one dimension per word, and let [ f(d) ]j be the count of wj in d.

•  Twists: – Monotonic transforms, like dividing by the length of d or taking a log.

–  Increase the weights of words that occur in fewer documents (“inverse document frequency”)

–  n-‐grams –  Count specially defined groupings of words –  Sta*s*cal tests to select words likely to be informa*ve

3. Define a Func*on from Feature Vectors to Categories

•  Simplest choice: linear model

wc is the vector of coefficients associa*ng each feature with class c (can be posi*ve or nega*ve). – Advantage: interpretability – Advantage: computa*onal efficiency

•  Some alterna*ves: k-‐nearest neighbors, decision trees, neural networks, ...

hw(d) = argmax

c f(d) + wbiasc

4. Select Parameters using Data

•  Also known as “machine learning.” •  Many learning op*ons for linear classifiers!

probabilis3c interpreta3on

discrimina3ve

LR SVM

perceptron

4. Select Parameters using Data

Op*miza*on view of learning:

Typical loss func*ons for linear models are convex and can be efficiently op*mized using online or batch itera*ve algorithms with convergence guarantees.

w = argminw

R(w) +1

|Dtrain |X

d2Dtrain

L(d;w)

“regulariza*on” to avoid overfijng “empirical risk” = average loss over training data

4. Select Parameters using Data Considera*ons: •  Do you want posterior probabili*es, or just labels?

•  What methods do you understand well enough to explain in your paper?

•  What methods will your readers understand? •  What implementa*ons are available? –  Cost, scalability, programming language, compa*bility with your workflow, ...

•  How well does it work (on held-‐out data)?

5. Es*mate Performance

•  Always, always, always use held-‐out data. – Mul*ple rounds of tests? Fresh tes*ng data!

•  Consider the “most frequent class” baseline. •  Consider inter-‐annotator agreement. •  What to measure? – Accuracy – When one class is special: precision/recall

5. Es*mate Performance

precision

recall

hw(d) = argmax

c f(d) + wbiasc

Outline

ü Automa*cally categorizing documents 2.  Decoding sequences of words 3.  Clustering documents and/or words

Decoding Word Sequences: Examples

•  Categorizing each word by its part-‐of-‐speech or seman*c class

•  Recognizing men*ons of named en**es •  Segmen*ng a document into parts •  Parsing a sentence into a gramma*cal or seman*c structure

High-‐Level View

d c classifica*on

yN ...

structured predic*on

Possible Lines of AZack

1.  Transform into a sequence of classifica*on problems (see part 1).

2.  Transform into a sequence labeling problem and use a variant of the Viterbi algorithm.

3.  Design a representa*on, predic*on algorithm, and learning algorithm for your par*cular problem.

Shameless Self-‐Promo*on

Morgan Claypool Publishers&SYNTHESIS LECTURES ONHUMAN LANGUAGE TECHNOLOGIES

w w w . m o r g a n c l a y p o o l . c o m

Series Editor: Graeme Hirst, University of Toronto

CM& Morgan Claypool Publishers&

About SYNTHESIsThis volume is a printed version of a work that appears in the SynthesisDigital Library of Engineering and Computer Science. Synthesis Lecturesprovide concise, original presentations of important research and developmenttopics, published quickly, in digital and print formats. For more informationvisit www.morganclaypool.com

SYNTHESIS LECTURES ONHUMAN LANGUAGE TECHNOLOGIES

LINGUISTIC STRUCTURE PREDICTION

Linguistic StructurePrediction

Graeme Hirst, Series Editor

ISBN: 978-1-60845-405-1

9 781608 454051

Series ISSN: 1947-4040

Linguistic Structure PredictionNoah A. Smith, Carnegie Mellon UniversityA major part of natural language processing now depends on the use of text data to build linguisticanalyzers. We consider statistical, computational approaches to modeling linguistic structure. We seekto unify across many approaches and many kinds of linguistic structures. Assuming a basic understandingof natural language processing and/or machine learning, we seek to bridge the gap between the two fields.Approaches to decoding (i.e., carrying out linguistic structure prediction) and supervised and unsupervisedlearning of models that predict discrete structures as outputs are the focus. We also survey natural languageprocessing problems to which these methods are being applied, and we address related topics in probabilisticinference, optimization, and experimental methodology.

Noah A. Smith

$56.43 on amazon.com possibly free in electronic form, through your university’s library

Lines of AZack

1.  Reduce to a sequence of classifica*on problems (see part 1).

2.  Reduce to a sequence labeling problem and use a variant of the Viterbi algorithm.

3.  Design a representa*on, predic*on algorithm, and learning algorithm for your problem.

Sequence Labeling

•  Input: sequence of symbols x1 x2 ... xL •  Output: sequence of labels y1 y2 ... yL each ∈ Λ Predic*on rule: Problem: there are O(|Λ|L) choices for y1 y2 ... yL !

hw(x) = argmax

f(x1 . . . xL, y1 . . . yL)

Sequence Labeling with Local Features

A key assump*on about f allows us to solve the problem exactly, in O(|Λ|2 L) *me and O(|Λ|L) space.

hw(x) = argmax

f(x1 . . . xL, y1 . . . yL)

= argmax

L�1X

(x1 . . . xL, y`y`+1)

If I knew the best label sequence for x1 ... xL – 1, then yL would be easy. That decision would depend only on state L – 1. I don’t know that best sequence, but there are only |Λ| op*ons at L – 1. So I only need the score of the best sequence up to L – 1, for each possible label at L – 1. Call this V[L – 1, y] for y ∈ Λ. From this, I can score each label at L, for each hypothe*cal label at L – 1. Score of the best sequences up to L – 1 relies similarly on score of the best sequences up to L – 2. DiZo, at every other *mestep L – 2, L – 3, ... 1.

⇤L = arg max

yL2⇤w>

L�2X

flocal

(x1 . . . xL, y⇤` y

⇤`+1)

(x1 . . . xL, y⇤L�1yL)

L�2X

flocal

(x1 . . . xL, y⇤` y

⇤`+1)

!+ arg max

yL2⇤w>f

(x1 . . . xL, y⇤L�1yL)

(Featurized) Viterbi Algorithm

•  Precompute V[*, *] from ler to right. V[1, *] = 0. For ℓ𝓁 = 2 to L, for each y in Λ:

•  Backtrack and select the labels from right to ler. For ℓ𝓁 = L -‐ 1 to 1:

y⇤L = argmax

yV [L, y]

y⇤` = B[`+ 1, y⇤`+1]

V [`, y] = max

y02⇤V [`� 1, y

0] +w>f

(x1 . . . xL, y0y)

B[`, y] = argmax

y02⇤V [`� 1, y

0] +w>f

(x1 . . . xL, y0y)

Part of Speech Tagging

Arer paying the medical bills , Frances was nearly broke . RB VBG DT JJ NNS , NNP VBZ RB JJ . •  Adverb (RB) •  Verb (VBG, VBZ, and others) •  Determiner (DT) •  Adjec*ve (JJ) •  Noun (NN, NNS, NNP, and others) •  Punctua*on (., ,, and others)

Named En*ty Recogni*on

With Commander Chris Ferguson at the helm ,

Atlan*s touched down at Kennedy Space Center .

Named En*ty Recogni*on

With Commander Chris Ferguson at the helm ,

Atlan*s touched down at Kennedy Space Center .

B-‐person I-‐person I-‐person O O O O O

O O O O B-‐space-‐shuZle B-‐place I-‐place I-‐place

Word Alignment Mr. President , Noah’s ark was filled not with produc*on factors , but with living creatures.

NULL Noahs Arche war nicht voller Produc*onsfactoren , sondern Geschöpfe .

1.  Obtain a pool of correctly labeled sequences D. 2.  Define a locally factored func*on f from

sequences and labelings to feature vectors. 3.  Define a parameterized func*on hw from

feature vectors to labelings. 4.  Select h’s parameters w using a training sample

from D. 5.  Es*mate performance on a held-‐out sample

from D.

Sequence Labeling

Structured Learners Generalize Linear Classifica*on Learners!

•  hidden Markov models ⟵ naïve Bayes •  condi*onal random fields ⟵ logis*c regression •  structured perceptron ⟵ perceptron •  structured SVM ⟵ support vector machine

Addi*onal Notes

•  Outputs that are trees, graphs, logical forms, other strings ... parse trees (phrase structure, dependencies) coreference rela*onships among en*ty men*ons (and pronouns) a huge range of seman*c analyses

•  Evalua*on?

Dependency Parse

Frame-‐Seman*c Parse

Run our Parsers!

http://demo.ark.cs.cmu.edu/parse

Outline

ü Automa*cally categorizing documents ü Decoding sequences of words 3.  Clustering documents and/or words

Clustering Real Data

K-‐Means

Given: points {x1, …, xN}, K (number of clusters) 1. Arbitrarily select μ1, …, μK. 2. Assign each xi to the nearest μj. 3. Select each μj to be the mean of all xi assigned to it.

4. If all μj have converged stop; else go to 2.

K-‐Means, Visualized

K-‐Means for Text?

•  Documents – Use the same f we might use for classifica*on.

•  Words – Use “context” vectors ...

Where’s the beef?

chicken

Hypothe*cal Counts based on Syntac*c Dependencies

Modified-‐by-‐ferocious(adj)

Subject-‐of-‐devour(v)

Object-‐of-‐pet(v)

Modified-‐by-‐African(adj)

Modified-‐by-‐big(adj)

Lion 15 5 0 6 15

Dog 7 3 8 0 12

Cat 1 1 6 1 9

Elephant 0 0 0 10 15

Brown Clustering

Given: corpus of length N, K 1.  Assign each word to its cluster (V clusters) 2.  Repeat V – K *mes: •  Find the single merge (cj, ck) that results in a new clustering with the highest Quality score •  Prepend cj’s bitstring with 0 and ck’s with 1 (and the same for all their descendents)

Mini-‐Example

Bitstrings that share a prefix are in the same cluster, at some level of granularity.

Clusters from Brown et al. (1992)

Clusters from Owopu* et al. (2013) (56M Tweets)

acronyms for laughter

lmao lmfao lmaoo lmaooo hahahahaha lool c�u rofl loool lmfaoo lmfaooo lmaoooo lmbo lololol

onomatopoeic laugher

haha hahaha hehe hahahaha hahah aha hehehe ahaha hah hahahah kk hahaa ahah

affirma*ve yes yep yup nope yess yesss yessss ofcourse yeap likewise yepp yesh yw yuup yus

nega*ve yeah yea nah naw yeahh nooo yeh noo noooo yeaa ikr nvm yeahhh nahh nooooo

metacomment smh jk #fail #random #fact sm� #smh #winning #realtalk smdh #dead #justsaying

second person pronoun

u yu yuh yhu uu yuu yew y0u yuhh youh yhuu iget yoy yooh yuo yue juu dya youz yyou

preposi*ons w fo fa fr fro ov fer fir whit abou ar serie fore fah fuh w/her w/that fron isn agains

“contrac*ons” tryna gon finna bouta trynna bouZa gne fina gonn tryina fenna qone trynaa qon

going to gonna gunna gona gna guna gnna ganna qonna gonnna gana qunna gonne goona

so+ soo sooo soooo sooooo soooooo sooooooo soooooooo sooooooooo soooooooooo

mischevious ;) :p :-‐) xd ;-‐) ;d (; :3 ;p =p :-‐p =)) ;] xdd #gno xddd >:) ;-‐p >:d 8-‐) ;-‐d

happy :) (: =) :)) :] :’) =] ^_^ :))) ^.^ [: ;)) ((: ^__^ (= ^-‐^ :))))

sad :( :/ -‐_-‐ -‐.-‐ :-‐( :’( d: :| :s -‐__-‐ =( =/ >.< -‐___-‐ :-‐/ </3 :\ -‐____-‐ ;( /: :(( >_< =[ :[ #fml

love <3 xoxo <33 xo <333 #love s2 <URL-‐twi**on.com> #neversaynever <3333

F-‐word + ing

fucking fuckin freaking bloody freakin friggin effin effing fuckn fucken frickin fukin f'n fckn flippin �n motherfucking fckin f*cking fricken fukn fuccin fcking fukkin

Browse our TwiZer Clusters!

http://www.ark.cs.cmu.edu/TweetNLP/cluster_viewer.html

Addi*onal Notes

•  Sor clustering allows items to have mixed membership in different clusters. – Typically accomplished with probabilis*c models – Latent Dirichlet alloca*on is a popular and Bayesian model

•  Evalua*on? •  One view of clusters: feature crea*on!

Summary

supervised classifica*on

(5 steps: data, features, predic3on func3on, learning, evalua3on)

structured predic*on

local factoring + dynamic programming

unsupervised clustering

alterna3ng or greedy

op3miza3on

NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u...

Documents

You’re taking control of new things in your life - getting ... · take control of youh realt h. You’re taking control of new things in your life - getting behind the wheel, starting

E-Encourager North Grand River Baptist Association The ... · = Grand Oaks NGR/HBA Youh Camp June 21-26 = Fathers on Father’s Day June 21 = mb125 Pastor’s Conference at Branson

:YEo £ EoY Au , nY Yuo - STEM · por tu equipo. Usa la Hoja de Colección de Puntos y la Rúbrica de Puntuación para las Actividades Educativas de Misión X para calcular la puntuación

Ko11m 01 - Australian National University · 2020. 9. 16. · molo -= wo taman suo yuo gole yeo 2. tok plein :papa rat dis sa.n ma.n rain mambu do~ ban~na. kaukau d1wai 1iklik win

Local Economic Shocks and Political Participation Evidence ... · Local Economic Shocks and Political Participation: Evidence from the US Shale Boom Michael W. Sancesy Hye Young Youz

USTR-IPW 00091 · 2015-06-05 · I stil haven'l madt ie lto Genev ta o dro ipn for a casua chal witt youh I'. somewham reluctant to raist a e substantive matte witr yoh u befor wee

Comunidad de Madrid€¦ · 15/12/2020 · DE e bLGUJ!O UJGIOL IJJQCUCO qq guo EDILAley. YkJO sooe. nsquq. yuo 5009 blscs CIG qe IS esu!qsq COIJe616LlS qe eSlJ!qgq CIG IS COUJflLJ!qsq

W THE MARRIAGE YUO W ANTE D - Amazon Web Serviceschapmanlive.s3.amazonaws.com/resources/CHP_TMYAW_tickets.pdf · New York Times bestselling author Dr. Gary Chapman on The Marriage

Tenant Mix Variety in Regional Shopping Centres: UK ... · Tenant Mix Variety in Regional Shopping Centres: Some UK Empirical Analyses Tony Shun-Te Yuo#, Neil Crosby, Colin Lizieri*

Issue01TRUDK - BioMedia Project...AGORI FIND MORE COMIC' ÄCLE.DK IAR6ER OtseRr. ae FACE 15 WE to "'Arc" FROM.. rms spors THE OF "Y TO you 602 um YOuZ May YOuZ murs BEsr IF SuZRE'WER

Trabajo de Yuo

Presentación1 yuo tube

Tuesday November 27 th 2012 QU: You do not need to write the question. Cna yuo raed tihs? Olny 55 plepoe out of 100 can. I cdnuolt blveiee taht I cluod

KIRKLARELİ ÜNİVERSİTESİ REKTÖRLÜĞÜyuo.klu.edu.tr/dosyalar/birimler/yuo/dosyalar/dosya_ve_belgeler/2._ek_yerlesen.pdf2 07976104 gamal al-sagheer 23.09.1999 yemen erkek oÖbso

Programm2020 PlanB- Abgesagt...Achtung: Die Aufsicht durch das Team der Offenen Jugend Rosengarten wird nur im youZ Nenndorf gewährleistet. Zeit: 15.07.2020 19:00 - 22:00 Uhr Alter:

fi yuo cna raed tihs , yuo hvae a sgtrane mnid

¿Qué es yuo tube y flickr?

Ghe Slooist friilcs AWin · 2017-07-11 · wod 6rarono"{yuo was interpreted as edrtu. I do t}u nk that wc can lrene Rdsn* I likrd ihem and hope you u ill too. Here they arc: cone

STATE OF YOUTH VOLUNTEERING IN INDIA of Youth Volunteering in India...2017 sae of youh oluneering n ndia 13 21 table of contents list of figures list of boxes list of case studies

Exit Strategy: Career Concerns and Revolving Doors in Congress · Exit Strategy: Career Concerns and Revolving Doors in Congress Michael E. Shepherdy Hye Young Youz Abstract Although