Stats 170A: Project in Data Science Text Analysis and

Preview:

Citation preview

Stats170A:ProjectinDataScience

TextAnalysisandClassification

Padhraic SmythDepartmentofComputerScienceBrenSchoolofInformationandComputerSciencesUniversityofCalifornia,Irvine

(AcknowledgementstoProfMarkSteyvers andProfSameerSingh,UCI,forvariousslides)

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 2

Reading,Homework,Lectures

• Referencereading:– Python

• http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

• Homework7– Dueby2pmWednesdaynextweek

• NextLectures– Today:textanalysisandclassification– Nextweek:discussionofprojects

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 3

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 4

Roomforimprovement….

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 5

PredictingtheSentimentofTweets

wow:othiswasincluded intheplaylist#awesome positive

Icouldnotbehappierwithmyliferightnow positive

HAPPYBIRTHDAYTOTHEPANDERIFICPANDAIK.@TheOrionSound positive

I'mtiredashellInevergetaoffdayduring theweekanymore negative

Themoviewasreallydullandstupid negative

@washingtonpost @silvajanes Howawful!!!!!! negative

“Documents” ClassLabels

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 6

PredictingtheScoresofMovieReviews

Ilikedthismovie.Welldone.Greatactinganddirection 4

Terrificmovie, sawit3times.LovedthescenesinMexico. 5

Boringashell.Whogetspaidtocreatethisstuff? 1

Notoneof thedirector’sbestefforts.TomCruisewasterrible. 2

“Documents” Ratings

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 7

EntityRecognition

Ilikedthismovie.Welldone.Greatactinganddirection 4

Terrificmovie, sawit3times.LovedthescenesinMexico. 5

Boringashell.Whogetspaidtocreatethisstuff? 1

Notoneof thedirector’sbestefforts.TomCruisewasterrible. 2

“Documents” Ratings

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 8

ExamplesofUserLocationFieldsinTwitter

User location: Deutschland User location: Mountain View, CA User location: Florida, USA User location: United States User location: West Virginia, Appalachia User location: Columbia Mo User location: South Florida User location: Germany/Berlin User location: St Louis, MO User location: St. Louis User location: Central Virginia User location: United States��User location: Sea Holme (of Norse legend) Oz User location: California, USA User location: USA User location: Cambodia User location: San Antonio, TX User location: Between my ears User location: Chicago, IL

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 9

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 10

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 11

BasicConceptsinTextRepresentation

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 12

RepresentingTextDocumentsNumerically

• Howdowerepresenttext(strings)forinputtomachinelearningalgorithms?(whichgenerallyrequirenumericrepresentations)

• Definea vocabulary=setofwordsorterms

• Simplemethod:BagofWords– Document representedasavectorofcountsofwordsinthevocabulary

• Morecomplex:Real-ValuedEmbeddings– Document (orword) representedasareal-valuedvectorin“embedding space”

• Evenmorecomplex:SequentialRepresentations– Sequential state-machinemodelfor sequencesofwords

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 13

ExampleofBag-of-WordsMatrix

database SQL index calculus derivative function Label

d1 24 21 9 0 0 3 1d2 32 10 5 0 3 0 1d3 12 16 5 0 0 0 1d4 6 7 2 0 0 0 1d5 43 31 20 0 3 0 1d6 2 0 0 18 7 16 2d7 0 0 1 32 12 0 2d8 3 0 0 22 4 2 2d9 1 0 0 34 27 25 2d10 6 0 0 17 4 23 2

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 14

ConceptsandTerminology

• Document:– Collectionofwords

• Abook,anewsarticle,areport,aWebpage,anemail,atweet,etc– Maycontainboth textandmetadata.– Examplesofmetadata:authorname(s),date,wherepublished, etc

Notethatthedefinition ofadocument isflexiblee.g.,abookcouldbeasingledocument,or…..

eachsectionofabookcouldbeconsidereda“document”

• Corpus:acollectionofdocuments– e.g.,allnewsarticlesfromtheLosAngelesTimessince1990– e.g.,allWikipediaWebpages– e.g.,allYelpreviewsfor restaurantsinChicago– e.g.,arandomsampleofTweetsfromDec2017

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 15

ConceptsandTerminology

• Tokens:– Groupings ofcharactersintherawtext– individualwords (or“wordtypes”)+possiblynumbers,punctuation, etc

• WordsorTerms– Uniquetokens (orcombinationsof tokens)thataremeaningful– e.g.,wordssuchas“cat”,“dog”,termssuchas“U.S”or“92697”– Canalsoincludebigrams(“SanDiego”), trigrams(“NewYorkCity”),etc

• Vocabulary– Thespecificsetofunique termsusedbyanalgorithm orapplication– TheEnglish languagehasorderof1millionuniquewords– Inaparticularapplicationwemightuseavocabularyofonly10kto50kterms

• e.g.,relevant/commonwords(unigrams)• Bigram,trigrams,…,ngrams,canalsobepartofthevocabulary

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 16

TypicalPipelineforBuildingaTextClassifier

Tokenization

StopWordRemoval

VocabularyDefinition

CreateBagofWords

BuildTextClassifier

Document(string)

Notethatstepsinthepipeline,andhowtheseareimplemented (e.g.,howtokenization isdone)willvaryfromapplicationtoapplicationdepending ontheproblem

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 17

ExampleofaTextAnalysisPipeline(forInformationExtraction)

Fromhttp://www.nltk.org/book/ch07.html

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 18

Example

Document=‘Chapter1:TheBeginning.Inthebeginning,lifewastough!..........’

Tokens={‘chapter’,‘1’,‘the’,‘beginning’,‘in’,‘the’,‘beginning’,‘life’,‘was’,…}(Punctuation andwhitespacesareusually ignored)

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 19

Example

Document=‘Chapter1:TheBeginning.Inthebeginning,lifewastough!..........’

Tokens={‘chapter’,‘1’,‘the’,‘beginning’,‘in’,‘the’,‘beginning’,‘life’,‘was’,…}(Punctuation andwhitespacesareusually ignored)

Vocabulary={‘chapter’,‘1’,‘the’,‘beginning’,‘in’,‘life’,‘was’,‘tough’,…}

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 20

Example

Document=‘Chapter1:TheBeginning.Inthebeginning,lifewastough!..........’

Tokens={‘chapter’,‘1’,‘the’,‘beginning’,‘in’,‘the’,‘beginning’,‘life’,‘was’,…}(Punctuation andwhitespacesareusually ignored)

Vocabulary={‘chapter’,‘1’,‘the’,‘beginning’,‘in’,‘life’,‘was’,‘tough’,…}

BagofWords={[‘chapter’,1],[‘1’,1],[‘the’,2],[‘beginning’,2],…}

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 21

?

Tokenization

l Splituptextintoindividualtokens(words/terms)

l Simplestapproachistoignoreallpunctuationandwhitespaceanduseonlyunbrokenstringsofalphabeticcharactersastokens

If you had a magic potion I’d love to have it.

If you had a magic potion I’d love to have it

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 22

IssuesinTokenization

• Finland’s capital → Finland Finlands Finland’s ?

• what’re, I’m, isn’t → What are, I am, is not

• Hewlett-Packard → Hewlett Packard ?

• state-of-the-art → state of the art ?

• Lowercase → lower-case lowercase lower case ?

• San Francisco → onetokenortwo?

• m.p.h., PhD. → ??

Fromhttps://web.stanford.edu/~jurafsky/slp3/SpeechandLanguageProcessing,3rd ed,Jurafsky andMartin

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 23

TokenizationSoftware

• Insteadofwritingyourowntokenizerwithacomplexsetofrules,useexistingsoftware– e.g.,tokenizer function inNLTK– e.g.,tokenizer fromStanford’snaturallanguagegroup

• Practicaltip:– Itsusefultokeepdifferent representations ofthedatatouselateron,e.g.,

• Datawithoriginalsequenceandformatting,tokenizedlist,bagofwords,etc

– Sequentialorderofwordsisneeded fordetectingngrams.– Punctuationcancontainuseful information:

If you had a magic potion I’d love to have it . If that makes sense

Ifwedecidetoextractn-gramslateron,weknowthat“it”and“if”shouldnotbecombined.So itsusefultoretainthisinformation.

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 24

SentenceDetection:ExampleusingaDecisionTree

Fromhttps://web.stanford.edu/~jurafsky/slp3/SpeechandLanguageProcessing,3rd ed,Jurafsky andMartin

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 25

Example:TokenizationofTweets

Therearespecial-purposetokenizers fordifferenttypesoftext,e.g.,forTweets

from nltk.tokenize import TweetTokenizerfrom nltk.tokenize import word_tokenize

tknzr = TweetTokenizer()

for i in range(10): print("NLTK Twitter Tokenizer: ", tknzr.tokenize(tweets.text[i]) ) print("NLTK Standard Tokenizer:", word_tokenize(tweets.text[i]))

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 26

NLTK Twitter Tokenizer: ['RT', '@ProudResister', ':', 'Trump', 'to', 'Comey', ':', 'Let', 'Flynn', 'go', '.', 'Trump', 'to', 'Russians', ':', 'Firing', 'Nut', 'Job', 'Comey', 'took', 'great', 'pressure', 'off', 'me', '.', 'Trump', 'to', 'McCabe', ':', 'Who', '…'] NLTK Standard Tokenizer: ['RT', '@', 'ProudResister', ':', 'Trump', 'to', 'Comey', ':', 'Let', 'Flynn', 'go', '.', 'Trump', 'to', 'Russians', ':', 'Firing', 'Nut', 'Job', 'Comey', 'took', 'great', 'pressure', 'off', 'me', '.', 'Trump', 'to', 'McCabe', ':', 'Who…']

NLTK Twitter Tokenizer: ['Mueller', 'to', 'interview', 'former', 'spokesman', 'of', 'Trump', 'legal', 'team', ':', 'source', 'https://t.co/olA35uCt89'] NLTK Standard Tokenizer: ['Mueller', 'to', 'interview', 'former', 'spokesman', 'of', 'Trump', 'legal', 'team', ':', 'source', 'https', ':', '//t.co/olA35uCt89']

NLTK Twitter Tokenizer: ['RT', '@IndivisibleTeam', ':', "Here's", 'what', 'you', 'need', 'to', 'know', 'about', '#ReleaseTheMemo', '�', '1', '⃣', 'It', '’', 's', 'not', 'really', 'a', 'memo', '.', "They're", 'talking', 'points', 'by', 'Devin', 'Nunes', '…'] NLTK Standard Tokenizer: ['RT', '@', 'IndivisibleTeam', ':', 'Here', "'s", 'what', 'you', 'need', 'to', 'know', 'about', '#', 'ReleaseTheMemo�', '1⃣', 'It’s', 'not', 'really', 'a', 'memo', '.', 'They', "'re", 'talking', 'points', 'by', 'Devin', 'Nunes…']

SampleOutputfromthe2Tokenizers

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 27

Example:DefiningaVocabulary

Rawtext(astringinPython) raw1 = “The dog chased the cat and the mouse. Why did the dog do this?”

Thereare14wordtokens inthestringraw1(ifweignorepunctuationandspaces)The,dog,chased,the,cat,and,the,mouse,Why,did,the,dog, do,this

Thevocabulary (theunique tokens,normalizing tolowercase)is:the,dog,chased,cat,and,mouse,why,did,do, thisThevocabularysizeis10.

Thecounts forabagofwordsrepresentation is:the(3),dog (2), chased(1), cat(1),and(1),mouse(1),why(1),did(1),do(1), this(1)

Ifweremovestopwords wedecreaseourvocabularysizeto4dog (2),chased(1), cat(1),mouse(1)

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 28

DefiningtheVocabulary(afterTokenization)

• Vocabulary– Setof terms(words) usedtoconstructthedocument-termmatrix

• Basicapproach:usesinglewords(unigrams)asterms

• Removeverycommonterms(e.g.,stopwords)

• Removeveryrareterms:e.g.,removealltermsthatoccurinfewerthanKdocumentsinthecorpus(e.g.,K=10)– Getsridofmisspellings, unusualnames,etc

• Canextendtermlistwithn-grams– Frequentwordcombinations (2-grams,3-grams,…)

“feelgood”/“NewYorkCity”

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 29

FrequencyofWordinEnglishusage

RankofWordbyFrequency

Verylong“tail”ofrare

words

Graphfromwww.prismnet.com/~dierdorf/wordfrequency.png

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 30

Stopword Removal

l Removewords thatarelikelytobeirrelevanttoourprobleml Keepcontentwords(typically verbs,adverbs,nouns,adjectives)

If you had a magic potion I’d love to have it . If that makes sense

If you had a magic potion I’d love to have it . If that makes sense

Example:

But what about this?

[Prince Hamlet] To be or not to be ...

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 31

NLTKStopWord List

Note:inmanyapplicationstheremaybeadditionaldomain-specific“stopwords”thatareverycommonandthatwemaywanttoremovesincetheycontainlittleinformation,e.g.,theterm“restaurant”inreviews

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 32

N-grams

• Usefuln-gramsaregroupsofn-wordsthatcommonlyappearinsequence– “ComputerScience”– “LosAngeles”– “NewYorkCity”

– Notethatwecanhavebothtermslike“computer”and“computer science”,e.g.,

“Iamstudying computer scienceatUCIandIboughtanewcomputer lastweek.”

- Addingann-gramasaterminthevocabularymaybeusefulinamodel- e.g.,forarestaurantreviewclassificationproblem, “waittime”maybemore

informative than“time”

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 33

AutomaticallyFindingN-grams

- e.g.,Bigrams- Keeptrackofalluniquepairsthatoccursequentially(excludestopwords)

- I visited Microsoft Research for a computer science job interview last week.”

- * visited Microsoft Research * * computer science job interview last week.”

- Rankbyfrequencyofoccurrence(ornumberofdocsabigramappearsin)- Canalsorankbyothermetrics

- KeepthetopKinthevocabulary- Howlargeshould Kbe?Dependsontheapplication,amountofdata,etc- MightneedtosearchoverdifferentvaluesofK(e.g.,usingcross-validation)

- Sameideafortri-grams,4grams,etc.

- Seealsohttp://www.nltk.org/howto/collocations.html inNLTK

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 34

ExampleofBag-of-WordsMatrix

database SQL index calculus derivative function Label

d1 24 21 9 0 0 3 1d2 32 10 5 0 3 0 1d3 12 16 5 0 0 0 1d4 6 7 2 0 0 0 1d5 43 31 20 0 3 0 1d6 2 0 0 18 7 16 2d7 0 0 1 32 12 0 2d8 3 0 0 22 4 2 2d9 1 0 0 34 27 25 2d10 6 0 0 17 4 23 2

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 35

ExampleofBag-of-WordsMatrix

database SQL index calculus derivative function Label

d1 24 21 9 0 0 3 1d2 32 10 5 0 3 0 1d3 12 16 5 0 0 0 1d4 6 7 2 0 0 0 1d5 43 31 20 0 3 0 1d6 2 0 0 18 7 16 2d7 0 0 1 32 12 0 2d8 3 0 0 22 4 2 2d9 1 0 0 34 27 25 2d10 6 0 0 17 4 23 2

Uselabeledtrainingdata(supervisedlearning)tolearnaclassifier– HomeworkAssignment7

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 36

OverviewofTextClassification

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 37

TextClassification

• Textclassificationhasmanyapplications– Spamemaildetection– Classifyingnewsarticles,e.g.,Google News– ClassifyingWebpagesintocategories

• DataRepresentation– “Bagofwords/terms”mostcommonlyused:eithercountsorbinary– Canalsouseotherweighting andadditional features(e.g.,metadata)

• ClassificationMethods– NaïveBayeswidelyusedbaseline

• Fastandreasonablyaccurate– LogisticRegression

• Widelyusedinindustry,accurate,excellentbaseline– Neuralnetworksanddeep learning

• Canbeveryaccurate• Canrequireverylargeamountsoflabeledtrainingdata• Morecomplexthanothermethods

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 38

ExampleofDocumentbyTermMatrix

predict finance stocks goal score team ClassLabel

d1 3 1 4 0 0 1 1

d2 2 2 1 0 1 0 1

d3 1 1 2 0 0 0 1

d4 1 2 0 0 0 0 1

d5 3 1 1 0 1 0 1

d6 1 0 0 1 3 1 2

d7 0 0 1 4 1 0 2

d8 2 0 0 1 2 1 2

d9 1 0 0 2 1 2 2

d10 1 0 0 1 3 1 2

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 39

PossibleWeightsforaLinearClassifier

predict finance stocks goal score team ClassLabel

Weight 0.1 2.5 1.0 -2.0 -0.2 -1.5

d1 3 1 4 0 0 1 1d2 2 2 1 0 1 0 1d3 1 1 2 0 0 0 1d4 1 2 0 0 0 0 1d5 3 1 1 0 1 0 1d6 1 0 0 1 3 1 2d7 0 0 1 4 1 0 2d8 2 0 0 1 2 1 2d9 1 0 0 2 1 2 2d10 1 0 0 1 3 1 2

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 40

RealExamplefromYelpData

YelpDatasetNumberofReviews 706,693NumberofReviewsw/oNeutralRating 595,468

NumberofTokens 85,392,376VocabularySizew/oStopwords 176,114

ArrayDimensions (595468, 176114)

Numberofcellsin theArray 104,870,251,352Non-zeroentries 28,357,001Density 0.0027%

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 41

ExampleofaPipelineforDocumentClassification

TrainingDocuments(corpus)

Tokenization ListsofTokens

Vocabulary

BagofWords

MachineLearningAlgorithm

Stopword andrarewordremoval

FrequencyCounts

DocumentClassifier

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 42

ExampleofaPipelineforDocumentClassification

TrainingDocuments(corpus)

Tokenization ListsofTokens

Vocabulary

BagofWords

MachineLearningAlgorithm

NewDocument

DocumentClassifier

ListsofTokens

Tokenization BagofWords

LabelPrediction

Stopword andrarewordremoval

FrequencyCounts

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 43

KeyStepsinDocumentAnalysisPipelines(forBagofWords)

• Tokenization– Variousoptions (e.g.,withpunctuation, nonalphanumericsymbols, etc)

• Vocabularydefinition– N-grams,stopword removal,rarewordremoval, stemming

• Featuredefinition– Binary(termpresentornot?)– Counts– Weightedcounts,e.g.,TD-IDF(seelaterintheslides)

• Classifierselection– NaïveBayes,logistic,SVMs,neuralnetworks,etc

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 44

TF-IDFWeightingofFeatures

Inpracticetheinputscanbeweighted– Itcanbehelpful touse“TF-IDFweights”insteadofcounts

TF(t,d)=termfrequency=count=numberoftimestermtoccursindocd

IDF(t,d)=inversedocumentfrequency=log(N/numberofdocswithtermt)

(whereN=totalnumberofdocsinthecorpus)

TF-IDF(t,d)=TF(t,d)*IDF(t,d)

TheIDFtermhastheeffectofupweighting termsthatoccurinfewdocs

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 45

TF-IDFExample

N=1000inacorpusofnewsarticles

Term1:t=“city”,appearsin500documents

IDF(t)=log(1000/500)=log(2)=1(logisbase2,not important)

Term2:t=“freeway”,appearsin10documents

IDF(t)=log(1000/10)=log(100)=6.64

Sooccurrencesof“freeway”willgetupweighted byafactorof6.64comparedtooccurrencesof“city”

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 46

ClassificationMethodsforText

• Logisticregression– widelyusedforbag-of-words– Inputdimensionality (vocabulary size)ishigh(e.g.,5kto500k)– regularization isuseful

• Recurrentneuralnetworksarewidelyusedforsequentialmodels(seelater)– Morecomplex,butcanpickupsequential information

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 47From:http://eli5.readthedocs.io/en/latest/tutorials/sklearn-text.html

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 48

From:http://eli5.readthedocs.io/en/latest/tutorials/sklearn-text.html

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 49

From:http://eli5.readthedocs.io/en/latest/tutorials/sklearn-text.html

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 50

From:https://github.com/TeamHG-Memex/eli5

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 51

SentimentAnalysis

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 52

Veryusefultutorial,manypracticaltipsandideas

Onlineathttp://sentiment.christopherpotts.net/

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 53

Example:PredictingifReviewTextisPositiveorNegative

• Twobasicapproaches– Uselexiconsofpositiveandnegativewords– Uselabeleddatatolearnaclassificationmodel(canalsocombinebothapproaches)

• Challenges,e.g.,howtohandlenegation?– Positivereview:

• “ThismovieisnotdepressingthewayIthought itwouldbe.Idon’t thinkIhaveanythingnegative tosayabout it.”

• “Iwasn’t disappointed atallwiththeacting – infacttheactingwasexcellent.”

– Negativereviews:• “Sittingthroughthismoviewasnot somethingIenjoyed doing”• “Thiswasn’t averygoodscript.Thedirectorisusuallyexcellent (somereallygreat

moviesovertheyearsthatIreallyenjoyed),but thisleavesmecold”

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 54

NegationMarking(duringTokenization)

• Idea:appenda_NEGsuffixtoeverywordappearingbetweenanegationandaclause-levelpunctuationmark

Fromhttp://sentiment.christopherpotts.net/lingstruc.html

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 55

Fromhttp://sentiment.christopherpotts.net/lingstruc.html

ExamplesofNegationMarking

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 56

ImprovementsinClassificationAccuracy

Fromhttp://sentiment.christopherpotts.net/lingstruc.html

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 57

SentimentLexicons(Dictionaries)

• Counts,percentages,weightsofwordsinvariouscategories,– e.g.,degree towhichawordtendstopositiveornegative

• ExampleLexicons– Generalinquirer (http://www.wjh.harvard.edu/~inquirer)

• WordscategorizedaccordingtoPositive/Negative,StrongvsWeak,ActivevsPassive,etc

– Sentiwordnet (http://sentiwordnet.isti.cnr.it/)• Synsets inWordNet3.0 annotatedfordegreesofpositivity,negativity,and

neutrality/objectiveness

– LinguisticInquiryandWordCount(http://www.liwc.net/)

– NRCWord-Emotion AssociationLexicon(nextslide)

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 58

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 59

YelpChallengeDataSet

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 60

Jupyter NotebookExamplewithYelpDataset

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 61

VectorRepresentationsforText

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 62

VectorRepresentationsallowGeneralization

• Example:Cat : [1.5,1.5,-0.2,0.0,0.0,0.0]Kitten: [1.4,1.5,-0.2,0.0,0.0,0.0]Pet: [2.0,1.6,-0.3,0.0,0.0,0.0]

• Iftheword“cat”occursinadocumentthenweknowthatotherwordslike“kitten”and“pet”aresimilar

• Whyisthisusefulinlearning?– Consideraclassifierbasedonweights(e.g.,logisticregression,neuralnetwork)– Traditionalapproach:eachwordhasitsownseparateweight– Ifwerepresentwordsbytheirvectors,andlearnweightson thevectors,then

wordsthatareclosetogetherwillproducesimilaroutputs– Thiscanhelpwithgeneralization

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 63

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 64

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 65

AnotherExampleofWordEmbedding

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 66

Publicly-AvailablePre-TrainedWordEmbeddings

• Word2Vec:https://code.google.com/archive/p/word2vec/– GoogleNewsdataset,3Mvocab(wordsandphrases), from100Btokens– Entityvectors:1.4Mvectorsforentities, trainedon100Bwords fromnews

articles,entitynamesfromFreebase

• Glove:http://nlp.stanford.edu/projects/glove/– Wikipedia:400kvocab,50dto300dvectors,basedon6Btokens– CommonCrawl,2.2Mvocab,300dvectors, from840Btokens– Twitter:1.2Mvocab,25dto200dvectors,basedon2Btweets

• Note:youmaywanttoconsiderusingtheseinyourprojects– Easierandfasterthantrainingyourownembeddingmodel

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 67

Component1

Component2

x

y

z

Inthisexample,y andz aremoresimilarundercosinesimilaritythany andx becausetheanglebetweeny andz ismuchsmaller

CosineDistancebetweentwoVectors

Cosinedistance=1- Cosinesimilarity:measurescosineoftheanglebetween2vectors, wherecos(0)=1,andcos(90)=0

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 68

ExamplesofSimilaritybetweenWordsExamplesfromhttps://code.google.com/archive/p/word2vec/

Cosinesimilarityto“France”

Cosinesimilarityto“SanFrancisco”

Cosinesimilarity:measurescosineoftheanglebetween2vectors, wherecos(0)=1,andcos(90)=0

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 69

1:cat 1.002:cats 0.813:dog 0.764:kitten 0.755:feline 0.736:beagle 0.727:puppy 0.718:pup 0.699:pet 0.6910:felines 0.68

MostSimilarVectorsto‘cat’fromword2vec300dembeddings

Rank Word CosineSimilarity

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 70

1:cat 1.002:cats 0.813:dog 0.764:kitten 0.755:feline 0.736:beagle 0.727:puppy 0.718:pup 0.699:pet 0.6910:felines 0.6811:chihuahua 0.6712:pooch 0.6713:kitties 0.6714:dachshund 0.6715:poodle 0.6616:stray_cat 0.6617:Shih_Tzu 0.6618:tabby 0.6619:basset_hound 0.6520:golden_retriever 0.65

MostSimilarVectorsto‘cat’fromword2vec300dembeddings

Rank Word CosineSimilarity

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 71

1:oxycontin 1.002:Oxycontin 0.793:Oxycodone 0.714:OxyContin 0.695:morphine_pills 0.676:hydrocodone 0.677:Lortab 0.678:oxycodone 0.659:OxyContin_pills 0.6510:Hydrocodone 0.6511:Alprazolam 0.6412:Oxycodone_pills 0.6313:Xanax 0.6214:Lorcet 0.6215:Percocet_pills 0.6116:Valium 0.6117:Diazepam 0.6018:methadone_pills 0.6019:hydrocodone_pills 0.6020:prescription_pills 0.59

MostSimilarVectorsto‘oxycontin’fromword2vec300dembeddings

Rank Word CosineSimilarity

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 72

1:iran 1.002:pakistan 0.663:israel 0.644:lebanon 0.625:russia 0.626:iraq 0.617:egypt 0.608:Irans 0.599:afghanistan 0.5910:Iran 0.5911:arabs 0.5812:israeli 0.5713:obama 0.5614:arab 0.5615:iraqi 0.5616:japan 0.5617:sri_lanka 0.5618:america 0.5519:korea 0.5520:cuba 0.55

MostSimilarVectorsto‘iran’fromword2vec300dembeddings

Rank Word CosineSimilarity

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 73

1:Iran 1.002:Tehran 0.833:Iranian 0.824:Islamic_republic 0.815:Islamic_Republic 0.816:Iranians 0.787:Teheran 0.768:Syria 0.759:Ahmadinejad 0.6910:Irans 0.6711:Larijani 0.6612:North_Korea 0.6413:Mottaki 0.6314:clerical_regime 0.6315:nuclear_ambitions 0.6316:Khamenei 0.6317:Ayatollah_Khamenei 0.6318:Velayati 0.6319:Ahmedinejad 0.6320:Rafsanjani 0.62

MostSimilarVectorsto‘Iran’fromword2vec300dembeddings

Rank Word CosineSimilarity

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 74

RecurrentNeuralNetworksforSequentialData

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 75

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 76

NetworkModelforPredictingtheNextWordinaSentence

thedogsawcatonwall

Sentence:Thedogsawthecatonthewall.

.

1000000

0100000

h1

h2

Binaryinput,“One-hot”encoding

thedogsawcatonwall.

Hiddenunitactivations(real-valued)

Binarytargetoutput,

“One-hot”encoding

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 77

StandardRecurrentNeuralNetwork

Figuresfromhttp://cs231n.stanford.edu/slides/2017/

Differenttoa“feedforward”networkHiddenunitatpositiontnowalsoHasinputfromhiddenvectorattimet-1

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 78

StateComputationinaNeuralNetwork

Figuresfromhttp://cs231n.stanford.edu/slides/2017/

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 79

StateComputationinaNeuralNetwork

Figuresfromhttp://cs231n.stanford.edu/slides/2017/

Keypoints:- fw ineachpositionistheRNNstate- Thisstateisupdatedrecursivelyfromthe

previousstateandthecurrentinput

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 80

StateComputationinaNeuralNetwork

Figuresfromhttp://cs231n.stanford.edu/slides/2017/

Keypoints:- Thex->fandf->harrowsareweightmatrices- Thesameweightmatricesare(usually)used

acrossthesequence

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 81

LearningtheWeightsinaRecurrentNetwork

Figuresfromhttp://cs231n.stanford.edu/slides/2017/

Learning:- lossfunctioncomparespredictionsandtargets- computegradient(basedonerror)andupdateweights

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 82

Example:PredictionofSequencesofCharacters

Figuresfromhttp://cs231n.stanford.edu/slides/2017/

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 83

SimulatingSequencesfromanRNN

Figuresfromhttp://cs231n.stanford.edu/slides/2017/

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 84

SimulatingSequencesfromanRNN

Figuresfromhttp://cs231n.stanford.edu/slides/2017/

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 85

SimulatingSequencesfromanRNN

Figuresfromhttp://cs231n.stanford.edu/slides/2017/

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 86

OutputfromanRNNModelTrainedonShakespeare

KING LEAR: O, if you were a feeble sight, the courtesy of your law, Your sight and several breath, will wear the gods With his heads, and my hands are wonder'd at the deeds, So drop upon your lordship's head, and your opinion Shall be against your honour.

Second Senator: They are away this miseries, produced upon my soul, Breaking and strongly should be buried, when I perish The earth and thoughts of many states.

DUKE VINCENTIO: Well, your wit is in the care of side and that.

Examplesfrom“TheUnreasonableEffectiveness ofRecurrentNeuralNetworks”,AndrejKaparthy,blog,http://karpathy.github.io/2015/05/21/rnn-effectiveness/

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 87

OutputfromanRNNModelTrainedonCookingRecipes

Fromhttps://gist.github.com/nylki/1efbaa36635956d35bcc

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 88

Imageclassification

Imagecaptioning

Sentimentanalysis

Machinetranslation

Syncedsequence(videoclassification)

DifferentRecurrentNetworkModels

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 89

Part-of-SpeechTagging

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 90

PartofSpeech(POS)Tagging

• CommonPOScategories(ortags)inEnglish:– Noun,verb,article,preposition, pronoun, adverb, conjunction, interjection

• Howevertherearemanymorespecializedcategories– E.g.,propernouns:e.g.,‘Toronto’, ‘Smith’,….– E.g.,comparativeadverb:e.g.,‘bigger’, ‘smaller’,…– E.g.,symbol: ‘3.12’,‘$’,…

• AssigningPOScategoriestowordsintextisknownastagging

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 91

UniversalTagset (asusedinNLTK)

12universaltags:VERB- verbs(alltensesandmodes)NOUN- nouns(commonandproper)PRON- pronounsADJ- adjectivesADV- adverbsADP- adpositions (prepositionsandpostpositions)CONJ- conjunctionsDET- determinersNUM- cardinalnumbersPRT- particlesorotherfunctionwordsX- other:foreignwords,typos,abbreviations.- punctuation

See"AUniversalPart-of-SpeechTagset"bySlavPetrov,Dipanjan DasandRyanMcDonald formoredetails:http://arxiv.org/abs/1104.2086andhttp://code.google.com/p/universal-pos-tags/

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 92

UniversalTagset (usedinNLTK)

Tag Meaning English ExamplesADJ adjective new, good, high, special, big, localADP adposition on, of, at, with, by, into, underADV adverb really, already, still, early, nowCONJ conjunction and, or, but, if, while, althoughDET determiner, article the, a, some, most, every, no, whichNOUN noun year, home, costs, time, AfricaNUM numeral twenty-four, fourth, 1991, 14:24PRT particle at, on, out, over per, that, up, withPRON pronoun he, their, her, its, my, I, usVERB verb is, say, told, given, playing, would. punctuation marks . , ; !X other ersatz, esprit, dunno, gr8, univeristy

(fromSection2.3inChapter5ofNLTKBook)

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 93

POSTaggingAlgorithms

• TaggingAlgorithms:“POSTaggers”– Tagging isoftendoneautomaticallywithalgorithms– Thesealgorithms oftenusesequential(Markov)models

• Thetagforaparticulartokencandependonwords/tagsbeforeandafterit– Thesemodelsaretrainedusingmachinelearning

• usingvariouswordfeaturesanddictionariesasinput– Trainedonmanuallylabeleddocuments

• Taggingperformanceisbestonmaterialthatthetaggerwasoriginallytrainedon(oftennewsdocuments)

• Tagscanbehelpful“downstream”forvariousapplications– E.g.,fordocumentclassificationwemightwanttoonlyusenouns, adjectives,

andverbsandignoreeverythingelse– E.g.,forinformation extractionwemight focusonlyonnouns

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 94

ChallengesinPOSTagging

• TaggingwordsintextwiththeircorrectPOStagsisnotsimplyassigningwordstotagsusingalookuptable

• Semanticcontext– Thenegotiatorwasabletobridgethegapbetweenthe2sides– Here‘bridge’ isusedasaverbeventhough weordinarily thinkofitasanoun– Theotherwordsinthesentenceandthegrammaticalstructureallowusto

interpret ‘bridge’ hereasaverb

• Ambiguity,e.g.,– Thepresidentwasentertaininglastnight

• Boththeadjectiveandverbtagfor“entertaining”workhere,i.e.,thereisambiguity

• Tokenizationissues– ThealgorithmmustbeabletodealwithtokenssuchasI’ld or‘pre-specified’

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 95

SoftwareforPartofSpeechTagging

• Manysoftwarepackagesandonlinetoolsavailable

• NLTKPOSTagger

• StanfordnaturallanguagegroupprovidesexcellentPOStaggersinseverallanguages(English,German,Chinese,French,etc)

• http://nlp.stanford.edu/software/tagger.shtml• UsesthePennTreebanktagset

• Onlinedemo– http://demo.ark.cs.cmu.edu/parse

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 96

Example:RemovingStopWordswithaPOSTagger

• Filteroutanytermthatdoesnotbelongtoa(user-defined)targetsetofPOSclasses

SPEAKER WORD POSTAGPATIENT if IN PrepositionPATIENT you PRP Personal PronounPATIENT had VBD Verb, Past Tense PATIENT a DT DeterminerPATIENT magic JJ AdjectivePATIENT potion NN NounPATIENT i PRP Personal PronounPATIENT ‘d MD ModalPATIENT love VB Verb, base formPATIENT to TO PATIENT have VB Verb, base formPATIENT it PRP Personal Pronoun

Examplerule:extractadjectivesandnounsonly

Recommended