Stats 170A: Project in Data Science Text Analysis and

Stats170A:ProjectinDataScience

TextAnalysisandClassification

Padhraic SmythDepartmentofComputerScienceBrenSchoolofInformationandComputerSciencesUniversityofCalifornia,Irvine

(AcknowledgementstoProfMarkSteyvers andProfSameerSingh,UCI,forvariousslides)

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 2

Reading,Homework,Lectures

• Referencereading:– Python

• http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

• Homework7– Dueby2pmWednesdaynextweek

• NextLectures– Today:textanalysisandclassification– Nextweek:discussionofprojects

Roomforimprovement….

PredictingtheSentimentofTweets

wow:othiswasincluded intheplaylist#awesome positive

Icouldnotbehappierwithmyliferightnow positive

HAPPYBIRTHDAYTOTHEPANDERIFICPANDAIK.@TheOrionSound positive

I'mtiredashellInevergetaoffdayduring theweekanymore negative

Themoviewasreallydullandstupid negative

@washingtonpost @silvajanes Howawful!!!!!! negative

“Documents” ClassLabels

PredictingtheScoresofMovieReviews

Ilikedthismovie.Welldone.Greatactinganddirection 4

Terrificmovie, sawit3times.LovedthescenesinMexico. 5

Boringashell.Whogetspaidtocreatethisstuff? 1

Notoneof thedirector’sbestefforts.TomCruisewasterrible. 2

“Documents” Ratings

EntityRecognition

Ilikedthismovie.Welldone.Greatactinganddirection 4

Terrificmovie, sawit3times.LovedthescenesinMexico. 5

Boringashell.Whogetspaidtocreatethisstuff? 1

Notoneof thedirector’sbestefforts.TomCruisewasterrible. 2

“Documents” Ratings

ExamplesofUserLocationFieldsinTwitter

User location: Deutschland User location: Mountain View, CA User location: Florida, USA User location: United States User location: West Virginia, Appalachia User location: Columbia Mo User location: South Florida User location: Germany/Berlin User location: St Louis, MO User location: St. Louis User location: Central Virginia User location: United States��User location: Sea Holme (of Norse legend) Oz User location: California, USA User location: USA User location: Cambodia User location: San Antonio, TX User location: Between my ears User location: Chicago, IL

BasicConceptsinTextRepresentation

RepresentingTextDocumentsNumerically

• Howdowerepresenttext(strings)forinputtomachinelearningalgorithms?(whichgenerallyrequirenumericrepresentations)

• Definea vocabulary=setofwordsorterms

• Simplemethod:BagofWords– Document representedasavectorofcountsofwordsinthevocabulary

• Morecomplex:Real-ValuedEmbeddings– Document (orword) representedasareal-valuedvectorin“embedding space”

• Evenmorecomplex:SequentialRepresentations– Sequential state-machinemodelfor sequencesofwords

ExampleofBag-of-WordsMatrix

database SQL index calculus derivative function Label

d1 24 21 9 0 0 3 1d2 32 10 5 0 3 0 1d3 12 16 5 0 0 0 1d4 6 7 2 0 0 0 1d5 43 31 20 0 3 0 1d6 2 0 0 18 7 16 2d7 0 0 1 32 12 0 2d8 3 0 0 22 4 2 2d9 1 0 0 34 27 25 2d10 6 0 0 17 4 23 2

ConceptsandTerminology

• Document:– Collectionofwords

• Abook,anewsarticle,areport,aWebpage,anemail,atweet,etc– Maycontainboth textandmetadata.– Examplesofmetadata:authorname(s),date,wherepublished, etc

Notethatthedefinition ofadocument isflexiblee.g.,abookcouldbeasingledocument,or…..

eachsectionofabookcouldbeconsidereda“document”

• Corpus:acollectionofdocuments– e.g.,allnewsarticlesfromtheLosAngelesTimessince1990– e.g.,allWikipediaWebpages– e.g.,allYelpreviewsfor restaurantsinChicago– e.g.,arandomsampleofTweetsfromDec2017

ConceptsandTerminology

• Tokens:– Groupings ofcharactersintherawtext– individualwords (or“wordtypes”)+possiblynumbers,punctuation, etc

• WordsorTerms– Uniquetokens (orcombinationsof tokens)thataremeaningful– e.g.,wordssuchas“cat”,“dog”,termssuchas“U.S”or“92697”– Canalsoincludebigrams(“SanDiego”), trigrams(“NewYorkCity”),etc

• Vocabulary– Thespecificsetofunique termsusedbyanalgorithm orapplication– TheEnglish languagehasorderof1millionuniquewords– Inaparticularapplicationwemightuseavocabularyofonly10kto50kterms

• e.g.,relevant/commonwords(unigrams)• Bigram,trigrams,…,ngrams,canalsobepartofthevocabulary

TypicalPipelineforBuildingaTextClassifier

Tokenization

StopWordRemoval

VocabularyDefinition

CreateBagofWords

BuildTextClassifier

Document(string)

Notethatstepsinthepipeline,andhowtheseareimplemented (e.g.,howtokenization isdone)willvaryfromapplicationtoapplicationdepending ontheproblem

ExampleofaTextAnalysisPipeline(forInformationExtraction)

Fromhttp://www.nltk.org/book/ch07.html

Example

Document=‘Chapter1:TheBeginning.Inthebeginning,lifewastough!..........’

Tokens={‘chapter’,‘1’,‘the’,‘beginning’,‘in’,‘the’,‘beginning’,‘life’,‘was’,…}(Punctuation andwhitespacesareusually ignored)

Example

Vocabulary={‘chapter’,‘1’,‘the’,‘beginning’,‘in’,‘life’,‘was’,‘tough’,…}

Example

Vocabulary={‘chapter’,‘1’,‘the’,‘beginning’,‘in’,‘life’,‘was’,‘tough’,…}

BagofWords={[‘chapter’,1],[‘1’,1],[‘the’,2],[‘beginning’,2],…}

Tokenization

l Splituptextintoindividualtokens(words/terms)

l Simplestapproachistoignoreallpunctuationandwhitespaceanduseonlyunbrokenstringsofalphabeticcharactersastokens

If you had a magic potion I’d love to have it.

If you had a magic potion I’d love to have it

IssuesinTokenization

• Finland’s capital → Finland Finlands Finland’s ?

• what’re, I’m, isn’t → What are, I am, is not

• Hewlett-Packard → Hewlett Packard ?

• state-of-the-art → state of the art ?

• Lowercase → lower-case lowercase lower case ?

• San Francisco → onetokenortwo?

• m.p.h., PhD. → ??

Fromhttps://web.stanford.edu/~jurafsky/slp3/SpeechandLanguageProcessing,3rd ed,Jurafsky andMartin

TokenizationSoftware

• Insteadofwritingyourowntokenizerwithacomplexsetofrules,useexistingsoftware– e.g.,tokenizer function inNLTK– e.g.,tokenizer fromStanford’snaturallanguagegroup

• Practicaltip:– Itsusefultokeepdifferent representations ofthedatatouselateron,e.g.,

• Datawithoriginalsequenceandformatting,tokenizedlist,bagofwords,etc

– Sequentialorderofwordsisneeded fordetectingngrams.– Punctuationcancontainuseful information:

If you had a magic potion I’d love to have it . If that makes sense

Ifwedecidetoextractn-gramslateron,weknowthat“it”and“if”shouldnotbecombined.So itsusefultoretainthisinformation.

SentenceDetection:ExampleusingaDecisionTree

Fromhttps://web.stanford.edu/~jurafsky/slp3/SpeechandLanguageProcessing,3rd ed,Jurafsky andMartin

Example:TokenizationofTweets

Therearespecial-purposetokenizers fordifferenttypesoftext,e.g.,forTweets

from nltk.tokenize import TweetTokenizerfrom nltk.tokenize import word_tokenize

tknzr = TweetTokenizer()

for i in range(10): print("NLTK Twitter Tokenizer: ", tknzr.tokenize(tweets.text[i]) ) print("NLTK Standard Tokenizer:", word_tokenize(tweets.text[i]))

NLTK Twitter Tokenizer: ['RT', '@ProudResister', ':', 'Trump', 'to', 'Comey', ':', 'Let', 'Flynn', 'go', '.', 'Trump', 'to', 'Russians', ':', 'Firing', 'Nut', 'Job', 'Comey', 'took', 'great', 'pressure', 'off', 'me', '.', 'Trump', 'to', 'McCabe', ':', 'Who', '…'] NLTK Standard Tokenizer: ['RT', '@', 'ProudResister', ':', 'Trump', 'to', 'Comey', ':', 'Let', 'Flynn', 'go', '.', 'Trump', 'to', 'Russians', ':', 'Firing', 'Nut', 'Job', 'Comey', 'took', 'great', 'pressure', 'off', 'me', '.', 'Trump', 'to', 'McCabe', ':', 'Who…']

NLTK Twitter Tokenizer: ['Mueller', 'to', 'interview', 'former', 'spokesman', 'of', 'Trump', 'legal', 'team', ':', 'source', 'https://t.co/olA35uCt89'] NLTK Standard Tokenizer: ['Mueller', 'to', 'interview', 'former', 'spokesman', 'of', 'Trump', 'legal', 'team', ':', 'source', 'https', ':', '//t.co/olA35uCt89']

NLTK Twitter Tokenizer: ['RT', '@IndivisibleTeam', ':', "Here's", 'what', 'you', 'need', 'to', 'know', 'about', '#ReleaseTheMemo', '�', '1', '⃣', 'It', '’', 's', 'not', 'really', 'a', 'memo', '.', "They're", 'talking', 'points', 'by', 'Devin', 'Nunes', '…'] NLTK Standard Tokenizer: ['RT', '@', 'IndivisibleTeam', ':', 'Here', "'s", 'what', 'you', 'need', 'to', 'know', 'about', '#', 'ReleaseTheMemo�', '1⃣', 'It’s', 'not', 'really', 'a', 'memo', '.', 'They', "'re", 'talking', 'points', 'by', 'Devin', 'Nunes…']

SampleOutputfromthe2Tokenizers

Example:DefiningaVocabulary

Rawtext(astringinPython) raw1 = “The dog chased the cat and the mouse. Why did the dog do this?”

Thereare14wordtokens inthestringraw1(ifweignorepunctuationandspaces)The,dog,chased,the,cat,and,the,mouse,Why,did,the,dog, do,this

Thevocabulary (theunique tokens,normalizing tolowercase)is:the,dog,chased,cat,and,mouse,why,did,do, thisThevocabularysizeis10.

Thecounts forabagofwordsrepresentation is:the(3),dog (2), chased(1), cat(1),and(1),mouse(1),why(1),did(1),do(1), this(1)

Ifweremovestopwords wedecreaseourvocabularysizeto4dog (2),chased(1), cat(1),mouse(1)

DefiningtheVocabulary(afterTokenization)

• Vocabulary– Setof terms(words) usedtoconstructthedocument-termmatrix

• Basicapproach:usesinglewords(unigrams)asterms

• Removeverycommonterms(e.g.,stopwords)

• Removeveryrareterms:e.g.,removealltermsthatoccurinfewerthanKdocumentsinthecorpus(e.g.,K=10)– Getsridofmisspellings, unusualnames,etc

• Canextendtermlistwithn-grams– Frequentwordcombinations (2-grams,3-grams,…)

“feelgood”/“NewYorkCity”

FrequencyofWordinEnglishusage

RankofWordbyFrequency

Verylong“tail”ofrare

Graphfromwww.prismnet.com/~dierdorf/wordfrequency.png

Stopword Removal

l Removewords thatarelikelytobeirrelevanttoourprobleml Keepcontentwords(typically verbs,adverbs,nouns,adjectives)

If you had a magic potion I’d love to have it . If that makes sense

Example:

But what about this?

[Prince Hamlet] To be or not to be ...

NLTKStopWord List

Note:inmanyapplicationstheremaybeadditionaldomain-specific“stopwords”thatareverycommonandthatwemaywanttoremovesincetheycontainlittleinformation,e.g.,theterm“restaurant”inreviews

N-grams

• Usefuln-gramsaregroupsofn-wordsthatcommonlyappearinsequence– “ComputerScience”– “LosAngeles”– “NewYorkCity”

– Notethatwecanhavebothtermslike“computer”and“computer science”,e.g.,

“Iamstudying computer scienceatUCIandIboughtanewcomputer lastweek.”

- Addingann-gramasaterminthevocabularymaybeusefulinamodel- e.g.,forarestaurantreviewclassificationproblem, “waittime”maybemore

informative than“time”

AutomaticallyFindingN-grams

- e.g.,Bigrams- Keeptrackofalluniquepairsthatoccursequentially(excludestopwords)

- I visited Microsoft Research for a computer science job interview last week.”

- * visited Microsoft Research * * computer science job interview last week.”

- Rankbyfrequencyofoccurrence(ornumberofdocsabigramappearsin)- Canalsorankbyothermetrics

- KeepthetopKinthevocabulary- Howlargeshould Kbe?Dependsontheapplication,amountofdata,etc- MightneedtosearchoverdifferentvaluesofK(e.g.,usingcross-validation)

- Sameideafortri-grams,4grams,etc.

- Seealsohttp://www.nltk.org/howto/collocations.html inNLTK

d1 24 21 9 0 0 3 1d2 32 10 5 0 3 0 1d3 12 16 5 0 0 0 1d4 6 7 2 0 0 0 1d5 43 31 20 0 3 0 1d6 2 0 0 18 7 16 2d7 0 0 1 32 12 0 2d8 3 0 0 22 4 2 2d9 1 0 0 34 27 25 2d10 6 0 0 17 4 23 2

Uselabeledtrainingdata(supervisedlearning)tolearnaclassifier– HomeworkAssignment7

OverviewofTextClassification

TextClassification

• Textclassificationhasmanyapplications– Spamemaildetection– Classifyingnewsarticles,e.g.,Google News– ClassifyingWebpagesintocategories

• DataRepresentation– “Bagofwords/terms”mostcommonlyused:eithercountsorbinary– Canalsouseotherweighting andadditional features(e.g.,metadata)

• ClassificationMethods– NaïveBayeswidelyusedbaseline

• Fastandreasonablyaccurate– LogisticRegression

• Widelyusedinindustry,accurate,excellentbaseline– Neuralnetworksanddeep learning

• Canbeveryaccurate• Canrequireverylargeamountsoflabeledtrainingdata• Morecomplexthanothermethods

ExampleofDocumentbyTermMatrix

predict finance stocks goal score team ClassLabel

d1 3 1 4 0 0 1 1

d2 2 2 1 0 1 0 1

d3 1 1 2 0 0 0 1

d4 1 2 0 0 0 0 1

d5 3 1 1 0 1 0 1

d6 1 0 0 1 3 1 2

d7 0 0 1 4 1 0 2

d8 2 0 0 1 2 1 2

d9 1 0 0 2 1 2 2

d10 1 0 0 1 3 1 2

PossibleWeightsforaLinearClassifier

predict finance stocks goal score team ClassLabel

Weight 0.1 2.5 1.0 -2.0 -0.2 -1.5

d1 3 1 4 0 0 1 1d2 2 2 1 0 1 0 1d3 1 1 2 0 0 0 1d4 1 2 0 0 0 0 1d5 3 1 1 0 1 0 1d6 1 0 0 1 3 1 2d7 0 0 1 4 1 0 2d8 2 0 0 1 2 1 2d9 1 0 0 2 1 2 2d10 1 0 0 1 3 1 2

RealExamplefromYelpData

YelpDatasetNumberofReviews 706,693NumberofReviewsw/oNeutralRating 595,468

NumberofTokens 85,392,376VocabularySizew/oStopwords 176,114

ArrayDimensions (595468, 176114)

Numberofcellsin theArray 104,870,251,352Non-zeroentries 28,357,001Density 0.0027%

ExampleofaPipelineforDocumentClassification

TrainingDocuments(corpus)

Tokenization ListsofTokens

Vocabulary

BagofWords

MachineLearningAlgorithm

Stopword andrarewordremoval

FrequencyCounts

DocumentClassifier

ExampleofaPipelineforDocumentClassification

TrainingDocuments(corpus)

Tokenization ListsofTokens

Vocabulary

BagofWords

MachineLearningAlgorithm

NewDocument

DocumentClassifier

ListsofTokens

Tokenization BagofWords

LabelPrediction

Stopword andrarewordremoval

FrequencyCounts

KeyStepsinDocumentAnalysisPipelines(forBagofWords)

• Tokenization– Variousoptions (e.g.,withpunctuation, nonalphanumericsymbols, etc)

• Vocabularydefinition– N-grams,stopword removal,rarewordremoval, stemming

• Featuredefinition– Binary(termpresentornot?)– Counts– Weightedcounts,e.g.,TD-IDF(seelaterintheslides)

• Classifierselection– NaïveBayes,logistic,SVMs,neuralnetworks,etc

TF-IDFWeightingofFeatures

Inpracticetheinputscanbeweighted– Itcanbehelpful touse“TF-IDFweights”insteadofcounts

TF(t,d)=termfrequency=count=numberoftimestermtoccursindocd

IDF(t,d)=inversedocumentfrequency=log(N/numberofdocswithtermt)

(whereN=totalnumberofdocsinthecorpus)

TF-IDF(t,d)=TF(t,d)*IDF(t,d)

TheIDFtermhastheeffectofupweighting termsthatoccurinfewdocs

TF-IDFExample

N=1000inacorpusofnewsarticles

Term1:t=“city”,appearsin500documents

IDF(t)=log(1000/500)=log(2)=1(logisbase2,not important)

Term2:t=“freeway”,appearsin10documents

IDF(t)=log(1000/10)=log(100)=6.64

Sooccurrencesof“freeway”willgetupweighted byafactorof6.64comparedtooccurrencesof“city”

ClassificationMethodsforText

• Logisticregression– widelyusedforbag-of-words– Inputdimensionality (vocabulary size)ishigh(e.g.,5kto500k)– regularization isuseful

• Recurrentneuralnetworksarewidelyusedforsequentialmodels(seelater)– Morecomplex,butcanpickupsequential information

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 47From:http://eli5.readthedocs.io/en/latest/tutorials/sklearn-text.html

From:http://eli5.readthedocs.io/en/latest/tutorials/sklearn-text.html

From:https://github.com/TeamHG-Memex/eli5

SentimentAnalysis

Veryusefultutorial,manypracticaltipsandideas

Onlineathttp://sentiment.christopherpotts.net/

Example:PredictingifReviewTextisPositiveorNegative

• Twobasicapproaches– Uselexiconsofpositiveandnegativewords– Uselabeleddatatolearnaclassificationmodel(canalsocombinebothapproaches)

• Challenges,e.g.,howtohandlenegation?– Positivereview:

• “ThismovieisnotdepressingthewayIthought itwouldbe.Idon’t thinkIhaveanythingnegative tosayabout it.”

• “Iwasn’t disappointed atallwiththeacting – infacttheactingwasexcellent.”

– Negativereviews:• “Sittingthroughthismoviewasnot somethingIenjoyed doing”• “Thiswasn’t averygoodscript.Thedirectorisusuallyexcellent (somereallygreat

moviesovertheyearsthatIreallyenjoyed),but thisleavesmecold”

NegationMarking(duringTokenization)

• Idea:appenda_NEGsuffixtoeverywordappearingbetweenanegationandaclause-levelpunctuationmark

Fromhttp://sentiment.christopherpotts.net/lingstruc.html

ExamplesofNegationMarking

ImprovementsinClassificationAccuracy

SentimentLexicons(Dictionaries)

• Counts,percentages,weightsofwordsinvariouscategories,– e.g.,degree towhichawordtendstopositiveornegative

• ExampleLexicons– Generalinquirer (http://www.wjh.harvard.edu/~inquirer)

• WordscategorizedaccordingtoPositive/Negative,StrongvsWeak,ActivevsPassive,etc

– Sentiwordnet (http://sentiwordnet.isti.cnr.it/)• Synsets inWordNet3.0 annotatedfordegreesofpositivity,negativity,and

neutrality/objectiveness

– LinguisticInquiryandWordCount(http://www.liwc.net/)

– NRCWord-Emotion AssociationLexicon(nextslide)

YelpChallengeDataSet

Jupyter NotebookExamplewithYelpDataset

VectorRepresentationsforText

VectorRepresentationsallowGeneralization

• Example:Cat : [1.5,1.5,-0.2,0.0,0.0,0.0]Kitten: [1.4,1.5,-0.2,0.0,0.0,0.0]Pet: [2.0,1.6,-0.3,0.0,0.0,0.0]

• Iftheword“cat”occursinadocumentthenweknowthatotherwordslike“kitten”and“pet”aresimilar

• Whyisthisusefulinlearning?– Consideraclassifierbasedonweights(e.g.,logisticregression,neuralnetwork)– Traditionalapproach:eachwordhasitsownseparateweight– Ifwerepresentwordsbytheirvectors,andlearnweightson thevectors,then

wordsthatareclosetogetherwillproducesimilaroutputs– Thiscanhelpwithgeneralization

AnotherExampleofWordEmbedding

Publicly-AvailablePre-TrainedWordEmbeddings

• Word2Vec:https://code.google.com/archive/p/word2vec/– GoogleNewsdataset,3Mvocab(wordsandphrases), from100Btokens– Entityvectors:1.4Mvectorsforentities, trainedon100Bwords fromnews

articles,entitynamesfromFreebase

• Glove:http://nlp.stanford.edu/projects/glove/– Wikipedia:400kvocab,50dto300dvectors,basedon6Btokens– CommonCrawl,2.2Mvocab,300dvectors, from840Btokens– Twitter:1.2Mvocab,25dto200dvectors,basedon2Btweets

• Note:youmaywanttoconsiderusingtheseinyourprojects– Easierandfasterthantrainingyourownembeddingmodel

Component1

Component2

Inthisexample,y andz aremoresimilarundercosinesimilaritythany andx becausetheanglebetweeny andz ismuchsmaller

CosineDistancebetweentwoVectors

Cosinedistance=1- Cosinesimilarity:measurescosineoftheanglebetween2vectors, wherecos(0)=1,andcos(90)=0

ExamplesofSimilaritybetweenWordsExamplesfromhttps://code.google.com/archive/p/word2vec/

Cosinesimilarityto“France”

Cosinesimilarityto“SanFrancisco”

Cosinesimilarity:measurescosineoftheanglebetween2vectors, wherecos(0)=1,andcos(90)=0

1:cat 1.002:cats 0.813:dog 0.764:kitten 0.755:feline 0.736:beagle 0.727:puppy 0.718:pup 0.699:pet 0.6910:felines 0.68

MostSimilarVectorsto‘cat’fromword2vec300dembeddings

Rank Word CosineSimilarity

1:cat 1.002:cats 0.813:dog 0.764:kitten 0.755:feline 0.736:beagle 0.727:puppy 0.718:pup 0.699:pet 0.6910:felines 0.6811:chihuahua 0.6712:pooch 0.6713:kitties 0.6714:dachshund 0.6715:poodle 0.6616:stray_cat 0.6617:Shih_Tzu 0.6618:tabby 0.6619:basset_hound 0.6520:golden_retriever 0.65

MostSimilarVectorsto‘cat’fromword2vec300dembeddings

1:oxycontin 1.002:Oxycontin 0.793:Oxycodone 0.714:OxyContin 0.695:morphine_pills 0.676:hydrocodone 0.677:Lortab 0.678:oxycodone 0.659:OxyContin_pills 0.6510:Hydrocodone 0.6511:Alprazolam 0.6412:Oxycodone_pills 0.6313:Xanax 0.6214:Lorcet 0.6215:Percocet_pills 0.6116:Valium 0.6117:Diazepam 0.6018:methadone_pills 0.6019:hydrocodone_pills 0.6020:prescription_pills 0.59

MostSimilarVectorsto‘oxycontin’fromword2vec300dembeddings

1:iran 1.002:pakistan 0.663:israel 0.644:lebanon 0.625:russia 0.626:iraq 0.617:egypt 0.608:Irans 0.599:afghanistan 0.5910:Iran 0.5911:arabs 0.5812:israeli 0.5713:obama 0.5614:arab 0.5615:iraqi 0.5616:japan 0.5617:sri_lanka 0.5618:america 0.5519:korea 0.5520:cuba 0.55

MostSimilarVectorsto‘iran’fromword2vec300dembeddings

1:Iran 1.002:Tehran 0.833:Iranian 0.824:Islamic_republic 0.815:Islamic_Republic 0.816:Iranians 0.787:Teheran 0.768:Syria 0.759:Ahmadinejad 0.6910:Irans 0.6711:Larijani 0.6612:North_Korea 0.6413:Mottaki 0.6314:clerical_regime 0.6315:nuclear_ambitions 0.6316:Khamenei 0.6317:Ayatollah_Khamenei 0.6318:Velayati 0.6319:Ahmedinejad 0.6320:Rafsanjani 0.62

MostSimilarVectorsto‘Iran’fromword2vec300dembeddings

RecurrentNeuralNetworksforSequentialData

NetworkModelforPredictingtheNextWordinaSentence

thedogsawcatonwall

Sentence:Thedogsawthecatonthewall.

1000000

0100000

Binaryinput,“One-hot”encoding

thedogsawcatonwall.

Hiddenunitactivations(real-valued)

Binarytargetoutput,

“One-hot”encoding

StandardRecurrentNeuralNetwork

Figuresfromhttp://cs231n.stanford.edu/slides/2017/

Differenttoa“feedforward”networkHiddenunitatpositiontnowalsoHasinputfromhiddenvectorattimet-1

StateComputationinaNeuralNetwork

Keypoints:- fw ineachpositionistheRNNstate- Thisstateisupdatedrecursivelyfromthe

previousstateandthecurrentinput

Keypoints:- Thex->fandf->harrowsareweightmatrices- Thesameweightmatricesare(usually)used

acrossthesequence

LearningtheWeightsinaRecurrentNetwork

Learning:- lossfunctioncomparespredictionsandtargets- computegradient(basedonerror)andupdateweights

Example:PredictionofSequencesofCharacters

SimulatingSequencesfromanRNN

OutputfromanRNNModelTrainedonShakespeare

KING LEAR: O, if you were a feeble sight, the courtesy of your law, Your sight and several breath, will wear the gods With his heads, and my hands are wonder'd at the deeds, So drop upon your lordship's head, and your opinion Shall be against your honour.

Second Senator: They are away this miseries, produced upon my soul, Breaking and strongly should be buried, when I perish The earth and thoughts of many states.

DUKE VINCENTIO: Well, your wit is in the care of side and that.

Examplesfrom“TheUnreasonableEffectiveness ofRecurrentNeuralNetworks”,AndrejKaparthy,blog,http://karpathy.github.io/2015/05/21/rnn-effectiveness/

OutputfromanRNNModelTrainedonCookingRecipes

Fromhttps://gist.github.com/nylki/1efbaa36635956d35bcc

Imageclassification

Imagecaptioning

Sentimentanalysis

Machinetranslation

Syncedsequence(videoclassification)

DifferentRecurrentNetworkModels

Part-of-SpeechTagging

PartofSpeech(POS)Tagging

• CommonPOScategories(ortags)inEnglish:– Noun,verb,article,preposition, pronoun, adverb, conjunction, interjection

• Howevertherearemanymorespecializedcategories– E.g.,propernouns:e.g.,‘Toronto’, ‘Smith’,….– E.g.,comparativeadverb:e.g.,‘bigger’, ‘smaller’,…– E.g.,symbol: ‘3.12’,‘$’,…

• AssigningPOScategoriestowordsintextisknownastagging

UniversalTagset (asusedinNLTK)

12universaltags:VERB- verbs(alltensesandmodes)NOUN- nouns(commonandproper)PRON- pronounsADJ- adjectivesADV- adverbsADP- adpositions (prepositionsandpostpositions)CONJ- conjunctionsDET- determinersNUM- cardinalnumbersPRT- particlesorotherfunctionwordsX- other:foreignwords,typos,abbreviations.- punctuation

See"AUniversalPart-of-SpeechTagset"bySlavPetrov,Dipanjan DasandRyanMcDonald formoredetails:http://arxiv.org/abs/1104.2086andhttp://code.google.com/p/universal-pos-tags/

UniversalTagset (usedinNLTK)

Tag Meaning English ExamplesADJ adjective new, good, high, special, big, localADP adposition on, of, at, with, by, into, underADV adverb really, already, still, early, nowCONJ conjunction and, or, but, if, while, althoughDET determiner, article the, a, some, most, every, no, whichNOUN noun year, home, costs, time, AfricaNUM numeral twenty-four, fourth, 1991, 14:24PRT particle at, on, out, over per, that, up, withPRON pronoun he, their, her, its, my, I, usVERB verb is, say, told, given, playing, would. punctuation marks . , ; !X other ersatz, esprit, dunno, gr8, univeristy

(fromSection2.3inChapter5ofNLTKBook)

POSTaggingAlgorithms

• TaggingAlgorithms:“POSTaggers”– Tagging isoftendoneautomaticallywithalgorithms– Thesealgorithms oftenusesequential(Markov)models

• Thetagforaparticulartokencandependonwords/tagsbeforeandafterit– Thesemodelsaretrainedusingmachinelearning

• usingvariouswordfeaturesanddictionariesasinput– Trainedonmanuallylabeleddocuments

• Taggingperformanceisbestonmaterialthatthetaggerwasoriginallytrainedon(oftennewsdocuments)

• Tagscanbehelpful“downstream”forvariousapplications– E.g.,fordocumentclassificationwemightwanttoonlyusenouns, adjectives,

andverbsandignoreeverythingelse– E.g.,forinformation extractionwemight focusonlyonnouns

ChallengesinPOSTagging

• TaggingwordsintextwiththeircorrectPOStagsisnotsimplyassigningwordstotagsusingalookuptable

• Semanticcontext– Thenegotiatorwasabletobridgethegapbetweenthe2sides– Here‘bridge’ isusedasaverbeventhough weordinarily thinkofitasanoun– Theotherwordsinthesentenceandthegrammaticalstructureallowusto

interpret ‘bridge’ hereasaverb

• Ambiguity,e.g.,– Thepresidentwasentertaininglastnight

• Boththeadjectiveandverbtagfor“entertaining”workhere,i.e.,thereisambiguity

• Tokenizationissues– ThealgorithmmustbeabletodealwithtokenssuchasI’ld or‘pre-specified’

SoftwareforPartofSpeechTagging

• Manysoftwarepackagesandonlinetoolsavailable

• NLTKPOSTagger

• StanfordnaturallanguagegroupprovidesexcellentPOStaggersinseverallanguages(English,German,Chinese,French,etc)

• http://nlp.stanford.edu/software/tagger.shtml• UsesthePennTreebanktagset

• Onlinedemo– http://demo.ark.cs.cmu.edu/parse

Example:RemovingStopWordswithaPOSTagger

• Filteroutanytermthatdoesnotbelongtoa(user-defined)targetsetofPOSclasses

SPEAKER WORD POSTAGPATIENT if IN PrepositionPATIENT you PRP Personal PronounPATIENT had VBD Verb, Past Tense PATIENT a DT DeterminerPATIENT magic JJ AdjectivePATIENT potion NN NounPATIENT i PRP Personal PronounPATIENT ‘d MD ModalPATIENT love VB Verb, base formPATIENT to TO PATIENT have VB Verb, base formPATIENT it PRP Personal Pronoun

Examplerule:extractadjectivesandnounsonly

Stats 170A: Project in Data Science Text Analysis and

Documents

MATH 170A/B: Introduction to Probability Theory Lecture Note

Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

TM 170 - TM 170A - TM 880A - allpa- · PDF fileMANUALE DI OFFICINA TM 170 – TM 170A ... Name plate 1 2028004 110 Corpo pompa ... - CHECK WITH AN AIR GUN ANY DIRT IN THE KEY AREA

Stats on Stats ETF June 2016

KX 170A - Maintenance Manual Honeywell

Lesson 9: File & Dir Path, Modules, Basic Text Statsnaraehan/ling1901/Lesson9.pdf · 2014-03-05 · Lesson 9: File & Dir Path, Modules, Basic Text Stats Fundamentals of Text Processing

DoE Stats 2000 Text formeset b3 Statistics Publication/DoE Stats at... · 2015-05-20 · schools) in South Africa. Ordinary schools exclude stand-alone ELSEN/special schools and pre-primary/ECD

Today stats

Panel de operador, TP 170micro, TP 170A, TP 170B, OP 170B ... · PDF filePrólogo TP 170micro, TP 170A, TP 170B, OP 170B (WinCC flexible) iv Instrucciones de servicio, Edición 03/2004,

GC 170A! FALL 2013 INTRODUCTION TO GLOBAL CHANGE … · GC 170A! FALL 2013 INTRODUCTION TO GLOBAL CHANGE PRACTICE FINAL EXAM QUESTIONS, ORGANIZED BY TOPIC Here are practice questions

NLTK & BASIC TEXT STATS DAY 19 - 10/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Federal Civil Procedure 170A 1771 - Leasing Newsleasingnews.org/PDF/WellsFargoCase_62014.pdf · Fed.Rules Civ.Proc.Rule 12(b)(6), (c), 28 U.S.C.A. [3] Federal Civil Procedure 170A

SIMATIC HMI Bediengerät TP 170micro, TP 170A, TP 170B, OP ... · TP 170micro, TP 170A, TP 170B, OP 170B (WinCC flexible) Betriebsanleitung, Ausgabe 03/2004, 6AV6691-1DB01-0AA0 i

iTunes Stats

He70G-1„Blitz“ (F-2/manuals.hobbico.com/rvl/80-4229.pdf · Heinkel He70G-1„Blitz“(F-2/170A) Heinkel He70G-1„Blitz“(F-2/170A) Das Schnellverkehrsflugzeug He 70 wurde am

FIRST GRADE STATS SECOND GRADE STATS THIRD GRADE …

CREB Stats 2010 July Metro Stats

exploratory data analysis - grape.ics.uci.edu · Stats 170A: Project in Data Science Data Visualization and Exploratory Data Analysis Padhraic Smyth Department of Computer Science

Data Formats and APIs - Application Server...JSON vs. XML Michael Carey/PadhraicSmyth, UC Irvine: Stats 170A/B, Winter 2018 XML JSON Verbosity Higher Lower Complexity Higher Lower

CREB Stats 2010 Sept Metro Stats