Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
3/17/09
1
TextProcessing
CISC489/689‐010,Lecture#3Monday,Feb.16
BenCartereFe
Indexing
• Anindexisalistofthings(keys)withpointerstootherthings(items).– Keywordscatalognumbers(shelves).– Conceptspagenumbers.– Termsdocuments.
• Needforindexes:– Easeofuse.– Speed.– Scalability.
3/17/09
2
Manualvs.AutomaVcIndexing
• Manual:– An“expert”assignskeystoeachitem.
– Example:cardcatalog.
• AutomaVc:– KeysautomaVcallyidenVfiedandassigned.– Example:Google.
• AutomaVcasgoodasmanualformostpurposes.
TextProcessing
• FirststepinautomaVcindexing.• ConverVngdocumentsintoindex terms.
• Termsarenotjustwords.– Notallwordsareofequalvalueinasearch.– SomeVmesnotclearwherewordsbeginandend.
• Especiallywhennotspace‐separated,e.g.Chinese,Korean.
– Matchingtheexactwordstypedbytheuserdoesn’tworkverywellintermsofeffecVveness.
3/17/09
3
TextProcessingSteps
• Foreachdocument:– Parseittolocatethepartsthatareimportant.
– Segmentandtokenizethetextintheimportantpartstogetwords.
– Removestop words.– Stemwordstocommonroots.
• Advancedprocessingmayincludedphrases,enVtytagging,link‐graphfeatures,andmore.
Parsing
• Somepartsofadocumentaremoreimportantthanothers.
• Documentparserrecognizesstructureusingmarkup suchasHTMLtags.– Headers,anchortext,boldedtextarelikelytobeimportant.
– JavaScript,styleinformaVon,navigaVonlinkslesslikelytobeimportant.
– Metadatacanalsobeimportant.
3/17/09
4
ExampleWikipediaPage
WikipediaMarkup<title>Tropical fish</title> <text>{{Unreferenced|date=July 2008}} {{Original research|date=July 2008}} ’’’Tropical fish’’’ include [[fish]] found in [[Tropics|
topical]] environments around the world, including both [[fresh water|freshwater]] and [[sea water|salt water]] species. [[Fishkeeping|Fishkeepers]] often use the term ’’tropical fish’’ to refer only those requiring fresh water, with saltwater tropical fish referred to as ’’[[list of marine aquarium fish species|marine fish]]’’.
…
3/17/09
5
WikipediaHTML
DocumentParsing
• HTMLpagesorganizeintotrees.
<HTML>
<HEAD>
<TITLE> Tropicalfish
<META>
<BODY>
<H1> Tropicalfish
<P>
<B> Tropicalfish
<A> fish
<A> tropical
includefoundinenvironmentsaroundtheworld
Nodes contain blocks of text.
3/17/09
6
EndResultofParsing
• Blocksoftextfromimportantpartsofpage.– Tropicalfishincludefishfoundintropicalenvironmentsaroundtheworld,includingbothfreshwaterandsaltwaterspecies.Fishkeepersoienusetheterm“tropicalfish”toreferonlythoserequiringfreshwater,withsaltwatertropicalfishreferredtoas“marinefish”.
• Nextstep:segmenVngandtokenizing.
Tokenizing
• Formingwordsfromsequenceofcharactersinblocksoftext.
• SurprisinglycomplexinEnglish,canbeharderinotherlanguages.
• EarlyIRsystems:– Anysequenceofalphanumericcharactersoflength3ormore.
– Terminatedbyaspaceorotherspecialcharacter.
– Upper‐casechangedtolower‐case.
3/17/09
7
Tokenizing
• Example:– “Bigcorp's2007bi‐annualreportshowedprofitsrose10%.”becomes
– “bigcorp2007annualreportshowedprofitsrose”• ToosimpleforsearchapplicaVonsorevenlarge‐scaleexperiments
• Why?ToomuchinformaVonlost– SmalldecisionsintokenizingcanhavemajorimpactoneffecVvenessofsomequeries
TokenizingProblems• Smallwordscanbeimportantinsomequeries,usuallyincombinaVons
• xp,ma,pm,beneking,elpaso,masterp,gm,jlo,worldwarII
• Bothhyphenatedandnon‐hyphenatedformsofmanywordsarecommon– SomeVmeshyphenisnotneeded
• e‐bay,wal‐mart,acVve‐x,cd‐rom,t‐shirts
– AtotherVmes,hyphensshouldbeconsideredeitheraspartofthewordorawordseparator
• winston‐salem,mazdarx‐7,e‐cards,pre‐diabetes,t‐mobile,spanish‐speaking
3/17/09
8
TokenizingProblems
• Specialcharactersareanimportantpartoftags,URLs,codeindocuments
• Capitalizedwordscanhavedifferentmeaningfromlowercasewords– Bush,Apple
• Apostrophescanbeapartofaword,apartofapossessive,orjustamistake– rosieo'donnell,can't,don't,80's,1890's,men'sstrawhats,master'sdegree,england'stenlargestciVes,shriner's
TokenizingProblems
• Numberscanbeimportant,includingdecimals– nokia3250,top10courses,united93,quickVme6.5pro,92.3thebeat,288358
• Periodscanoccurinnumbers,abbreviaVons,URLs,endsofsentences,andothersituaVons– I.B.M.,Ph.D.,cis.udel.edu
• Note:tokenizingstepsforqueriesmustbeidenVcaltostepsfordocuments
3/17/09
9
TokenizingProcess
• Assumewehaveusedtheparsertofindblocksofimportanttext.
• Awordmaybeanysequenceofalphanumericcharactersterminatedbyaspaceorspecialcharacter.– everythingconvertedtolowercase.– everythingindexed.
• Defercomplexdecisionstoothercomponents– example:92.3→923butsearchfindsdocumentswith92and3adjacent
– incorporatesomerulestoreducedependenceonquerytransformaVoncomponents
EndResultofTokenizaVon
• Listofwordsinblocksoftext.– tropicalfishincludefishfoundintropicalenvironmentsaroundtheworldincludingbothfreshwaterandsaltwaterspeciesfishkeepersoienusethetermtropicalfishtoreferonlythoserequiringfreshwaterwithsaltwatertropicalfishreferredtoasmarinefish
• Nextstep:stopping.• Butfirst:textstaVsVcs.
3/17/09
10
TextStaVsVcs
• Hugevarietyofwordsusedintextbut• ManystaVsVcalcharacterisVcsofwordoccurrencesarepredictable– e.g.,distribuVonofwordcounts
• RetrievalmodelsandrankingalgorithmsdependheavilyonstaVsVcalproperVesofwords– e.g.,importantwordsoccuroienindocumentsbutarenothighfrequencyincollecVon
Zipf’sLaw• DistribuVonofwordfrequenciesisveryskewed
– afewwordsoccurveryoien,manywordshardlyeveroccur
– e.g.,twomostcommonwords(“the”,“of”)makeupabout10%ofallwordoccurrencesintextdocuments
• Zipf’s“law”:– observaVonthatrank(r)ofawordVmesitsfrequency(f)isapproximatelyaconstant(k)
• assumingwordsarerankedinorderofdecreasingfrequency
– i.e.,r.f ≈korr.Pr≈c,wherePrisprobabilityofwordoccurrenceandc≈ 0.1forEnglish
3/17/09
11
Zipf’sLaw
WikipediaStaVsVcs(wiki000subset)
Totaldocuments 5,001
Totalwordoccurrences 22,545,922
Vocabularysize 348,436
Wordsoccurring>1000Vmes 2,751
Wordsoccurringonce 163,404
Word Freq r Pr(%) r.Pr
poliVcian 5096 510 0.023 0.116
contractor 100 14,852 4.4∙10‐4 0.066
kickboxer 10 56,125 4.4∙10‐5 0.025
comdedian 1 185,035 4.4∙10‐6 0.008
3/17/09
12
Top50Wordsfromwiki000Subset
Zipf’sLawforwiki000Subset
Rank
Pro
babi
lity
3/17/09
13
Zipf’sLaw
• WhatistheproporVonofwordswithagivenfrequency?– Wordthatoccursn Vmeshasrankrn = k/n – Numberofwordswithfrequencyn is
• rn − rn+1 = k/n − k/(n + 1)= k/n(n + 1)– ProporVonfoundbydividingbytotalnumberofwords=highestrank=k
– So,proporVonwithfrequencynis1/n(n+1)
Zipf’sLaw
• Exampleword
frequencyranking
• Tocomputenumberofwordswithfrequency493– rankof“png”minustherankof“defend”
– 5005−5001=4
Rank Word Freq
4999 objecVve 494
5000 albany 494
5001 defend 494
5002 appeals 493
5003 125 493
5004 lasVng 493
5005 png 493
3/17/09
14
Example
• ProporVonsofwordsoccurringnVmesin5,001Wikipediadocuments
• Vocabularysizeis348,436.
Num.occurrences(n)
Predictedpropor:on(1/n(n+1))
Actualpropor:on
Actualnumberofwords
1 .500 .469 163,404
2 .167 .151 52,672
3 .083 .070 24,272
4 .050 .045 15,685
5 .033 .030 10,437
6 .024 .022 7,832
7 .018 .017 5,962
8 .014 .014 4,890
9 .011 .011 3,886
10 .009 .009 3,291
VocabularyGrowth
• Ascorpusgrows,sodoesvocabularysize– Fewernewwordswhencorpusisalreadylarge
• ObservedrelaVonship(Heaps’ Law):
v=k.nβ
wherevisvocabularysize(numberofuniquewords),nisthenumberofwordsincorpus, k,β areparametersthatvaryfor
eachcorpus (typicalvaluesgivenare10≤ k ≤ 100 andβ ≈ 0.5)
3/17/09
15
wiki000SubsetExample
Words in collection
Voca
bula
ry s
ize
v ≈ 18.61·n0.5819
Heaps’LawPredicVons
• PredicVonsforTRECcollecVonsareaccurateforlargenumbersofwords– e.g.,first22,545,922wordsofwiki000scanned– predicVonis353,587uniquewords– actualnumberis348,436
• PredicVonsforsmallnumbersofwords(i.e.<1000)aremuchworse
3/17/09
16
Heaps’LawPredicVons
• Heaps’Lawworkswithverylargecorpora– newwordsoccurringevenaierseeing30million!
• Newwordscomefromavarietyofsources• spellingerrors,inventedwords(e.g.product,companynames),code,otherlanguages,emailaddresses,etc.
• Searchenginesmustdealwiththeselargeandgrowingvocabularies
Stopping
• FuncVonwords(determiners,preposiVons)haveliFlemeaningontheirown
• Highoccurrencefrequencies– Top6words:the, of, and, in, to, a
• Treatedasstopwords (i.e.removed)– reduceindexspace,improveresponseVme,improveeffecVveness
• CanbeimportantincombinaVons– e.g.,“tobeornottobe”
3/17/09
17
Stopping
• Keeptrackofallverycommonwordsinastopwords list.
• Duringtextprocessing,ignoreanywordonthelist.
• Stopwordlistcanbecreatedfromhigh‐frequencywordsorbasedonastandardlist
• ListsarecustomizedforapplicaVons,domains,andevenpartsofdocuments– e.g.,“click”isagoodstopwordforanchortext
Stopping
• Whenstoragespaceisnotaconcern,itcanbebeFertonotstop.– Queriesarelessrestricted.– RemovestopwordsatqueryVmeunlessusersaystoincludethem.
• Googledoesnotstop.– “tobeornottobe” returnsresults.– +thereturnsresults(over14billion).
3/17/09
18
EndResultofStopping
• Listofwordsminusthoseonthestoplist.– tropicalfishincludefishfoundtropicalenvironmentsaroundworldincludingbothfreshwatersaltwaterspeciesfishkeepersoienusetermtropicalfishreferonlythoserequiringfreshwatersaltwatertropicalfishreferredmarinefish
• Nextstep:stemming.
Stemming• ManymorphologicalvariaVonsofwords
– inflecFonal(plurals,tenses)– derivaFonal(makingverbsnounsetc.)
• Inmostcases,thesehavethesameorverysimilarmeanings
• StemmersaFempttoreducemorphologicalvariaVonsofwordstoacommonstem– usuallyinvolvesremovingsuffixes
• CanbedoneatindexingVmeoraspartofqueryprocessing(likestopwords)
3/17/09
19
Stemming
• GenerallyasmallbutsignificanteffecVvenessimprovement– canbecrucialforsomelanguages– e.g.,5‐10%improvementforEnglish,upto50%inArabic
Words with the Arabic root ktb
Stemming
• Twobasictypes– DicVonary‐based:useslistsofrelatedwords– Algorithmic:usesprogramtodeterminerelatedwords
• Algorithmicstemmers– suffix‐s: remove‘s’endingsassumingplural
• e.g.,cats→cat,lakes→lake
• Manyfalse negaFves:supplies→supplie• Somefalse posiFves:ups→up
3/17/09
20
PorterStemmer
• AlgorithmicstemmerusedinIRexperimentssincethe70s
• Consistsofaseriesofrulesdesignedtothelongestpossiblesuffixateachstep
• ProvablyeffecVve• Producesstemsnotwords
• Makesanumberoferrorsanddifficulttomodify
PorterStemmer
• Examplestep(1of5)
3/17/09
21
PorterStemmer
• Porter2stemmeraddressessomeoftheseissues
• Approachhasbeenusedwithotherlanguages
KrovetzStemmer
• Hybridalgorithmic‐dicVonary– WordcheckedindicVonary
• Ifpresent,eitherleialoneorreplacedwith“excepVon”• Ifnotpresent,wordischeckedforsuffixesthatcouldberemoved
• Aierremoval,dicVonaryischeckedagain
• Produceswordsnotstems• ComparableeffecVveness• LowerfalseposiVverate,somewhathigherfalsenegaVve
3/17/09
22
StemmerComparison
EndResultofStemming
• Listofstemmedterms:– tropicfishincludefishfoundtropicenvironaroundworldincludebothfreshwatsaltwaterspecifishkeepoienusetermtropicfishreferonlithoserequirfreshwatersaltwattropicfishrefermarinfish
– (fromPorter2stemmer)
• Nextstep:advancedprocessing,orindexing.
3/17/09
23
Martin Hall, 49, head of public policy and external affairs at the London Stock Exchange, is to leave at the end of June.
… The departure of Hall, who had
been in the running to be head of corporate affairs at the BBC, appears to have been prompted by the decision of the new chief executive, Michael Lawrence, to split Hall’s job in two and take the public policy element under his own wing.
<person id=pe1>Martin Hall</person>, 49, <sense num=2>head</sense> of <ow1>public policy</ow1> and external affairs at the <corp id=co1>London Stock Exchange</corp>, is to <syn grp=1>leave</syn> at the end of June.
… The <syn grp=1>departure</syn> of
<person id=pe1>Hall</person>, <ref to=pe1>who</ref> had been in the running to be head of corporate affairs at the <corp id=co2>BBC</corp>, appears to have been prompted by the decision of the new chief executive, <person id=pe2>Michael Lawrence</person>, to split <person id=pe1>Hall’s</person> job in two and take the public policy element under <ref to=pe1>his</ref> own wing.
AdvancedTextProcessing
• Part‐of‐speechtagging.• SensedisambiguaVon.• SynonymclassificaVon.• NamedenVtytagging.• PhraseidenVficaVon.• ReferentresoluVon.• SentencesegmentaVon.• TranslaVon.• SpeechrecogniVon.
TextProcessingErrors
• Alltextprocessingiserrorful.– DesigndecisionsproducesegmentaVonerrors,stoppingerrors,stemmingerrors.
– FalseposiVvesandfalsenegaVves.– Moreadvancedmethodsmoredifficultprocessingmoreerrors.
• Doesthebenefitoutweighthecost?– SegmentaVon&stemming:definitely.– POStagging,NEtagging:dependsondomain.– Synonymclasses:maybenot.
3/17/09
24
EndResultofTextProcessing<title>Tropical fish</title> <text>{{Unreferenced|date=July 2008}} {{Original research|date=July 2008}} ’’’Tropical fish’’’ include [[fish]] found in [[Tropics|topical]]
environments around the world, including both [[fresh water|freshwater]] and [[sea water|salt water]] species. [[Fishkeeping|Fishkeepers]] often use the term ’’tropical fish’’ to refer only those requiring fresh water, with saltwater tropical fish referred to as ’’[[list of marine aquarium fish species|marine fish]]’’.
• Metadata:– Title:Tropicalfish
• Importantfields:– Links:fishtropicfreshwatsalt
waterfishkeepmarinfish
• Body:– tropicfishincludefishfound
tropicenvironaroundworldincludebothfreshwatsaltwaterspecifishkeepoienusetermtropicfishreferonlithoserequirfreshwatersaltwattropicfishrefermarinfish
CourseProject
• PhaseI,worksheet1.– Writeatextprocessingmodule.
– ParseWikipediapages,tokenize,stop,andstem.– AnswerquesVonsaboutWikipediadata:howbigisvocabulary,howmanywordoccurrencesarethere,etc.
• DuenextWednesday.– PleasestartASAP!
3/17/09
25
ExpectaVons
• ReadWikipediapagesoffdisk.• IdenVfypartsofthemthatdonotneedtobeindexed.
• Converttherestintoalistofwords.• Dropstopwords,stemremainingwordstoterms.
• KeeptrackofthenumberofVmeseachtermappears,howmanydocumentsitappearsin.
PseudoJavaimport java.io.*; import java.util.*;
… HashMap<String, int> termCounts = new HashMap();
File doc = new File(filename); Scanner docScanner = new Scanner(doc); while (docScanner.hasNextLine()) {
List<String> terms = processLine(docScanner.nextLine()) for (int i=0; i < terms.size(); i++) { String currentTerm = terms.get(i); int termCount = termCounts.get(currentTerm);
termCounts.set(currentTerm, termCount+1); }
}
docScanner.close()
3/17/09
26
public List processLine(String line) { List<String> terms = new List();
int i = 0;
Scanner lineScanner = new Scanner(line);
lineScanner.useDelimiter(“\\s*”); while (lineScanner.hasNext()) { String word = lineScanner.next();
/* check if word is appropriate for indexing or if it marks the start of a block to ignore */ if (word.indexOf(“{{“) >= 0)
/* ignore words until closing the block with a }}
… /* other conditions */
/* strip non-alphanumeric characters and lower-case */
word = word.replaceAll("[^a-zA-Z0-9]", ""); word = word.toLowerCase();
/* check if word is in the stop list */
if (!isStopWord(word)) { word = stemmer.stem(word); /* stem word */ terms.set(i, word);
i++; } } return(terms);
}