Upload
duonghanh
View
228
Download
0
Embed Size (px)
Citation preview
CS124/LINGUIST180FromLanguagesto
Information
UnixforPoetsDanJurafsky
(originalbyKenChurch,modificationsbymeandChrisManning)
StanfordUniversity
DanJurafsky
UnixforPoets
• Textiseverywhere• TheWeb• Dictionaries,corpora,email,etc.• Billionsandbillionsofwords
• Whatcanwedowithitall?• Itisbettertodosomethingsimple,thannothingatall.• YoucandosimplethingsfromaUnixcommand-line• Sometimesit’smuchfastereventhanwritingaquickpythontool• DIYisverysatisfying2
DanJurafsky
Exerciseswe’llbedoingtoday
1. Countwordsinatext2. Sortalistofwordsinvariousways• ascii order• ‘‘rhyming’’order
3. Extractusefulinfofromadictionary4. Computengram statistics5. Workwithpartsofspeechintaggedtext
3
DanJurafsky
Tools• grep:searchforapattern
(regularexpression)• sort• uniq –c(countduplicates)• tr (translatecharacters)• wc (word– orline– count)• sed (editstring-- replacement)• cat (sendfile(s)instream)• echo (sendtextinstream)
• cut (columnsintab-separatedfiles)
• paste (pastecolumns)• head• tail• rev (reverselines)• comm• join• shuf (shufflelinesoftext)
4
DanJurafsky
Prerequisites:getthetextfileweareusing• myth:ssh intoamythandthendo:scp cardinal:/afs/ir/class/cs124/nyt_200811.txt.gz .
• Orifyou’reusingyourownMacorUnixlaptop,dothatoryoucoulddownload,ifyouhaven'talready:
http://cs124.stanford.edu/nyt_200811.txt.gz
• Then:gunzip nyt_200811.txt.gz5
DanJurafsky
Prerequisites
• The unix “man” command• e.g.,man tr (showscommandoptions;notfriendly)
• Input/outputredirection:• > “output to a file”• < ”input from a file”• | “pipe”
• CTRL-C6
DanJurafsky
Exercise1:Countwordsinatext
• Input:textfile(nyt_200811.txt)(afterit’sgunzipped)• Output:listofwordsinthefilewithfreq counts• Algorithm1.Tokenize(tr)2.Sort(sort)3.Countduplicates(uniq –c)
• Goreadthemanpagesandfigureouthowtopipethesetogether7
DanJurafsky
SolutiontoExercise1
• tr -sc 'A-Za-z' '\n' < nyt_200811.txt | sort | uniq -c
25476 a1271 A
3 AA3 AAA1 Aalborg1 Aaliyah1 Aalto2 aardvark8
DanJurafsky
Someoftheoutput
• tr -sc 'A-Za-z' '\n' < nyt_200811.txt | sort | uniq -c | head –n 525476 a1271 A
3 AA3 AAA1 Aalborg
• tr -sc 'A-Za-z' '\n' < nyt_200811.txt | sort | uniq -c | head
• Gives you thefirst10lines• tail does thesame with the
endoftheinput• (Youcanomitthe“-n”but
it’sdiscouraged.)9
DanJurafsky
ExtendedCountingExercises
1. Mergeupperandlowercasebydowncasingeverything• Hint:Putinasecondtr command
2. Howcommonaredifferentsequencesofvowels(e.g.,ieu)• Hint:Putinasecondtr command
10
DanJurafsky
Solutions
Mergeupperandlowercasebydowncasing everythingtr -sc 'A-Za-z' '\n' < nyt_200811.txt | tr 'A-Z' 'a-z' | sort | uniq -c ortr -sc 'A-Za-z' '\n' < nyt_200811.txt | tr '[:upper:]' '[:lower:]' | sort | uniq -c
1. tokenizebyreplacingthecomplementofletterswithnewlines2. replacealluppercasewithlowercase3. sortalphabetically4. mergeduplicatesandshowcounts
DanJurafsky
Solutions
• Howcommonaredifferentsequencesofvowels(e.g.,ieu)tr -sc 'A-Za-z' '\n' < nyt_200811.txt | tr'A-Z' 'a-z' | tr -sc 'aeiou' '\n' | sort | uniq -c
12
DanJurafsky
Sortingandreversinglinesoftext
• sort• sort –f Ignorecase• sort –n Numericorder• sort –r Reversesort• sort –nr Reversenumericsort
• echo "Hello" | rev
13
DanJurafsky
Countingandsortingexercises
• Findthe50mostcommonwordsintheNYT• Hint:Usesortasecondtime,thenhead
• FindthewordsintheNYTthatendin"zz"• Hint:Lookattheendofalistofreversedwords• tr 'A-Z''a-z'<filename|tr –sc 'A-Za-z''\n' |rev|sort|rev|uniq -c
14
DanJurafsky
Countingandsortingexercises
• Findthe50mostcommonwordsintheNYTtr -sc 'A-Za-z' '\n' < nyt_200811.txt | sort | uniq -c | sort -nr | head -n 50
• FindthewordsintheNYTthatendin"zz"tr -sc 'A-Za-z' '\n' < nyt_200811.txt | tr'A-Z' 'a-z' | rev | sort | uniq -c | rev | tail -n 10 15
DanJurafsky
Lesson
• PipingcommandstogethercanbesimpleyetpowerfulinUnix
• Itgivesflexibility.
• TraditionalUnixphilosophy:smalltoolsthatcanbecomposed
16
DanJurafsky
Bigrams=wordpairsandtheircounts
Algorithm:1. Tokenizebyword2. Createtwoalmost-duplicatefilesofwords,off
byoneline,usingtail3. paste themtogethersoasto
getwordi andwordi +1 onthesameline4. Count 17
DanJurafsky
Bigrams
• tr -sc 'A-Za-z' '\n' < nyt_200811.txt > nyt.words
• tail -n +2 nyt.words > nyt.nextwords• paste nyt.words nyt.nextwords > nyt.bigrams• head –n 5 nyt.bigrams
KBR saidsaid FridayFriday thethe globalglobal economic18
DanJurafsky
Exercises
• Findthe10mostcommonbigrams• (Foryoutolookat:)Whatpart-of-speechpatternaremostofthem?
• Findthe10mostcommontrigrams
19
DanJurafsky
Solutions
• Findthe10mostcommonbigramstr 'A-Z' 'a-z' < nyt.bigrams | sort | uniq-c | sort -nr | head -n 10
• Findthe10mostcommontrigramstail -n +3 nyt.words > nyt.thirdwordspaste nyt.words nyt.nextwords nyt.thirdwords > nyt.trigramscat nyt.trigrams | tr "[:upper:]" "[:lower:]" | sort | uniq -c | sort -rn | head -n 1020
DanJurafsky
grep
• Grep findspatternsspecifiedasregularexpressions• grep rebuilt nyt_200811.txt ConnandJohnson,hasbeenrebuilt,amongthefirstofthe222moveintotheirrebuilthome,sleepingunderthesameroofforthethepartoftownthatwaswipedawayandisbeingrebuilt.Thatistolasertracewhatwasthereandrebuiltitwithaccuracy,"shehome- isexpectedtoberebuiltbyspring.Braasch promisesthatatheanonymousplaceswherethecountrywillhavetoberebuilt,"Thepartywillnotberebuiltwithoutmoderatesbeingapartof21
DanJurafsky
grep
• Grep findspatternsspecifiedasregularexpressions• globallysearchforregularexpressionandprint
• Findingwordsendingin–ing:• grep 'ing$' nyt.words |sort | uniq –c
22
DanJurafsky
grep• grep isafilter– youkeeponlysomelinesoftheinput• grep gh keeplinescontaining‘‘gh’’• grep 'ˆcon' keeplinesbeginningwith‘‘con’’• grep 'ing$' keeplinesendingwith‘‘ing’’• grep –v gh keeplinesNOTcontaining“gh”• egrep [extendedsyntax]• egrep '^[A-Z]+$' nyt.words |sort|uniq -c
ALLUPPERCASE(egrep,grep –e, grep –P, even grep mightwork)
DanJurafsky
Countinglines,words,characters
• wc nyt_200811.txt 140000 1007597 6070784 nyt_200811.txt
• wc -l nyt.words1017618 nyt.words
Exercise:Why is thenumber ofwords different?24
DanJurafsky
Exercisesongrep &wc
• HowmanyalluppercasewordsarethereinthisNYTfile?• Howmany4-letterwords?• Howmanydifferentwordsaretherewithnovowels
• Whatsubtypesdotheybelongto?
• Howmany“1syllable”wordsarethere• Thatis,oneswithexactlyonevowel
Type/tokendistinction:differentwords(types)vs.instances(tokens)
25
DanJurafsky
Solutionsongrep&wc• HowmanyalluppercasewordsarethereinthisNYTfile?grep -P '^[A-Z]+$' nyt.words | wc• Howmany4-letterwords?grep -P '^[a-zA-Z]{4}$' nyt.words | wc• Howmanydifferentwordsaretherewithnovowelsgrep -v '[AEIOUaeiou]' nyt.words | sort | uniq | wc
• Howmany“1syllable”wordsarethere• tr 'A-Z' 'a-z' < nyt.words | grep -P
'^[^aeiouAEIOU]*[aeiouAEIOU]+[^aeiouAEIOU]*$' | uniq | wc
Type/tokendistinction:differentwords(types)vs.instances(tokens)
DanJurafsky
sed
• sed isusedwhenyouneedtomakesystematicchangestostringsinafile(largerchangesthan‘tr’)
• It’slinebased:youoptionallyspecifyaline (byregexorlinenumbers)andspecificaregexsubstitutiontomake
• Forexampletochangeallcasesof“George”to“Jane”:
• sed 's/George/Jane/' nyt_200811.txt | less
27
DanJurafsky
sed exercises
• Countfrequencyofwordinitialconsonantsequences• Taketokenizedwords• Deletethefirstvowelthroughtheendoftheword• Sortandcount
• Countwordfinalconsonantsequences
28
DanJurafsky
sed exercises
• Countfrequencyofwordinitialconsonantsequencestr "[:upper:]" "[:lower:]" < nyt.words | sed's/[aeiouAEIOU].*$//' | sort | uniq -c
• Countwordfinalconsonantsequencestr "[:upper:]" "[:lower:]" < nyt.words | sed's/^.*[aeiou]//g' | sort | uniq -c | sort -rn| less
29
DanJurafsky
cut– tabseparatedfilesscp <sunet>@myth.stanford.edu:/afs/ir/class/cs124/parses.conll.gz .gunzip parses.conll.gz
head –n 5 parses.conll
1 Influential _ JJ JJ _ 2 amod _ _ 2 members _ NNS NNS _ 10 nsubj _ _ 3 of _ IN IN _ 2 prep _ _ 4 the _ DT DT _ 6 det _ _ 5 House _ NNP NNP _ 6 nn _ _
30
DanJurafsky
cut– tabseparatedfiles
• Frequencyofdifferentpartsofspeech:cut -f 4 parses.conll | sort | uniq -c | sort -nr
• Getjustwordsandtheirpartsofspeech:cut -f 2,4 parses.conll
• Youcandealwithcommaseparatedfileswith:cut–d,31
DanJurafsky
cutexercises
• Howoftenis‘that’usedasadeterminer(DT)“thatrabbit”versusacomplementizer (IN)“Iknowthattheyareplastic”versusarelative(WDT)“TheclassthatIlove”• Hint:Withgrep,youcanuse'\t'foratabcharacter
• Whatdeterminersoccurinthedata?Whatarethe5mostcommon?
32
DanJurafsky
cutexercisesolutions
• Howoftenis‘that’usedasadeterminer(DT)“thatrabbit”versusacomplementizer (IN)“Iknowthattheyareplastic”versusarelative(WDT)“TheclassthatIlove”cat parses.conll | grep -P '(that\t_\tDT)|(that\t_\tIN)|(that\t_\tWDT)' | cut -f 2,4 | sort | uniq -c
• Whatdeterminersoccurinthedata?Whatarethe5mostcommon?
cat parses.conll | tr 'A-Z' 'a-z'| grep -P '\tdt\t' | cut -f 2,4 | sort | uniq -c | sort