CS 124/LINGUIST 180 From Languages to Information Unix · PDF filedownload, if you haven't already: • Then: gunzip5nyt_200811.txt.gz. Dan Jurafsky ... • Count frequency of word

CS124/LINGUIST180FromLanguagesto

Information

UnixforPoetsDanJurafsky

(originalbyKenChurch,modificationsbymeandChrisManning)

StanfordUniversity

DanJurafsky

UnixforPoets

• Textiseverywhere• TheWeb• Dictionaries,corpora,email,etc.• Billionsandbillionsofwords

• Whatcanwedowithitall?• Itisbettertodosomethingsimple,thannothingatall.• YoucandosimplethingsfromaUnixcommand-line• Sometimesit’smuchfastereventhanwritingaquickpythontool• DIYisverysatisfying2

DanJurafsky

Exerciseswe’llbedoingtoday

1. Countwordsinatext2. Sortalistofwordsinvariousways• ascii order• ‘‘rhyming’’order

3. Extractusefulinfofromadictionary4. Computengram statistics5. Workwithpartsofspeechintaggedtext

3

DanJurafsky

Tools• grep:searchforapattern

(regularexpression)• sort• uniq –c(countduplicates)• tr (translatecharacters)• wc (word– orline– count)• sed (editstring-- replacement)• cat (sendfile(s)instream)• echo (sendtextinstream)

• cut (columnsintab-separatedfiles)

• paste (pastecolumns)• head• tail• rev (reverselines)• comm• join• shuf (shufflelinesoftext)

4

DanJurafsky

Prerequisites:getthetextfileweareusing• myth:ssh intoamythandthendo:scp cardinal:/afs/ir/class/cs124/nyt_200811.txt.gz .

• Orifyou’reusingyourownMacorUnixlaptop,dothatoryoucoulddownload,ifyouhaven'talready:

http://cs124.stanford.edu/nyt_200811.txt.gz

• Then:gunzip nyt_200811.txt.gz5

DanJurafsky

Prerequisites

• The unix “man” command• e.g.,man tr (showscommandoptions;notfriendly)

• Input/outputredirection:• > “output to a file”• < ”input from a file”• | “pipe”

• CTRL-C6

DanJurafsky

Exercise1:Countwordsinatext

• Input:textfile(nyt_200811.txt)(afterit’sgunzipped)• Output:listofwordsinthefilewithfreq counts• Algorithm1.Tokenize(tr)2.Sort(sort)3.Countduplicates(uniq –c)

• Goreadthemanpagesandfigureouthowtopipethesetogether7

DanJurafsky

SolutiontoExercise1

• tr -sc 'A-Za-z' '\n' < nyt_200811.txt | sort | uniq -c

25476 a1271 A

3 AA3 AAA1 Aalborg1 Aaliyah1 Aalto2 aardvark8

DanJurafsky

Someoftheoutput

• tr -sc 'A-Za-z' '\n' < nyt_200811.txt | sort | uniq -c | head –n 525476 a1271 A

3 AA3 AAA1 Aalborg

• tr -sc 'A-Za-z' '\n' < nyt_200811.txt | sort | uniq -c | head

• Gives you thefirst10lines• tail does thesame with the

endoftheinput• (Youcanomitthe“-n”but

it’sdiscouraged.)9

DanJurafsky

ExtendedCountingExercises

1. Mergeupperandlowercasebydowncasingeverything• Hint:Putinasecondtr command

2. Howcommonaredifferentsequencesofvowels(e.g.,ieu)• Hint:Putinasecondtr command

10

DanJurafsky

Solutions

Mergeupperandlowercasebydowncasing everythingtr -sc 'A-Za-z' '\n' < nyt_200811.txt | tr 'A-Z' 'a-z' | sort | uniq -c ortr -sc 'A-Za-z' '\n' < nyt_200811.txt | tr '[:upper:]' '[:lower:]' | sort | uniq -c

1. tokenizebyreplacingthecomplementofletterswithnewlines2. replacealluppercasewithlowercase3. sortalphabetically4. mergeduplicatesandshowcounts

DanJurafsky

Solutions

• Howcommonaredifferentsequencesofvowels(e.g.,ieu)tr -sc 'A-Za-z' '\n' < nyt_200811.txt | tr'A-Z' 'a-z' | tr -sc 'aeiou' '\n' | sort | uniq -c

12

DanJurafsky

Sortingandreversinglinesoftext

• sort• sort –f Ignorecase• sort –n Numericorder• sort –r Reversesort• sort –nr Reversenumericsort

• echo "Hello" | rev

13

DanJurafsky

Countingandsortingexercises

• Findthe50mostcommonwordsintheNYT• Hint:Usesortasecondtime,thenhead

• FindthewordsintheNYTthatendin"zz"• Hint:Lookattheendofalistofreversedwords• tr 'A-Z''a-z'<filename|tr –sc 'A-Za-z''\n' |rev|sort|rev|uniq -c

14

DanJurafsky

Countingandsortingexercises

• Findthe50mostcommonwordsintheNYTtr -sc 'A-Za-z' '\n' < nyt_200811.txt | sort | uniq -c | sort -nr | head -n 50

• FindthewordsintheNYTthatendin"zz"tr -sc 'A-Za-z' '\n' < nyt_200811.txt | tr'A-Z' 'a-z' | rev | sort | uniq -c | rev | tail -n 10 15

DanJurafsky

Lesson

• PipingcommandstogethercanbesimpleyetpowerfulinUnix

• Itgivesflexibility.

• TraditionalUnixphilosophy:smalltoolsthatcanbecomposed

16

DanJurafsky

Bigrams=wordpairsandtheircounts

Algorithm:1. Tokenizebyword2. Createtwoalmost-duplicatefilesofwords,off

byoneline,usingtail3. paste themtogethersoasto

getwordi andwordi +1 onthesameline4. Count 17

DanJurafsky

Bigrams

• tr -sc 'A-Za-z' '\n' < nyt_200811.txt > nyt.words

• tail -n +2 nyt.words > nyt.nextwords• paste nyt.words nyt.nextwords > nyt.bigrams• head –n 5 nyt.bigrams

KBR saidsaid FridayFriday thethe globalglobal economic18

DanJurafsky

Exercises

• Findthe10mostcommonbigrams• (Foryoutolookat:)Whatpart-of-speechpatternaremostofthem?

• Findthe10mostcommontrigrams

19

DanJurafsky

Solutions

• Findthe10mostcommonbigramstr 'A-Z' 'a-z' < nyt.bigrams | sort | uniq-c | sort -nr | head -n 10

• Findthe10mostcommontrigramstail -n +3 nyt.words > nyt.thirdwordspaste nyt.words nyt.nextwords nyt.thirdwords > nyt.trigramscat nyt.trigrams | tr "[:upper:]" "[:lower:]" | sort | uniq -c | sort -rn | head -n 1020

DanJurafsky

grep

• Grep findspatternsspecifiedasregularexpressions• grep rebuilt nyt_200811.txt ConnandJohnson,hasbeenrebuilt,amongthefirstofthe222moveintotheirrebuilthome,sleepingunderthesameroofforthethepartoftownthatwaswipedawayandisbeingrebuilt.Thatistolasertracewhatwasthereandrebuiltitwithaccuracy,"shehome- isexpectedtoberebuiltbyspring.Braasch promisesthatatheanonymousplaceswherethecountrywillhavetoberebuilt,"Thepartywillnotberebuiltwithoutmoderatesbeingapartof21

DanJurafsky

grep

• Grep findspatternsspecifiedasregularexpressions• globallysearchforregularexpressionandprint

• Findingwordsendingin–ing:• grep 'ing$' nyt.words |sort | uniq –c

22

DanJurafsky

grep• grep isafilter– youkeeponlysomelinesoftheinput• grep gh keeplinescontaining‘‘gh’’• grep 'ˆcon' keeplinesbeginningwith‘‘con’’• grep 'ing$' keeplinesendingwith‘‘ing’’• grep –v gh keeplinesNOTcontaining“gh”• egrep [extendedsyntax]• egrep '^[A-Z]+$' nyt.words |sort|uniq -c

ALLUPPERCASE(egrep,grep –e, grep –P, even grep mightwork)

DanJurafsky

Countinglines,words,characters

• wc nyt_200811.txt 140000 1007597 6070784 nyt_200811.txt

• wc -l nyt.words1017618 nyt.words

Exercise:Why is thenumber ofwords different?24

DanJurafsky

Exercisesongrep &wc

• HowmanyalluppercasewordsarethereinthisNYTfile?• Howmany4-letterwords?• Howmanydifferentwordsaretherewithnovowels

• Whatsubtypesdotheybelongto?

• Howmany“1syllable”wordsarethere• Thatis,oneswithexactlyonevowel

Type/tokendistinction:differentwords(types)vs.instances(tokens)

25

DanJurafsky

Solutionsongrep&wc• HowmanyalluppercasewordsarethereinthisNYTfile?grep -P '^[A-Z]+$' nyt.words | wc• Howmany4-letterwords?grep -P '^[a-zA-Z]{4}$' nyt.words | wc• Howmanydifferentwordsaretherewithnovowelsgrep -v '[AEIOUaeiou]' nyt.words | sort | uniq | wc

• Howmany“1syllable”wordsarethere• tr 'A-Z' 'a-z' < nyt.words | grep -P

'^[^aeiouAEIOU]*[aeiouAEIOU]+[^aeiouAEIOU]*$' | uniq | wc

Type/tokendistinction:differentwords(types)vs.instances(tokens)

DanJurafsky

sed

• sed isusedwhenyouneedtomakesystematicchangestostringsinafile(largerchangesthan‘tr’)

• It’slinebased:youoptionallyspecifyaline (byregexorlinenumbers)andspecificaregexsubstitutiontomake

• Forexampletochangeallcasesof“George”to“Jane”:

• sed 's/George/Jane/' nyt_200811.txt | less

27

DanJurafsky

sed exercises

• Countfrequencyofwordinitialconsonantsequences• Taketokenizedwords• Deletethefirstvowelthroughtheendoftheword• Sortandcount

• Countwordfinalconsonantsequences

28

DanJurafsky

sed exercises

• Countfrequencyofwordinitialconsonantsequencestr "[:upper:]" "[:lower:]" < nyt.words | sed's/[aeiouAEIOU].*$//' | sort | uniq -c

• Countwordfinalconsonantsequencestr "[:upper:]" "[:lower:]" < nyt.words | sed's/^.*[aeiou]//g' | sort | uniq -c | sort -rn| less

29

DanJurafsky

cut– tabseparatedfilesscp <sunet>@myth.stanford.edu:/afs/ir/class/cs124/parses.conll.gz .gunzip parses.conll.gz

head –n 5 parses.conll

1 Influential _ JJ JJ _ 2 amod _ _ 2 members _ NNS NNS _ 10 nsubj _ _ 3 of _ IN IN _ 2 prep _ _ 4 the _ DT DT _ 6 det _ _ 5 House _ NNP NNP _ 6 nn _ _

30

DanJurafsky

cut– tabseparatedfiles

• Frequencyofdifferentpartsofspeech:cut -f 4 parses.conll | sort | uniq -c | sort -nr

• Getjustwordsandtheirpartsofspeech:cut -f 2,4 parses.conll

• Youcandealwithcommaseparatedfileswith:cut–d,31

DanJurafsky

cutexercises

• Howoftenis‘that’usedasadeterminer(DT)“thatrabbit”versusacomplementizer (IN)“Iknowthattheyareplastic”versusarelative(WDT)“TheclassthatIlove”• Hint:Withgrep,youcanuse'\t'foratabcharacter

• Whatdeterminersoccurinthedata?Whatarethe5mostcommon?

32

DanJurafsky

cutexercisesolutions

• Howoftenis‘that’usedasadeterminer(DT)“thatrabbit”versusacomplementizer (IN)“Iknowthattheyareplastic”versusarelative(WDT)“TheclassthatIlove”cat parses.conll | grep -P '(that\t_\tDT)|(that\t_\tIN)|(that\t_\tWDT)' | cut -f 2,4 | sort | uniq -c

• Whatdeterminersoccurinthedata?Whatarethe5mostcommon?

cat parses.conll | tr 'A-Z' 'a-z'| grep -P '\tdt\t' | cut -f 2,4 | sort | uniq -c | sort

Documents

CS 124/LINGUIST 180 From Languages to Information Unix · PDF filedownload, if you haven't already: • Then: gunzip5nyt_200811.txt.gz. Dan Jurafsky ... • Count frequency of word