Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Using R for Text mining

Jasminka Dobša Faculty of Organization and Informatics

University of Zagreb

1 March 18th, 2015, Karl-Franzens Uni Graz

Outline

•  Definition of text mining •  Vector space model or Bag of words repersentation

  Representation of documents and queries in VSM   Measure of similarity between document and query   Example   Prepocessing of text documents

•  Tools for text mining •  Using R for text minig

  Data import   Preprocessing of data   Creation of term-document matrix   Opearating with term-document matrix

March 18th, 2015, Karl-Franzens Uni Graz

2

Definition and basic concepts

•  Discipline of text mining is an integral part of the wider discipline of data mining

•  It is content-based processing of unstructured text documents and extraction of useful information from them

•  Content-based treatment: treatment based only on the content of the documents, and not on some metadata such as date of publication, source of publication, type of document or similar

•  Basic text mining tasks:   Information retrieval   Automatic classification of text documents


3

Information retrieval: definition and basic concepts

•  The problem: determining a set of documents from a collection that are relevant to a particular user query

•  Assessment of the relevance of documents is based on a comparison of index terms that appear in texts of documents and text of the query

•  Index term: basically any word or group of words that appear in the text

•  When the text is represented by set of index terms much of the semantics contained in the text is lost

•  Information about the interconnectedness between index terms is lost


4

Labels and basic models

•  Labels:   d j - document of collection of documents   q - query   ai , aj – vector representations of documents   q - vector representation of the query

•  Basic models:   Probabilistic   Boolean model   Vector model


5

Vector space model

•  Vector space model is today one of the most popular models for information retrieval

•  It is developed by Gerald Salton and associates within the search system SMART (System for the Mechanical Analysis and Retrieval of Text)

•  In this model similarity between documents and queries can take real values in the interval [0,1]

•  Indexing terms are joined by weight •  Documents are ranked according to the degree of similarity with

the query •  Representation in the vector space model called Bag of words

representation •  Vector space model assumes independence of indexing terms


6

Vector space model

•  Notation   - set of index terms

  - set of documents   aij weight associated to pair (ti , dj) , i=1,2,..,m; j=1,2,...,n   qi , i=1,2,...,m weights associated to pair (ti ,q), q query

•  In the vector space model document dj , j=1,2,...n is represented by the vector aj =(aj1 , aj2 , ..., ajm )T

•  Query is represented by the vector q=(q1 , q2 , ..., qm )T •  Collection od documents is represented by term-document matrix


7

{ }mtttT ,,, 21 …=

{ }ndddD ,,, 21 …=

Term-document matrix

•  The terms are axes in the space

•  Documents and queries are points or vectors in space

•  Space is high-dimensional - dimension corresponds to the number of index terms

•  Documents usually contain only a few index terms from the list of index terms

•  Vectors representing documents are rare: most coordinator is equal to 0


8

=

↓↓↓

aaa

aaaaaa

A

ddd

mnmm

n

n

n

…

…

…

21

22221

11211

21

t

tt

m←

←

←

2

1

Rows represent concepts Columns represent documents

Measure of similarity

•  The similarity between the document and the query is measured by the angle between their representations

•  The measure of similarity between document and query is the cosine of the angle between them


9

qa

qaqa22

),(cosj

Tj

j =

Measure of similarity

•  Vector representations of documents usually are standardized so that its Euclidean norm is 1

•  Norm of a representation of the query does not affect the ranking of documents by relevance

•  The measure of similarity between document and query usually is measured by the dot product of vectors in the numerator


10

12=ja

2q

The process of assigning weights to index terms

•  Insisting on high recall reduces precision of retrieval and vice versa •  Therefore, more component to assign weight index terms are

introduced •  Components:

  Local - depends on the frequency of occurrence of the term in a particular document (Term frequency - TF component)

  Global - factor for the index term depends inversely on the number of documents in which the term appears (the inverse of the frequency of the term in the collection ( Inverse document frequency - IDF component)

  Normalization - vector representations of a documents are divided by Euclidean norm of representation (this avoids the situation that longer documents are often returned as relevant)


11

Advantages and disadvantages of the vector space model

•  Advantages   Measure the cosine similarity between documents and queries

anables to rank documents according to the importance for a query

  The scheme of assigning weight index terms increases the efficiency of information retrieval

•  Disadvantage   The assumption of independence of index terms

(questionable!)


12

Example

•  A collection of 15 documents (titles of books)   9 from the field of data mining   5 from the field of linear algebra   One that combines these two areas (application of linear

algebra to data mining)

•  A list of words is formed as follows:

  From the words contained in at least two documents   The words on the list of stop words are discarded   The words are reduced to their basic forms, ie. variations of

the word is mapped to the basic version of the word


Documents 1/2

D1 Survey of text mining: clustering, classification, and retrieval

D2 Automatic text processing: the transformation analysis and retrieval of information by computer

D3 Elementary linear algebra: A matrix approach

D4 Matrix algebra & its applications statistics and econometrics

D5 Effective databases for text & document management

D6 Matrices, vector spaces, and information retrieval

D7 Matrix analysis and applied linear algebra

D8 Topological vector spaces and algebras


Documents 2/2

D9 Information retrieval: data structures & algorithms

D10 Vector spaces and algebras for chemistry and physics

D11 Classification, clustering and data analysis

D12 Clustering of large data sets

D13 Clustering algorithms

D14 Document warehousing and text mining: techniques for improving business operations, marketing and sales

D15 Data mining and knowledge discovery


Terms Data mining terms Linear algebra terms Neutral terms

Text Linear Analysis Mining Algebra Application

Clustering Matrix Algorithm Classification Vector

Retrieval Space

Information

Document

Data


The concepts are mapped in their basic versions, eg. Applications, Applied → Application

Queries

•  Q1: Data mining   Relevant documents : All data mining documents

(blue)

•  Q2: Using linear algebra for data mining   Relevant document: D6


Rezultati pretraživanja weight bxx tfn

Document Q1 Q2 Q1 Q2

D1 0,4472 0,4472 0,3607 0,2424

D2 0 0 0 0

D3 0 1,1547 0 0,6417

D4 0 0,5774 0 0,1471

D5 0 0 0 0

D6 0 0 0 0

D7 0 0,8944 0 0,4598

D8 0 0,5774 0 0,1654

D9 0,5 0,5 0,2634 0,1771

D10 0 0,5774 0 0,1654

D11 0,5 0,5 0,2634 0,1770

D12 0,7071 0,7071 0,4488 0,3016

D13 0 0 0 0

D14 0,5774 0,5774 0,4092 0,2750

D15 1,4142 1,4142 1,0000 0,6720 18 March 18th, 2015, Karl-Franzens Uni Graz

Comments on search results

•  Consider that in the process of search are returned all documents whose measure of similarity with the query is greater than 0

•  For the first query (Data mining)   All documents are returned in the search process are relevant   Not all relevant documents are returned in the search   Only those documents containing the terms contained in the query are

returned - search is the lexical, but not semantical

•  For the second query (Using linear algebra for data mining)   None of the returned document is relevant   The only relevant document is not returned in the search

•  Search using weight bxx and tfn did not result in significantly different results presented in this case


19

Preprecessing of text

•  Index term   Any word that appears in the collection of documents   A series of several words that appears in the texts of

documents

•  Preprocessing of text includes a number of procedures which prepares the text for the processing and which increases processing efficiency itself

•  Some of the procedures are as follows:   Lexical processing   Removal of terms such as conjunctions, prepositions and

similar words that have low discriminatory value in information retrieval, so-called stop words

  Transformation of the words to their root form (stemming) or on their basic form (lemmatization)


20

•  Lexical processing consist of removing punctuation, removing of the bracket, the conversion of the letters in lower case

•  Bringing the words to root form consist of removing suffixes and prefixes

•  The alternative form of a word usually is obtained by adding suffixes

•  Most algorithms for reducing words to their stem form are limited to removing suffixes

•  The best known algorithm for reducing words to root form is Porter's algorithm developed for the English language

•  In addition to indexing terms which consist of only one word in the text can be identified and phrases that often occur, eg. New York


21

Tools for text mining

•  Today most of statistical software products offers text mining capabilities

•  This include:   Preprocessing of corpus (collection of documents)

  Data preparation   Importing of text data   Cleanig   Specific task of preprocessing (stop words removal,

stemming …)   Creation of term document matrix   Finding associations between terms and documents   Clustering of terms and documents   Automatic classification   Semantic information retrieval ....


22

Tools for text mining

•  Commercial   Clearforest   Copernik Summarizer   dtSearch   Insightful Infact   Inxight   SPSS Clementine   SAS Text Miner   TEMIS   WordStat   Statistica

•  Open Source   GATE   RapidMiner   Weka/KEA   R/tm


23

R: Packages

•  The main package in R for text mining is tm package •  The modular structure of tm enables integrating

existing functionalities from other text mining tool kids •  For example:

  Weka (via Rweka and Snowball)   Enable stemming and tokenization methods

  OpenNLP   Enables tokenization, sentence detection, part of speech

tagging

•  Advanced text mining methods beyond the scope of commercial products are available via extension text mining packages in R   Kernlab: string kernels   lsi: latent sematic indexing


24

Data import

•  The main structure for managing documents in tm is Corpus – collection of text documents

•  Possible implementations of corpus   Volatile corpus (default implementation)

  Corpus is R object held fully in memory   When R object is destroyed the corpus is also destroyed

  Permanent corpus   Documents of collection physically are stored outside of R

(in a database)   Corpus is not destroyed when corresponding R objects are

destroyed


25

Data import


26

•  Volatile corpus is created by command VCorpus(x, readerControl)   x: Source (input location); predefined sources are:

  DirSource: a directory   VectorSource: a vector interpreting each component as

document   DataframeSource: CSV, XLS files

  readaerControl: a list with components reader and language   reader constructs a corpus from elements delivered by the

source

•  In the case of permanent corpus there is third argument dbControl (controls the data base)

•  Storig of collection in the data base enables management of very large text collections, but shortcoming might be slower access performance

Readers

Reader Description

readPlain() Read in files as plain text ignoring matadata

readRCV1 Read in files in Reuters Corpus Volume 1 XML format

readReut21578XML() Read in files in Reuter-21578 XML format

readGmane() Read in Gmane RSS feeds

readNewsgroup() Read in newsgroup postings (e-mails)

readPDF() Read in PDF documents

readDOC() Read in MS Word documents

readHTML() Read in simply structured HTML documents


27

Data import: examples

•  Plain text files in the directory txt containing Latin texts by Roman poet Ovid can be read by:   txt<-system.file("texts","txt",package="tm")   ovid<-VCorpus(DirSource(txt,encoding="UTF-8"),

readerControl=list(language="lat"))

  > ovid   <<VCorpus (documents: 5, metadata (corpus/indexed): 0/0)>>

•  By VectorSource corpus can be created by character vectos   docs<-c("This is a text.","This another one.")

  corp<-VCorpus(VectorSource(docs))   > corp   <<VCorpus (documents: 2, metadata (corpus/indexed): 0/0)>>


28

Data import: examples

•  Creation of corpus for some Ruters documents   reut21578<-

system.file("texts","crude",package="tm")

  reuters<-VCorpus(DirSource(reut21578), readerControl=list(reader=readReut21578XMLasPlain))

  > reuters   <<VCorpus (documents: 20, metadata (corpus/indexed): 0/0)>>


29

Reuters-21578 collection of documents

•  The most widely used data set for classification •  It consists of Reuters reports from 1987 •  Divided into documents for learning and documents for

testing (ModApte Split) •  Documents for learning: reports from begining of 1987

to 7 April 1987 •  Documents for verification: Reports from 8 April until

the end of 1987 •  9603 document for learning and 3299 document for

testing •  10 large classes, 90 classes for which there is at least

one document for learning and at least one document for testing, totaly 118 classes

•  There are different class March 18th, 2015, Karl-Franzens Uni Graz

30

Reuters-21578 collection of documents

•  10 big classes


31

Class No of train documents

No of test documents

Earn 2877 1087

Acquisitions 1650 719

Money-fx 538 179

Grain 433 149

Crude 389 189

Trade 369 118

Interest 347 131

Ship 197 89

Wheat 212 71

Corn 182 56

Inspecting corpora

•  Full content of text documents is displayed with command inspect()   inspect(ovid[1])   <<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

[[1]] <<PlainTextDocument (metadata: 7)>> Si quis in hoc artem populo non novit amandi, hoc legat et lecto carmine doctus amet. arte citae veloque rates remoque moventur, arte leves currus: arte regendus amor.

curribus Automedon lentisque erat aptus habenis, Tiphys in Haemonia puppe magister erat: me Venus artificem tenero praefecit Amori; Tiphys et Automedon dicar Amoris ego.

ille quidem ferus est et qui mihi saepe repugnet: sed puer est, aetas mollis et apta regi.

Phillyrides puerum cithara perfecit Achillem, atque animos placida contudit arte feros. qui totiens socios, totiens exterruit hostes,   creditur annosum pertimuisse senem. March 18th, 2015, Karl-Franzens Uni Graz

32

Inspecting corpora

•  inspect(corp)   <<VCorpus (documents: 2, metadata (corpus/indexed): 0/0)>>

[[1]] <<PlainTextDocument (metadata: 7)>> This is a text.

[[2]] <<PlainTextDocument (metadata: 7)>> This another one.


33

Transformations

•  Typical transformations are stemming, stopwords removal...

•  Transformations are obtained by command tm_map() •  Extra whitespace is eliminated by:

  reuters<-tm_map(reuters,stripWhitespace) •  Conversation to lower case:

  reuters<-tm_map(reuters,content_transformer(tolower))

•  Removal of stopwords:   reuters<-tm_map(reuters,removeWords,stopwords("english"))

•  Removal of punctation   reuters<-tm_map(reuters,removePunctuation)

•  Removal of numbers   reuters<-tm_map(reuters,removeNumbers)


34

English stop words

•  stopwords("english") [1] "i" "me" "my" "myself" "we" [6] "our" "ours" "ourselves" "you" "your" [11] "yours" "yourself" "yourselves" "he" "him" [16] "his" "himself" "she" "her" "hers" [21] "herself" "it" "its" "itself" "they" [26] "them" "their" "theirs" "themselves" "what" [31] "which" "who" "whom" "this" "that" [36] "these" "those" "am" "is" "are" [41] "was" "were" "be" "been" "being" [46] "have" "has" "had" "having" "do" [51] "does" "did" "doing" "would" "should"

....and so on March 18th, 2015, Karl-Franzens Uni Graz

35

German stop words

•  stopwords("german")

[1] "aber" "alle" "allem" "allen" "aller" "alles" [7] "als" "also" "am" "an" "ander" "andere" [13] "anderem" "anderen" "anderer" "anderes" "anderm" "andern" [19] "anderr" "anders" "auch" "auf" "aus" "bei" [25] "bin" "bis" "bist" "da" "damit" "dann" [31] "der" "den" "des" "dem" "die" "das" [37] "daß" "derselbe" "derselben" "denselben" "desselben"

"demselben" [43] "dieselbe" "dieselben" "dasselbe" "dazu" "dein" "deine" [49] "deinem" "deinen" "deiner" "deines" "denn" "derer"

.... and so on


36

Transformations: stemming

> inspect(reuters[1]) <<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

[[1]]

<<PlainTextDocument (metadata: 7)>>

diamond shamrock corp said effective today cut contract prices crude oil 1.50 dlrs barrel. reduction brings posted price west texas intermediate 16.00 dlrs barrel, copany said. " price reduction today made light falling oil product prices weak crude oil market," company spokeswoman said. diamond latest line u.s. oil companies cut contract, posted, prices last two days citing weak oil markets. Reuter


37

Transformations: stemming

reuters<-tm_map(reuters, stemDocument)

> inspect(reuters[1])

<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

[[1]]

<<PlainTextDocument (metadata: 7)>>

diamond shamrock corp said effect today cut contract price crude oil 1.50 dlrs barrel. reduct bring post price west texa intermedi 16.00 dlrs barrel, copani said. " price reduct today made light fall oil product price weak crude oil market," compani spokeswoman said. diamond latest line u.s. oil compani cut contract, posted, price last two day cite weak oil markets. reuter


38

Creating term-document matrix

•  Term-document matrix   Terms: rows   Documents: columns

•  Documet-term matrix   Terms: columns   Documents: rows

•  dtm<-DocumentTermMatrix(reuters) •  tdm<-TermDocumentMatrix(reuters)


39

> dim(dtm) [1] 20 1208 Matrix of dimension 20 (documents)×1208 (terms) > dim(tdm) [1] 1208 20 Matrix of dimension 1208 (terms) × 20 (documents)

Creating term-document matrix


40

> inspect(dtm[5:10,740:743]) <<DocumentTermMatrix (documents: 6, terms: 4)>> Non-/sparse entries: 2/22 Sparsity : 92% Maximal term length: 7 Weighting : term frequency (tf) Terms Docs negat negoti neither net character(0) 0 0 0 2 character(0) 0 0 0 0 character(0) 0 0 0 0 character(0) 0 0 0 0 character(0) 0 0 0 0 character(0) 1 0 0 0

Incorporate weighting (TfIdf)

> dtm<-DocumentTermMatrix(reuters,control=list(weighting=weightTfIdf))


41

> inspect(dtm[5:10,740:743]) <<DocumentTermMatrix (documents: 6, terms: 4)>> Non-/sparse entries: 2/22 Sparsity : 92% Maximal term length: 7 Weighting : term frequency - inverse document frequency (normalized) (tf-idf) Terms Docs negat negoti neither net character(0) 0.00000000 0 0 0.08003571 character(0) 0.00000000 0 0 0.00000000 character(0) 0.00000000 0 0 0.00000000 character(0) 0.00000000 0 0 0.00000000 character(0) 0.00000000 0 0 0.00000000 character(0) 0.01338058 0 0 0.00000000

Operating with Term-Document Matrix

•  Finding frequent terms in collection


42

findFreqTerms(dtm,10) [1] "accord" "barrel" "bpd" "crude" "dlrs" "futur" "govern" [8] "kuwait" "last" "market" "meet" "mln" "new" "offici" [15] "oil" "one" "opec" "pct" "price" "product" "reuter" [22] "said" "said." "saudi" "say" "sheikh" "u.s." "will"

Operating with term-document matrix

•  Finding associations between terms


43

> findAssocs(dtm,"opec",0.8) opec oil 0.88 emerg 0.87 15.8 0.86 analyst 0.85 meet 0.85 said 0.84 want 0.82 abil 0.80


•  Removing sparse terms: terms which have at least x% of zeros in their representation


44

> inspect(removeSparseTerms(dtm,0.4)) <<DocumentTermMatrix (documents: 20, terms: 4)>> Non-/sparse entries: 73/7 Sparsity : 9% Maximal term length: 6 Weighting : term frequency (tf) Terms Docs oil price reuter said Id_1 5 5 1 1 Id_2 11 4 2 9 Id_3 2 2 1 1 Id_4 1 2 1 1 Id_5 1 0 1 3 ...

Creating term-document matrix with given dictionary


45

> inspect(DocumentTermMatrix(reuters,list(dictionary=c("prices", "crude","oil")))) <<DocumentTermMatrix (documents: 20, terms: 3)>> Non-/sparse entries: 29/31 Sparsity : 52% Maximal term length: 6 Weighting : term frequency (tf) Terms Docs crude oil prices Id_1 2 5 0 Id_2 0 11 0 Id_3 2 2 0 Id_4 3 1 0 Id_5 0 1 0 Id_6 1 7 0 Id_7 0 3 0 Id_8 0 3 0 Id_9 0 4 0 Id_10 0 9 0

Clustering of terms

•  Ploting dendrogram of hierarhical clusterig of terms •  plot(hclust(dist(removeSparseTerms(tdm,

0.6)),method="ward.D"))


46

EXAMPLE

Starkweather, J. (2014). Introduction to basic Text Mining in R. http://it.unt.edu/benchmarks/issues/2014/01/rss-matters


47

Text: University of North Texas (UNT) policy that governs Research and Statistical Support services (RSS)


48

Importing of data and cleaning

•  Reading by command readLines

  policy.HTML.page<-readLines("http://policy.unt.edu/policy/3-5")

•  Identification of relevant parts of the text, cleaning of data....   id.1<-3+which(policy.HTML.page==" TOTAL

UNIVERSITY </div>")

  id.2<-id.1+5

  text.data<-d[id.1:id.2]

•  After little bit of cleaning we get the text....


49

Text data


50

> text.d [1] "Academic Computing Services exists to assist faculty and students in the best, most comprehensive use of the computing facilities. This policy statement is not intended in any way to restrict the services the Computing and Information Technology Center attempts to provide. Because personnel resources are limited, it is imperative that some attempt be made to define what Academic Computing Services can do in a timely fashion and what may not be possible. In particular, the successful provision of Academic Computing Services requires considerable cooperation between the Computing and Information Technology Center staff on the one hand and faculty and students on the other hand." [2] "The primary goal of Academic Computing Services is to assist faculty and students in using computing facilities. The emphasis here is on "assist." Some of the facilities users might require are statistical programming systems such as SPSS and SAS, programming languages such as FORTRAN and C, and/or other specialized academic software. The staff can also be of assistance in the design stage of research or instructional projects so that ease of use of the computing facilities is assured. With ever greater emphasis by all disciplines on computing, it is desirable for individual faculty members and students to learn to deal with the computing facilities directly, without the need for intervention by the professional staff." [3] "In order to facilitate use, therefore, Academic Computing Services will aid individuals to learn the use of the hardware and software facilities available. Under ordinary circumstances Academic Computing Services personnel will not literally do projects for an individual, but will try to provide advice on how to complete the project most efficiently and effectively. In addition to limits on personnel time, it is not generally a good idea for a person working on a research project to maintain a position remote from the analysis of their data. The analysis of research data is an intrinsic part of the research process, and it requires the same care and understanding of the research assumptions as does the design and collection of the data."

.........

Conversion of text to corpus

Loading of tm package and conversion of characters of strings to corpus by command VectorSource >library(tm)

  txt<-VectorSource(text.d)

> txt.corpus<-Corpus(txt)


51

Corpus


52

> inspect(txt.corpus[1]) <<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>> [[1]] <<PlainTextDocument (metadata: 7)>> Academic Computing Services exists to assist faculty and students in the best, most comprehensive use of the computing facilities. This policy statement is not intended in any way to restrict the services the Computing and Information Technology Center attempts to provide. Because personnel resources are limited, it is imperative that some attempt be made to define what Academic Computing Services can do in a timely fashion and what may not be possible. In particular, the successful provision of Academic Computing Services requires considerable cooperation between the Computing and Information Technology Center staff on the one hand and faculty and students on the other hand.

Transformations / preprocessing

•  Conversion in lower case   txt.corpus<-tm_map(txt.corpus,tolower)

•  Removing of punctuation   txt.corpus<-tm_map(txt.corpus,removePunctuation)

•  Removing of numbers   txt.corpus<-tm_map(txt.corpus,removeNumbers)

•  Removing of stop words   txt.corpus<-tm_map(txt.corpus,removeWords,stopwords("english"))

•  Reduction to root form of words   > library(SnowballC)   > txt.corpus<-tm_map(txt.corpus,stemDocument)


53

Creating of term-document matrix

•  tdm<-TermDocumentMatrix(txt.corpus)


54

> inspect(tdm[1:10,1:5]) <<TermDocumentMatrix (terms: 10, documents: 5)>> Non-/sparse entries: 13/37 Sparsity : 74% Maximal term length: 10 Weighting : term frequency (tf) Docs Terms character(0) character(0) character(0) character(0) character(0) academ 3 2 2 0 2 accomplish 0 0 0 0 1 acquir 0 0 0 1 0 addit 0 0 1 0 0 advic 0 0 1 0 0 aid 0 0 1 0 0 also 0 1 0 0 0 amount 0 0 0 2 0 analysi 0 0 2 0 0 andor 0 1 0 0 0 >


•  Finding frequent terms   > findFreqTerms(x=tdm,lowfreq=8,highfreq=Inf)   [1] "academ" "comput" "facil" "research" "servic" "use„

•  Finding correlated words •  findAssocs(x=tdm,term="comput",corlimit=0.6) •  comput •  faculti 0.82 •  student 0.82 •  academ 0.75 •  comprehens 0.70 •  cooper 0.70 •  defin 0.70 •  hand 0.70 •  imper 0.70 •  intend 0.70 •  made 0.70 •  may 0.70 •  one 0.70 •  particular 0.70


55


•  Removal of unfrequent terms   Removig of terms whose representation has sparsity at least 60%

  tdm.common.60<-removeSparseTerms(x=tdm,sparse=0.60)

  Removig of terms whose representation has sparsity at least 20%

  tdm.common.20<-removeSparseTerms(x=tdm,sparse=0.20)


56


57

tdm <<TermDocumentMatrix (terms: 161, documents: 6)>> Non-/sparse entries: 251/715 Sparsity : 74% Maximal term length: 14 Weighting : term frequency (tf)

> tdm.common.60 <<TermDocumentMatrix (terms: 17, documents: 6)>> Non-/sparse entries: 66/36 Sparsity : 35% Maximal term length: 10 Weighting : term frequency (tf)

> tdm.common.20 <<TermDocumentMatrix (terms: 4, documents: 6)>> Non-/sparse entries: 22/2 Sparsity : 8% Maximal term length: 9 Weighting : term frequency (tf)

Hierarchical clustering of index terms

•  plot(hclust(stats::dist(tdm.common.20),method="ward.D2"))


58

Hierarchical clustering of index terms

•  > clust<-hclust(stats::dist(tdm.common.60),method="ward.D2")

•  > plot(clust)


59

Cutting a dendrogram

•  cutree(clust,k=4) •  academic assist best can center computing facilities 1 2 2 2 2 3 1 faculty personnel provide requires resources services software 2 2 4 2 2 1 4 staff students use 4 2 1


60

Cutting a dendrogram


61

1

3

4 2

Clustering of documents •  dtm<-DocumentTermMatrix(txt.corpus) •  > clust<-hclust(stats::dist(dtm),method="ward.D2") •  > plot(clust)


62

Wordcloud

•  Package wordcloud library(wordcloud)

> (v<-sort(rowSums(as.matrix(tdm)),decreasing=TRUE))

computing academic services facilities use 21 10 10 9 9 provide software staff will faculty 7 7 7 7 6 project research students assist can 5 5 5 4 4 center information personnel programming requires

4 4 4 4 4


63

Wordcloud

•  (myNames<-names(v)) •  (d<-data.frame(word=myNames,freq=v))


64

word freq computing computing 21 academic academic 10 services services 10 facilities facilities 9 use use 9 provide provide 7 software software 7 staff staff 7 will will 7 faculty faculty 6

Wordcloud

•  wordcloud(d$word,d$freq,min.freq=3)


65

References

•  R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, ACM Press, Addison-Wesley, New York, 1999.

•  Starkweather, J. (2014). Introduction to basic Text Mining in R. Pridobljeno na http://it.unt.edu/benchmarks/issues/2014/01/rss-matters

•  Feinerer, I. (2014). Introduction to the tm Package Text Mining in R. http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf

•  Feinerer, I., Hornik, K., & Meyer, D. (2008). Text Mining Infrastructure in R. Journal of Statistical Software, 25(5), 1–54. http://www.jstatsoft.org/v25/i05/paper


66