Upload
lamlien
View
221
Download
1
Embed Size (px)
Citation preview
Using R for Text mining
Jasminka Dobša Faculty of Organization and Informatics
University of Zagreb
1 March 18th, 2015, Karl-Franzens Uni Graz
Outline
• Definition of text mining • Vector space model or Bag of words repersentation
Representation of documents and queries in VSM Measure of similarity between document and query Example Prepocessing of text documents
• Tools for text mining • Using R for text minig
Data import Preprocessing of data Creation of term-document matrix Opearating with term-document matrix
March 18th, 2015, Karl-Franzens Uni Graz
2
Definition and basic concepts
• Discipline of text mining is an integral part of the wider discipline of data mining
• It is content-based processing of unstructured text documents and extraction of useful information from them
• Content-based treatment: treatment based only on the content of the documents, and not on some metadata such as date of publication, source of publication, type of document or similar
• Basic text mining tasks: Information retrieval Automatic classification of text documents
March 18th, 2015, Karl-Franzens Uni Graz
3
Information retrieval: definition and basic concepts
• The problem: determining a set of documents from a collection that are relevant to a particular user query
• Assessment of the relevance of documents is based on a comparison of index terms that appear in texts of documents and text of the query
• Index term: basically any word or group of words that appear in the text
• When the text is represented by set of index terms much of the semantics contained in the text is lost
• Information about the interconnectedness between index terms is lost
March 18th, 2015, Karl-Franzens Uni Graz
4
Labels and basic models
• Labels: d j - document of collection of documents q - query ai , aj – vector representations of documents q - vector representation of the query
• Basic models: Probabilistic Boolean model Vector model
March 18th, 2015, Karl-Franzens Uni Graz
5
Vector space model
• Vector space model is today one of the most popular models for information retrieval
• It is developed by Gerald Salton and associates within the search system SMART (System for the Mechanical Analysis and Retrieval of Text)
• In this model similarity between documents and queries can take real values in the interval [0,1]
• Indexing terms are joined by weight • Documents are ranked according to the degree of similarity with
the query • Representation in the vector space model called Bag of words
representation • Vector space model assumes independence of indexing terms
March 18th, 2015, Karl-Franzens Uni Graz
6
Vector space model
• Notation - set of index terms
- set of documents aij weight associated to pair (ti , dj) , i=1,2,..,m; j=1,2,...,n qi , i=1,2,...,m weights associated to pair (ti ,q), q query
• In the vector space model document dj , j=1,2,...n is represented by the vector aj =(aj1 , aj2 , ..., ajm )T
• Query is represented by the vector q=(q1 , q2 , ..., qm )T • Collection od documents is represented by term-document matrix
March 18th, 2015, Karl-Franzens Uni Graz
7
{ }mtttT ,,, 21 …=
{ }ndddD ,,, 21 …=
Term-document matrix
• The terms are axes in the space
• Documents and queries are points or vectors in space
• Space is high-dimensional - dimension corresponds to the number of index terms
• Documents usually contain only a few index terms from the list of index terms
• Vectors representing documents are rare: most coordinator is equal to 0
March 18th, 2015, Karl-Franzens Uni Graz
8
=
↓↓↓
aaa
aaaaaa
A
ddd
mnmm
n
n
n
…
…
…
21
22221
11211
21
t
tt
m←
←
←
2
1
Rows represent concepts Columns represent documents
Measure of similarity
• The similarity between the document and the query is measured by the angle between their representations
• The measure of similarity between document and query is the cosine of the angle between them
March 18th, 2015, Karl-Franzens Uni Graz
9
qa
qaqa22
),(cosj
Tj
j =
Measure of similarity
• Vector representations of documents usually are standardized so that its Euclidean norm is 1
• Norm of a representation of the query does not affect the ranking of documents by relevance
• The measure of similarity between document and query usually is measured by the dot product of vectors in the numerator
March 18th, 2015, Karl-Franzens Uni Graz
10
12=ja
2q
The process of assigning weights to index terms
• Insisting on high recall reduces precision of retrieval and vice versa • Therefore, more component to assign weight index terms are
introduced • Components:
Local - depends on the frequency of occurrence of the term in a particular document (Term frequency - TF component)
Global - factor for the index term depends inversely on the number of documents in which the term appears (the inverse of the frequency of the term in the collection ( Inverse document frequency - IDF component)
Normalization - vector representations of a documents are divided by Euclidean norm of representation (this avoids the situation that longer documents are often returned as relevant)
March 18th, 2015, Karl-Franzens Uni Graz
11
Advantages and disadvantages of the vector space model
• Advantages Measure the cosine similarity between documents and queries
anables to rank documents according to the importance for a query
The scheme of assigning weight index terms increases the efficiency of information retrieval
• Disadvantage The assumption of independence of index terms
(questionable!)
March 18th, 2015, Karl-Franzens Uni Graz
12
Example
• A collection of 15 documents (titles of books) 9 from the field of data mining 5 from the field of linear algebra One that combines these two areas (application of linear
algebra to data mining)
• A list of words is formed as follows:
From the words contained in at least two documents The words on the list of stop words are discarded The words are reduced to their basic forms, ie. variations of
the word is mapped to the basic version of the word
13 March 18th, 2015, Karl-Franzens Uni Graz
Documents 1/2
D1 Survey of text mining: clustering, classification, and retrieval
D2 Automatic text processing: the transformation analysis and retrieval of information by computer
D3 Elementary linear algebra: A matrix approach
D4 Matrix algebra & its applications statistics and econometrics
D5 Effective databases for text & document management
D6 Matrices, vector spaces, and information retrieval
D7 Matrix analysis and applied linear algebra
D8 Topological vector spaces and algebras
14 March 18th, 2015, Karl-Franzens Uni Graz
Documents 2/2
D9 Information retrieval: data structures & algorithms
D10 Vector spaces and algebras for chemistry and physics
D11 Classification, clustering and data analysis
D12 Clustering of large data sets
D13 Clustering algorithms
D14 Document warehousing and text mining: techniques for improving business operations, marketing and sales
D15 Data mining and knowledge discovery
15 March 18th, 2015, Karl-Franzens Uni Graz
Terms Data mining terms Linear algebra terms Neutral terms
Text Linear Analysis Mining Algebra Application
Clustering Matrix Algorithm Classification Vector
Retrieval Space
Information
Document
Data
16 March 18th, 2015, Karl-Franzens Uni Graz
The concepts are mapped in their basic versions, eg. Applications, Applied → Application
Queries
• Q1: Data mining Relevant documents : All data mining documents
(blue)
• Q2: Using linear algebra for data mining Relevant document: D6
17 March 18th, 2015, Karl-Franzens Uni Graz
Rezultati pretraživanja weight bxx tfn
Document Q1 Q2 Q1 Q2
D1 0,4472 0,4472 0,3607 0,2424
D2 0 0 0 0
D3 0 1,1547 0 0,6417
D4 0 0,5774 0 0,1471
D5 0 0 0 0
D6 0 0 0 0
D7 0 0,8944 0 0,4598
D8 0 0,5774 0 0,1654
D9 0,5 0,5 0,2634 0,1771
D10 0 0,5774 0 0,1654
D11 0,5 0,5 0,2634 0,1770
D12 0,7071 0,7071 0,4488 0,3016
D13 0 0 0 0
D14 0,5774 0,5774 0,4092 0,2750
D15 1,4142 1,4142 1,0000 0,6720 18 March 18th, 2015, Karl-Franzens Uni Graz
Comments on search results
• Consider that in the process of search are returned all documents whose measure of similarity with the query is greater than 0
• For the first query (Data mining) All documents are returned in the search process are relevant Not all relevant documents are returned in the search Only those documents containing the terms contained in the query are
returned - search is the lexical, but not semantical
• For the second query (Using linear algebra for data mining) None of the returned document is relevant The only relevant document is not returned in the search
• Search using weight bxx and tfn did not result in significantly different results presented in this case
March 18th, 2015, Karl-Franzens Uni Graz
19
Preprecessing of text
• Index term Any word that appears in the collection of documents A series of several words that appears in the texts of
documents
• Preprocessing of text includes a number of procedures which prepares the text for the processing and which increases processing efficiency itself
• Some of the procedures are as follows: Lexical processing Removal of terms such as conjunctions, prepositions and
similar words that have low discriminatory value in information retrieval, so-called stop words
Transformation of the words to their root form (stemming) or on their basic form (lemmatization)
March 18th, 2015, Karl-Franzens Uni Graz
20
• Lexical processing consist of removing punctuation, removing of the bracket, the conversion of the letters in lower case
• Bringing the words to root form consist of removing suffixes and prefixes
• The alternative form of a word usually is obtained by adding suffixes
• Most algorithms for reducing words to their stem form are limited to removing suffixes
• The best known algorithm for reducing words to root form is Porter's algorithm developed for the English language
• In addition to indexing terms which consist of only one word in the text can be identified and phrases that often occur, eg. New York
March 18th, 2015, Karl-Franzens Uni Graz
21
Tools for text mining
• Today most of statistical software products offers text mining capabilities
• This include: Preprocessing of corpus (collection of documents)
Data preparation Importing of text data Cleanig Specific task of preprocessing (stop words removal,
stemming …) Creation of term document matrix Finding associations between terms and documents Clustering of terms and documents Automatic classification Semantic information retrieval ....
March 18th, 2015, Karl-Franzens Uni Graz
22
Tools for text mining
• Commercial Clearforest Copernik Summarizer dtSearch Insightful Infact Inxight SPSS Clementine SAS Text Miner TEMIS WordStat Statistica
• Open Source GATE RapidMiner Weka/KEA R/tm
March 18th, 2015, Karl-Franzens Uni Graz
23
R: Packages
• The main package in R for text mining is tm package • The modular structure of tm enables integrating
existing functionalities from other text mining tool kids • For example:
Weka (via Rweka and Snowball) Enable stemming and tokenization methods
OpenNLP Enables tokenization, sentence detection, part of speech
tagging
• Advanced text mining methods beyond the scope of commercial products are available via extension text mining packages in R Kernlab: string kernels lsi: latent sematic indexing
March 18th, 2015, Karl-Franzens Uni Graz
24
Data import
• The main structure for managing documents in tm is Corpus – collection of text documents
• Possible implementations of corpus Volatile corpus (default implementation)
Corpus is R object held fully in memory When R object is destroyed the corpus is also destroyed
Permanent corpus Documents of collection physically are stored outside of R
(in a database) Corpus is not destroyed when corresponding R objects are
destroyed
March 18th, 2015, Karl-Franzens Uni Graz
25
Data import
March 18th, 2015, Karl-Franzens Uni Graz
26
• Volatile corpus is created by command VCorpus(x, readerControl) x: Source (input location); predefined sources are:
DirSource: a directory VectorSource: a vector interpreting each component as
document DataframeSource: CSV, XLS files
readaerControl: a list with components reader and language reader constructs a corpus from elements delivered by the
source
• In the case of permanent corpus there is third argument dbControl (controls the data base)
• Storig of collection in the data base enables management of very large text collections, but shortcoming might be slower access performance
Readers
Reader Description
readPlain() Read in files as plain text ignoring matadata
readRCV1 Read in files in Reuters Corpus Volume 1 XML format
readReut21578XML() Read in files in Reuter-21578 XML format
readGmane() Read in Gmane RSS feeds
readNewsgroup() Read in newsgroup postings (e-mails)
readPDF() Read in PDF documents
readDOC() Read in MS Word documents
readHTML() Read in simply structured HTML documents
March 18th, 2015, Karl-Franzens Uni Graz
27
Data import: examples
• Plain text files in the directory txt containing Latin texts by Roman poet Ovid can be read by: txt<-system.file("texts","txt",package="tm") ovid<-VCorpus(DirSource(txt,encoding="UTF-8"),
readerControl=list(language="lat"))
> ovid <<VCorpus (documents: 5, metadata (corpus/indexed): 0/0)>>
• By VectorSource corpus can be created by character vectos docs<-c("This is a text.","This another one.")
corp<-VCorpus(VectorSource(docs)) > corp <<VCorpus (documents: 2, metadata (corpus/indexed): 0/0)>>
March 18th, 2015, Karl-Franzens Uni Graz
28
Data import: examples
• Creation of corpus for some Ruters documents reut21578<-
system.file("texts","crude",package="tm")
reuters<-VCorpus(DirSource(reut21578), readerControl=list(reader=readReut21578XMLasPlain))
> reuters <<VCorpus (documents: 20, metadata (corpus/indexed): 0/0)>>
March 18th, 2015, Karl-Franzens Uni Graz
29
Reuters-21578 collection of documents
• The most widely used data set for classification • It consists of Reuters reports from 1987 • Divided into documents for learning and documents for
testing (ModApte Split) • Documents for learning: reports from begining of 1987
to 7 April 1987 • Documents for verification: Reports from 8 April until
the end of 1987 • 9603 document for learning and 3299 document for
testing • 10 large classes, 90 classes for which there is at least
one document for learning and at least one document for testing, totaly 118 classes
• There are different class March 18th, 2015, Karl-Franzens Uni Graz
30
Reuters-21578 collection of documents
• 10 big classes
March 18th, 2015, Karl-Franzens Uni Graz
31
Class No of train documents
No of test documents
Earn 2877 1087
Acquisitions 1650 719
Money-fx 538 179
Grain 433 149
Crude 389 189
Trade 369 118
Interest 347 131
Ship 197 89
Wheat 212 71
Corn 182 56
Inspecting corpora
• Full content of text documents is displayed with command inspect() inspect(ovid[1]) <<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>
[[1]] <<PlainTextDocument (metadata: 7)>> Si quis in hoc artem populo non novit amandi, hoc legat et lecto carmine doctus amet. arte citae veloque rates remoque moventur, arte leves currus: arte regendus amor.
curribus Automedon lentisque erat aptus habenis, Tiphys in Haemonia puppe magister erat: me Venus artificem tenero praefecit Amori; Tiphys et Automedon dicar Amoris ego.
ille quidem ferus est et qui mihi saepe repugnet: sed puer est, aetas mollis et apta regi.
Phillyrides puerum cithara perfecit Achillem, atque animos placida contudit arte feros. qui totiens socios, totiens exterruit hostes, creditur annosum pertimuisse senem. March 18th, 2015, Karl-Franzens Uni Graz
32
Inspecting corpora
• inspect(corp) <<VCorpus (documents: 2, metadata (corpus/indexed): 0/0)>>
[[1]] <<PlainTextDocument (metadata: 7)>> This is a text.
[[2]] <<PlainTextDocument (metadata: 7)>> This another one.
March 18th, 2015, Karl-Franzens Uni Graz
33
Transformations
• Typical transformations are stemming, stopwords removal...
• Transformations are obtained by command tm_map() • Extra whitespace is eliminated by:
reuters<-tm_map(reuters,stripWhitespace) • Conversation to lower case:
reuters<-tm_map(reuters,content_transformer(tolower))
• Removal of stopwords: reuters<-tm_map(reuters,removeWords,stopwords("english"))
• Removal of punctation reuters<-tm_map(reuters,removePunctuation)
• Removal of numbers reuters<-tm_map(reuters,removeNumbers)
March 18th, 2015, Karl-Franzens Uni Graz
34
English stop words
• stopwords("english") [1] "i" "me" "my" "myself" "we" [6] "our" "ours" "ourselves" "you" "your" [11] "yours" "yourself" "yourselves" "he" "him" [16] "his" "himself" "she" "her" "hers" [21] "herself" "it" "its" "itself" "they" [26] "them" "their" "theirs" "themselves" "what" [31] "which" "who" "whom" "this" "that" [36] "these" "those" "am" "is" "are" [41] "was" "were" "be" "been" "being" [46] "have" "has" "had" "having" "do" [51] "does" "did" "doing" "would" "should"
....and so on March 18th, 2015, Karl-Franzens Uni Graz
35
German stop words
• stopwords("german")
[1] "aber" "alle" "allem" "allen" "aller" "alles" [7] "als" "also" "am" "an" "ander" "andere" [13] "anderem" "anderen" "anderer" "anderes" "anderm" "andern" [19] "anderr" "anders" "auch" "auf" "aus" "bei" [25] "bin" "bis" "bist" "da" "damit" "dann" [31] "der" "den" "des" "dem" "die" "das" [37] "daß" "derselbe" "derselben" "denselben" "desselben"
"demselben" [43] "dieselbe" "dieselben" "dasselbe" "dazu" "dein" "deine" [49] "deinem" "deinen" "deiner" "deines" "denn" "derer"
.... and so on
March 18th, 2015, Karl-Franzens Uni Graz
36
Transformations: stemming
> inspect(reuters[1]) <<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>
[[1]]
<<PlainTextDocument (metadata: 7)>>
diamond shamrock corp said effective today cut contract prices crude oil 1.50 dlrs barrel. reduction brings posted price west texas intermediate 16.00 dlrs barrel, copany said. " price reduction today made light falling oil product prices weak crude oil market," company spokeswoman said. diamond latest line u.s. oil companies cut contract, posted, prices last two days citing weak oil markets. Reuter
March 18th, 2015, Karl-Franzens Uni Graz
37
Transformations: stemming
reuters<-tm_map(reuters, stemDocument)
> inspect(reuters[1])
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>
[[1]]
<<PlainTextDocument (metadata: 7)>>
diamond shamrock corp said effect today cut contract price crude oil 1.50 dlrs barrel. reduct bring post price west texa intermedi 16.00 dlrs barrel, copani said. " price reduct today made light fall oil product price weak crude oil market," compani spokeswoman said. diamond latest line u.s. oil compani cut contract, posted, price last two day cite weak oil markets. reuter
March 18th, 2015, Karl-Franzens Uni Graz
38
Creating term-document matrix
• Term-document matrix Terms: rows Documents: columns
• Documet-term matrix Terms: columns Documents: rows
• dtm<-DocumentTermMatrix(reuters) • tdm<-TermDocumentMatrix(reuters)
March 18th, 2015, Karl-Franzens Uni Graz
39
> dim(dtm) [1] 20 1208 Matrix of dimension 20 (documents)×1208 (terms) > dim(tdm) [1] 1208 20 Matrix of dimension 1208 (terms) × 20 (documents)
Creating term-document matrix
March 18th, 2015, Karl-Franzens Uni Graz
40
> inspect(dtm[5:10,740:743]) <<DocumentTermMatrix (documents: 6, terms: 4)>> Non-/sparse entries: 2/22 Sparsity : 92% Maximal term length: 7 Weighting : term frequency (tf) Terms Docs negat negoti neither net character(0) 0 0 0 2 character(0) 0 0 0 0 character(0) 0 0 0 0 character(0) 0 0 0 0 character(0) 0 0 0 0 character(0) 1 0 0 0
Incorporate weighting (TfIdf)
> dtm<-DocumentTermMatrix(reuters,control=list(weighting=weightTfIdf))
March 18th, 2015, Karl-Franzens Uni Graz
41
> inspect(dtm[5:10,740:743]) <<DocumentTermMatrix (documents: 6, terms: 4)>> Non-/sparse entries: 2/22 Sparsity : 92% Maximal term length: 7 Weighting : term frequency - inverse document frequency (normalized) (tf-idf) Terms Docs negat negoti neither net character(0) 0.00000000 0 0 0.08003571 character(0) 0.00000000 0 0 0.00000000 character(0) 0.00000000 0 0 0.00000000 character(0) 0.00000000 0 0 0.00000000 character(0) 0.00000000 0 0 0.00000000 character(0) 0.01338058 0 0 0.00000000
Operating with Term-Document Matrix
• Finding frequent terms in collection
March 18th, 2015, Karl-Franzens Uni Graz
42
findFreqTerms(dtm,10) [1] "accord" "barrel" "bpd" "crude" "dlrs" "futur" "govern" [8] "kuwait" "last" "market" "meet" "mln" "new" "offici" [15] "oil" "one" "opec" "pct" "price" "product" "reuter" [22] "said" "said." "saudi" "say" "sheikh" "u.s." "will"
Operating with term-document matrix
• Finding associations between terms
March 18th, 2015, Karl-Franzens Uni Graz
43
> findAssocs(dtm,"opec",0.8) opec oil 0.88 emerg 0.87 15.8 0.86 analyst 0.85 meet 0.85 said 0.84 want 0.82 abil 0.80
Operating with term-document matrix
• Removing sparse terms: terms which have at least x% of zeros in their representation
March 18th, 2015, Karl-Franzens Uni Graz
44
> inspect(removeSparseTerms(dtm,0.4)) <<DocumentTermMatrix (documents: 20, terms: 4)>> Non-/sparse entries: 73/7 Sparsity : 9% Maximal term length: 6 Weighting : term frequency (tf) Terms Docs oil price reuter said Id_1 5 5 1 1 Id_2 11 4 2 9 Id_3 2 2 1 1 Id_4 1 2 1 1 Id_5 1 0 1 3 ...
Creating term-document matrix with given dictionary
March 18th, 2015, Karl-Franzens Uni Graz
45
> inspect(DocumentTermMatrix(reuters,list(dictionary=c("prices", "crude","oil")))) <<DocumentTermMatrix (documents: 20, terms: 3)>> Non-/sparse entries: 29/31 Sparsity : 52% Maximal term length: 6 Weighting : term frequency (tf) Terms Docs crude oil prices Id_1 2 5 0 Id_2 0 11 0 Id_3 2 2 0 Id_4 3 1 0 Id_5 0 1 0 Id_6 1 7 0 Id_7 0 3 0 Id_8 0 3 0 Id_9 0 4 0 Id_10 0 9 0
Clustering of terms
• Ploting dendrogram of hierarhical clusterig of terms • plot(hclust(dist(removeSparseTerms(tdm,
0.6)),method="ward.D"))
March 18th, 2015, Karl-Franzens Uni Graz
46
EXAMPLE
Starkweather, J. (2014). Introduction to basic Text Mining in R. http://it.unt.edu/benchmarks/issues/2014/01/rss-matters
March 18th, 2015, Karl-Franzens Uni Graz
47
Text: University of North Texas (UNT) policy that governs Research and Statistical Support services (RSS)
March 18th, 2015, Karl-Franzens Uni Graz
48
Importing of data and cleaning
• Reading by command readLines
policy.HTML.page<-readLines("http://policy.unt.edu/policy/3-5")
• Identification of relevant parts of the text, cleaning of data.... id.1<-3+which(policy.HTML.page==" TOTAL
UNIVERSITY </div>")
id.2<-id.1+5
text.data<-d[id.1:id.2]
• After little bit of cleaning we get the text....
March 18th, 2015, Karl-Franzens Uni Graz
49
Text data
March 18th, 2015, Karl-Franzens Uni Graz
50
> text.d [1] "Academic Computing Services exists to assist faculty and students in the best, most comprehensive use of the computing facilities. This policy statement is not intended in any way to restrict the services the Computing and Information Technology Center attempts to provide. Because personnel resources are limited, it is imperative that some attempt be made to define what Academic Computing Services can do in a timely fashion and what may not be possible. In particular, the successful provision of Academic Computing Services requires considerable cooperation between the Computing and Information Technology Center staff on the one hand and faculty and students on the other hand." [2] "The primary goal of Academic Computing Services is to assist faculty and students in using computing facilities. The emphasis here is on "assist." Some of the facilities users might require are statistical programming systems such as SPSS and SAS, programming languages such as FORTRAN and C, and/or other specialized academic software. The staff can also be of assistance in the design stage of research or instructional projects so that ease of use of the computing facilities is assured. With ever greater emphasis by all disciplines on computing, it is desirable for individual faculty members and students to learn to deal with the computing facilities directly, without the need for intervention by the professional staff." [3] "In order to facilitate use, therefore, Academic Computing Services will aid individuals to learn the use of the hardware and software facilities available. Under ordinary circumstances Academic Computing Services personnel will not literally do projects for an individual, but will try to provide advice on how to complete the project most efficiently and effectively. In addition to limits on personnel time, it is not generally a good idea for a person working on a research project to maintain a position remote from the analysis of their data. The analysis of research data is an intrinsic part of the research process, and it requires the same care and understanding of the research assumptions as does the design and collection of the data."
.........
Conversion of text to corpus
Loading of tm package and conversion of characters of strings to corpus by command VectorSource >library(tm)
txt<-VectorSource(text.d)
> txt.corpus<-Corpus(txt)
March 18th, 2015, Karl-Franzens Uni Graz
51
Corpus
March 18th, 2015, Karl-Franzens Uni Graz
52
> inspect(txt.corpus[1]) <<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>> [[1]] <<PlainTextDocument (metadata: 7)>> Academic Computing Services exists to assist faculty and students in the best, most comprehensive use of the computing facilities. This policy statement is not intended in any way to restrict the services the Computing and Information Technology Center attempts to provide. Because personnel resources are limited, it is imperative that some attempt be made to define what Academic Computing Services can do in a timely fashion and what may not be possible. In particular, the successful provision of Academic Computing Services requires considerable cooperation between the Computing and Information Technology Center staff on the one hand and faculty and students on the other hand.
Transformations / preprocessing
• Conversion in lower case txt.corpus<-tm_map(txt.corpus,tolower)
• Removing of punctuation txt.corpus<-tm_map(txt.corpus,removePunctuation)
• Removing of numbers txt.corpus<-tm_map(txt.corpus,removeNumbers)
• Removing of stop words txt.corpus<-tm_map(txt.corpus,removeWords,stopwords("english"))
• Reduction to root form of words > library(SnowballC) > txt.corpus<-tm_map(txt.corpus,stemDocument)
March 18th, 2015, Karl-Franzens Uni Graz
53
Creating of term-document matrix
• tdm<-TermDocumentMatrix(txt.corpus)
March 18th, 2015, Karl-Franzens Uni Graz
54
> inspect(tdm[1:10,1:5]) <<TermDocumentMatrix (terms: 10, documents: 5)>> Non-/sparse entries: 13/37 Sparsity : 74% Maximal term length: 10 Weighting : term frequency (tf) Docs Terms character(0) character(0) character(0) character(0) character(0) academ 3 2 2 0 2 accomplish 0 0 0 0 1 acquir 0 0 0 1 0 addit 0 0 1 0 0 advic 0 0 1 0 0 aid 0 0 1 0 0 also 0 1 0 0 0 amount 0 0 0 2 0 analysi 0 0 2 0 0 andor 0 1 0 0 0 >
Operating with term-document matrix
• Finding frequent terms > findFreqTerms(x=tdm,lowfreq=8,highfreq=Inf) [1] "academ" "comput" "facil" "research" "servic" "use„
• Finding correlated words • findAssocs(x=tdm,term="comput",corlimit=0.6) • comput • faculti 0.82 • student 0.82 • academ 0.75 • comprehens 0.70 • cooper 0.70 • defin 0.70 • hand 0.70 • imper 0.70 • intend 0.70 • made 0.70 • may 0.70 • one 0.70 • particular 0.70
March 18th, 2015, Karl-Franzens Uni Graz
55
Operating with term-document matrix
• Removal of unfrequent terms Removig of terms whose representation has sparsity at least 60%
tdm.common.60<-removeSparseTerms(x=tdm,sparse=0.60)
Removig of terms whose representation has sparsity at least 20%
tdm.common.20<-removeSparseTerms(x=tdm,sparse=0.20)
March 18th, 2015, Karl-Franzens Uni Graz
56
March 18th, 2015, Karl-Franzens Uni Graz
57
tdm <<TermDocumentMatrix (terms: 161, documents: 6)>> Non-/sparse entries: 251/715 Sparsity : 74% Maximal term length: 14 Weighting : term frequency (tf)
> tdm.common.60 <<TermDocumentMatrix (terms: 17, documents: 6)>> Non-/sparse entries: 66/36 Sparsity : 35% Maximal term length: 10 Weighting : term frequency (tf)
> tdm.common.20 <<TermDocumentMatrix (terms: 4, documents: 6)>> Non-/sparse entries: 22/2 Sparsity : 8% Maximal term length: 9 Weighting : term frequency (tf)
Hierarchical clustering of index terms
• plot(hclust(stats::dist(tdm.common.20),method="ward.D2"))
March 18th, 2015, Karl-Franzens Uni Graz
58
Hierarchical clustering of index terms
• > clust<-hclust(stats::dist(tdm.common.60),method="ward.D2")
• > plot(clust)
March 18th, 2015, Karl-Franzens Uni Graz
59
Cutting a dendrogram
• cutree(clust,k=4) • academic assist best can center computing facilities 1 2 2 2 2 3 1 faculty personnel provide requires resources services software 2 2 4 2 2 1 4 staff students use 4 2 1
March 18th, 2015, Karl-Franzens Uni Graz
60
Cutting a dendrogram
March 18th, 2015, Karl-Franzens Uni Graz
61
1
3
4 2
Clustering of documents • dtm<-DocumentTermMatrix(txt.corpus) • > clust<-hclust(stats::dist(dtm),method="ward.D2") • > plot(clust)
March 18th, 2015, Karl-Franzens Uni Graz
62
Wordcloud
• Package wordcloud library(wordcloud)
> (v<-sort(rowSums(as.matrix(tdm)),decreasing=TRUE))
computing academic services facilities use 21 10 10 9 9 provide software staff will faculty 7 7 7 7 6 project research students assist can 5 5 5 4 4 center information personnel programming requires
4 4 4 4 4
March 18th, 2015, Karl-Franzens Uni Graz
63
Wordcloud
• (myNames<-names(v)) • (d<-data.frame(word=myNames,freq=v))
March 18th, 2015, Karl-Franzens Uni Graz
64
word freq computing computing 21 academic academic 10 services services 10 facilities facilities 9 use use 9 provide provide 7 software software 7 staff staff 7 will will 7 faculty faculty 6
Wordcloud
• wordcloud(d$word,d$freq,min.freq=3)
March 18th, 2015, Karl-Franzens Uni Graz
65
References
• R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, ACM Press, Addison-Wesley, New York, 1999.
• Starkweather, J. (2014). Introduction to basic Text Mining in R. Pridobljeno na http://it.unt.edu/benchmarks/issues/2014/01/rss-matters
• Feinerer, I. (2014). Introduction to the tm Package Text Mining in R. http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
• Feinerer, I., Hornik, K., & Meyer, D. (2008). Text Mining Infrastructure in R. Journal of Statistical Software, 25(5), 1–54. http://www.jstatsoft.org/v25/i05/paper
March 18th, 2015, Karl-Franzens Uni Graz
66