66
Using R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb 1 March 18th, 2015, Karl-Franzens Uni Graz

Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

  • Upload
    lamlien

  • View
    221

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Using R for Text mining

Jasminka Dobša Faculty of Organization and Informatics

University of Zagreb

1 March 18th, 2015, Karl-Franzens Uni Graz

Page 2: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Outline

•  Definition of text mining •  Vector space model or Bag of words repersentation

  Representation of documents and queries in VSM   Measure of similarity between document and query   Example   Prepocessing of text documents

•  Tools for text mining •  Using R for text minig

  Data import   Preprocessing of data   Creation of term-document matrix   Opearating with term-document matrix

March 18th, 2015, Karl-Franzens Uni Graz

2

Page 3: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Definition and basic concepts

•  Discipline of text mining is an integral part of the wider discipline of data mining

•  It is content-based processing of unstructured text documents and extraction of useful information from them

•  Content-based treatment: treatment based only on the content of the documents, and not on some metadata such as date of publication, source of publication, type of document or similar

•  Basic text mining tasks:   Information retrieval   Automatic classification of text documents

March 18th, 2015, Karl-Franzens Uni Graz

3

Page 4: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Information retrieval: definition and basic concepts

•  The problem: determining a set of documents from a collection that are relevant to a particular user query

•  Assessment of the relevance of documents is based on a comparison of index terms that appear in texts of documents and text of the query

•  Index term: basically any word or group of words that appear in the text

•  When the text is represented by set of index terms much of the semantics contained in the text is lost

•  Information about the interconnectedness between index terms is lost

March 18th, 2015, Karl-Franzens Uni Graz

4

Page 5: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Labels and basic models

•  Labels:   d j - document of collection of documents   q - query   ai , aj – vector representations of documents   q - vector representation of the query

•  Basic models:   Probabilistic   Boolean model   Vector model

March 18th, 2015, Karl-Franzens Uni Graz

5

Page 6: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Vector space model

•  Vector space model is today one of the most popular models for information retrieval

•  It is developed by Gerald Salton and associates within the search system SMART (System for the Mechanical Analysis and Retrieval of Text)

•  In this model similarity between documents and queries can take real values in the interval [0,1]

•  Indexing terms are joined by weight •  Documents are ranked according to the degree of similarity with

the query •  Representation in the vector space model called Bag of words

representation •  Vector space model assumes independence of indexing terms

March 18th, 2015, Karl-Franzens Uni Graz

6

Page 7: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Vector space model

•  Notation   - set of index terms

  - set of documents   aij weight associated to pair (ti , dj) , i=1,2,..,m; j=1,2,...,n   qi , i=1,2,...,m weights associated to pair (ti ,q), q query

•  In the vector space model document dj , j=1,2,...n is represented by the vector aj =(aj1 , aj2 , ..., ajm )T

•  Query is represented by the vector q=(q1 , q2 , ..., qm )T •  Collection od documents is represented by term-document matrix

March 18th, 2015, Karl-Franzens Uni Graz

7

{ }mtttT ,,, 21 …=

{ }ndddD ,,, 21 …=

Page 8: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Term-document matrix

•  The terms are axes in the space

•  Documents and queries are points or vectors in space

•  Space is high-dimensional - dimension corresponds to the number of index terms

•  Documents usually contain only a few index terms from the list of index terms

•  Vectors representing documents are rare: most coordinator is equal to 0

March 18th, 2015, Karl-Franzens Uni Graz

8

=

↓↓↓

aaa

aaaaaa

A

ddd

mnmm

n

n

n

21

22221

11211

21

t

tt

m←

2

1

Rows represent concepts Columns represent documents

Page 9: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Measure of similarity

•  The similarity between the document and the query is measured by the angle between their representations

•  The measure of similarity between document and query is the cosine of the angle between them

March 18th, 2015, Karl-Franzens Uni Graz

9

qa

qaqa22

),(cosj

Tj

j =

Page 10: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Measure of similarity

•  Vector representations of documents usually are standardized so that its Euclidean norm is 1

•  Norm of a representation of the query does not affect the ranking of documents by relevance

•  The measure of similarity between document and query usually is measured by the dot product of vectors in the numerator

March 18th, 2015, Karl-Franzens Uni Graz

10

12=ja

2q

Page 11: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

The process of assigning weights to index terms

•  Insisting on high recall reduces precision of retrieval and vice versa •  Therefore, more component to assign weight index terms are

introduced •  Components:

  Local - depends on the frequency of occurrence of the term in a particular document (Term frequency - TF component)

  Global - factor for the index term depends inversely on the number of documents in which the term appears (the inverse of the frequency of the term in the collection ( Inverse document frequency - IDF component)

  Normalization - vector representations of a documents are divided by Euclidean norm of representation (this avoids the situation that longer documents are often returned as relevant)

March 18th, 2015, Karl-Franzens Uni Graz

11

Page 12: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Advantages and disadvantages of the vector space model

•  Advantages   Measure the cosine similarity between documents and queries

anables to rank documents according to the importance for a query

  The scheme of assigning weight index terms increases the efficiency of information retrieval

•  Disadvantage   The assumption of independence of index terms

(questionable!)

March 18th, 2015, Karl-Franzens Uni Graz

12

Page 13: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Example

•  A collection of 15 documents (titles of books)   9 from the field of data mining   5 from the field of linear algebra   One that combines these two areas (application of linear

algebra to data mining)

•  A list of words is formed as follows:

  From the words contained in at least two documents   The words on the list of stop words are discarded   The words are reduced to their basic forms, ie. variations of

the word is mapped to the basic version of the word

13 March 18th, 2015, Karl-Franzens Uni Graz

Page 14: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Documents 1/2

D1 Survey of text mining: clustering, classification, and retrieval

D2 Automatic text processing: the transformation analysis and retrieval of information by computer

D3 Elementary linear algebra: A matrix approach

D4 Matrix algebra & its applications statistics and econometrics

D5 Effective databases for text & document management

D6 Matrices, vector spaces, and information retrieval

D7 Matrix analysis and applied linear algebra

D8 Topological vector spaces and algebras

14 March 18th, 2015, Karl-Franzens Uni Graz

Page 15: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Documents 2/2

D9 Information retrieval: data structures & algorithms

D10 Vector spaces and algebras for chemistry and physics

D11 Classification, clustering and data analysis

D12 Clustering of large data sets

D13 Clustering algorithms

D14 Document warehousing and text mining: techniques for improving business operations, marketing and sales

D15 Data mining and knowledge discovery

15 March 18th, 2015, Karl-Franzens Uni Graz

Page 16: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Terms Data mining terms Linear algebra terms Neutral terms

Text Linear Analysis Mining Algebra Application

Clustering Matrix Algorithm Classification Vector

Retrieval Space

Information

Document

Data

16 March 18th, 2015, Karl-Franzens Uni Graz

The concepts are mapped in their basic versions, eg. Applications, Applied → Application

Page 17: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Queries

•  Q1: Data mining   Relevant documents : All data mining documents

(blue)

•  Q2: Using linear algebra for data mining   Relevant document: D6

17 March 18th, 2015, Karl-Franzens Uni Graz

Page 18: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Rezultati pretraživanja weight bxx tfn

Document Q1 Q2 Q1 Q2

D1 0,4472 0,4472 0,3607 0,2424

D2 0 0 0 0

D3 0 1,1547 0 0,6417

D4 0 0,5774 0 0,1471

D5 0 0 0 0

D6 0 0 0 0

D7 0 0,8944 0 0,4598

D8 0 0,5774 0 0,1654

D9 0,5 0,5 0,2634 0,1771

D10 0 0,5774 0 0,1654

D11 0,5 0,5 0,2634 0,1770

D12 0,7071 0,7071 0,4488 0,3016

D13 0 0 0 0

D14 0,5774 0,5774 0,4092 0,2750

D15 1,4142 1,4142 1,0000 0,6720 18 March 18th, 2015, Karl-Franzens Uni Graz

Page 19: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Comments on search results

•  Consider that in the process of search are returned all documents whose measure of similarity with the query is greater than 0

•  For the first query (Data mining)   All documents are returned in the search process are relevant   Not all relevant documents are returned in the search   Only those documents containing the terms contained in the query are

returned - search is the lexical, but not semantical

•  For the second query (Using linear algebra for data mining)   None of the returned document is relevant   The only relevant document is not returned in the search

•  Search using weight bxx and tfn did not result in significantly different results presented in this case

March 18th, 2015, Karl-Franzens Uni Graz

19

Page 20: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Preprecessing of text

•  Index term   Any word that appears in the collection of documents   A series of several words that appears in the texts of

documents

•  Preprocessing of text includes a number of procedures which prepares the text for the processing and which increases processing efficiency itself

•  Some of the procedures are as follows:   Lexical processing   Removal of terms such as conjunctions, prepositions and

similar words that have low discriminatory value in information retrieval, so-called stop words

  Transformation of the words to their root form (stemming) or on their basic form (lemmatization)

March 18th, 2015, Karl-Franzens Uni Graz

20

Page 21: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

•  Lexical processing consist of removing punctuation, removing of the bracket, the conversion of the letters in lower case

•  Bringing the words to root form consist of removing suffixes and prefixes

•  The alternative form of a word usually is obtained by adding suffixes

•  Most algorithms for reducing words to their stem form are limited to removing suffixes

•  The best known algorithm for reducing words to root form is Porter's algorithm developed for the English language

•  In addition to indexing terms which consist of only one word in the text can be identified and phrases that often occur, eg. New York

March 18th, 2015, Karl-Franzens Uni Graz

21

Page 22: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Tools for text mining

•  Today most of statistical software products offers text mining capabilities

•  This include:   Preprocessing of corpus (collection of documents)

  Data preparation   Importing of text data   Cleanig   Specific task of preprocessing (stop words removal,

stemming …)   Creation of term document matrix   Finding associations between terms and documents   Clustering of terms and documents   Automatic classification   Semantic information retrieval ....

March 18th, 2015, Karl-Franzens Uni Graz

22

Page 23: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Tools for text mining

•  Commercial   Clearforest   Copernik Summarizer   dtSearch   Insightful Infact   Inxight   SPSS Clementine   SAS Text Miner   TEMIS   WordStat   Statistica

•  Open Source   GATE   RapidMiner   Weka/KEA   R/tm

March 18th, 2015, Karl-Franzens Uni Graz

23

Page 24: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

R: Packages

•  The main package in R for text mining is tm package •  The modular structure of tm enables integrating

existing functionalities from other text mining tool kids •  For example:

  Weka (via Rweka and Snowball)   Enable stemming and tokenization methods

  OpenNLP   Enables tokenization, sentence detection, part of speech

tagging

•  Advanced text mining methods beyond the scope of commercial products are available via extension text mining packages in R   Kernlab: string kernels   lsi: latent sematic indexing

March 18th, 2015, Karl-Franzens Uni Graz

24

Page 25: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Data import

•  The main structure for managing documents in tm is Corpus – collection of text documents

•  Possible implementations of corpus   Volatile corpus (default implementation)

  Corpus is R object held fully in memory   When R object is destroyed the corpus is also destroyed

  Permanent corpus   Documents of collection physically are stored outside of R

(in a database)   Corpus is not destroyed when corresponding R objects are

destroyed

March 18th, 2015, Karl-Franzens Uni Graz

25

Page 26: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Data import

March 18th, 2015, Karl-Franzens Uni Graz

26

•  Volatile corpus is created by command VCorpus(x, readerControl)   x: Source (input location); predefined sources are:

  DirSource: a directory   VectorSource: a vector interpreting each component as

document   DataframeSource: CSV, XLS files

  readaerControl: a list with components reader and language   reader constructs a corpus from elements delivered by the

source

•  In the case of permanent corpus there is third argument dbControl (controls the data base)

•  Storig of collection in the data base enables management of very large text collections, but shortcoming might be slower access performance

Page 27: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Readers

Reader Description

readPlain() Read in files as plain text ignoring matadata

readRCV1 Read in files in Reuters Corpus Volume 1 XML format

readReut21578XML() Read in files in Reuter-21578 XML format

readGmane() Read in Gmane RSS feeds

readNewsgroup() Read in newsgroup postings (e-mails)

readPDF() Read in PDF documents

readDOC() Read in MS Word documents

readHTML() Read in simply structured HTML documents

March 18th, 2015, Karl-Franzens Uni Graz

27

Page 28: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Data import: examples

•  Plain text files in the directory txt containing Latin texts by Roman poet Ovid can be read by:   txt<-system.file("texts","txt",package="tm")   ovid<-VCorpus(DirSource(txt,encoding="UTF-8"),

readerControl=list(language="lat"))

  > ovid   <<VCorpus (documents: 5, metadata (corpus/indexed): 0/0)>>

•  By VectorSource corpus can be created by character vectos   docs<-c("This is a text.","This another one.")

  corp<-VCorpus(VectorSource(docs))   > corp   <<VCorpus (documents: 2, metadata (corpus/indexed): 0/0)>>

March 18th, 2015, Karl-Franzens Uni Graz

28

Page 29: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Data import: examples

•  Creation of corpus for some Ruters documents   reut21578<-

system.file("texts","crude",package="tm")

  reuters<-VCorpus(DirSource(reut21578), readerControl=list(reader=readReut21578XMLasPlain))

  > reuters   <<VCorpus (documents: 20, metadata (corpus/indexed): 0/0)>>

March 18th, 2015, Karl-Franzens Uni Graz

29

Page 30: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Reuters-21578 collection of documents

•  The most widely used data set for classification •  It consists of Reuters reports from 1987 •  Divided into documents for learning and documents for

testing (ModApte Split) •  Documents for learning: reports from begining of 1987

to 7 April 1987 •  Documents for verification: Reports from 8 April until

the end of 1987 •  9603 document for learning and 3299 document for

testing •  10 large classes, 90 classes for which there is at least

one document for learning and at least one document for testing, totaly 118 classes

•  There are different class March 18th, 2015, Karl-Franzens Uni Graz

30

Page 31: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Reuters-21578 collection of documents

•  10 big classes

March 18th, 2015, Karl-Franzens Uni Graz

31

Class No of train documents

No of test documents

Earn 2877 1087

Acquisitions 1650 719

Money-fx 538 179

Grain 433 149

Crude 389 189

Trade 369 118

Interest 347 131

Ship 197 89

Wheat 212 71

Corn 182 56

Page 32: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Inspecting corpora

•  Full content of text documents is displayed with command inspect()   inspect(ovid[1])   <<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

[[1]] <<PlainTextDocument (metadata: 7)>> Si quis in hoc artem populo non novit amandi, hoc legat et lecto carmine doctus amet. arte citae veloque rates remoque moventur, arte leves currus: arte regendus amor.

curribus Automedon lentisque erat aptus habenis, Tiphys in Haemonia puppe magister erat: me Venus artificem tenero praefecit Amori; Tiphys et Automedon dicar Amoris ego.

ille quidem ferus est et qui mihi saepe repugnet: sed puer est, aetas mollis et apta regi.

Phillyrides puerum cithara perfecit Achillem, atque animos placida contudit arte feros. qui totiens socios, totiens exterruit hostes,   creditur annosum pertimuisse senem. March 18th, 2015, Karl-Franzens Uni Graz

32

Page 33: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Inspecting corpora

•  inspect(corp)   <<VCorpus (documents: 2, metadata (corpus/indexed): 0/0)>>

[[1]] <<PlainTextDocument (metadata: 7)>> This is a text.

[[2]] <<PlainTextDocument (metadata: 7)>> This another one.

March 18th, 2015, Karl-Franzens Uni Graz

33

Page 34: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Transformations

•  Typical transformations are stemming, stopwords removal...

•  Transformations are obtained by command tm_map() •  Extra whitespace is eliminated by:

  reuters<-tm_map(reuters,stripWhitespace) •  Conversation to lower case:

  reuters<-tm_map(reuters,content_transformer(tolower))

•  Removal of stopwords:   reuters<-tm_map(reuters,removeWords,stopwords("english"))

•  Removal of punctation   reuters<-tm_map(reuters,removePunctuation)

•  Removal of numbers   reuters<-tm_map(reuters,removeNumbers)

March 18th, 2015, Karl-Franzens Uni Graz

34

Page 35: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

English stop words

•  stopwords("english") [1] "i" "me" "my" "myself" "we" [6] "our" "ours" "ourselves" "you" "your" [11] "yours" "yourself" "yourselves" "he" "him" [16] "his" "himself" "she" "her" "hers" [21] "herself" "it" "its" "itself" "they" [26] "them" "their" "theirs" "themselves" "what" [31] "which" "who" "whom" "this" "that" [36] "these" "those" "am" "is" "are" [41] "was" "were" "be" "been" "being" [46] "have" "has" "had" "having" "do" [51] "does" "did" "doing" "would" "should"

....and so on March 18th, 2015, Karl-Franzens Uni Graz

35

Page 36: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

German stop words

•  stopwords("german")

[1] "aber" "alle" "allem" "allen" "aller" "alles" [7] "als" "also" "am" "an" "ander" "andere" [13] "anderem" "anderen" "anderer" "anderes" "anderm" "andern" [19] "anderr" "anders" "auch" "auf" "aus" "bei" [25] "bin" "bis" "bist" "da" "damit" "dann" [31] "der" "den" "des" "dem" "die" "das" [37] "daß" "derselbe" "derselben" "denselben" "desselben"

"demselben" [43] "dieselbe" "dieselben" "dasselbe" "dazu" "dein" "deine" [49] "deinem" "deinen" "deiner" "deines" "denn" "derer"

.... and so on

March 18th, 2015, Karl-Franzens Uni Graz

36

Page 37: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Transformations: stemming

> inspect(reuters[1]) <<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

[[1]]

<<PlainTextDocument (metadata: 7)>>

diamond shamrock corp said effective today cut contract prices crude oil 1.50 dlrs barrel. reduction brings posted price west texas intermediate 16.00 dlrs barrel, copany said. " price reduction today made light falling oil product prices weak crude oil market," company spokeswoman said. diamond latest line u.s. oil companies cut contract, posted, prices last two days citing weak oil markets. Reuter

March 18th, 2015, Karl-Franzens Uni Graz

37

Page 38: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Transformations: stemming

reuters<-tm_map(reuters, stemDocument)

> inspect(reuters[1])

<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

[[1]]

<<PlainTextDocument (metadata: 7)>>

diamond shamrock corp said effect today cut contract price crude oil 1.50 dlrs barrel. reduct bring post price west texa intermedi 16.00 dlrs barrel, copani said. " price reduct today made light fall oil product price weak crude oil market," compani spokeswoman said. diamond latest line u.s. oil compani cut contract, posted, price last two day cite weak oil markets. reuter

March 18th, 2015, Karl-Franzens Uni Graz

38

Page 39: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Creating term-document matrix

•  Term-document matrix   Terms: rows   Documents: columns

•  Documet-term matrix   Terms: columns   Documents: rows

•  dtm<-DocumentTermMatrix(reuters) •  tdm<-TermDocumentMatrix(reuters)

March 18th, 2015, Karl-Franzens Uni Graz

39

> dim(dtm) [1] 20 1208 Matrix of dimension 20 (documents)×1208 (terms) > dim(tdm) [1] 1208 20 Matrix of dimension 1208 (terms) × 20 (documents)

Page 40: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Creating term-document matrix

March 18th, 2015, Karl-Franzens Uni Graz

40

> inspect(dtm[5:10,740:743]) <<DocumentTermMatrix (documents: 6, terms: 4)>> Non-/sparse entries: 2/22 Sparsity : 92% Maximal term length: 7 Weighting : term frequency (tf) Terms Docs negat negoti neither net character(0) 0 0 0 2 character(0) 0 0 0 0 character(0) 0 0 0 0 character(0) 0 0 0 0 character(0) 0 0 0 0 character(0) 1 0 0 0

Page 41: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Incorporate weighting (TfIdf)

> dtm<-DocumentTermMatrix(reuters,control=list(weighting=weightTfIdf))

March 18th, 2015, Karl-Franzens Uni Graz

41

> inspect(dtm[5:10,740:743]) <<DocumentTermMatrix (documents: 6, terms: 4)>> Non-/sparse entries: 2/22 Sparsity : 92% Maximal term length: 7 Weighting : term frequency - inverse document frequency (normalized) (tf-idf) Terms Docs negat negoti neither net character(0) 0.00000000 0 0 0.08003571 character(0) 0.00000000 0 0 0.00000000 character(0) 0.00000000 0 0 0.00000000 character(0) 0.00000000 0 0 0.00000000 character(0) 0.00000000 0 0 0.00000000 character(0) 0.01338058 0 0 0.00000000

Page 42: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Operating with Term-Document Matrix

•  Finding frequent terms in collection

March 18th, 2015, Karl-Franzens Uni Graz

42

findFreqTerms(dtm,10) [1] "accord" "barrel" "bpd" "crude" "dlrs" "futur" "govern" [8] "kuwait" "last" "market" "meet" "mln" "new" "offici" [15] "oil" "one" "opec" "pct" "price" "product" "reuter" [22] "said" "said." "saudi" "say" "sheikh" "u.s." "will"

Page 43: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Operating with term-document matrix

•  Finding associations between terms

March 18th, 2015, Karl-Franzens Uni Graz

43

> findAssocs(dtm,"opec",0.8) opec oil 0.88 emerg 0.87 15.8 0.86 analyst 0.85 meet 0.85 said 0.84 want 0.82 abil 0.80

Page 44: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Operating with term-document matrix

•  Removing sparse terms: terms which have at least x% of zeros in their representation

March 18th, 2015, Karl-Franzens Uni Graz

44

> inspect(removeSparseTerms(dtm,0.4)) <<DocumentTermMatrix (documents: 20, terms: 4)>> Non-/sparse entries: 73/7 Sparsity : 9% Maximal term length: 6 Weighting : term frequency (tf) Terms Docs oil price reuter said Id_1 5 5 1 1 Id_2 11 4 2 9 Id_3 2 2 1 1 Id_4 1 2 1 1 Id_5 1 0 1 3 ...

Page 45: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Creating term-document matrix with given dictionary

March 18th, 2015, Karl-Franzens Uni Graz

45

> inspect(DocumentTermMatrix(reuters,list(dictionary=c("prices", "crude","oil")))) <<DocumentTermMatrix (documents: 20, terms: 3)>> Non-/sparse entries: 29/31 Sparsity : 52% Maximal term length: 6 Weighting : term frequency (tf) Terms Docs crude oil prices Id_1 2 5 0 Id_2 0 11 0 Id_3 2 2 0 Id_4 3 1 0 Id_5 0 1 0 Id_6 1 7 0 Id_7 0 3 0 Id_8 0 3 0 Id_9 0 4 0 Id_10 0 9 0

Page 46: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Clustering of terms

•  Ploting dendrogram of hierarhical clusterig of terms •  plot(hclust(dist(removeSparseTerms(tdm,

0.6)),method="ward.D"))

March 18th, 2015, Karl-Franzens Uni Graz

46

Page 47: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

EXAMPLE

Starkweather, J. (2014). Introduction to basic Text Mining in R. http://it.unt.edu/benchmarks/issues/2014/01/rss-matters

March 18th, 2015, Karl-Franzens Uni Graz

47

Page 48: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Text: University of North Texas (UNT) policy that governs Research and Statistical Support services (RSS)

March 18th, 2015, Karl-Franzens Uni Graz

48

Page 49: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Importing of data and cleaning

•  Reading by command readLines

  policy.HTML.page<-readLines("http://policy.unt.edu/policy/3-5")

•  Identification of relevant parts of the text, cleaning of data....   id.1<-3+which(policy.HTML.page==" TOTAL

UNIVERSITY </div>")

  id.2<-id.1+5

  text.data<-d[id.1:id.2]

•  After little bit of cleaning we get the text....

March 18th, 2015, Karl-Franzens Uni Graz

49

Page 50: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Text data

March 18th, 2015, Karl-Franzens Uni Graz

50

> text.d [1] "Academic Computing Services exists to assist faculty and students in the best, most comprehensive use of the computing facilities. This policy statement is not intended in any way to restrict the services the Computing and Information Technology Center attempts to provide. Because personnel resources are limited, it is imperative that some attempt be made to define what Academic Computing Services can do in a timely fashion and what may not be possible. In particular, the successful provision of Academic Computing Services requires considerable cooperation between the Computing and Information Technology Center staff on the one hand and faculty and students on the other hand." [2] "The primary goal of Academic Computing Services is to assist faculty and students in using computing facilities. The emphasis here is on &quot;assist.&quot; Some of the facilities users might require are statistical programming systems such as SPSS and SAS, programming languages such as FORTRAN and C, and/or other specialized academic software. The staff can also be of assistance in the design stage of research or instructional projects so that ease of use of the computing facilities is assured. With ever greater emphasis by all disciplines on computing, it is desirable for individual faculty members and students to learn to deal with the computing facilities directly, without the need for intervention by the professional staff." [3] "In order to facilitate use, therefore, Academic Computing Services will aid individuals to learn the use of the hardware and software facilities available. Under ordinary circumstances Academic Computing Services personnel will not literally do projects for an individual, but will try to provide advice on how to complete the project most efficiently and effectively. In addition to limits on personnel time, it is not generally a good idea for a person working on a research project to maintain a position remote from the analysis of their data. The analysis of research data is an intrinsic part of the research process, and it requires the same care and understanding of the research assumptions as does the design and collection of the data."

.........

Page 51: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Conversion of text to corpus

Loading of tm package and conversion of characters of strings to corpus by command VectorSource >library(tm)

  txt<-VectorSource(text.d)

> txt.corpus<-Corpus(txt)

March 18th, 2015, Karl-Franzens Uni Graz

51

Page 52: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Corpus

March 18th, 2015, Karl-Franzens Uni Graz

52

> inspect(txt.corpus[1]) <<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>> [[1]] <<PlainTextDocument (metadata: 7)>> Academic Computing Services exists to assist faculty and students in the best, most comprehensive use of the computing facilities. This policy statement is not intended in any way to restrict the services the Computing and Information Technology Center attempts to provide. Because personnel resources are limited, it is imperative that some attempt be made to define what Academic Computing Services can do in a timely fashion and what may not be possible. In particular, the successful provision of Academic Computing Services requires considerable cooperation between the Computing and Information Technology Center staff on the one hand and faculty and students on the other hand.

Page 53: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Transformations / preprocessing

•  Conversion in lower case   txt.corpus<-tm_map(txt.corpus,tolower)

•  Removing of punctuation   txt.corpus<-tm_map(txt.corpus,removePunctuation)

•  Removing of numbers   txt.corpus<-tm_map(txt.corpus,removeNumbers)

•  Removing of stop words   txt.corpus<-tm_map(txt.corpus,removeWords,stopwords("english"))

•  Reduction to root form of words   > library(SnowballC)   > txt.corpus<-tm_map(txt.corpus,stemDocument)

March 18th, 2015, Karl-Franzens Uni Graz

53

Page 54: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Creating of term-document matrix

•  tdm<-TermDocumentMatrix(txt.corpus)

March 18th, 2015, Karl-Franzens Uni Graz

54

> inspect(tdm[1:10,1:5]) <<TermDocumentMatrix (terms: 10, documents: 5)>> Non-/sparse entries: 13/37 Sparsity : 74% Maximal term length: 10 Weighting : term frequency (tf) Docs Terms character(0) character(0) character(0) character(0) character(0) academ 3 2 2 0 2 accomplish 0 0 0 0 1 acquir 0 0 0 1 0 addit 0 0 1 0 0 advic 0 0 1 0 0 aid 0 0 1 0 0 also 0 1 0 0 0 amount 0 0 0 2 0 analysi 0 0 2 0 0 andor 0 1 0 0 0 >

Page 55: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Operating with term-document matrix

•  Finding frequent terms   > findFreqTerms(x=tdm,lowfreq=8,highfreq=Inf)   [1] "academ" "comput" "facil" "research" "servic" "use„

•  Finding correlated words •  findAssocs(x=tdm,term="comput",corlimit=0.6) •  comput •  faculti 0.82 •  student 0.82 •  academ 0.75 •  comprehens 0.70 •  cooper 0.70 •  defin 0.70 •  hand 0.70 •  imper 0.70 •  intend 0.70 •  made 0.70 •  may 0.70 •  one 0.70 •  particular 0.70

March 18th, 2015, Karl-Franzens Uni Graz

55

Page 56: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Operating with term-document matrix

•  Removal of unfrequent terms   Removig of terms whose representation has sparsity at least 60%

  tdm.common.60<-removeSparseTerms(x=tdm,sparse=0.60)

  Removig of terms whose representation has sparsity at least 20%

  tdm.common.20<-removeSparseTerms(x=tdm,sparse=0.20)

March 18th, 2015, Karl-Franzens Uni Graz

56

Page 57: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

March 18th, 2015, Karl-Franzens Uni Graz

57

tdm <<TermDocumentMatrix (terms: 161, documents: 6)>> Non-/sparse entries: 251/715 Sparsity : 74% Maximal term length: 14 Weighting : term frequency (tf)

> tdm.common.60 <<TermDocumentMatrix (terms: 17, documents: 6)>> Non-/sparse entries: 66/36 Sparsity : 35% Maximal term length: 10 Weighting : term frequency (tf)

> tdm.common.20 <<TermDocumentMatrix (terms: 4, documents: 6)>> Non-/sparse entries: 22/2 Sparsity : 8% Maximal term length: 9 Weighting : term frequency (tf)

Page 58: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Hierarchical clustering of index terms

•  plot(hclust(stats::dist(tdm.common.20),method="ward.D2"))

March 18th, 2015, Karl-Franzens Uni Graz

58

Page 59: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Hierarchical clustering of index terms

•  > clust<-hclust(stats::dist(tdm.common.60),method="ward.D2")

•  > plot(clust)

March 18th, 2015, Karl-Franzens Uni Graz

59

Page 60: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Cutting a dendrogram

•  cutree(clust,k=4) •  academic assist best can center computing facilities 1 2 2 2 2 3 1 faculty personnel provide requires resources services software 2 2 4 2 2 1 4 staff students use 4 2 1

March 18th, 2015, Karl-Franzens Uni Graz

60

Page 61: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Cutting a dendrogram

March 18th, 2015, Karl-Franzens Uni Graz

61

1

3

4 2

Page 62: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Clustering of documents •  dtm<-DocumentTermMatrix(txt.corpus) •  > clust<-hclust(stats::dist(dtm),method="ward.D2") •  > plot(clust)

March 18th, 2015, Karl-Franzens Uni Graz

62

Page 63: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Wordcloud

•  Package wordcloud library(wordcloud)

> (v<-sort(rowSums(as.matrix(tdm)),decreasing=TRUE))

computing academic services facilities use 21 10 10 9 9 provide software staff will faculty 7 7 7 7 6 project research students assist can 5 5 5 4 4 center information personnel programming requires

4 4 4 4 4

March 18th, 2015, Karl-Franzens Uni Graz

63

Page 64: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Wordcloud

•  (myNames<-names(v)) •  (d<-data.frame(word=myNames,freq=v))

March 18th, 2015, Karl-Franzens Uni Graz

64

word freq computing computing 21 academic academic 10 services services 10 facilities facilities 9 use use 9 provide provide 7 software software 7 staff staff 7 will will 7 faculty faculty 6

Page 65: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

Wordcloud

•  wordcloud(d$word,d$freq,min.freq=3)

March 18th, 2015, Karl-Franzens Uni Graz

65

Page 66: Jasminka Dobša - UNIGRAZ · PDF fileUsing R for Text mining Jasminka Dobša Faculty of Organization and Informatics University of Zagreb March 18th, 2015, Karl-Franzens 1 Uni Graz

References

•  R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, ACM Press, Addison-Wesley, New York, 1999.

•  Starkweather, J. (2014). Introduction to basic Text Mining in R. Pridobljeno na http://it.unt.edu/benchmarks/issues/2014/01/rss-matters

•  Feinerer, I. (2014). Introduction to the tm Package Text Mining in R. http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf

•  Feinerer, I., Hornik, K., & Meyer, D. (2008). Text Mining Infrastructure in R. Journal of Statistical Software, 25(5), 1–54. http://www.jstatsoft.org/v25/i05/paper

March 18th, 2015, Karl-Franzens Uni Graz

66