20
Presented by: Ajay Ram K P

Text Analytics

Embed Size (px)

Citation preview

Page 1: Text Analytics

Presented by: Ajay Ram K P

Page 2: Text Analytics

2

What is Text analytics?? Text analytics is the process of

analyzing unstructured text, extracting relevant information and transforming it into useful business intelligence.

Text analytics processes can be performed manually, but the amount of text-based data available to companies today makes it increasingly important to use intelligent, automated solutions.

Page 3: Text Analytics

3

Why is Text Analytics important??

Emails, online reviews, tweets, call center agent notes, and the vast array of other written feedback, all hold insight into customer wants and needs only if you can unlock it.

Text analytics is the way to extract meaning from this unstructured text, and to uncover patterns and themes.

Page 4: Text Analytics

4

Text Analytics in R

Text Analytics in R is carried out with the help of tm package.

It is a framework for text mining applications within R.

Contains functions for actions such as content transformation, word removal, finding frequent terms and lot more

Page 5: Text Analytics

5

The Case Study data The data used is a collection of game reviews in an

Excel sheet.

Game reviews from 1000 gamers are recorded in the data set.

The objective is to do an analysis of these reviews treating all of them as one text and find out the most frequent words.

Page 6: Text Analytics

6

Part 1

The review are read to a variable docs using functions VectorSource(), Corpus().

VectorSource() sets a source for comparison. Corpus() creates a skeleton of the text.

Reading the Data

review.txt

Page 7: Text Analytics

7

Data cleansing is required as most of the reviews are contain punctuations, numbers, stop words etc. that we don’t require for analysis.

Depending out what you are trying to achieve with your analysis, you may want to do the data cleaning step differently.

Data cleansing is done using tm_map() function in R

Cleaning the Data

Page 8: Text Analytics

8

Converting document into Document Term Matrix A document-term matrix or term-document matrix is a mathematical matrix that

describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

The tm package stores document term matrixes as sparse matrices for efficacy. Since we only have 1000 reviews and one document we can just convert our term-document-matrix into a normal matrix, which is easier to work with.

Code: dtm <- TermDocumentMatrix(docs) m <- as.matrix(dtm)

We then take the column sums of this matrix, which will give us a named vector.

And now we can sort this vector to see the most frequently used words.Code: v <- sort(rowSums(m),decreasing=TRUE) head(v)

Finding the frequent terms and their frequency

Page 9: Text Analytics

9

Page 10: Text Analytics

10

For plotting the Word Cloud, we use wordcloud package.

Plotting the Word Cloud

Page 11: Text Analytics

11

And Voila!!!

Page 12: Text Analytics

12

Part 2Creating the Network For network creation, we take help of packages

igraph sna network

Page 13: Text Analytics

13

Finding the association. findAssocs() function is used.

Creating the Network

Page 14: Text Analytics

14

Plotting the graph. Using igraph package & graph.data.frame() function

Creating the Network

Page 15: Text Analytics

15

And there it is!!!

Page 16: Text Analytics

16

Another Graph… Graph where frequent terms are node and number

of frequencies are interaction/strength.

Page 17: Text Analytics

17

In case of large networks Say the network has more than 10K nodes. Such networks will be

complicated. For quantifying such networks we go for statistical aspects of the

network. Use of Random network, Scale-free network or Hierarchical network

models in such cases would be fit.

Random Network Scale-free Network

Hierarchical Network

Page 18: Text Analytics

18

Where else can network approaches be powerful?? Biological Science

Economics

Computer science

Page 19: Text Analytics
Page 20: Text Analytics

20

THANK YOU!!!