Visual Analytics for Linguistics - Day 4 ESSLLI - structured data

Preview:

Citation preview

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Visual Analytics for Linguistics - Day 4

Olga Scrivner

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

What You Will Learn

DAY 1 Introduction to Visual Analytics

DAY 2 Visualization Methods, Design, and Tools

DAY 3 Working with Unstructured Data

DAY 4 Working with Structured Data

DAY 5 Advanced Analytics

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Our Materials - Web Site

http://obscrivn.wixsite.com/visualization

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

What We Need

I Interactive Text Mining Suite

I Language Variation Suite

I Download R code

I Download files from Day 3 and Day 4

I R and Rstudio

I R libraries: tm, party, partykit

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

What We Need

I Interactive Text Mining Suite

I Language Variation Suite

I Download R code

I Download files from Day 3 and Day 4

I R and Rstudio

I R libraries: tm, party, partykit

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Text Mining Review

I Intro to ITMS - slides Day 3I R coding

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Text Mining Review

1. Open RStudio2. Open R file textmining.R:

3. Set up working directory

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Text Mining Package - tm

I Main structure - corpus

I Corpus is constructed via DirSource, VectorSource,DataframeSource

mycorpus <- Corpus(VectorSource(file.txt))

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Preprocessing with tm

I lower case

mycorpus <- tm_map(mycorpus, tolower)

I remove punctuation

mycorpus <- tm_map(mycorpus,removePunctuation)

I remove numbers

mycorpus <- tm_map(mycorpus, removeNumbers)

I remove stopwords

mycorpus <- tm_map(mycorpus, removeWords,stopwords(’english’))

I mycorpus <- tm_map(mycorpus, stripWhitespace)

I mycorpus <- tm_map(mycorpus,PlainTextDocument)

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

What is LVS?

Language Variation Suite

It is a Shiny web application originally designed for dataanalysis in sociolinguistic research.

It can be used for:I Processing spreadsheet data

I Visualizing data

I Analyzing means, regression, conditional trees ...(and much more)

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Background

LVS is built in R using Shiny package:

1. R - a free programming language for statisticalcomputing and graphics

2. Shiny App - a web application framework for R

Computational power of R + Web interactivity

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Background

http://littleactuary.github.io/blog/Web-application-framework-with-Shiny/

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Workspace

Browser

I Chrome, Firefox, Safari - recommendable

I Explorer may cause instability issues

AccessibilityI PC, Mac, Linux

I Data files will be uploaded from any location on yourcomputer

I Smart PhoneI Data files must be on a cloud platform connected to

your phone account (e.g. dropbox)

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Server

Since LVS is hosted on a server, Shiny idle time-out settingsmay stop application when it is left inactive (it will grey out).

Solution: Click reload and re-upload your csv file

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Data Preparation

Important things to consider before data entry:I File format:

I Comma separated value (CSV) - faster processingI Excel format will slow processing

I Column names should not contain spacesI Permitted: non-accented characters, numbers,

underscore, hyphen, and period

I One column must contain your dependent variableI The rest of the columns contain independent variables

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Terminology Review

a. Categorical - non-numerical data with two values

I yes - no; male - female

b. Continuous - numerical data

I duration, age, year

c. Multinomial - non-numerical data with three or morevalues

I regions, nationalities

d. Ordinal - scale: currently not supported

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Terminology Review

a. Categorical - non-numerical data with two values

I yes - no; male - female

b. Continuous - numerical data

I duration, age, year

c. Multinomial - non-numerical data with three or morevalues

I regions, nationalities

d. Ordinal - scale: currently not supported

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Workshop Files

1. http://obscrivn.wixsite.com/visualizationDownload file Day 4: movie_metadata.csvSimplified set from https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset

2. LVS web site: http://languagevariationsuite.com/

http://cl.indiana.edu/~obscrivn/docs/movie_metadata.csv

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Movie Data

I BudgetI DirectorI Actor 1I Director facebook likesI Actor 1 facebook likesI GenreI Year

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Language Variation Suite - Structure

1. Data

I Upload file, data summary, adjust data, cross tabulation

2. Visual Analysis

I Plotting, cluster classification

3. Inferential Statistics

I Modeling, regression, conditional trees, random forest

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Language Variation Suite - Structure

1. Data

I Upload file, data summary, adjust data, cross tabulation

2. Visual Analysis

I Plotting, cluster classification

3. Inferential Statistics

I Modeling, regression, conditional trees, random forest

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Upload File

Upload movie_metadata.csv

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Uploaded Dataset

The data content is imported as a table and allows forsorting columns.

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Summary

Summary provides a quantitative summary for each variable,e.g. frequency count, mean, median.

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Data Structure

1. Total number of observations (rows)

2. Number of variables (columns)

3. Variable types

I Factor - categorical valuesI Num - numeric values (0.95, 1.05)I Int - integer values (1, 2, 3)

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Cross Tabulation

Cross-tabulation examines the relationship between variables.

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Cross Tabulation Plot

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Cross Tabulation Plot

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Adjusting Browser - Layout

Shiny pages are fluid and reactive.

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Adjusting Browser - Layout

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Language Variation Suite - Structure

1. Data

I Upload file, data summary, adjust data, cross tabulation

2. Visual Analysis

I Plotting, cluster classification

3. Inferential statistics

I Modeling, regression, varbrul analysis, conditional trees,random forest

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

One Variable Plot

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

One Variable Plot

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Customizing Plot

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Customizing Plot

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Saving Plot

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Cluster Plot

I Classification of data into sub-groups is based onpairwise similarities

I Groups are clustered in the form of a tree-likedendrogram

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Cluster Plot

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Cluster Plot

Group 1 Animation, Biography and Group 2 Action, Drama,Comedy

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Inferential Statistics

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Language Variation Suite - Structure

1. Data

I Upload file, data summary, adjust data, cross tabulation

2. Visual Analysis

I Plotting, cluster classification

3. Inferential statistics

I Modeling, regression, varbrul analysis, conditional trees,random forest

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

How to Create a Regression Model

Step 1 Modeling - create a model with dependentand independent variables

Step 2 Regression - specify the type of regression(fixed, mixed) and type of dependent variable(binary, continuous, multinomial)

Step 3 Stepwise Regression - compare models(Log-likelihood, AIC, BIC)

Step 4 Conditional Trees - apply non-parametrictests to the model

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Modeling

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Regression Types

I Model

a.) Fixed effect

b.) Mixed effect - individual speaker/token variation (withingroup)

I Type of Dependent Variable

a.) Binary/categorical (only two values)

b.) Continuous (numeric)

c.) Multinomial - categorical with more than two values

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Regression

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Model Output

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Model Output

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Interpretation: Budget and Genre

I Genre Action is the reference value

I Positive coefficient - positive effect

I Negative coefficient - negative effect

http://www.free-online-calculator-use.com/scientific-notation-converter.html

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Interpretation: Budget and Genre

I Genre Action is the reference value

I Positive coefficient - positive effect

I Negative coefficient - negative effecthttp://www.free-online-calculator-use.com/scientific-notation-converter.html

exponential notation:1.46e-7

0.000000146

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Conditional Tree

Conditional tree: a simple non-parametric regression analysis,commonly used in social and psychological studies

I Linear regression: all information is combined linearly

I Conditional tree regression: visual splitting to captureinteraction between variables

Recursive splitting (tree branches)

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Conditional Tree

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Conditional Tree

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Conditional Tree

1. Genre is the significant factor for budget2. Budget distribution is split in two groups:

I Action and Animation

I Biography, Comedy and Drama

3. Budget is significantly higher for Animation and Action

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Random Forest

1. Variable importance for predictors

2. Robust technique with small n large p data

3. All predictors considered jointly (allows for inclusion ofcorrelated factors)

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Random Forest

Let’s add more factors!

I Return to Modeling

I Add independent factors: director facebook likes,actor 1 facebook likes, title year

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Random Forest

I Genre is the most important predictor for this model.I Close to zero or red-dotted line (cut off values) -

irrelevant for this model

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Random Forest

I Genre is the most important predictor for this model.I Close to zero or red-dotted line (cut off values) -

irrelevant for this model

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

Assignment

I Select LVS or ITMS

I Upload your own file (csv for LVS or txt/pdf for ITMS)

I Explore your data

Visual Analyticsfor Linguistics -

Day 4

Olga Scrivner

Course Info

tm Package

StatisticalVisualization

Data Preparation

LVS

Working withData

Visual Analytics

InferentialAnalysis

References I

Baayen, Harald. 2008. Analyzing linguistic data: A practical introduction to statistics.Cambridge: Cambridge University Press

Gries, Stefan Th. 2015. Quantitative designs and statistical techniques. In DouglasBiber Randi Reppen (eds.), The Cambridge Handbook of English Corpus Linguistics.Cambridge: Cambridge University Press

Schnapp, Jeffrey, and Peter Presner. 2009. Digital Humanities Manifesto 2.0.

http://gifsanimados.espaciolatino.com/x_bob_esponja_8.gif

https://daniellestolt.files.wordpress.com/2013/01/are-you-ready1.gif

http://www.martijnwieling.nl/R/sheets.pdf

Recommended