English R Lightning Talks @ BURN (2014-04-22)

23 April 2014

László Gönczy:Exploratory data analysis:

project experience and ongoing developments

Quanopt

László Gönczy:Exploratory data analysis: project

experience and ongoing developments

Gergely Horváth:R workshop in Bucharest

KSH

buchaRest

Literature

Romania - organizer

Girafe and church

Big-big professor

Mr V. Tepes alias Dracula

Hungary

Serbia

Ancient hero - Traian

Austria

Romania

RO – GB - NL

Gergely Horváth:R workshop in Bucharest

Quanopt

Imre Kocsis:Bigvis: plotting

(relatively) large data in R

Budapesti Műszaki és Gazdaságtudományi EgyetemMéréstechnika és Információs Rendszerek Tanszék

Bigvis: plotting (relatively) large data in R

Kocsis Imre

[email protected]

BURN Meetup, 2014.04.22.

mailto:[email protected]

Let’s do Exploratory Data Analysis!

„Flight data”

2008: 113MB df

~7 million x 29

> system.time(print((qplot(data=b,

x=Distance,y=AirTime))))

user system elapsed102.2 60.2 163.5

SotA

Relatively PainlessVisual EDA

Relatively PainlessHandling of Big Data

[…]

[…]

bigvis

From Hadley Wickham

A rather generic approach

o Paper: vita.had.co.nz/papers/bigvis.pdf

o Slides: files.meetup.com/1406240/bigvis.pdf

A reference implementation in R

o ggplot2 gets a huge boost

o GitHub: hadley/bigvis

Big Data EDA?

Subsampling is a hassle.

You probably want…

0. For the whole data

1. Summary statistics over

2. Interval-binned data

+ Error approx. would be nice

+ Supress outliers (or not)

Put in pictures…

ggplot2 bigvis

Few seconds

bigvis (simplified) workflow

bin()

Data in memory

bin()

condense()

bin() Interval binning

count, sum, mean, median, sd

bigvis (simplified) workflow

condense()

smooth()

peel()

count, sum, mean, median, sd

smooth out errors

peel off outliers

… and then plot with ggplot

Some other aspects

Some further automatic magic with KDE

Relative error estimation with alpha / hues

Vis. patterns for (n, m)-d datasets

o n: # of binned variables

om: # of summaries

o Dens. estimate: (1,1)-d, earlier: (2,1)-d

Parallelization & decoupling?

The pattern can scale bymoving out concerns from R

bin: see MapReduce

Some formulations easy forstream proc., too

bin

data

summarize

smooth

visualize


Summary: depends…

Distributive stats: count, sum, min, max

Algebraic stats: mean, sd, higher moments

Holistic…? (quantiles, countdistinct)

bin

data

summarize

smooth

visualize


Input: mostly „resolution” bound

R excels here

bin

data

summarize

smooth

visualize

Towards interactive EDA?

Bin-summarize-smooth can be still long…

Precompute/cache…

… and e.g. update after new batches

Raw data-at-rest

RDBMS / in-memory summarized data

client

Imre Kocsis:Bigvis: plotting (relatively)

large data in R

András Tajti:Changing User Roles in

an Online Forum

Changing User Roles in an Online Forum

András Tajti

BURN meetup

04.23.2014.

Questions

1. Can we declare patterns in user behaviour?

2. Can we detect the change of the behaviour?

Of course, we can!

I will show you one way...

Theoretical tools

● You need features to describe behaviour:– Network science

● You need to find the most important variables:– Principal component analysis

● You need to find users with similar behaviour:– Cluster analysis

Practical tools

● To do all the computations, I used R packages:– Igraph for extracting network features– PcaPP and rrcov for PCA– Fpc for cluster evaluation

● Of course, basic R functions were used mostly:– Princomp for PCA– Hclust for hierarchical clustering– Compiler package for faster computation

How does a forum look like?

● One post is either a reply to another or not:– One post has maximum one out-degree– Can have several in-degrees as any later post can

refer to it.

Users' features

● To describe behaviour, I used:– Number of posts– Number of neighbours– Parent users in- and outdegree– All above as ranks and relative ranks

Choosing important features

● Main problem: all variables have heavy-tailed distribution– Principal component is best for normally

distributed variables– Alternatives:

● Robust correlation estimation● Projection pursuit methods

– Winner: ROBpca from rrcov as PcaHubert– Mostly the same results as the original Princomp

Searching groups

● Cluster analysis:– Hierarchical, with euclidean distance and

complete linkage– Used on the PCA scores increased with explained

variance– Technical limits on the number of clusters:

● Min.: 2 (the result contained groupings with at least three grous)

● Max: 30 (was reached a few times)

Selecting cluster numbers

● For every goodness measure, I was looking for– First local minimum/maximum– Sharpest “elbow”

Select by eye

What is changing?

● I used “time windows“ to slice the data● One window contained 1000 posts and their

full thread● I ran role detection for all sets● Than compared memberships between

clusters

How to compare memberships?

● There are users only in one or the other dataset

● Two groups are similar if they have significant number of common users:

Example

Example

Thank You!

[email protected]@atajti

The code will be availabe at github.com/atajti/changingForumRoles

mailto:[email protected]

András Tajti:Changing User Roles in an Online Forum

MTA

Dénes Tóth:Dilemmas in package development:

interactive visualization, GUIs, largish data, extensibility

[email protected]

Dilemmas in package development:

Dénes Tóth


BURN Meetup 1 / 15Budapest, 23.04.2014.

Dénes Tóth / [email protected] MTA TTK KPI/ humlab.cogpsyphy.hu


• Electroencephalography (EEG)– Voltage fluctuations (μV) recorded at the scalp– A typical setup: 32-128 channels, 500-1000 Hz sampling

rate, 30-90 minutes recording time, 20-30 participants → 200 MB – 2 GB / participant

– Tasks: raw data import + signal processing (filtering, resampling, artifact correction [e.g. eye movements])

– Visual inspection is unavoidable → interactive visualization is a must

EEG





• Cognitive experiments: what does the brain do if exposed to A versus B

– EEG & events = Event-related potentials (ERP)– 40-200 repetitions per condition, factorial design (Fac1 x

Fac2)– Tasks: marker handling, segmentation, artifact rejection,

averaging, time-frequency analyses → extract components & do statistics (e.g. clustering, ANOVA, etc.) → tremendous number of analytic possibilities

– randomization statistics (e.g. 5000 permutations)

ERP





• no dedicated comprehensive package in R for EEG, but a lot of related packages (e.g. signal, mfilter, icaOcularCorrection + one trillion statistical methods)

• Present (eegR)– Base data class: array– Basic operation: apply-like– ~60 functions, ~4000 lines → appropriate for a specific

workflow– No cohesive system

eegR package



• Future (dream :)– Covers all basic analytic steps + highly extensible– Provides Workflow + GUI + scripting– Handles well out of memory datasets, easy parallelization– Interactive visualization capabilities

eegR package



• One package– Pros: easy install process, better tuning options– Cons: less general, harder to extend

• One core package + extensions– Pros: anyone can write extensions, easy to invoke other

packages– Cons: the core package must be very well written

Question I. One package or related packages?



• Range– Introduce only classes, methods and utility functions, or

provide a basic stand-alone package?

• Classes– S3 / S4 / R5 ?

Question I/a. How to write a good core?



Workflow approach: R AnalyticFlow• Pros:

the natural way ofEEG signal processing

unconstrained scripting

• Cons:

reliability?performance?

Question II.What about the user interface?



• GUI coverageFull GUI ←→ subtask- (function-) related GUI

• GUI type– Desktop GUI ←→ web based GUI– gWidgets2 |

gWidgetsWWW2 | Shiny |

...

Question II.What about the user interface?



• SciDB would be great, but only available on Linux• Two candidates: ff & gdsfmt packages

– ff package: more comprehensive– gdsfmt package: lightweight & fast

Question III.Which out-of-RAM package to choose?



• iPlots, playwith, ggvis etc.: good, but not efficient• Performance issues: a 10-sec part can contain

128 x 1000 x 10 = 1.280.000 data points• Candidates for line plots:

– Acinonyx– rCharts w. Dygraphs

Question IV.Interactive visualization?



• Acinonyx (iPlots Extreme)– Pros: very fast, iContainer– Cons: very poor documentation, not on CRAN

• rCharts and other web-based tools, esp. JavaScript libraries

– Dygraphs: fast and nice, but no official port to rCharts– Communication between JS and R?

Question IV.Interactive visualization?



Thank you!

Q1: One package or related packages?

Q1a: What should the base package cover? Do I need S4 or R5?

Q2: User interface? → R AnalyticFlow, GUIs

Q3: How to handle out-of-memory data?

Q4: Interactive visualization?

Dénes Tóth:Dilemmas in package development:


rapporter.net

Gergely Daróczi:pander: Transforming R objects

to Pandoc’s markdown

pander: A Pandoc writer in RTransforming R objects to Pandoc’s markdown

Gergely Daró[email protected]

Budapest Users of R Network

23 April 2014

What is pander?A collection of helper functions to print markdown syntax

> ?pandoc.(footnote|header|horizontal.rule|image|link|p)(.return)?> ?pandoc.(emphasis|strikeout|strong|verbatim)(.return)?

> pandoc.strong(’foobar’)**foobar**

> pandoc.strong.return(’foobar’)[1] "**foobar**"

> pandoc.header(’foobar’, level = 2)

## foobar

> pandoc.header(’foobar’, style = ’setext’)

foobar======

Gergely Daróczi (rapporter.net) pander: A Pandoc writer in R 23/4/2014 2 / 15

What is pander?Collection of helper functions to map R objects to markdown

> ?pandoc.(list|table)(.return)?

> pandoc.list(list(’foo’, list(’bar’)))

* foo* bar

> pandoc.table(head(iris, 2), split.table = Inf)

-------------------------------------------------------------------Sepal.Length Sepal.Width Petal.Length Petal.Width Species

-------------- ------------- -------------- ------------- ---------5.1 3.5 1.4 0.2 setosa

4.9 3 1.4 0.2 setosa-------------------------------------------------------------------


What is pander?Collection of helper functions to map R objects to various markdown languages

> pandoc.table(head(iris, 2), split.table = Inf, style = ’rmarkdown’)

| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species ||:--------------:|:-------------:|:--------------:|:-------------:|:---------:|| 5.1 | 3.5 | 1.4 | 0.2 | setosa || 4.9 | 3 | 1.4 | 0.2 | setosa |

> pandoc.table(head(iris, 2), split.table = Inf, style = ’simple’)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species-------------- ------------- -------------- ------------- ---------

5.1 3.5 1.4 0.2 setosa4.9 3 1.4 0.2 setosa


What is pander?Collection of helper functions to map R objects to various markdown languages

> iris$Species <- ’foos and bars’; names(iris) <- gsub(’.’, ’ ’, names(iris)> pandoc.table(head(iris, 4), split.table = Inf, style = ’grid’,+ split.cells = 5, justify = ’left’)

+----------+---------+----------+---------+------------+| Sepal | Sepal | Petal | Petal | Species || Length | Width | Length | Width | |+==========+=========+==========+=========+============+| 5.1 | 3.5 | 1.4 | 0.2 | setosa |+----------+---------+----------+---------+------------+| 4.9 | 3 | 1.4 | 0.2 | setosa |+----------+---------+----------+---------+------------+| 4.7 | 3.2 | 1.3 | 0.2 | setosa |+----------+---------+----------+---------+------------+| 4.6 | 3.1 | 1.5 | 0.2 | foos || | | | | and || | | | | bars |+----------+---------+----------+---------+------------+


What is pander?S3 method to map R objects to markdown

> ?pander(.return)?

> methods(pander)

[1] pander.anova* pander.aov* pander.cast_df* pander.character*

[5] pander.data.frame* pander.default* pander.density* pander.evals*

[9] pander.factor* pander.glm* pander.htest* pander.image*

[13] pander.list* pander.lm* pander.logical* pander.matrix*

[17] pander.NULL* pander.numeric* pander.option pander.POSIXct*

[21] pander.POSIXt* pander.prcomp* pander.rapport* pander.table*

Non-visible functions are asterisked

> pander(head(iris, 1), split.table = Inf)

-------------------------------------------------------------------

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

-------------- ------------- -------------- ------------- ---------

5.1 3.5 1.4 0.2 setosa

-------------------------------------------------------------------



> pander(letters[1:7])

_a_, _b_, _c_, _d_, _e_, _f_ and _g_

> pander(ks.test(runif(50), runif(50))

---------------------------------------------------

Test statistic P value Alternative hypothesis

---------------- --------- ------------------------

0.18 _0.3959_ two-sided

---------------------------------------------------

Table: Two-sample Kolmogorov-Smirnov test: ‘runif(50)‘ and ‘runif(50)‘

> pander(chisq.test(table(mtcars$am, mtcars$gear)))

---------------------------------------

Test statistic df P value

---------------- ---- -----------------

20.94 2 _2.831e-05_ * * *

---------------------------------------

Table: Pearson’s Chi-squared test: ‘table(mtcars$am, mtcars$gear)‘

Warning message:In chisq.test(table(mtcars$am, mtcars$gear)) :

Chi-squared approximation may be incorrect



> pander(lm(mtcars$wt ~ mtcars$hp), summary = TRUE)

--------------------------------------------------------------

Estimate Std. Error t value Pr(>|t|)

----------------- ---------- ------------ --------- ----------

**mtcars$hp** 0.009401 0.00196 4.796 4.146e-05

**(Intercept)** 1.838 0.3165 5.808 2.389e-06

--------------------------------------------------------------

-------------------------------------------------------------

Observations Residual Std. Error $R^2$ Adjusted $R^2$

-------------- --------------------- ------- ----------------

32 0.7483 0.4339 0.4151

-------------------------------------------------------------

Table: Fitting linear model: mtcars$wt ~ mtcars$hp


What is pander?S3 method to map R objects to pretty formatted markdown

> panderOptions(’table.split.table’, Inf)

> panderOptions(’table.style’, ’grid’)

> emphasize.cells(which(iris > 1.3, arr.ind = TRUE))

> pander(iris)

+----------------+---------------+----------------+---------------+------------+

| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |

+================+===============+================+===============+============+

| *5.1* | *3.5* | *1.4* | 0.2 | setosa |

+----------------+---------------+----------------+---------------+------------+

| *4.9* | *3* | *1.4* | 0.2 | setosa |

+----------------+---------------+----------------+---------------+------------+

| *4.7* | *3.2* | 1.3 | 0.2 | setosa |

+----------------+---------------+----------------+---------------+------------+

| *4.6* | *3.1* | *1.5* | 0.2 | setosa |

+----------------+---------------+----------------+---------------+------------+

| *5* | *3.6* | *1.4* | 0.2 | setosa |

+----------------+---------------+----------------+---------------+------------+

| *5.4* | *3.9* | *1.7* | 0.4 | setosa |

+----------------+---------------+----------------+---------------+------------+

| *4.6* | *3.4* | *1.4* | 0.3 | setosa |

+----------------+---------------+----------------+---------------+------------+

| *5* | *3.4* | *1.5* | 0.2 | setosa |

+----------------+---------------+----------------+---------------+------------+

| *4.4* | *2.9* | *1.4* | 0.2 | setosa |

+----------------+---------------+----------------+---------------+------------+


What is pander?Tool for literate programming like Sweave, knitr or brew

> ?Pandoc.brew

> Pandoc.brew(text = ’

+ Pi equals to <%= pi %>, and the best damn cars are:

+ <%= head(mtcars, 2) %>

+ ’)

Pi equals to _3.142_, and the best damn cars are:

--------------------------------------------------------

mpg cyl disp hp drat wt

------------------- ----- ----- ------ ---- ------ -----

**Mazda RX4** 21 6 160 110 3.9 2.62

**Mazda RX4 Wag** 21 6 160 110 3.9 2.875

--------------------------------------------------------

Table: Table continues below

--------------------------------------------------

qsec vs am gear carb

------------------- ------ ---- ---- ------ ------

**Mazda RX4** 16.46 0 1 4 4

**Mazda RX4 Wag** 17.02 0 1 4 4

--------------------------------------------------



Features of Pandoc.brew:

brew loops and conditional parts of a report just like with brew,capturing plots and images with automatically applied theme,render all R objects automatically in Pandoc’s markdown,recording all warning/error messages plus the raw R objects alongwith anything printed to stdout and the printed results,custom caching mechanism to disk or RAM with auto-dependecy,convert to HTML/pdf/odt/docx at one go,no chunk options (only workaround),building reports also in interactive session with an R5 reference class.

http://rapporter.github.io/pander/#brew-to-pandoc




Features of Pandoc.brew:

brew loops and conditional parts of a report just like with brew,capturing plots and images with automatically applied theme,render all R objects automatically in Pandoc’s markdown,recording all warning/error messages plus the raw R objects alongwith anything printed to stdout and the printed results,custom caching mechanism to disk or RAM with auto-dependecy,convert to HTML/pdf/odt/docx at one go,no chunk options (only workaround),building reports also in interactive session with an R5 reference class.




What is pander?Tool for literate programming like Sweave, knitr or brew – with global options

?panderOptions?evalsOptions

number formatting style (decimal mark, digits, trailing spaces etc.),date format,table formats (split, alignment, caption etc.),vector options (separator, copula, wrapper character),global graph settings for base, lattice and ggplot2 calls:

color palette, font settings, grid,legend poistion, axis labels angle etc.

plot dimensions, resolution,cache options, hooks, filter output etc.

http://rapporter.github.io/pander/#pander-options



What is pander?Tool for literate programming like Sweave, knitr or brew – with global options

?panderOptions?evalsOptions

number formatting style (decimal mark, digits, trailing spaces etc.),date format,table formats (split, alignment, caption etc.),vector options (separator, copula, wrapper character),global graph settings for base, lattice and ggplot2 calls:

color palette, font settings, grid,legend poistion, axis labels angle etc.

plot dimensions, resolution,cache options, hooks, filter output etc.




What is pander?Tool for literate programming like Sweave, knitr or brew – a quick comparison

> require(wordcloud)

> pkgs <- ctv:::.get_pkgs_from_ctv_or_repos(’ReproducibleResearch’)[[1]]

> wordcloud(pkgs, rep(1, times = length(pkgs)), colors = rainbow(length(pkgs)),

+ random.color = TRUE)

And pander is intended to be a wrapper around Pandoc,so transforming markdown files to other document formats:> ?Pandoc.convert

> Pandoc.brew(..., convert = ’(html|pdf|odt|docx)’, ...)


Job advertismentsData Scientist Rails programmer

Requirements:* Data warehouse experience* SQL, NoSQL* Programming (e.g. Perl)* English

Advantages:* R programming* Math or insurance degree* German

Requirements:* 2 yrs of Rails experience* jQuery, Ajax* git* work without specs :)

Advantages:* stats knowledge* GH and SO activity* SaaS experience

01 László Gönczy Exploratory data analysis.

02 Gergely Horváth R workshop in Bucharest.

03 Imre Kocsis Bigvis: plotting large data in R.

04 András Tajti Changing User Roles in a Forum.

05 Dénes Tóth Dilemmas in package development.

Documents

English R Lightning Talks @ BURN (2014-04-22)