Upload
rapporternet
View
62
Download
0
Embed Size (px)
DESCRIPTION
6 lightning talks on various R topics at the Budapest Users of R Network:• László Gönczy - Exploratory data analysis: project experience and ongoing developments• Gergely Horváth - R workshop in Bucharest• Imre Kocsis - Bigvis: plotting (relatively) large data in R• András Tajti: Changing User Roles in an Online Forum • Dénes Tóth - Dilemmas in package development: interactive visualization, GUIs, largish data, extensibility• Gergely Daróczi - Transforming R objects to Pandoc’s markdownMore details: http://www.meetup.com/Budapest-Users-of-R-Network/events/174345362/
Citation preview
23 April 2014
László Gönczy:Exploratory data analysis:
project experience and ongoing developments
Quanopt
László Gönczy:Exploratory data analysis: project
experience and ongoing developments
Gergely Horváth:R workshop in Bucharest
KSH
buchaRest
Literature
Romania - organizer
Girafe and church
Big-big professor
Mr V. Tepes alias Dracula
Hungary
Serbia
Ancient hero - Traian
Austria
Romania
RO – GB - NL
Gergely Horváth:R workshop in Bucharest
Quanopt
Imre Kocsis:Bigvis: plotting
(relatively) large data in R
Budapesti Műszaki és Gazdaságtudományi EgyetemMéréstechnika és Információs Rendszerek Tanszék
Bigvis: plotting (relatively) large data in R
Kocsis Imre
BURN Meetup, 2014.04.22.
Let’s do Exploratory Data Analysis!
„Flight data”
2008: 113MB df
~7 million x 29
> system.time(print((qplot(data=b,
x=Distance,y=AirTime))))
user system elapsed102.2 60.2 163.5
SotA
Relatively PainlessVisual EDA
Relatively PainlessHandling of Big Data
[…]
[…]
bigvis
From Hadley Wickham
A rather generic approach
o Paper: vita.had.co.nz/papers/bigvis.pdf
o Slides: files.meetup.com/1406240/bigvis.pdf
A reference implementation in R
o ggplot2 gets a huge boost
o GitHub: hadley/bigvis
Big Data EDA?
Subsampling is a hassle.
You probably want…
0. For the whole data
1. Summary statistics over
2. Interval-binned data
+ Error approx. would be nice
+ Supress outliers (or not)
Put in pictures…
ggplot2 bigvis
Few seconds
bigvis (simplified) workflow
bin()
Data in memory
bin()
condense()
bin() Interval binning
count, sum, mean, median, sd
bigvis (simplified) workflow
condense()
smooth()
peel()
count, sum, mean, median, sd
smooth out errors
peel off outliers
… and then plot with ggplot
Some other aspects
Some further automatic magic with KDE
Relative error estimation with alpha / hues
Vis. patterns for (n, m)-d datasets
o n: # of binned variables
om: # of summaries
o Dens. estimate: (1,1)-d, earlier: (2,1)-d
Parallelization & decoupling?
The pattern can scale bymoving out concerns from R
bin: see MapReduce
Some formulations easy forstream proc., too
bin
data
summarize
smooth
visualize
Parallelization & decoupling?
Summary: depends…
Distributive stats: count, sum, min, max
Algebraic stats: mean, sd, higher moments
Holistic…? (quantiles, countdistinct)
bin
data
summarize
smooth
visualize
Parallelization & decoupling?
Input: mostly „resolution” bound
R excels here
bin
data
summarize
smooth
visualize
Towards interactive EDA?
Bin-summarize-smooth can be still long…
Precompute/cache…
… and e.g. update after new batches
Raw data-at-rest
RDBMS / in-memory summarized data
client
Imre Kocsis:Bigvis: plotting (relatively)
large data in R
András Tajti:Changing User Roles in
an Online Forum
Changing User Roles in an Online Forum
András Tajti
BURN meetup
04.23.2014.
Questions
1. Can we declare patterns in user behaviour?
2. Can we detect the change of the behaviour?
Of course, we can!
I will show you one way...
Theoretical tools
● You need features to describe behaviour:– Network science
● You need to find the most important variables:– Principal component analysis
● You need to find users with similar behaviour:– Cluster analysis
Practical tools
● To do all the computations, I used R packages:– Igraph for extracting network features– PcaPP and rrcov for PCA– Fpc for cluster evaluation
● Of course, basic R functions were used mostly:– Princomp for PCA– Hclust for hierarchical clustering– Compiler package for faster computation
How does a forum look like?
● One post is either a reply to another or not:– One post has maximum one out-degree– Can have several in-degrees as any later post can
refer to it.
Users' features
● To describe behaviour, I used:– Number of posts– Number of neighbours– Parent users in- and outdegree– All above as ranks and relative ranks
Choosing important features
● Main problem: all variables have heavy-tailed distribution– Principal component is best for normally
distributed variables– Alternatives:
● Robust correlation estimation● Projection pursuit methods
– Winner: ROBpca from rrcov as PcaHubert– Mostly the same results as the original Princomp
Searching groups
● Cluster analysis:– Hierarchical, with euclidean distance and
complete linkage– Used on the PCA scores increased with explained
variance– Technical limits on the number of clusters:
● Min.: 2 (the result contained groupings with at least three grous)
● Max: 30 (was reached a few times)
Selecting cluster numbers
● For every goodness measure, I was looking for– First local minimum/maximum– Sharpest “elbow”
Select by eye
What is changing?
● I used “time windows“ to slice the data● One window contained 1000 posts and their
full thread● I ran role detection for all sets● Than compared memberships between
clusters
How to compare memberships?
● There are users only in one or the other dataset
● Two groups are similar if they have significant number of common users:
Example
Example
Thank You!
[email protected]@atajti
The code will be availabe at github.com/atajti/changingForumRoles
András Tajti:Changing User Roles in an Online Forum
MTA
Dénes Tóth:Dilemmas in package development:
interactive visualization, GUIs, largish data, extensibility
Dilemmas in package development:
Dénes Tóth
interactive visualization, GUIs, largish data, extensibility
BURN Meetup 1 / 15Budapest, 23.04.2014.
Dénes Tóth / [email protected] MTA TTK KPI/ humlab.cogpsyphy.hu
BURN Meetup 2 / 15Budapest, 23.04.2014.
• Electroencephalography (EEG)– Voltage fluctuations (μV) recorded at the scalp– A typical setup: 32-128 channels, 500-1000 Hz sampling
rate, 30-90 minutes recording time, 20-30 participants → 200 MB – 2 GB / participant
– Tasks: raw data import + signal processing (filtering, resampling, artifact correction [e.g. eye movements])
– Visual inspection is unavoidable → interactive visualization is a must
EEG
Dénes Tóth / [email protected] MTA TTK KPI/ humlab.cogpsyphy.hu
BURN Meetup 3 / 15Budapest, 23.04.2014.
Dénes Tóth / [email protected] MTA TTK KPI/ humlab.cogpsyphy.hu
BURN Meetup 4 / 15Budapest, 23.04.2014.
• Cognitive experiments: what does the brain do if exposed to A versus B
– EEG & events = Event-related potentials (ERP)– 40-200 repetitions per condition, factorial design (Fac1 x
Fac2)– Tasks: marker handling, segmentation, artifact rejection,
averaging, time-frequency analyses → extract components & do statistics (e.g. clustering, ANOVA, etc.) → tremendous number of analytic possibilities
– randomization statistics (e.g. 5000 permutations)
ERP
Dénes Tóth / [email protected] MTA TTK KPI/ humlab.cogpsyphy.hu
BURN Meetup 5 / 15Budapest, 23.04.2014.
Dénes Tóth / [email protected] MTA TTK KPI/ humlab.cogpsyphy.hu
BURN Meetup 6 / 15Budapest, 23.04.2014.
• no dedicated comprehensive package in R for EEG, but a lot of related packages (e.g. signal, mfilter, icaOcularCorrection + one trillion statistical methods)
• Present (eegR)– Base data class: array– Basic operation: apply-like– ~60 functions, ~4000 lines → appropriate for a specific
workflow– No cohesive system
eegR package
Dénes Tóth / [email protected] MTA TTK KPI/ humlab.cogpsyphy.hu
BURN Meetup 7 / 15Budapest, 23.04.2014.
• Future (dream :)– Covers all basic analytic steps + highly extensible– Provides Workflow + GUI + scripting– Handles well out of memory datasets, easy parallelization– Interactive visualization capabilities
eegR package
Dénes Tóth / [email protected] MTA TTK KPI/ humlab.cogpsyphy.hu
BURN Meetup 8 / 15Budapest, 23.04.2014.
• One package– Pros: easy install process, better tuning options– Cons: less general, harder to extend
• One core package + extensions– Pros: anyone can write extensions, easy to invoke other
packages– Cons: the core package must be very well written
Question I. One package or related packages?
Dénes Tóth / [email protected] MTA TTK KPI/ humlab.cogpsyphy.hu
BURN Meetup 9 / 15Budapest, 23.04.2014.
• Range– Introduce only classes, methods and utility functions, or
provide a basic stand-alone package?
• Classes– S3 / S4 / R5 ?
Question I/a. How to write a good core?
Dénes Tóth / [email protected] MTA TTK KPI/ humlab.cogpsyphy.hu
BURN Meetup 10 / 15Budapest, 23.04.2014.
Workflow approach: R AnalyticFlow• Pros:
the natural way ofEEG signal processing
unconstrained scripting
• Cons:
reliability?performance?
Question II.What about the user interface?
Dénes Tóth / [email protected] MTA TTK KPI/ humlab.cogpsyphy.hu
BURN Meetup 11 / 15Budapest, 23.04.2014.
• GUI coverageFull GUI ←→ subtask- (function-) related GUI
• GUI type– Desktop GUI ←→ web based GUI– gWidgets2 |
gWidgetsWWW2 | Shiny |
...
Question II.What about the user interface?
Dénes Tóth / [email protected] MTA TTK KPI/ humlab.cogpsyphy.hu
BURN Meetup 12 / 15Budapest, 23.04.2014.
• SciDB would be great, but only available on Linux• Two candidates: ff & gdsfmt packages
– ff package: more comprehensive– gdsfmt package: lightweight & fast
Question III.Which out-of-RAM package to choose?
Dénes Tóth / [email protected] MTA TTK KPI/ humlab.cogpsyphy.hu
BURN Meetup 13 / 15Budapest, 23.04.2014.
• iPlots, playwith, ggvis etc.: good, but not efficient• Performance issues: a 10-sec part can contain
128 x 1000 x 10 = 1.280.000 data points• Candidates for line plots:
– Acinonyx– rCharts w. Dygraphs
Question IV.Interactive visualization?
Dénes Tóth / [email protected] MTA TTK KPI/ humlab.cogpsyphy.hu
BURN Meetup 14 / 15Budapest, 23.04.2014.
• Acinonyx (iPlots Extreme)– Pros: very fast, iContainer– Cons: very poor documentation, not on CRAN
• rCharts and other web-based tools, esp. JavaScript libraries
– Dygraphs: fast and nice, but no official port to rCharts– Communication between JS and R?
Question IV.Interactive visualization?
Dénes Tóth / [email protected] MTA TTK KPI/ humlab.cogpsyphy.hu
BURN Meetup 15 / 15Budapest, 23.04.2014.
Thank you!
Q1: One package or related packages?
Q1a: What should the base package cover? Do I need S4 or R5?
Q2: User interface? → R AnalyticFlow, GUIs
Q3: How to handle out-of-memory data?
Q4: Interactive visualization?
Dénes Tóth:Dilemmas in package development:
interactive visualization, GUIs, largish data, extensibility
rapporter.net
Gergely Daróczi:pander: Transforming R objects
to Pandoc’s markdown
pander: A Pandoc writer in RTransforming R objects to Pandoc’s markdown
Gergely Daró[email protected]
Budapest Users of R Network
23 April 2014
What is pander?A collection of helper functions to print markdown syntax
> ?pandoc.(footnote|header|horizontal.rule|image|link|p)(.return)?> ?pandoc.(emphasis|strikeout|strong|verbatim)(.return)?
> pandoc.strong(’foobar’)**foobar**
> pandoc.strong.return(’foobar’)[1] "**foobar**"
> pandoc.header(’foobar’, level = 2)
## foobar
> pandoc.header(’foobar’, style = ’setext’)
foobar======
Gergely Daróczi (rapporter.net) pander: A Pandoc writer in R 23/4/2014 2 / 15
What is pander?Collection of helper functions to map R objects to markdown
> ?pandoc.(list|table)(.return)?
> pandoc.list(list(’foo’, list(’bar’)))
* foo* bar
> pandoc.table(head(iris, 2), split.table = Inf)
-------------------------------------------------------------------Sepal.Length Sepal.Width Petal.Length Petal.Width Species
-------------- ------------- -------------- ------------- ---------5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa-------------------------------------------------------------------
Gergely Daróczi (rapporter.net) pander: A Pandoc writer in R 23/4/2014 3 / 15
What is pander?Collection of helper functions to map R objects to various markdown languages
> pandoc.table(head(iris, 2), split.table = Inf, style = ’rmarkdown’)
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species ||:--------------:|:-------------:|:--------------:|:-------------:|:---------:|| 5.1 | 3.5 | 1.4 | 0.2 | setosa || 4.9 | 3 | 1.4 | 0.2 | setosa |
> pandoc.table(head(iris, 2), split.table = Inf, style = ’simple’)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species-------------- ------------- -------------- ------------- ---------
5.1 3.5 1.4 0.2 setosa4.9 3 1.4 0.2 setosa
Gergely Daróczi (rapporter.net) pander: A Pandoc writer in R 23/4/2014 4 / 15
What is pander?Collection of helper functions to map R objects to various markdown languages
> iris$Species <- ’foos and bars’; names(iris) <- gsub(’.’, ’ ’, names(iris)> pandoc.table(head(iris, 4), split.table = Inf, style = ’grid’,+ split.cells = 5, justify = ’left’)
+----------+---------+----------+---------+------------+| Sepal | Sepal | Petal | Petal | Species || Length | Width | Length | Width | |+==========+=========+==========+=========+============+| 5.1 | 3.5 | 1.4 | 0.2 | setosa |+----------+---------+----------+---------+------------+| 4.9 | 3 | 1.4 | 0.2 | setosa |+----------+---------+----------+---------+------------+| 4.7 | 3.2 | 1.3 | 0.2 | setosa |+----------+---------+----------+---------+------------+| 4.6 | 3.1 | 1.5 | 0.2 | foos || | | | | and || | | | | bars |+----------+---------+----------+---------+------------+
Gergely Daróczi (rapporter.net) pander: A Pandoc writer in R 23/4/2014 5 / 15
What is pander?S3 method to map R objects to markdown
> ?pander(.return)?
> methods(pander)
[1] pander.anova* pander.aov* pander.cast_df* pander.character*
[5] pander.data.frame* pander.default* pander.density* pander.evals*
[9] pander.factor* pander.glm* pander.htest* pander.image*
[13] pander.list* pander.lm* pander.logical* pander.matrix*
[17] pander.NULL* pander.numeric* pander.option pander.POSIXct*
[21] pander.POSIXt* pander.prcomp* pander.rapport* pander.table*
Non-visible functions are asterisked
> pander(head(iris, 1), split.table = Inf)
-------------------------------------------------------------------
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
-------------- ------------- -------------- ------------- ---------
5.1 3.5 1.4 0.2 setosa
-------------------------------------------------------------------
Gergely Daróczi (rapporter.net) pander: A Pandoc writer in R 23/4/2014 6 / 15
What is pander?S3 method to map R objects to markdown
> pander(letters[1:7])
_a_, _b_, _c_, _d_, _e_, _f_ and _g_
> pander(ks.test(runif(50), runif(50))
---------------------------------------------------
Test statistic P value Alternative hypothesis
---------------- --------- ------------------------
0.18 _0.3959_ two-sided
---------------------------------------------------
Table: Two-sample Kolmogorov-Smirnov test: ‘runif(50)‘ and ‘runif(50)‘
> pander(chisq.test(table(mtcars$am, mtcars$gear)))
---------------------------------------
Test statistic df P value
---------------- ---- -----------------
20.94 2 _2.831e-05_ * * *
---------------------------------------
Table: Pearson’s Chi-squared test: ‘table(mtcars$am, mtcars$gear)‘
Warning message:In chisq.test(table(mtcars$am, mtcars$gear)) :
Chi-squared approximation may be incorrect
Gergely Daróczi (rapporter.net) pander: A Pandoc writer in R 23/4/2014 7 / 15
What is pander?S3 method to map R objects to markdown
> pander(lm(mtcars$wt ~ mtcars$hp), summary = TRUE)
--------------------------------------------------------------
Estimate Std. Error t value Pr(>|t|)
----------------- ---------- ------------ --------- ----------
**mtcars$hp** 0.009401 0.00196 4.796 4.146e-05
**(Intercept)** 1.838 0.3165 5.808 2.389e-06
--------------------------------------------------------------
-------------------------------------------------------------
Observations Residual Std. Error $R^2$ Adjusted $R^2$
-------------- --------------------- ------- ----------------
32 0.7483 0.4339 0.4151
-------------------------------------------------------------
Table: Fitting linear model: mtcars$wt ~ mtcars$hp
Gergely Daróczi (rapporter.net) pander: A Pandoc writer in R 23/4/2014 8 / 15
What is pander?S3 method to map R objects to pretty formatted markdown
> panderOptions(’table.split.table’, Inf)
> panderOptions(’table.style’, ’grid’)
> emphasize.cells(which(iris > 1.3, arr.ind = TRUE))
> pander(iris)
+----------------+---------------+----------------+---------------+------------+
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
+================+===============+================+===============+============+
| *5.1* | *3.5* | *1.4* | 0.2 | setosa |
+----------------+---------------+----------------+---------------+------------+
| *4.9* | *3* | *1.4* | 0.2 | setosa |
+----------------+---------------+----------------+---------------+------------+
| *4.7* | *3.2* | 1.3 | 0.2 | setosa |
+----------------+---------------+----------------+---------------+------------+
| *4.6* | *3.1* | *1.5* | 0.2 | setosa |
+----------------+---------------+----------------+---------------+------------+
| *5* | *3.6* | *1.4* | 0.2 | setosa |
+----------------+---------------+----------------+---------------+------------+
| *5.4* | *3.9* | *1.7* | 0.4 | setosa |
+----------------+---------------+----------------+---------------+------------+
| *4.6* | *3.4* | *1.4* | 0.3 | setosa |
+----------------+---------------+----------------+---------------+------------+
| *5* | *3.4* | *1.5* | 0.2 | setosa |
+----------------+---------------+----------------+---------------+------------+
| *4.4* | *2.9* | *1.4* | 0.2 | setosa |
+----------------+---------------+----------------+---------------+------------+
Gergely Daróczi (rapporter.net) pander: A Pandoc writer in R 23/4/2014 9 / 15
What is pander?Tool for literate programming like Sweave, knitr or brew
> ?Pandoc.brew
> Pandoc.brew(text = ’
+ Pi equals to <%= pi %>, and the best damn cars are:
+ <%= head(mtcars, 2) %>
+ ’)
Pi equals to _3.142_, and the best damn cars are:
--------------------------------------------------------
mpg cyl disp hp drat wt
------------------- ----- ----- ------ ---- ------ -----
**Mazda RX4** 21 6 160 110 3.9 2.62
**Mazda RX4 Wag** 21 6 160 110 3.9 2.875
--------------------------------------------------------
Table: Table continues below
--------------------------------------------------
qsec vs am gear carb
------------------- ------ ---- ---- ------ ------
**Mazda RX4** 16.46 0 1 4 4
**Mazda RX4 Wag** 17.02 0 1 4 4
--------------------------------------------------
Gergely Daróczi (rapporter.net) pander: A Pandoc writer in R 23/4/2014 10 / 15
What is pander?Tool for literate programming like Sweave, knitr or brew
Features of Pandoc.brew:
brew loops and conditional parts of a report just like with brew,capturing plots and images with automatically applied theme,render all R objects automatically in Pandoc’s markdown,recording all warning/error messages plus the raw R objects alongwith anything printed to stdout and the printed results,custom caching mechanism to disk or RAM with auto-dependecy,convert to HTML/pdf/odt/docx at one go,no chunk options (only workaround),building reports also in interactive session with an R5 reference class.
http://rapporter.github.io/pander/#brew-to-pandoc
Gergely Daróczi (rapporter.net) pander: A Pandoc writer in R 23/4/2014 11 / 15
What is pander?Tool for literate programming like Sweave, knitr or brew
Features of Pandoc.brew:
brew loops and conditional parts of a report just like with brew,capturing plots and images with automatically applied theme,render all R objects automatically in Pandoc’s markdown,recording all warning/error messages plus the raw R objects alongwith anything printed to stdout and the printed results,custom caching mechanism to disk or RAM with auto-dependecy,convert to HTML/pdf/odt/docx at one go,no chunk options (only workaround),building reports also in interactive session with an R5 reference class.
http://rapporter.github.io/pander/#brew-to-pandoc
Gergely Daróczi (rapporter.net) pander: A Pandoc writer in R 23/4/2014 12 / 15
What is pander?Tool for literate programming like Sweave, knitr or brew – with global options
?panderOptions?evalsOptions
number formatting style (decimal mark, digits, trailing spaces etc.),date format,table formats (split, alignment, caption etc.),vector options (separator, copula, wrapper character),global graph settings for base, lattice and ggplot2 calls:
color palette, font settings, grid,legend poistion, axis labels angle etc.
plot dimensions, resolution,cache options, hooks, filter output etc.
http://rapporter.github.io/pander/#pander-options
Gergely Daróczi (rapporter.net) pander: A Pandoc writer in R 23/4/2014 13 / 15
What is pander?Tool for literate programming like Sweave, knitr or brew – with global options
?panderOptions?evalsOptions
number formatting style (decimal mark, digits, trailing spaces etc.),date format,table formats (split, alignment, caption etc.),vector options (separator, copula, wrapper character),global graph settings for base, lattice and ggplot2 calls:
color palette, font settings, grid,legend poistion, axis labels angle etc.
plot dimensions, resolution,cache options, hooks, filter output etc.
http://rapporter.github.io/pander/#pander-options
Gergely Daróczi (rapporter.net) pander: A Pandoc writer in R 23/4/2014 14 / 15
What is pander?Tool for literate programming like Sweave, knitr or brew – a quick comparison
> require(wordcloud)
> pkgs <- ctv:::.get_pkgs_from_ctv_or_repos(’ReproducibleResearch’)[[1]]
> wordcloud(pkgs, rep(1, times = length(pkgs)), colors = rainbow(length(pkgs)),
+ random.color = TRUE)
And pander is intended to be a wrapper around Pandoc,so transforming markdown files to other document formats:> ?Pandoc.convert
> Pandoc.brew(..., convert = ’(html|pdf|odt|docx)’, ...)
Gergely Daróczi (rapporter.net) pander: A Pandoc writer in R 23/4/2014 15 / 15
Job advertismentsData Scientist Rails programmer
Requirements:* Data warehouse experience* SQL, NoSQL* Programming (e.g. Perl)* English
Advantages:* R programming* Math or insurance degree* German
Requirements:* 2 yrs of Rails experience* jQuery, Ajax* git* work without specs :)
Advantages:* stats knowledge* GH and SO activity* SaaS experience
01 László Gönczy Exploratory data analysis.
02 Gergely Horváth R workshop in Bucharest.
03 Imre Kocsis Bigvis: plotting large data in R.
04 András Tajti Changing User Roles in a Forum.
05 Dénes Tóth Dilemmas in package development.