43
R and Modern Statistical Computing Robert Gentleman

R and Modern Statistical Computing

Embed Size (px)

DESCRIPTION

R and Modern Statistical Computing. Robert Gentleman. Outline. Introduction R past R present R future Bioconductor. What is R?. R is an environment for data analysis and visualization R is an open source implementation of the S language - PowerPoint PPT Presentation

Citation preview

Page 1: R  and Modern Statistical Computing

R and

Modern Statistical Computing

Robert Gentleman

Page 2: R  and Modern Statistical Computing

Outline

• Introduction

• R past

• R present

• R future

• Bioconductor

Page 3: R  and Modern Statistical Computing

What is R?

• R is an environment for data analysis and visualization

• R is an open source implementation of the S language

• S-Plus is a commercial implementation of the S language

• The current version of R is 1.4.1• www.r-project.org

Page 4: R  and Modern Statistical Computing

R Core

• Doug Bates, John Chambers, Peter Dalgaard, Robert Gentleman, Kurt Hornik, Stefano Iacus, Ross Ihaka, Friedrich Leisch, Thomas Lumley, Martin Maechler, Guido Masarotto, Paul Murrell, Brian Ripley, Duncan Temple Lang, and Luke Tierney

• Duncan Murdoch, Martyn Plummer, Vincent Carey

Page 5: R  and Modern Statistical Computing

Funding for R

• to date R has had little funding (no formal funding)

• our universities, particularly the University of Auckland have provided support

• Dept of Biostatistics, Harvard has donated 5,000.00

Page 6: R  and Modern Statistical Computing

R History

• 1991: Ross Ihaka and Robert Gentleman begin work on a project that will ultimately become R

• 1992: Design and implementation of pre-R.

• 1993: The first announcement of R

• 1995: R available by ftp under the GPL

Page 7: R  and Modern Statistical Computing

R History

• 1996: A mailing list is started and maintained by Martin Maechler at ETH

• 1997: The R core group is formed

• 1999: DSC meeting in Vienna, the first time many of R core meet

• 2000: R 1.0.0 is released

• 2002: R 1.4.1 is the current release

Page 8: R  and Modern Statistical Computing

Open Source

• R is both open source and open development– you can look at the source code and you can

propose changes that we will generally adopt

• R is not in the public domain

• You are given a license to run our software– GPL (current)– LGPL (under consideration)

Page 9: R  and Modern Statistical Computing

R and Omegahat

• Omegahat: www.omegahat.org

• Omegahat is another initiative that will allow us to explore alternative implementations and languages without disturbing the R user base too much

• Current contents are largely the work of Duncan Temple Lang and John Chambers

Page 10: R  and Modern Statistical Computing

R Design

• Many of the features of S but with slightly different semantics and memory management.

• We chose Scheme for our semantic model.

• Much of the original code has since been replaced but the basic model remains intact.

Page 11: R  and Modern Statistical Computing

R Internals

• R is written mainly in C

• Our original intention was for R to be as platform independent as practical.

• We began with Macintosh as a primary delivery platform and Unix as our primary development platform.

Page 12: R  and Modern Statistical Computing

What Platforms?

• Unix of many flavours including Linux, Solaris, FreeBSD, AIX (compiles on 64 bit machines)

• Windows - 95/98/NT and 2000

• both binaries and source available

• R can be obtained from – www.r-project.org

Page 13: R  and Modern Statistical Computing

R Internals

• One difference with S is scope– R uses a different set of rules to bind variables

to values

• In S it is hard to treat programs as data

• R should be source code compatible with S-Plus for most code that you will write

Page 14: R  and Modern Statistical Computing

Environments

• an environment is a mechanism for binding symbols to values (hence similar to a hash table)

• each environment has a parent environment

• a big difference between R and S is that R has lexical scope

Page 15: R  and Modern Statistical Computing

Environments

• a function has an environment associated with it and that environment provides bindings for any free variables function

• another way that this can be thought of is that in R functions have mutable state

• Ihaka and Gentleman (JCGS, 2000)

• environments are also associated with formulas in R

Page 16: R  and Modern Statistical Computing

How did we do it?

• we took advantage of certain technologies

• CVS – for version control

• a reasonably sophisticated checking system

• every example in R is runnable and is run many times by all users

• any changes made must pass the checking routine before they are commited

Page 17: R  and Modern Statistical Computing

Testing

• this very simple idea makes distributed development possible

• I am responsible for writing examples for my code (and I should be because I know it)

• others are responsible for making sure that they do not break my code (by running my examples)

Page 18: R  and Modern Statistical Computing

R Package System

• packages are self-contained units of code with documentation

• there are automatic testing features built in

• all functions must have examples and the examples must run

• interesting commands:– example, update.packages

Page 19: R  and Modern Statistical Computing

Databases

• R will talk to most databases

• the ability to access large tables, execute SQL queries etc

• RPgSQL has the notion of proxy objects– R symbols refer to tables in the database– these can look like data.frames in R

Page 20: R  and Modern Statistical Computing

Object Oriented Programming

• S3 class system is a good start but it has some major deficiencies

• in Programming with Data, John Chambers introduced a new and potentially much better system

• object oriented programming helps us build better programs and deal more naturally with complex data structures

Page 21: R  and Modern Statistical Computing

Object Oriented Programming

• a formal mechanism for defining classes of objects

• these provide us with an abstraction that lets us deal with complex data

• generic functions and methods also reduce complexity (for the user)– plot is a generic, methods are defined to

implement plot for different types of data

Page 22: R  and Modern Statistical Computing

Object Oriented Programming

• more important for developers than for users

• it may not be worth defining classes and methods interactively

• Vincent Carey has been working on better mechanisms for documenting the classes and methods

Page 23: R  and Modern Statistical Computing

R as a broker

• R can execute code in virtually any other language

• R has connections, these can be used to access data via different protocols

• R is embeddable in other languages– systems like Perl, Python, Postgres, Apache– allow the user to define and use procedural

languages

Page 24: R  and Modern Statistical Computing

R as a broker

• this means that we can push the calculations to more natural places

• computation can be done where the data are rather than by transporting data

• this will greatly increase our ability to process large data sets

Page 25: R  and Modern Statistical Computing

R: Future

• where to next?

• XML and markup languages

• compilation

• object oriented programming

Page 26: R  and Modern Statistical Computing

XML

• eXtensible Markup Language

• has many friends, XSLT, XLINK, …

• similar to HTML, but more flexible

• <foo> hi there </foo>

• I define my own tags, and provide information about their meaning

Page 27: R  and Modern Statistical Computing

XML

• it allows us to provide semantics/meaning to data

• it separates content from presentation

• content can be presented in many different ways (SAS – output)

• we can use a single parser written by an expert

Page 28: R  and Modern Statistical Computing

XML

• data can be read and understood directly from the source

• eg: we want to search PubMed abstracts

• these are contained in web pages at NCBI

• using the XML package and htmlTreeParse this is a simple operation from within R

Page 29: R  and Modern Statistical Computing

XML

• will form the basis of a more flexible documentation format

• documentation is really content, how you view the help page is rendering (is HTML, internal R, etc).

• the ability to selectively run examples with lots of control

Page 30: R  and Modern Statistical Computing

XML

• live documents

• reports etc can be made into live documents using XML (or similar strategies)

• see Sweave (Leisch, 2002) in R 1.5.0 or from Fritz’s web site

• documents can automatically update (daily/weekly etc)

Page 31: R  and Modern Statistical Computing

Compilation

• most users are interested in compilation because they believe it will increase speed

• we are interested in it for a variety of reasons– understanding how to compile helps us

understand how the language functions (where the warts are)

– virtual machines: JVM, .Net

Page 32: R  and Modern Statistical Computing

Training

• we need to develop a new syllabus for statistical computing courses

• tools that are needed include– computational inference– database interactions– software design and structure– markup languages (and relatives)

Page 33: R  and Modern Statistical Computing

The Future

• statistical computing can develop into a rich subject if it is encouraged

• encouragement needs to take several different approaches

• support: financial, career development,

• statistical computing is a laboratory science, it needs to be funded and run that way

Page 34: R  and Modern Statistical Computing

Production of Code

• we need to encourage (very strongly) writers of methodology to provide code that implements their methodology

• the mathematical or theoretical description of a data analytic technique is really worth very little

• if that technique is implemented then it is much more useful

Page 35: R  and Modern Statistical Computing

Production of Code

• the R package system is a reasonable delivery mechanism

• some design principles will be needed

Page 36: R  and Modern Statistical Computing

An Example

• Bioconductor is a new software initiative– www.bioconductor.org

• among the goals of this project is the deployment of high quality software for the analysis of genomic data

• the challenges are varied and exciting

Page 37: R  and Modern Statistical Computing

Genomic Data

• the data are large; tens of thousands of genes across a few hundred samples

• the biologists have developed high throughput methods for screening samples

• we need to develop high throughput methods for analysis

Page 38: R  and Modern Statistical Computing

Genomic Data

• other challenges: much of the data is non-numeric

• the annotation of genes, their location on the chromosome, deletions, mutations

• the role of the gene in a particular pathway

Page 39: R  and Modern Statistical Computing

Genomics

• what do we measure?– DNA (the raw thing)

– mRNA (microarrays – transcribed DNA)

– protein (proteomics – translated DNA)

• these data gain value from annotation, from knowledge about adjacent genes or gene products

• data sources are varied with different formats, error structures etc

Page 40: R  and Modern Statistical Computing

TFG- pathway

• TGF- (transforming growth factor beta) plays an essential role in the control of development and morphogenesis in multicellular organisms.

• This is done through SMADS, a family of signal transducers and transcriptional activators.

Page 41: R  and Modern Statistical Computing
Page 42: R  and Modern Statistical Computing

Pathways

• http://www.grt.kyushu-u.ac.jp/spad/

• There are many open questions regarding the relationship between expression level and pathways.

• It is not clear whether expression level data will be informative.

Page 43: R  and Modern Statistical Computing

Thanks

• Ross Ihaka, without whom there would be no R

• John Chambers, for S and gracious guidance

• Luke Tierney, Vince Carey, Duncan Temple Lang

• Dept of Stats, U of Auckland