Upload
prosper-hill
View
217
Download
1
Tags:
Embed Size (px)
Citation preview
Savvas [email protected]
EPCC, The University of Edinburgh
SPRINTSPRINT
A SSimple PParallel RR INTINTerface
March 2010 SPRINTSPRINT 2
Overview
• What is SPRINT
• How is SPRINT different from other parallel R packages
• Biological example: Post-genomic data analysis
• Code comparison
March 2010 SPRINTSPRINT 3
SSimple PParallel RR INTINTerface(www.r-sprint.org)
“SPRINT: A new parallel framework for R”, J Hill et al, BMC Bioinformatics, Dec 2008.
SPRINT
March 2010 SPRINTSPRINT 4
Issues of existing parallel R packages
• Difficult to program
• Require scientist to also be a parallel
programmer!
• Require substantial changes to existing
scripts
• Can’t be used to solve some problems
• No data dependencies allowed
March 2010 SPRINTSPRINT 5
• Data:Data: A matrix of expression measurements with genes
in rows and samples in columns
Biological example
March 2010 SPRINTSPRINT 6
Biological example
• ProblemProblem
Using all or many genes will either crash or be very slow
(R memory allocation limits, number of computations)
Input array
dimensions and size
Final array size
in memory
11,000 x 320
26.85 MB26.85 MB
923.15 MB
(0.9 GB0.9 GB)
22,000 x 320
53.7 MB53.7 MB
3,692.62 MB
(3.6 GB3.6 GB)
35,000 x 320
85.44 MB85.44 MB
9,346 MB
(9.12 GB9.12 GB)
45,000 x 320
109.86 MB109.86 MB
15,449.52 MB
(15.08 GB15.08 GB)
Data limitations (correlations)Data limitations (correlations)
Input array dimensions
and permutation count
Estimated total
run time
36,612 x 76
500,000500,000
20,750 seconds
6 hours6 hours
36,612 x 76
1,000,0001,000,000
41,500 seconds
12 hours12 hours
73,224 x 76
500,000500,000
35,000 seconds
10 hours10 hours
73,224 x 76
1,000,0001,000,000
70,000 seconds
20 hours20 hours
Work load limitations (permutations)Work load limitations (permutations)
March 2010 SPRINTSPRINT 7
Workarounds and solution
• Workaround:Workaround:
– Remove as many genes as possible before applying algorithm. This can be an arbitrary process and remove relevant data.
– Perform multiple executions and post-process the data. Can become very painful procedure.
• Solution:Solution:Parallelisation of R code can be made accessible to
bioinformaticians/statisticians.
A library with expertexpert coded solutions once, then easy
end-point use by all.
SPRINT
R
Biological Results
HPCHPC
Big PostGenomic Data
March 2010 SPRINTSPRINT 8
Benchmarks (256 processes)
Input array
dimensions and size
Final array size
in memory
Total run time (in serial)
(in seconds)
Total run time (in parallel)
(in seconds)
11,000 x 320
26.85 MB26.85 MB
923.15 MB
(0.9 GB0.9 GB)63.1863.18 4.764.76
22,000 x 320
53.7 MB53.7 MB
3,692.62 MB
(3.6 GB3.6 GB)
““Error: cannot allocate vectorError: cannot allocate vector
of size 3.6 Gb”of size 3.6 Gb”13.8713.87
35,000 x 320
85.44 MB85.44 MB
9,346 MB
(9.12 GB9.12 GB)CRASHEDCRASHED 36.6436.64
45,000 x 320
109.86 MB109.86 MB
15,449.52 MB
(15.08 GB15.08 GB)CRASHEDCRASHED 42.1842.18
Data limitations (correlations)Data limitations (correlations)
Input array dimensions
and permutation count
Estimated total
run time (in serial)
Total run time (in parallel)
(in seconds)
36,612 x 76
500,000500,000
20,750 seconds
6 hours6 hours73.1873.18
36,612 x 76
1,000,0001,000,000
41,500 seconds
12 hours12 hours146.64146.64
73,224 x 76
500,000500,000
35,000 seconds
10 hours10 hours148.46148.46
73,224 x 76
1,000,0001,000,000
70,000 seconds
20 hours20 hours294.61294.61
Work load limitations (permutations)Work load limitations (permutations)
March 2010 SPRINTSPRINT 9
edata <- read.table("largedata.dat")
pearsonpairwise <- cor(edata)
write.table(pearsonpairwise, "Correlations.txt")
quit(save="no")
library("sprint")
edata <- read.table("largedata.dat")
ff_handle <- pcor(edata)
pterminate()
quit(save="no")
Correlation code comparison
March 2010 SPRINTSPRINT 10
data(golub)smallgd <- golub[1:100,] classlabel <- golub.cl
resT <- mt.maxT(smallgd, classlabel, test="t", side="abs")
quit(save="no")
library("sprint")
data(golub)smallgd <- golub[1:100,] classlabel <- golub.cl
resT <- pmaxT(smallgd, classlabel, test="t", side="abs")
pterminate()
quit(save="no")
Permutation testing code comparison
March 2010 SPRINTSPRINT 11
• Website: Website: http://www.r-sprint.org/
• Source code can be downloaded from websiteSource code can be downloaded from website
• Soon also in the Soon also in the CRANCRAN repository repository
• Mailing list: Mailing list: [email protected]
• Contact email: Contact email: [email protected]
SPRINT
March 2010 SPRINTSPRINT 12
Acknowledgements
DPM Team:DPM Team:
• Peter Ghazal
• Thorsten Forster
• Muriel Mewissen
EPCC Team:EPCC Team:
• Terry Sloan
• Michal Piotrowski
• Savvas Petrou
• Bartek Dobrzelecki
• Jon Hill
• Florian Scharinger
This work is supported by the Wellcome TrustWellcome Trust and the NAG dCSE SupportNAG dCSE Support service.
Numerical Numerical Algorithms GroupAlgorithms Group