View
227
Download
1
Category
Preview:
Citation preview
HPC in Bioinformatics and Genomics
Daniel Kahn, Clément Rezvoy and Frédéric Vivien
Lyon 1 University & INRIA HELIX teamLIP-ENS & INRIA GRAAL team
D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day
Moore’s law in genomics
Ø Exponential increase
Ø Doubling time ~20 months
New high-throughput technologies
Ø Pyrosequencing (Roche 454 GS FLX)l 100-400 Mb per run (1 day)
l Long reads (up to 400 bp)
l ~15 Gb raw data
Ø Illumina Genome Analyzerl 1,500 Mb per run (3 days)
l Short reads (35 bp)
l ~1 Tb raw data
Ø Applied Biosystems SOLID sequencerl 3,000 Mb per run (5 days)
l Short reads (35 bp)
l ~15 Tb raw data
D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day
Uses of high throughput sequencing
Ø Population genomicsl For instance, 1000 human genome project
Ø Individual sequencing
Ø Metagenomicsl Comprehensive appraisal of microbial communities and gene repertoires
in various environments
Ø Phylogenomicsl Resolving the history of genes and species
Ø ….
Ø As many computing challenges
D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day
Large scale protein sequence analysis
Ø All vs. all
Ø The challenge of protein modularityl Most proteins are combinatorial arrangements of conserved modules
(domains)
LuxR
GerE
FixJ
OmpR
SpoOA
NtrC
NifA
D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day
The ProDom project
Ø Need for an automated process in order to allow for comprehensive analysis
Ø Automatically decompose proteins into domains and cluster domain families, using MKDOM2
Ø Generate multiple alignments and trees for all families
Ø Automatically generate mutually consistent representations for all proteins
D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day
Resolving combinatorial proteins
query
internal repeat detection
yes
query
no
PSI-BLAST
DB
DB changesremove newly found domains
split modified sequencessort by size
DB
query
no match matches repeat matches
(i+1)th iteration
ith iteration The MKDOM2 program
D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day
Drawbacks of sequential MKDOM2
Ø Greedy algorithm
Ø Scales quadratically
Ø Data follow Moore’s law
è no more tractable !
D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day
Parallelization of MKDOM2
Ø Parallelization of the main loop
Ø Distribute sequences for independent family construction
Ø Difficulties:l Heterogeneous run times for the main loop
l Possible dependencies between families
è Precalculate an all vs. all comparison in order to select independent queries
è Send batches of independent sequences before worker nodes are idle
è Verify family independence a posteriori
D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day
Speed-up on medium scale test set
Ø 32 Archaeal genomes
Ø 21.5 M aminoacids
D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day
Large-scale test set
Ø 263 genomes
Ø 950,216 protein sequences
Ø 339 M aminoacids
Ø Run on GRID’5000 (150 nodes)
Ø Half of the data set processed in only 20 hours
D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day
Database crunching
D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day
Increasing query sizes
D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day
Variable sizes of domain families
D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day
Heterogeneous run times
Ø ~1000-fold range
D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day
Large result queue
D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day
… yet efficient node usage
Ø 86% processor usage
D. Kahn, HPC in Bioinformatics and Genomics IBM Campus Day
Full-scale protein domain analysis
Ø To be scaled-up 7-fold for full processing of UniProt today !
Ø Will require stable MPI usage of ~1000 processors over the grid
Ø Appropriate infrastructure not yet identified
Ø Other program MPI_MKDOM3 envisioned to make full use of precalculated all vs. all comparison
… required in order to further cope with Moore’s law
INRA ToulouseEmmanuel COURCELLEDaniel KAHN
Support- PRABI- EU (EMBRACE & IMPACT)- IN2P3- GRID’5000
Lyon 1 UniversityINRIA HELIX projectAurélie LAUGRAUD Lauranne DUQUENNEDaniel KAHN
LIP-ENS LyonClément REZVOYFrédéric VIVIEN
Recommended