Upload
clovis
View
22
Download
2
Embed Size (px)
DESCRIPTION
Bioinformatics Applications in the Spanish Network for e-Science. Ignacio Blanquer Vicente Hernández. Outline. The Spanish Network for e-Science Structure and link with the Spanish NGI. Bioinformatics applications in the Spanish Network for e-Science. - PowerPoint PPT Presentation
Citation preview
EGEE-III INFSO-RI-222667
Enabling Grids for E-sciencE
www.eu-egee.org
EGEE and gLite are registered trademarks
Ignacio Blanquer
Vicente Hernández
Bioinformatics Applications in the Spanish Network for e-Science
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667
Outline
• The Spanish Network for e-Science– Structure and link with the Spanish NGI.
• Bioinformatics applications in the Spanish Network for e-Science.
• Challenges for Bioinformatics on the Grid.
Bioinformatics Session - EGEE’09 - Barcelona 2
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667
The Creation of the Spanish Network for e-Science
• As a consequence of the interest raised by the different research centres and groups participating in national and international projects on Grids and Supercomputing, the white book for the e-Science was produced (http://www.fecyt.es/e-ciencia/libroblanco.htm).
• The need for a global coordination and the development of common tool for easing the access to resources, the Spanish Network for e-Science (CAC-2007-52) was created by the Ministry of Science and Innovation– Officially approved on December 2007 and coordinated by Vicente
Hernández García (Universidad Politécnica de Valencia).• One of the mandates of the Network was to set up the Spanish
NGI, which has been officially created in July 2009– The ministry nominated Isabel Campos (IFCA) as the coordinator of
the Spanish NGI.
Bioinformatics Session - EGEE’09 - Barcelona 3
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667
Participant Groups
• More than 50 different institutions and 97 Research Groups.• More than 1000 researchers.• Dynamic Structure
– 28 Groups have been incorporated after the starting of the activity.
• Structured in Four Activity Areas– EGEE Booth Number 6.
Bioinformatics Session - EGEE’09 - Barcelona 4
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667
Infrastructure
Bioinformatics Session - EGEE’09 - Barcelona 5
CESGACESGA339 cores339 cores1 TB1 TB
UPVUPV36 cores36 cores1 TB1 TB
UNIZARUNIZAR54 cores54 cores
0.8 TB0.8 TB
CIEMATCIEMAT220 cores220 cores
2.7 TB2.7 TB
PICPIC1296 cores1296 cores
10 TB10 TB
IFCAIFCA867 cores867 cores
1 TB1 TB
• gLite-based• Own BDII (EGEE-Compatible)• Supporting IBERGRID
(ES+PT)• 3 Different WMs (Xbroker,
gLite-WMS, GridWay)
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667
Applications
• 3 Roles are identified– Mature applications aiming at a challenging experiment.– Pilots that require intensive porting and a feasibility study.– Support groups with experience on porting applications.
• Pilots, Applications and Support Groups
are certified by an expert board.
• An internal call for projects was set up.
Bioinformatics Session - EGEE’09 - Barcelona 6
PilotsPilots
ApplicationsApplications
Pilot Selection
Pilot Selection
Expert PanelExpert Panel
Analysis and Selection
Analysis and Selection
Resource AllocationResource Allocation
Pilot migration
Pilot migration
Support GroupsSupport Groups
Deploym. and test
Deploym. and test ReportReport
Applications proposal
Applications proposal
Expert panelExpert panel
Autonom. migration
Autonom. migration
Assisted MigrationAssisted
Migration
ProductionProduction
NGI infrastructure
Support GroupsSupport Groups
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667
Overview of the Bioinformatics Applications
• Consolidated Use– Work on current databases to analyse quality, improve
annotation or increase the usability CD-HIT. GSBLAST. BiG - Metagenomics.
• Emerging Use– Port new applications on
the Grid for providing new services
Gfrodock. G-MIRA. Filogen.
Bioinformatics Session - EGEE’09 - Barcelona 7
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667
http://www.e-ciencia.es/wiki/index.php/CD-HIT
CD-HIT
• Identification of Representative Sequences of Protein Families using CD-HIT– Proposed by the National Centre of Oncological
Research (CNIO).
– It proposes using the resources available through the Spanish Network for e-Science and the CD-HIT algorithm to create more regularly non redundant versions of the available databases.
Bioinformatics Session - EGEE’09 - Barcelona 8
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667
http://www.e-ciencia.es/wiki/index.php/BLAST
GBLAST
• Analysis of the horizontal transference of genes through a BLAST Processing Service– Proposed by the “Instituto de Biología Celular y Molecular
de Plantas” and the GRyCAP, from the Universidad Politécnica de Valencia.
– This experiment aims at identifying the horizontal transference of gens between prokaryotes and plants, using the UINPROT database, and comparing all known prokaryotic sequences (~4M) among all the known sequences of plants (~0.5M), animals (~1.5M) and fungus (~0.4M).
Output size using the
columns as input and the
rows as reference database
Bioinformatics Session - EGEE’09 - Barcelona 9
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667
http://www.e-ciencia.es/wiki/index.php/GFrodock
GFrodDock
• Grid-Fast ROtational DOCKing– Proposed by the Centro de Investigaciones
Biológicas – CSIC.– The objective is determining the interaction between two proteins by
means of the analysis of their atomic structure.– Aiming at solving one of the CAPRI (Critical Assessment of
Predicted Interactions) scientific challenges.
Bioinformatics Session - EGEE’09 - Barcelona 10
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667
Metagenomic Analysis on the GridBiG
• Quality of the phylogenetic annotation of bacteria– Comparative phylogenetic experiment on a soil
sample with respect to different releases of the NR Gene Bank Database.
– Many of the associations of sample fragments to biological families have changed, even recently.
– The changing rate does not decreases as time goes by, being increased in many cases.
– This reveals that the complete diversity of such communities is not sufficiently well described on current data bases.
Bioinformatics Session - EGEE’09 - Barcelona 11
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667
http://www.e-ciencia.es/wiki/index.php/MIRA
GMIRA
• Assembly of Pyrosequences – Proposed by the “Instituto de Biología Molecular y
Celular de Plantas” and the Grid and High Performance Computing Research Group of the Universidad Politécnica de Valencia.
– The new high-throughput sequencing techniques are producing millions of readings between 80 and 500 nucleotids each, requiring intensive post-processing for their assembly.
– This pilot focuses on porting to the Grid one well-known code for this
kind of sequences, which requires vast computing and memory resources.
Bioinformatics Session - EGEE’09 - Barcelona 12
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667
http://www.e-ciencia.es/wiki/index.php/Filogen
Filogen
• Construction of Phylogenetic trees– Proposed by the Institute of Research on Engineering in
Aragon (I3A).– Phylogenetics aims at reconstructing the evolutionary
relations among species and living beings using the information from their genome.
– This pilot focuses on porting a suite of general purpose codes for such objective, in order to reduce the long response time required for challenging executions.
Bioinformatics Session - EGEE’09 - Barcelona 13
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667
Current Status
• 4 Projects already have a VO created (vo.odthpiv.es-ngi.eu, vo.blast.es-ngi.eu, vo.filogen.es-ngi.eu and vo.frodock.es-ngi.eu ).
• 3 Projects (GBLAST, FILOGEN, and g-MIRA), have been granted with resources for porting through an internal project call.
• 33% of the resources have been consumed by the biomed applications.
66,9
33,1
Others
Biomed
Resource Usage
Bioinformatics Session - EGEE’09 - Barcelona 14
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667
Challenges 1/2
• From the point of view of the resources– Improved scheduling of jobs
Highly dynamic nature of the behaviour of resources (multiple entry points, information system refreshment delays, wide geographic distribution, …).
Need for Quality of Service and job run-length prediction. Need for much more scalable algorithms and models
• Go beyond the simple high-throughput approach based on splitting the input.
I/O Bandwidth consume minimisation• Improvement of locality of reference for large databases.
– Specialised resources Main memory constraints. Availability of pre-existing tuned configurations of widely
used software.
Bioinformatics Session - EGEE’09 - Barcelona 15
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667
Challenges 2/2
• From the point of view of the community– Trade-off on Public Database between extensively covering the
available information and its quality. Many results of using Grid in bioinformatics have been focused on
this issue. Since databases are exponentially growing on size, this issue
seems to be valid for the medium-term.
– Popularisation of community access Availability of simpler interfaces and configurable workflows But Grids are not adequate for any kind of problems
• Do not create over-expectances.
• Many research group already have medium-size computing resources which can tackle most of the daily work.
• Create user’s confidence.
Bioinformatics Session - EGEE’09 - Barcelona 16