Upload
drusilla-marsh
View
214
Download
0
Embed Size (px)
Citation preview
INFSO-RI-508833
Enabling Grids for E-sciencE
www.eu-egee.org
Life Sciences ApplicationsJosé R. Valverde
EMBnet/CNB
INFSO-RI-508833
Enabling Grids for E-sciencE
www.eu-egee.org
jr
• PhD in Medicine and Surgery– You already guessed ;-)– Forensic Dr.– Molecular Biologist– Exobiologist, IVF-ET...
• Computer Scientist– Bioinformatician
• Actually– EMBnet node manager
EMBnet:Provide support, training, resources and services to Life Sciences
Enabling Grids for E-sciencE
INFSO-RI-508833
Daydreaming
• One day we would like to– Go to the doctor– Get taken a blood sample– Get a personalized diagnosis and therapy
• We may as well love to live in a World with– No hunger– No contamination– Biodiversity– Lots of fun!
Enabling Grids for E-sciencE
INFSO-RI-508833
The Human Genome
• Laid out the basis for – Genomics– Proteomics– Transcriptomics– HighThroughput structure analysis
• Sets out the basis for the “Databang!”– Genome sequencing in many organisms– Genomic analysis implies a quantum leap
several orders of magnitude forward
– New experimental approaches– Genome sequencing overnight(?)
For less than 1K€/genome!
Enabling Grids for E-sciencE
INFSO-RI-508833
Oh, no! Not me again!
• What about YOU?– Would you like your genome sequenced?
Enabling Grids for E-sciencE
INFSO-RI-508833
The Databang!
• Data growth in Molecular Biology– Exponential till recently (2x every year)– Greater than exponential lately (2x every 8mo., 6mo...)– The worst is still to come
• With a doubling rate of less than 6 months– You miss half the knowledge gathered in all of Human History if
you lag a few months behind!
• Experimental work– Classical: one gene– Modern: one genome
• Forget the classical Databank• Welcome the new Databang!
Enabling Grids for E-sciencE
INFSO-RI-508833
Beyond Molecular Biology
• Medicine– Homo sum nihil humani a me alienum puto. (Terenzio) ;-)– Knowledge application– Knowledge INTEGRATION
Medical records Neurology, immunology, etc...
• Pharmacology– Drug identification– Drug testing
• Biotechnology• Chemistry• You name it!
Enabling Grids for E-sciencE
INFSO-RI-508833
Pursue your dreams!
• Never, ever give up!But, how?
• Your doctor will need to analyse your whole genome– and compare against population standards
• Your pharmacist will need to find the best drug– Out of millions
• Engineers will need to understand Life– From molecules to populations– And how to modify it ecologically
• Shorthands/rules/laws will need to be drawn
Enabling Grids for E-sciencE
INFSO-RI-508833
Meaning what?
• Huge amounts of raw power at the fingertips...– Of many professionals– To store replicated data (security, accessibility, efficiency..)– To analyze vast amounts of data
• An scaffold of knowledge– Built stepwise on top of prior knowledge (molecules, cells, tissues,
organisms, populations, ecosystems)– From many sources (Biology, Medicine, Industry,etc.)– By many professionals (all over the World)
• Tight security– To protect data (personal, corporate) from abuse
Enabling Grids for E-sciencE
INFSO-RI-508833
Getting there
• A component based architecture– Multifaceted, multiheaded, multihosted– Integrable, deals with complexity
• A lot of power– HPC systems– Grid systems
• Politics– Security in the face of tremendous stress forces
Corporate Political Private Moral Ethical
Enabling Grids for E-sciencE
INFSO-RI-508833
Component Architecture
• This is Science, man!– There should be no barriers to collaboration
• Object Oriented Web Services (and CORBA, .Net, etc..)• BioMOBY (www.biomoby.org)
– Web Services based– Workflows with Taverna (under reevaluation)– Distributed development– Integration with the Grid (MyGrid, UK)– Examples of BioMOBY applications:
Sequence conversion Protein structure modelling Sequence comparison Gene finding...
Enabling Grids for E-sciencE
INFSO-RI-508833
Component layers
LAYER• Low level
– System('xxx')
• Middleware– Submit('xxx')
• Application– Analyze('xxx')– Decide('xxx')– Predict('xxx')
EXAMPLES• CGIs
• PHP:Grid, DRMAA
• DOCK-ws,• BLAST-ws• MODELLER-ws
Enabling Grids for E-sciencE
INFSO-RI-508833
Base services: Blast
• Different interfaces with different calling conventions
• Dynamically changing with each new version / syadmin / fashion
Enabling Grids for E-sciencE
INFSO-RI-508833
Derived services
☞ Call upon existing servers on remote systems
☞ Might be called from servers on remote sites
Enabling Grids for E-sciencE
INFSO-RI-508833
Distributed data queries
• Distributed DBMS (SRS Federation, www.srsfed.org)• Store databases distributed/replicated over central nodes• Distribute database processing to hosting servers• Distribute database queries transparently from distributive front
ends
• User data• Find the best way to store/access
• Data collection into databases• Test distributed collection/storage • Systems
Good problem for ☞ HPC, gridification
Enabling Grids for E-sciencE
INFSO-RI-508833
HPC
• MPI, queues• Good for massively parallel jobs
– e.g. Molecular Dynamics on MareNostrum– e.g. 3D reconstruction on MareNostrum
• Very expensive• Good for embarrassingly parallel jobs
– But so is the Grid
• Good for communication dependent jobs– Large messages– Many messages
• Don't misconstrue me: there is a lof of life on HPC
Enabling Grids for E-sciencE
INFSO-RI-508833
Classical problems
• There are still huge problems with huge demands on compute power
• Structure refinement (X-ray, NMR, Microscopy...)• Structure prediction (Homology, Threading, MD)• Structure analysis (docking, MD/QM simulations, QSAR, 3D-
QSAR)• Many others
• Coarse and fine grain computation• Benefit from distributed computing
• Farm / cluster / grid / supercomputers• PVM/MPI implementations may exist
Enabling Grids for E-sciencE
INFSO-RI-508833
Marenostrum
The life science program will take advantage of the Supercomputer to get a deeper understanding of the behavior of living organisms.
Research lines• Genomic analysis• Data mining• Systems Biology• Prediction of protein fold• Molecular interactions
Enabling Grids for E-sciencE
INFSO-RI-508833
Grid computing
• Affordable• Good for embarrassingly parallel jobs
– Wisdom, GROCK, 3D analysis, HMMER...
• Appropriate for parallel jobs (big clusters)– MPI, MD, etc..
• Appropriate for distributed data – Medical imaging, databases, etc..
• Good for HT and HP (highly popular) tasks– EMBOSS, EMBRACE, etc..
Enabling Grids for E-sciencE
INFSO-RI-508833
High throughput data
• New experimental approaches generate pervasive hyper-exponential data streams
• Processing requires massive computing power
• Currently beyond reach of common developers
• There are some solutions
• Parallel processing in Molecular Structure
• But the vast majority of applications are still single threaded monolithic processes
• And developers are used to it!
Enabling Grids for E-sciencE
INFSO-RI-508833
Processing HT data
• Distribute computation over as many nodes as possible:
• Supercomputing centres• Departmental servers• Workstations• PDAs, mobiles, commodity appliances• Fridges, toasters, etc... as they become available
• Bring developers in
• MPI, PVM are powerful but lack programmers
• OO is intuitive and widely available
Enabling Grids for E-sciencE
INFSO-RI-508833
GROCK: HT docking
• Why do we want easy High-Throughput docking? find best matches between two molecular structures for a probe molecule against all molecules in a database
drug against protein Identify drug function, predict secondary effects
protein against proteins Identify protein interactions, build interaction networks
protein against drugs Identify candidate drugs for therapy
Beyond a single organism
Enabling Grids for E-sciencE
INFSO-RI-508833
Match molecule vs. database
• Sort pairs by energy• For each pair
– Save 1000 best matches– Show 10 best for exploration
Enabling Grids for E-sciencE
INFSO-RI-508833
Other EGEE examples
• GATE: Geant4 application for tomographic emission• CDSS: Clinical decision support system• GPS@: Genomics web portal• SIMRI3D: Magnetic resonance image simulator• gPTM3D: Interactive radiography visualization• WISDOM: Docking platform for tropical diseases• Pharmacokinetics: contrast agent diffusion in MRI• Bronze standard: evaluation of medical imaging algorithms• SPLATCHE: Genome evolution modelling• Mammogrid project• HealthGrid, EMBRACE, etc...
Enabling Grids for E-sciencE
INFSO-RI-508833
Users
Web serversFront Ends
Back Ends
A distribution architecture
Enabling Grids for E-sciencE
INFSO-RI-508833
Security
• Access control and authentication– CAs
• Tru$t– VOs
• Encryption (e.g. parrot/perroquet)• Usage/access policies• SOCIAL POLITICS (e.g. France)
– Patient privacy is sacred
• PRIVATE INTERESTS (e.g. Pharma and Biotech)• Criminal abuse
– Crackers
Enabling Grids for E-sciencE
INFSO-RI-508833
Science, Medicine, etc...
• Back to collaboration– We need to stand on the shoulders of giants (and dwarfs as well)– We need to share information
• Can we relay on personal certificates?– Services need server certificates– How do we deal with multiple access to private data?
• Does it make sense?– Research groups– Research projects– Doctor(s) and patient(s)
• A brave new World– For you to morph
Enabling Grids for E-sciencE
INFSO-RI-508833
Kudos to
• YOU ALL– for being here, your help, encouragement, feedback and
support– and not falling asleep
• The TEAM at CNB– Biocomputing
José M. Carazo, Carlos Pérez-Roca, Enrique de Andrés, Natalia Jiménez, Sjors Schëres,Alfredo
– Bioinformatics José R. Valverde, David J. García
• The NA4 Biomed task force• The EU for EGEE and EGEE-II