Upload
quynn-peters
View
20
Download
0
Tags:
Embed Size (px)
DESCRIPTION
eScience and Grid Tools and techniques for the next generation scientist. Professor Brian Vinter Head of the Copenhagen eScience Center. e Science. «The next 10 to 20 years will see computational science firmly embedded in the fabric of science - PowerPoint PPT Presentation
Citation preview
eScience and Grid
Tools and techniques for the next generation scientist
Professor Brian VinterHead of the Copenhagen eScience Center
eScience
«The next 10 to 20 years will seecomputational science firmlyembedded in the fabric of science– the most profound development in the scientific method in over three centuries.»
US Department of Energy 2003.
Mega-Science
The next scientific period will be dominated by Mega-Science projects• 104 researchers on a single project• Extreme data production• Highly integrated collaboration between different
groups of scientistsExamples
• CERN LHC• ALMA• Mars project
Data Production
1997: Total data worldwide app 12 exabytes (incl. documents, film, TV, pictures, …)1
1999: 2-3 exabytes data produced2
2002: App. 5 exabytes data produced2
1 Exabyte = 1000 Petabytes1 Petabyte = 1000 Terabytes1 Terabyte = 1000 Gigabytes1 Gigabyte = 1000 Megabytes
Global data availablity doubles every 4-5 years.
1) http://www.lesk.com/mlesk/ksg97/ksg.html2) http://www.sims.berkeley.edu/research/projects/how-much-info-2003/
Modeling and simulation
eScience Components
Modeling and simulation
Data acquisition and handling
eScience Components
eScience Components
Modeling and simulation
Data acquisition and handling
Visualization
eScience Components
Modeling and simulation
Data acquisition and handling
Visualization
HPC and Grid
442 molecules1372 molecyles
54 molecules
Why is it getting more difficult?
System sizes and time scales
System
Size 1H
2
O2
10
Sing
lepe
ptid
e
1000
Bio
mim
etic
Com
poun
d
105
Prot
eins
106
Rib
osom
es
Time 10-15
Phot
o-io
niza
tion
10-12
Prot
ontr
ansf
er
10-9 10-6 1
Process
103
Thi
s se
min
ar
seconds
number of atoms
10-3
Prot
ein
fold
ing
104
Bio
poly
mer
s
System
Size 1H
1H
2
O2
2
O2
10
Sing
lepe
ptid
e
10
Sing
lepe
ptid
e
1000
Bio
mim
etic
Com
poun
d
1000
Bio
mim
etic
Com
poun
d
105
Prot
eins
105
Prot
eins
106
Rib
osom
es
106
Rib
osom
es
Time 10-15
Phot
o-io
niza
tion
10-15
Phot
o-io
niza
tion
10-12
Prot
ontr
ansf
er
10-12
Prot
ontr
ansf
er
10-910-9 10-610-6 11
Process
103
Thi
s se
min
ar
103
Thi
s se
min
ar
seconds
number of atoms
10-3
Prot
ein
fold
ing
10-310-3
Prot
ein
fold
ing
104
Bio
poly
mer
s
104104
Bio
poly
mer
s
Nano-modeling
Extremely CPU- and Data-intensive algorithmsComplex structure-calculationsMultiple days of execution even on a supercomputerRuns of both PCs and Supercomputers
eScience and Bio/Med
We expect very good results form eScience in biology and medicine
The foremost advantages will come from introducing a mathematical causal understanding of biological systems• Bio-informatics are already doing this
An emerging field: Systems Biology• Systems Medicine is also starting internationally
Calculations in treatment
Computational methods are already important in medical planning
• Radiation planning• Bypass flow
modeling• Robotic surgery• …
Every human is uniqueAlso at the genetic level
In our genome, which is written with the alphabet ACGT, we have a number of micro mutations – called single nucleotide polymorphisms, SNP
These SNPs are often without consequence but• Some make us sick• Some are indicators of a faulty gene• Others influence our reception of a drug
The last complication makes is very hard to make drugs for the general population
We want to move from commodity medicine to custom tailored drugs
Personalized medicine
An example
app 60% of today's medicines are metabolized by cytochrome P450 enzymes• Some have highly efficient P450 while
others have very slow and inefficient P450• Knowledge of a patients P450 level will
allow us to dose medicine to the individual much more efficiently
This is already in early use
And this is eScience how?
Developing a drug is not a linear process The human genome is written with
billions og letters• Any person has millions of SNP mutations• Finding the SNP that has an effect is a
highly complex computational task
eScience and geology
Geology and hydrology too has been using computational methods for a long time
There are very interesting aspects in combining different methods• i.e. include biological systems in the models• Inverse mapping of seismic data
It turns out that we use the same techniques in medicine• And soon in industry
Grid
Minimum intrusion Grid
Minimum intrusion Grid
GRID
GRID
GRID
Resource
Resource
Resource
Resource
User
User
User
Processing plants
Like the power grid the computing Grid has many types of power producers• High yield power plants (fossil fuel, nuclear,…)
• Supercomputers and large farms
• Low yield producers (windmills, etc)• Individual PCs and games-consoles
• Very low yield producers (solar panels, etc.)• Web-browers
One Click
Interactive Applications
VGrids
Best thing since sliced bread VGrids are Virtual Organizations in MiGThey are a dead easy way to create collaborations
• Share files• Share resources• Private entry page• Public Web-page
Portals
VO’s can generate their own private entry pages including application portals
Files in VGrids
A user must keep her personal home-directory independent of which VGrid she works in
But VGrids have a common directory where only members of the VGrid are allowed• These are represented as directories in the
users home-directory
VGrid owners can create sub-VGrids
Examples
eScience on Grid
GeneRecon
GeneRecon seeks to identify genetic factors behind heretical deceases
The overall idea is to compare two genomes• One where the decease is observed• One where the decease is not observed
App 1000 individuals in each set
GeneRecon is developed at the Bioinformatics Research Center, Århus University
GeneRecon
The Algorithm is a Markov-chain Monte Carlo method
A test run consists of app. 30.000 individual tests• One test runs form 1 to 10 days on a PC• In total no less than 82 CPU years
MiG hosted the execution on Grid and got the execution down below a month
0.01
2.08
5546
101
505392
678
Total time
Queue timeExecution time
Min
Avg
Max
Statistics
1315 jobs were submitted to Grid at the same time0 jobs were lostFirst result
• 2:04:44
Last result• 28 days, 5:42:54
Groundwater modeling on Funen
11.0
12.0
13.0
14.0
15.0
16.0
17.0
18.0
0 200 400 600 800 1000
Antal model evalueringer
Ag
gre
ge
ret
ob
jekt
iv f
un
ktio
n
Calibration of the Assens model:1 model evaluation = 30 min920 model evaluations = 19 days
AUTOCAL OfficeGRID
Days to hours
Client
Master
ClientClient
Client
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
0 20 40 60 80 100Time [h]
Obj
ectiv
e fu
nctio
n
AUTOCAL (1 PC)
AUTOCAL OfficeGRID (10 PCs)
Drug Design
Molecular docking is a time consuming calculation process which this project does through two steps
First step is a coarse calculation that can eliminate molecules that won’t dock• This process can run on PCs and PS3’s – a lot of work is
being done towards efficient utilization of the CELL CPU for molecular docking
The molecules that survive the first step are then modeled more precisely at quantum level on classic supercomputers and clusters
SeGrid
Still a proposalThe idea is to share sensitive data through Grid and use the
Grid technology to manage access control and automatic anonymization
More information
www.eScience.dkPortal for KUs eScience activities
www.migrid.orgPortal for the Minimum intrusion Grid
www.rcuk.ac.uk/escience/The very ambitious UK eScience program