Upload
chris-southan
View
487
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Microsoft E Science 2009
Citation preview
[1]
Beyond the Tsunami: Developing the Infrastructure to Deal with Life Sciences Data
Christopher Southan and Graham Cameron, EMBL-European Bioinformatics Institute (EBI), Cambridge, U.K.
[2]
EBI and Sanger at Hinxton: Engaging with the Data Challenges
• Technology for sequence data generation and reduction• Repositories, storage, archiving • Databases, entitity linking, infrasctruture and utility• Biocuration, annotation, standards, ontolgies• Experimental biological data from research groups• Data exploitation, mining and visualisation • Biological hypothesis iteration
[3]
EMBL-Bank
0
5E+10
1E+11
1.5E+11
2E+11
2.5E+11
3E+11
Release 101, Aug 2009, 163 million entries, 283 billion bases
[4]
10 years of Rapid Growth
GU057010; SV 1; linear; viral cRNA; STD; VRL; 1701 BP.08-OCT-2009 (Rel. 102, Created)08-OCT-2009 (Rel. 102, Last updated, Version 1)Influenza A virus (A/Chengdu/03/2009(H1N1)) segment 4 hemagglutinin (HA) Jiang T., Qin C., Li X., Zhao H., Yu M., Deng Y., Yu X., Han J., Qin E., RA Zhu Q.; "A community transmission of influenza A (H1N1) virus in a boarding school RT in China, 22-27 July 2009“
*******************************************************************************************AF177758; SV 1; linear; mRNA; STD; HUM; 1868 BP.10-SEP-1999 (Rel. 61, Created)07-OCT-2008 (Rel. 97, Last updated, Version 6)Homo sapiens ubiquitin specific protease 16 (USP16) mRNA, complete cds.PUBMED; 10786635. Smith T.S., Southan C.; "Sequencing, tissue distribution and chromosomal assignment of a novel ubiquitin-specific protease USP23"; Biochim. Biophys. Acta 1490(1-2):184-88(2000). Ensembl-Gn; ENSG00000143258; Homo_sapiens.
[5]
New Technology > New Data Archives
Volume (TB) 1.9
70
35Assembledsequence
Capilliary traces
Next. Gen. Reads
European Nucleotide Archive Snapshot March 2009
[6]
Accelerating Genome Coverage
Jan 2009, 4370 projects
[7]
from EBI/Sanger
[8]
The 1000 Genomes Project: Cataloging Human Genetic Variation
• Initial human genome -10 years and 40 gigabases • Over next two years the eqivalent of two human genomes
will be produced every 24 hours • Completed dataset will be 6 trillion DNA bases, 500 TB• 60-fold more than 28 years of EMBL-Bank • Expected to cover 1200 genomes
[9]
Data Exploitation: EBI Accesses
Last 4 years of hit-rates for web pages and web services
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
CGI
API
[10]
GenomesGenomes Nucleotide sequenceNucleotide sequence
ExpressionExpression ProteomesProteomes
Protein families, and domains
Protein families, and domains
Protein structureProtein structure
Protein interactions
Protein interactions
Chemical entitiesChemical entities
PathwaysPathways
SystemsSystems
Literature, ontologiesLiterature, ontologies
Towards a sustainable infrastructure for biological information in Europe, to support life science, translation to medicine, the environment, bio-industries and society.
[11]
Conclusions
• The International Nucleotide Sequence Database Collaboration will exeed 300 billion bases in 2009.
• Storage at the EBI has doubled annually and is now 5 Petabytes.• Next-Generation Sequencing is increasing data production ~ 10-fold.• By 2010 the full genomic variation in over 1000 people will be revealed
and genomes from over 1000 species completed.• An increase in data mining is needed to facilitate conversion into
knowledge.• The European ELIXIR project and other global initiatives to enhance
the sustainable infrastructure for biological databases are essential.• The impact of data-intensive computing on the Life Sciences will be
profound and transforming.• Exploitation will bring major benefits for biology, medicine, agriculture,
biofuels and environmental science.