Upload
datherine-cooper
View
34
Download
0
Embed Size (px)
DESCRIPTION
Making Sense of Life Sciences Data. Nigel Martin. 21 st May 2008. Life Sciences Informatics. The development and use of computational methods for the acquisition management analysis and interpretation - PowerPoint PPT Presentation
Citation preview
Making Sense of Life Sciences Data
Nigel Martin
21st May 2008
The development and use of computational methods for the• acquisition• management • analysis and • interpretation
of biological and medical information to determine biological functions and mechanisms as well as their applications in user communities
This biological and medical information is encoded in the vast amounts of data now generated in the life sciences e.g. dna data
Life Sciences Informatics
Life Sciences Informatics
CC AA CC CC TTGG ……
Life Sciences Informatics
CC AA CC CC TTGG ……
Homo sapiens
Genome (made of DNA)
RNA Protein
A gene
Gene expression
Permanent copy Temporary copy Product
FUNCTION
Job
BiologicalProcesses
• The primary data of DNA and protein sequences are held in large repositories such as the
EMBL Nucleotide Sequence Database
• The latest release contains 114,475,051 sequences comprising 215,540,553,360 nucleotides
• But life sciences data comprises of much besides sequence data…
Life Sciences Data is Complex
Life Sciences Data is Complex
• e.g. CATH protein structure classification
Life Sciences Data is Complex• e.g. herpesvirus evolutionary tree
Life Sciences Data is Complex
• e.g. Kegg metabolic pathway
• e.g. PubMed medical abstract
Toxicol Appl Pharmacol. 2004 Dec 1;201(2):178-85.Related Articles, Links
cDNA microarray analysis of rat alveolar epithelial cells following exposure to organic extract of diesel exhaust particles.
Koike E, Hirano S, Furuyama A, Kobayashi T.
Particulate Matter (PM2.5) and Diesel Exhaust Particles (DEP) Research Project, National Institute for Environmental Studies, Tsukuba, Ibaraki, 305-8506, Japan.
Diesel exhaust particles (DEP) induce pulmonary diseases including asthma and chronic bronchitis. Comprehensive evaluation is required to know the mechanisms underlying the effects of air pollutants including DEP on lung diseases. Using a cDNA microarray, we examined changes in gene expression in SV40T2 cells, a rat alveolar type II epithelial cell line, following exposure to an organic extract of DEP. We identified candidate sensitive genes that were up- or down-regulated in response to DEP. The cDNA microarray analysis revealed that a 6-h exposure to the DEP extract (30 mug/ml) increased (>2-fold) the expression of 51 genes associated with drug metabolism, antioxidation, cell cycle/proliferation/apoptosis, coagulation/fibrinolysis, and expressed sequence tags (ESTs), and decreased (<0.5-fold) that of 20 genes. In the present study, heme oxygenase (HO)-1, an antioxidative enzyme, showed the maximum increase in gene expression; and type II transglutaminase (TGM-2), a regulator of coagulation, showed the most prominent decrease among the genes. We confirmed the change in the HO-1 protein level by Western blot analysis and that in the enzyme activity of TGM-2. The organic extract of DEP increased the expression of HO-1 protein and decreased the enzyme activity of TGM-2. Furthermore, these effects of DEP on either HO-1 or TGM-2 were reduced by N-acetyl-l-cysteine (NAC), thus suggesting that oxidative stress caused by this organic fraction of DEP may have induced these cellular responses. Therefore, an increase in HO-1 and a decrease in TGM-2 might be good markers of the biological response to organic compounds of airborne particulate substances.
PMID: 15541757 [PubMed - in process]
Life Sciences Data is Complex
• e.g. Gene Ontology http://www.geneontology.org/
• GO:0008150 : biological_process ( 109503 ) • GO:0005575 : cellular_component ( 98453 ) • GO:0003674 : molecular_function ( 108120 )
• GO:0016209 : antioxidant activity ( 478 ) • GO:0005488 : binding ( 31317 ) • GO:0003824 : catalytic activity ( 35260 ) • GO:0030188 : chaperone regulator activity ( 14 ) • GO:0030234 : enzyme regulator activity ( 2087 ) • GO:0005554 : molecular_function unknown ( 29597 ) • GO:0003774 : motor activity ( 522 ) • GO:0045735 : nutrient reservoir activity ( 36 ) • GO:0004871 : signal transducer activity ( 8356 ) • GO:0005198 : structural molecule activity ( 3428 ) • GO:0030528 : transcription regulator activity ( 8552 )
• GO:0017163 : negative regulator of basal transcription activity ( 15 ) • GO:0003701 : RNA polymerase I transcription factor activity ( 31 ) • GO:0003702 : RNA polymerase II transcription factor activity ( 982 ) • GO:0003709 : RNA polymerase III transcription factor activity ( 41 ) • GO:0030401 : transcription antiterminator activity ( 16 ) • GO:0003712 : transcription cofactor activity ( 731 ) • GO:0003700 : transcription factor activity ( 5510 ) • GO:0016986 : transcription initiation factor activity ( 82 ) • GO:0016988 : transcription initiation factor antagonist activity ( 9 ) • GO:0003715 : transcription termination factor activity ( 38 ) • GO:0016563 : transcriptional activator activity ( 499 ) • GO:0003711 : transcriptional elongation regulator activity ( 97 ) • GO:0016564 : transcriptional repressor activity ( 507 ) • GO:0000156 : two-component response regulator activity ( 394 )
• GO:0045182 : translation regulator activity ( 687 ) • GO:0005215 : transporter activity ( 9054 ) • GO:0030533 : triplet codon-amino acid adaptor activity ( 555 )
Life Sciences Data is Complex
Life Sciences Informatics in Birkbeck Comp Sci
• Evolutionary analysis: reconstruction of evolutionary events from genomic and related data
• Integration of life sciences data: data and knowledge management techniques to support the integration, analysis, mining and visualisation of life sciences data
• Medical informatics: data integration, semantic modelling, fuzzy inferencing and data mining techniques to support virtual integration of medical records
For full details of topics, people, projects, publications…
http://www.dcs.bbk.ac.uk/research/bioinf
Example Research Areas:
Evolutionary Analysis
• Annotating evolutionary trees
Mathematical models and algorithms addressingproblems such as:
• Given an evolutionary species tree and a set of trees built on the same extant species according to similarity between individual gene families, find a mapping of the individual gene trees onto the species tree exhibiting gene duplications and losses to account for the differences
• Given an evolutionary species tree and patterns of presence/absence of genes in the extant species, compute evolutionary scenarios of gene gain, horizantal transfer and loss events to account for the patterns
Evolutionary Analysis
• Applied to the analysis of evolutionary gains and loss of functions in herpesvirus genomes
Reconstructed history of HPF161 Host–virus interaction
Integration of Life Sciences Data
• Integrating transcriptomics and structural data to reveal protein functions: BioMap
• A data warehouse to support analysis and mining integrating data including microarray gene expression data, protein structure data, CATH structural classification data, functional data including Gene Ontology, KEGG (Gene, Orthology, Genome, Pathway…)
• Creation of a pilot Grid for proteomics resources: ISpider
• An integrated platform of proteomics resources supporting techniques for distributed querying, workflows and data analysis tasks in a Grid
• Research approach based on semantic mapping services using the techniques developed in the AutoMed project http://www.doc.ic.ac.uk/automed/
Existing Resources
PS
WS
PF
WS
TR
WS
GS
WS
FA
WS
PPI
WS
PID
WS
PRIDE
WS
PEDRo
WS
ISPIDER Resources
Integrated Proteomics Informatics Platform - Architecture
VanillaQuery Client
2D GelVisualisation
Client + Aspergil.Extensions
+ Phosph.Extensions
PPI Validation + Analysis
Client
Protein ID Client
ExistingE-ScienceInfrastructure
ISPIDERProteomics GridInfrastructure
ISPIDERProteomics Clients
PublicProteomicResources
myGridOntologyServices
myGridDQP
DASAutoMedmyGrid
Workflows
ProteomeRequestHandler
InstanceIdent/Mapping
Services
ProteomicOntologies/
Vocabularies
SourceSelectionServices
DataCleaningServices
Phos
WS
WP1
WP2
WP3
WP4
WP5
WP6
WP6
WP3
KEY: WS = Web services, GS = Genome sequence, TR = transcriptomic data, PS = protein structure, PF = protein family, FA = functional annotation, PPI = protein-protein interaction data, WP = Work PackageKEY: WS = Web services, GS = Genome sequence, TR = transcriptomic data, PS = protein structure, PF = protein family, FA = functional annotation, PPI = protein-protein interaction data, WP = Work Package
Web services
Medical Informatics• ASASsociation sociation SStudies assisted by tudies assisted by IInference and nference and SSemanticemantic TTechnologies – ASSISTechnologies – ASSIST
• 10 E.U. partners: U.K., Greece, Belgium, Germany, Spain
The main objectives of ASSIST are to:
• Allow researchers to combine phenotypic and genotypic data
• Unify multiple patient records repositories
• Automate the process of evaluating medical hypotheses • Provide an inference engine capable of statistically evaluating medical data
• Offer expressive, graphical tools for medical researchers to post their queries.
Medical Informatics
AutoMedMetadataRepository
AUTh(Greece)
Charite(Germany)
Ghent(Belgium)
AutoMed transformation pathways
Virtual IntegratedRelational Schema
ChariteRelational Schema
GhentRelational Schema
AUThRelational Schema
AutoMedQuery Processor
SeRQL query
SeRQL result
Medical RulesRepository
(First-Order Logic)
IQL query
expandedIQL query
Virtual IntegratedOWL Schema
Web Interface
AutoMed Wrappers (JDBC/Grid Services)
• ASSIST query processing builds on AutoMed technology with integrated ontology and inference rules capabilities
Making Sense of Life Sciences Data
• Some areas of on-going and future research on-going and future research
• automated reasoning using ontologies and widerautomated reasoning using ontologies and wider domain knowledge domain knowledge
• evolutionary reconstruction exploiting domainevolutionary reconstruction exploiting domain knowledge knowledge
• analysis and mining of heterogeneous distributedanalysis and mining of heterogeneous distributed resources resources
• metrics for data integration qualitymetrics for data integration quality
• The overarching motivation is the potential to make The overarching motivation is the potential to make scientific discoveries that can improve quality of life scientific discoveries that can improve quality of life
Some Collaborators
Funding
http://www.dcs.bbk.ac.uk/research/bioinf
Further Information