View
527
Download
1
Category
Preview:
DESCRIPTION
Introduction to Bioinformatics. Presented on May 9, 2013 at the Hospital La Fe in Valencia.
Citation preview
Bioinformatics in medicinetoday
David Montanerdmontaner@cipf.es
Centro de Investigación Príncipe FelipeInstitute of Computational Genomics
9 May 2013in Valencia
David Montaner Bioinformatics in medicine 1/26
Genomics
“Progress in science depends on new techniques, newdiscoveries and new ideas, probably in that order.”Sydney Brenner, 1980
Microarray devices and high-throughput sequencing allow usmeasuring thousands or millions of genomic characteristics.
David Montaner Bioinformatics in medicine 2/26
Genomics vs. genetics
Genetics:• Single genes are responsible for biological changes.• one gene→ one hypothesis→ one p-value→ conclusions
Genomics:• Genes or genomic features act together to produce
biological changes.• many genes→ many hypothesis→ many p-value→→ more data analysis
• Computational support is needed even for drawingconclusions
David Montaner Bioinformatics in medicine 3/26
Genomic numbers
Microarray:• 30.000 genes• 2 million SNPs• 100 Mb
Measured features:• genes, isoforms• SNPs, Polymorphisms• IN-DELS• loss of heterozygosity• methylation• copy number alterations
NGS:• 30.000 genes• 30.000 transcripts• 20 million SNPs• 10-100 GB
Registered information:• Genomic characteristics:
position, chromosome ...• Biological function• Disease association• miRNA targets
David Montaner Bioinformatics in medicine 4/26
Genomic databases
Nucleic Acid Research lists +1500 online databases!http://www.oxfordjournals.org/nar/database/c
• Many different databases for each category, which should Iuse?
• No standards: different IDs, methods, servers, formats, ...• Lack of international initiatives, many local and small
databases• Different gene IDs, more than 50• In vivo vs in silico databases
David Montaner Bioinformatics in medicine 5/26
Biological databases (Wikipedia)1 Primary nucleotide
sequence databases2 Metadatabases3 Genome databases4 Protein sequence
databases5 Proteomics databases6 Protein structure
databases7 Protein model databases8 RNA databases9 Carbohydrate structure
databases10 Protein-protein interactions
11 Signal transductionpathway databases
12 Metabolic pathwaydatabases
13 Experimental datarepositories (MicroarraysNGS, Sanger)
14 Exosomal databases15 Mathematical model
databases16 PCR / real time PCR
primer databases17 Specialized databases18 Taxonomic databases19 Wiki-style databasesDavid Montaner Bioinformatics in medicine 6/26
Primary nucleotide sequencedatabases
Contain any kind of nucleotide sequences, form genes togenomes.
The International Nucleotide Sequence Database (INSD)Collaboration:
• GenBankNational Center for Biotechnology Information (NCBI)
• European Nucleotide Archive (ENA)European Bioinformatics Institute (EBI)
• DNA Data Bank of Japan (DDBJ)
David Montaner Bioinformatics in medicine 7/26
GenBankPrimary nucleotide sequence databases
• available on the NCBI ftp site:http://www.ncbi.nlm.nih.gov/Ftp/
• A new release is made every two months.• 3 types of entries:
• CoreNucleotide (the main collection)• dbEST (Expressed Sequence Tags)• dbGSS (Genome Survey Sequences)
Access:• Search for sequence identifiers using Entrez Nucleotide:
http://www.ncbi.nlm.nih.gov/nucleotide/• Align GenBank sequences to a query sequence using
BLAST (Basic Local Alignment Search Tool).http://blast.ncbi.nlm.nih.gov/Blast.cgi
• Several other e-utilities (see book)See an example of a GenBank record.
David Montaner Bioinformatics in medicine 8/26
Metadatabases
• Collect and organize data from primary nucleotidesequence databases and may other resources.
• Make the information available in a convenient format andprovide data handling resources: web pages, applicationprogramming interface (API) …
• Focus on particular species, diseases …
Examples• Entrez: searches through almost all NCBI resources.
http://www.ncbi.nlm.nih.gov/sites/gquery• GeneCards: provides genomic, proteomic, transcriptomic,
genetic and functional information for human genes (knownand predicted )http://www.genecards.org/
David Montaner Bioinformatics in medicine 9/26
EntrezMetadatabases
• Searches through almost all NCBI resources.• Entrez search page: http://www.ncbi.nlm.nih.gov/sites/gquery• queries can be saved if you have a a MyNCBI account
http://www.ncbi.nlm.nih.gov/
David Montaner Bioinformatics in medicine 10/26
Genome databasesCollect genome sequences and annotation (specification aboutgenes) for particular organisms, and try to improve them:
• Data curation.• Complete missing information using insilico methods.• Generate new relational organization.• Complement feature IDs.• Provide “easy” access, visualization …
Examples
• Ensembl: automatic annotation on selected eukaryotegenomes.
• UCSC Genome Browser: reference sequence and workingdraft assemblies for a large collection of genomes
• Wormbase: genome of the model organism C.elegans.
David Montaner Bioinformatics in medicine 11/26
EnsemblGenome databases
• Ensembl is a joint project between European BioinformaticsInstitute (EBI) the European Molecular Biology Laboratory(EMBL) and the Wellcome Trust Sanger Institute.
• Develop a software system which produces and maintainsautomatic annotation on selected vertebrate andeukaryote genomes.
• http://www.ensembl.org
David Montaner Bioinformatics in medicine 12/26
UCSC Genome BrowserGenome databases
• UCSC: University of California, Santa Cruz.• This site contains the reference sequence and working
draft assemblies for a large collection of genomes.• http://genome.ucsc.edu/
David Montaner Bioinformatics in medicine 13/26
Protein sequence databases
• Most times proteins are the final unit of interest to research.• There is a direct conversion from DNA/RNA sequences to
protein sequences.• Gene IDs and protein IDs are equivalently used by
researchers (biologists not bioinformaticians …)
Examples
• UniProt: Universal Protein Resource (EBI)• Swiss-Prot (Swiss Institute of Bioinformatics)• InterPro Classifies proteins into families and predicts the
presence of domains and sites.• Pfam Protein families database of alignments and HMMs
(Sanger Institute)
David Montaner Bioinformatics in medicine 14/26
RNA databases
• Contain information about RNA molecules.• Most of them regarding gene regulatory factors. (Gene
information is usually in other repositories).
Examples
• mirBase: microRNAshttp://www.mirbase.org/
• TRANSFAC: transcription factors in eukaryote (Proprietarydatabase).
• JASPAR: transcription factor binding sites for eukaryote(Open access, curated, non-redundant).http://jaspar.genereg.net/
David Montaner Bioinformatics in medicine 15/26
Protein-protein interactions
• Proteins are the main functional units.• But they do not work in isolation.• Pretty useless at the moment but promising in the future …• some information is experimental, but most of it is
generated insilico.
Examples• IntAct: protein–small molecule
and protein–nucleic acidinteractions.
• BIND: Biomolecular InteractionNetwork Database.
David Montaner Bioinformatics in medicine 16/26
Signal transduction pathwaydatabases& Metabolic pathway databases
• Information about how genes (or proteins) interact amongthem.
• not only physical interactions …
Examples• Reactome: free online database of biological pathways.
http://www.reactome.org• KEGG: Kyoto Encyclopedia of Genes and Genomes.
Metabolic pathways.http://www.genome.jp/kegg/pathway.html
David Montaner Bioinformatics in medicine 17/26
KEGGMetabolic pathway databases
David Montaner Bioinformatics in medicine 18/26
Experimental data repositories
Contain Microarray, NGS, Sanger, and other experimental highthroughput data.
• GEO: Gene Expression Omnibus (NCBI)http://www.ncbi.nlm.nih.gov/geo/
• ArrayExpress: database of functional genomicsexperiments including (EBI)http://www.ebi.ac.uk/arrayexpress/
• The Cancer Genome Atlas (TCGA): Data on differentcancer related tissues.http://cancergenome.nih.gov/
David Montaner Bioinformatics in medicine 19/26
Bioinformatics
Training• Biology 1/3• Statistics 1/3• Computer science 1/3←−
Efficiently combine:• Experimental information• Database registered knowledge
Time and resources:• As in the wet lab
David Montaner Bioinformatics in medicine 20/26
Example
David Montaner Bioinformatics in medicine 21/26
Example I
Autistic children
1 (microarray) NGS data processing• data quality control, filtering...• map against reference genome• CNV calling
2 CNV filtering• just 75 rare de novo CNV events (not registered in
databases)• filter out the long ones• keep the ones that contain genes
David Montaner Bioinformatics in medicine 22/26
Example II
3 move to the gene level• 47 loci in total affecting 433 human genes
4 Building the background likelihood network• GO annotations• KEGG pathways• InterPro domains• protein-proteins interactions. Databases: BIND, BioGRID,
DIP, HPRD, InNetDB, IntAct, BiGG, MINT, and MIPS• sequence homology between the gene pair (BLAST)
David Montaner Bioinformatics in medicine 23/26
Example III
5 Search for high scoring clusters affected by CNVs6 Evaluating significance of cluster scores:
10.000 simulations
David Montaner Bioinformatics in medicine 24/26
Example IV7 Functional characterization of the identified network
8 And, finally, draw conclusions
David Montaner Bioinformatics in medicine 25/26
Questions
Thanks
David Montaner Bioinformatics in medicine 26/26
Recommended