26
Bioinformatics in medicine today David Montaner [email protected] Centro de Investigación Príncipe Felipe Institute of Computational Genomics 9 May 2013 in Valencia David Montaner Bioinformatics in medicine 1/26

Bioinformatics Introduction

Embed Size (px)

DESCRIPTION

Introduction to Bioinformatics. Presented on May 9, 2013 at the Hospital La Fe in Valencia.

Citation preview

Page 1: Bioinformatics Introduction

Bioinformatics in medicinetoday

David [email protected]

Centro de Investigación Príncipe FelipeInstitute of Computational Genomics

9 May 2013in Valencia

David Montaner Bioinformatics in medicine 1/26

Page 2: Bioinformatics Introduction

Genomics

“Progress in science depends on new techniques, newdiscoveries and new ideas, probably in that order.”Sydney Brenner, 1980

Microarray devices and high-throughput sequencing allow usmeasuring thousands or millions of genomic characteristics.

David Montaner Bioinformatics in medicine 2/26

Page 3: Bioinformatics Introduction

Genomics vs. genetics

Genetics:• Single genes are responsible for biological changes.• one gene→ one hypothesis→ one p-value→ conclusions

Genomics:• Genes or genomic features act together to produce

biological changes.• many genes→ many hypothesis→ many p-value→→ more data analysis

• Computational support is needed even for drawingconclusions

David Montaner Bioinformatics in medicine 3/26

Page 4: Bioinformatics Introduction

Genomic numbers

Microarray:• 30.000 genes• 2 million SNPs• 100 Mb

Measured features:• genes, isoforms• SNPs, Polymorphisms• IN-DELS• loss of heterozygosity• methylation• copy number alterations

NGS:• 30.000 genes• 30.000 transcripts• 20 million SNPs• 10-100 GB

Registered information:• Genomic characteristics:

position, chromosome ...• Biological function• Disease association• miRNA targets

David Montaner Bioinformatics in medicine 4/26

Page 5: Bioinformatics Introduction

Genomic databases

Nucleic Acid Research lists +1500 online databases!http://www.oxfordjournals.org/nar/database/c

• Many different databases for each category, which should Iuse?

• No standards: different IDs, methods, servers, formats, ...• Lack of international initiatives, many local and small

databases• Different gene IDs, more than 50• In vivo vs in silico databases

David Montaner Bioinformatics in medicine 5/26

Page 6: Bioinformatics Introduction

Biological databases (Wikipedia)1 Primary nucleotide

sequence databases2 Metadatabases3 Genome databases4 Protein sequence

databases5 Proteomics databases6 Protein structure

databases7 Protein model databases8 RNA databases9 Carbohydrate structure

databases10 Protein-protein interactions

11 Signal transductionpathway databases

12 Metabolic pathwaydatabases

13 Experimental datarepositories (MicroarraysNGS, Sanger)

14 Exosomal databases15 Mathematical model

databases16 PCR / real time PCR

primer databases17 Specialized databases18 Taxonomic databases19 Wiki-style databasesDavid Montaner Bioinformatics in medicine 6/26

Page 7: Bioinformatics Introduction

Primary nucleotide sequencedatabases

Contain any kind of nucleotide sequences, form genes togenomes.

The International Nucleotide Sequence Database (INSD)Collaboration:

• GenBankNational Center for Biotechnology Information (NCBI)

• European Nucleotide Archive (ENA)European Bioinformatics Institute (EBI)

• DNA Data Bank of Japan (DDBJ)

David Montaner Bioinformatics in medicine 7/26

Page 8: Bioinformatics Introduction

GenBankPrimary nucleotide sequence databases

• available on the NCBI ftp site:http://www.ncbi.nlm.nih.gov/Ftp/

• A new release is made every two months.• 3 types of entries:

• CoreNucleotide (the main collection)• dbEST (Expressed Sequence Tags)• dbGSS (Genome Survey Sequences)

Access:• Search for sequence identifiers using Entrez Nucleotide:

http://www.ncbi.nlm.nih.gov/nucleotide/• Align GenBank sequences to a query sequence using

BLAST (Basic Local Alignment Search Tool).http://blast.ncbi.nlm.nih.gov/Blast.cgi

• Several other e-utilities (see book)See an example of a GenBank record.

David Montaner Bioinformatics in medicine 8/26

Page 9: Bioinformatics Introduction

Metadatabases

• Collect and organize data from primary nucleotidesequence databases and may other resources.

• Make the information available in a convenient format andprovide data handling resources: web pages, applicationprogramming interface (API) …

• Focus on particular species, diseases …

Examples• Entrez: searches through almost all NCBI resources.

http://www.ncbi.nlm.nih.gov/sites/gquery• GeneCards: provides genomic, proteomic, transcriptomic,

genetic and functional information for human genes (knownand predicted )http://www.genecards.org/

David Montaner Bioinformatics in medicine 9/26

Page 10: Bioinformatics Introduction

EntrezMetadatabases

• Searches through almost all NCBI resources.• Entrez search page: http://www.ncbi.nlm.nih.gov/sites/gquery• queries can be saved if you have a a MyNCBI account

http://www.ncbi.nlm.nih.gov/

David Montaner Bioinformatics in medicine 10/26

Page 11: Bioinformatics Introduction

Genome databasesCollect genome sequences and annotation (specification aboutgenes) for particular organisms, and try to improve them:

• Data curation.• Complete missing information using insilico methods.• Generate new relational organization.• Complement feature IDs.• Provide “easy” access, visualization …

Examples

• Ensembl: automatic annotation on selected eukaryotegenomes.

• UCSC Genome Browser: reference sequence and workingdraft assemblies for a large collection of genomes

• Wormbase: genome of the model organism C.elegans.

David Montaner Bioinformatics in medicine 11/26

Page 12: Bioinformatics Introduction

EnsemblGenome databases

• Ensembl is a joint project between European BioinformaticsInstitute (EBI) the European Molecular Biology Laboratory(EMBL) and the Wellcome Trust Sanger Institute.

• Develop a software system which produces and maintainsautomatic annotation on selected vertebrate andeukaryote genomes.

• http://www.ensembl.org

David Montaner Bioinformatics in medicine 12/26

Page 13: Bioinformatics Introduction

UCSC Genome BrowserGenome databases

• UCSC: University of California, Santa Cruz.• This site contains the reference sequence and working

draft assemblies for a large collection of genomes.• http://genome.ucsc.edu/

David Montaner Bioinformatics in medicine 13/26

Page 14: Bioinformatics Introduction

Protein sequence databases

• Most times proteins are the final unit of interest to research.• There is a direct conversion from DNA/RNA sequences to

protein sequences.• Gene IDs and protein IDs are equivalently used by

researchers (biologists not bioinformaticians …)

Examples

• UniProt: Universal Protein Resource (EBI)• Swiss-Prot (Swiss Institute of Bioinformatics)• InterPro Classifies proteins into families and predicts the

presence of domains and sites.• Pfam Protein families database of alignments and HMMs

(Sanger Institute)

David Montaner Bioinformatics in medicine 14/26

Page 15: Bioinformatics Introduction

RNA databases

• Contain information about RNA molecules.• Most of them regarding gene regulatory factors. (Gene

information is usually in other repositories).

Examples

• mirBase: microRNAshttp://www.mirbase.org/

• TRANSFAC: transcription factors in eukaryote (Proprietarydatabase).

• JASPAR: transcription factor binding sites for eukaryote(Open access, curated, non-redundant).http://jaspar.genereg.net/

David Montaner Bioinformatics in medicine 15/26

Page 16: Bioinformatics Introduction

Protein-protein interactions

• Proteins are the main functional units.• But they do not work in isolation.• Pretty useless at the moment but promising in the future …• some information is experimental, but most of it is

generated insilico.

Examples• IntAct: protein–small molecule

and protein–nucleic acidinteractions.

• BIND: Biomolecular InteractionNetwork Database.

David Montaner Bioinformatics in medicine 16/26

Page 17: Bioinformatics Introduction

Signal transduction pathwaydatabases& Metabolic pathway databases

• Information about how genes (or proteins) interact amongthem.

• not only physical interactions …

Examples• Reactome: free online database of biological pathways.

http://www.reactome.org• KEGG: Kyoto Encyclopedia of Genes and Genomes.

Metabolic pathways.http://www.genome.jp/kegg/pathway.html

David Montaner Bioinformatics in medicine 17/26

Page 18: Bioinformatics Introduction

KEGGMetabolic pathway databases

David Montaner Bioinformatics in medicine 18/26

Page 19: Bioinformatics Introduction

Experimental data repositories

Contain Microarray, NGS, Sanger, and other experimental highthroughput data.

• GEO: Gene Expression Omnibus (NCBI)http://www.ncbi.nlm.nih.gov/geo/

• ArrayExpress: database of functional genomicsexperiments including (EBI)http://www.ebi.ac.uk/arrayexpress/

• The Cancer Genome Atlas (TCGA): Data on differentcancer related tissues.http://cancergenome.nih.gov/

David Montaner Bioinformatics in medicine 19/26

Page 20: Bioinformatics Introduction

Bioinformatics

Training• Biology 1/3• Statistics 1/3• Computer science 1/3←−

Efficiently combine:• Experimental information• Database registered knowledge

Time and resources:• As in the wet lab

David Montaner Bioinformatics in medicine 20/26

Page 21: Bioinformatics Introduction

Example

David Montaner Bioinformatics in medicine 21/26

Page 22: Bioinformatics Introduction

Example I

Autistic children

1 (microarray) NGS data processing• data quality control, filtering...• map against reference genome• CNV calling

2 CNV filtering• just 75 rare de novo CNV events (not registered in

databases)• filter out the long ones• keep the ones that contain genes

David Montaner Bioinformatics in medicine 22/26

Page 23: Bioinformatics Introduction

Example II

3 move to the gene level• 47 loci in total affecting 433 human genes

4 Building the background likelihood network• GO annotations• KEGG pathways• InterPro domains• protein-proteins interactions. Databases: BIND, BioGRID,

DIP, HPRD, InNetDB, IntAct, BiGG, MINT, and MIPS• sequence homology between the gene pair (BLAST)

David Montaner Bioinformatics in medicine 23/26

Page 24: Bioinformatics Introduction

Example III

5 Search for high scoring clusters affected by CNVs6 Evaluating significance of cluster scores:

10.000 simulations

David Montaner Bioinformatics in medicine 24/26

Page 25: Bioinformatics Introduction

Example IV7 Functional characterization of the identified network

8 And, finally, draw conclusions

David Montaner Bioinformatics in medicine 25/26

Page 26: Bioinformatics Introduction

Questions

Thanks

David Montaner Bioinformatics in medicine 26/26