67
Basic bioinformatics concepts, databases and tools Introduction to the training and Sequence databases Joachim Jacob http://www.bits.vib.be Updated 22 February 2012 http://dl.dropbox.com/u/18352887/BITS_training_material/Link%20to%20mod1-intro_H1_2012_SeqDBs.pdf

BITS: Basics of sequence databases

  • Upload
    bits

  • View
    4.434

  • Download
    1

Embed Size (px)

DESCRIPTION

Module 1: Sequence databases. Part of training session "Basic Bioinformatics concepts, databases and tools" - http://www.bits.vib.be/training

Citation preview

Page 1: BITS: Basics of sequence databases

Basic bioinformatics concepts, databases and tools

Introduction to the training

and Sequence databases

Joachim Jacobhttp://www.bits.vib.be

Updated 22 February 2012http://dl.dropbox.com/u/18352887/BITS_training_material/Link%20to%20mod1-intro_H1_2012_SeqDBs.pdf

Page 2: BITS: Basics of sequence databases

Scope

Introductory training to Bioinformatics

Exploring and understanding

databases and software

for everyday bioinformatics use

If there is any term which is unclear, please stop me and ask me!

Page 3: BITS: Basics of sequence databases

Bio

all data is derived from living samples

Informatics

that data is stored and analyzed in and with computers to obtain understanding

Extremely broad description, for which however we will extract common principles during the course

Bioinformatics ...

Page 4: BITS: Basics of sequence databases

Bioinformatics is present into every aspect of life sciences research

Page 5: BITS: Basics of sequence databases

Bioinformatics is present into every aspect of life sciences research

Page 6: BITS: Basics of sequence databases

Bioinformatics is present into every aspect of life sciences research

, sequences

Page 7: BITS: Basics of sequence databases

Bioinformatics is present into every aspect of life sciences research

Page 8: BITS: Basics of sequence databases

Bioinformatics is present into every aspect of life sciences research

Page 9: BITS: Basics of sequence databases

Bioinformatics is present into every aspect of life sciences research

Page 10: BITS: Basics of sequence databases

Bioinformatics is present into every aspect of life sciences research

Page 11: BITS: Basics of sequence databases

Bioinformatics is present into every aspect of life sciences research

Page 12: BITS: Basics of sequence databases

Bioinformatics is present into every aspect of life sciences research

Page 13: BITS: Basics of sequence databases

Bioinformatics ...

Bio

- different types of living samples

Informatics

- storing and categorizing the information and making it easily accessible

- interpreting that information reliably

Page 14: BITS: Basics of sequence databases

Bioinformatics … and his companion

Bio

- different types of living samples

Informatics

- storing and categorizing the information and making it easily accessible

- interpreting that information reliably

Statistics

- large numbers, observational data

Page 15: BITS: Basics of sequence databases

The siblings of Bioinformatics

Based on the biological component extracted from life, the measured properties and the ultimate goal of the analysis, different sub-disciplines of bioinformatics exist.

DNA RNA proteins metabolites

GenomicsTranscriptomics

ProteomicsMetabolomics

Epigenomics Structural bioinformatics Systems biology Microbiomics InteractomicsMetagenomics Functional genomics Comparative gx

Page 16: BITS: Basics of sequence databases

Mere data is worth nothing

Data = symbols

Information = data that are processed to be useful; provides answers to "who", "what", "where", and "when" questions. Also called metadata.

Knowledge: application of data and information; answers "how" questions

Understanding: appreciation of "why"

Wisdom

CGCTACGCATATCGCT

- Dasypus novemcinctus- found in my garden- Part of genome- sequenced on June 2010

This species seems to be related to my neighbor's pet, because it has also this sequence

Has the same mother

http://www.systems-thinking.org/dikw/dikw.htm

Page 17: BITS: Basics of sequence databases

Biology Computer Statistics

Bioinformatics research, as a specific branch on the boundary of life science, mathematics and computer science'tool manufacturer'

Tools and approaches

Life sciences research as major 'end user' for the bioinformatics tools and conclusions'tool user'

? !data knowledge

Page 18: BITS: Basics of sequence databases

This course is organised in several modules

Module 1: Sequence databases: what, where, how

Module 2: Sequence comparisons: searching, aligning

Module 3: Sequence analysis – domains in protein sequences and predicting functionality, standardisation and useful links

Module 4: Beyond sequences - additional important data sources

Module 5: Genome Browsers - integrating biological data and performing reproducible bioinformatics research in the Galaxy

Page 19: BITS: Basics of sequence databases

Overview of the crash course

Page 20: BITS: Basics of sequence databases

One tip for the future

Be prepared for change...

Information is fluid

So are bioinfo tools

Learn how to accommodate for change

Major resources are more stable

Important concepts do not change often

Page 21: BITS: Basics of sequence databases

Module 1

Sequence databases

Page 22: BITS: Basics of sequence databases

Module 1: Sequence databases

Sequence databases store DNA and RNA sequences. In Bioinformatics, they are by far (still) the largest collections of biological data, and used by many subdisciplines of bioinformatics.

http://www.ebi.ac.uk/embl/Services/DBStats/

Page 23: BITS: Basics of sequence databases

... and growing

http://www.ebi.ac.uk/embl/Services/DBStats/

Page 24: BITS: Basics of sequence databases

Three major nucleotide databanks host primary sequence data

European Nucleotide Archive (ENA) at EBI - http://www.ebi.ac.uk/

Division EMBL-bank (European Molecular Biology Laboratory) (single)

Trace Archive

SRA Archive

GenBank at NCBI - http://www.ncbi.nlm.nih.gov/

maintained at NCBI (National Center for Biotechnology Information,

(USA)

DDBJ (DNA Data Bank of Japan) - http://www.ddbj.nig.ac.jp/

maintained at NIG/CIB (National Institute of Genetics, Center for

Information Biology, Mishima, Japan)

Page 25: BITS: Basics of sequence databases

These databases are filled with NA sequence information by scientists and consortia

Individual scientists

Large-scale sequencing

projects

Primary sequence data

Primarysequencedatabase

Patent Offices ACTGCTGCTA

GCTAGCTGATCTATGCTAGCTGTAGCTGAG

each primary sequence =

one experiment

Basically, all 'source' nucleotide material

Jennifer McDowall - http://www.biotnet.org/training-materials/nucleotide-sequence-databases-ena

Page 26: BITS: Basics of sequence databases

Primary NA sequence can be produced by Sanger-based technologies or NGS technologies

sample

DNARNA

cDNA

RT

Sanger

Low output in number of seqs, high quality, 400-850 bp.Read profiles in .abi format. Stored in Trace Archive.

NGSDifferent technologies. Extremely high output rate, low quality, 30 bp – 600 bp. Reads in .fastq format, stored inthe SRA.

These techniques can only read DNA strands, so RNA needs first to be converted to cDNA with reverse transcriptases prior to loading to the machines.

Sanger overview: http://www.bio.davidson.edu/Courses/Molbio/MolStudents/spring2003/Obenrader/sanger_method_page.htmNGS overview: http://seqanswers.com/forums/showthread.php?t=3561

Page 27: BITS: Basics of sequence databases

Dennis Wall, NGS Data Analysis and Computation I course, Wall Lab

Overview major DNA reading technologies

Page 28: BITS: Basics of sequence databases

In the primary sequence dbs a major distinction can be made in two major categories

High quality single submission (Sanger)- gene sequence (genomic – 'STD' data class)- mRNA sequence (via cDNA – 'STD')- BAC/YAC/cosmid sequences- genome sequencing projects (contigs,

assemblies, WGS)- genome markers, STS (sequence tagged sites, unique short sequences from a genome)

Low quality batch submissions- Expressed Sequence Tags (EST)- Genome Survey Sequences (GSS)- high-throughput sequence data (e.g. NGS)

DNARNAcDNA

http://www.ebi.ac.uk/ena/about/formats

Page 29: BITS: Basics of sequence databases

The batch submissions originate mostly from sequencing centers

chromosome

cyp30 cyp309 insvcg343

annotation

sequence reads

sequencing library

assemble sequence

Large-scale sequencing

projects

submissione.g. whole genome shotgun

submission

submission

fragment

Page 30: BITS: Basics of sequence databases

Each primary database stores their sequences and batch submissions in their own way...

- NCBI: ESTs are stored in dbEST (separate database)- ENA: ESTs are part of EMBL-bank in 'EST' data class

Similar for GSS (see dbGSS at NCBI)

ESTs : expressed sequence tag, often partial sequence derived from RNA in batch. See example

sample

RNA

RNA-seq

>est1ATCGACTAGCATCA>est2TCGACTAGCGACTA>est3CAGCATCATCGAC

Page 31: BITS: Basics of sequence databases

Batch submissions are marked and/or stored differently than single submissions

TYPETIER CLASS

Sequencing and sampling information

Assembly information

Feature annotation

ENA-Reads:

ENA-Assembly:

ENA-Annotation:

1) EMBL-Bank

2) Trace Archive - Raw data (capillary sequencing)

3) Sequence Read Archive - Raw data (Next Gen sequencing)

http://www.biotnet.org/sites/biotnet.org/files/documents/17/2010_ena_v2.0.ppt

ENA structure

Batch submissions

Data class ESTs arealso batch submissions

Page 32: BITS: Basics of sequence databases

The 'normal' submissions are a minority in primary sequence databases

http://www.ebi.ac.uk/ena/about/statistics#embl_bases_per_dataclass

Page 33: BITS: Basics of sequence databases

Primary sequence dbs are synchronised and every sequence receives a unique identifier

All database maintainers assign and share a unique accession number (AC) to each sequence – besides their own ID number – (info at NCBI). Sequences can get updated, and the accession number is extended with a version number, e.g. .1 (see SVA)

http://www.insdc.org/Collaboration onFeatures, taxonomy,...

Example of acc number: BC010109.2

http://en.wikipedia.org/wiki/Accession_number_(bioinformatics)

ENA

GenBank+ SRA

DDBJ

International nucleotideSequence databases collaboration

Synchronized

daily

All use the same- Accession Ids- Project Ids- Feature tables (see later)

Page 34: BITS: Basics of sequence databases

One sequence entry contains three categories of different types of information

1. Info about sequence, submitters and literature (metadata)

2. Annotations of the sequence (metadata related to the seq)

3. Stretch of ATGC / AUGC sequence (the 'data', at the bottom)• A sequence record is called 'annotated' when biological information is

added and linked to a position in the sequence

• Annotations, also called 'features', are abbreviated as codes, which can be found in the Feature Tables

http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html

Page 35: BITS: Basics of sequence databases

This sequence information can be written in different formats(plain) Text format, e.g. GenBank

1. General info

Official shared accession

Genbank specific identifier (just sums up with each new)

A lot of different identifiers! ~number of databases→ conversion tools can translate identifiers needed (see exercises)

*In humans: HUGO Nomenclature committee determines the right gene name

http://mobyle.pasteur.fr/cgi-bin/portal.py#tutorials::seqfmt

Page 36: BITS: Basics of sequence databases

db_xref = cross references,

= links to records of other databases which are related to this record (see later). The format dbname:identifier

2. Annotation

Feature name Qualifier name

Page 37: BITS: Basics of sequence databases

Each protein sequence receives also an accession number

3. Sequence

Page 38: BITS: Basics of sequence databases

Other sequence formats

Fasta (minimal metadata, basically only sequence)

>genename And a descriptionATCGATGCAGCTATATCCTCGCGATCAGCCGGACAGCTCTCGAGCGCATCGACGACGAC

ASN.1 Abstract Syntax Notation (ASN.1)

EMBL :all info as in gb, online referred to as 'plain text'

XML

Fastq : sequence info and base 'call' quality

http://emboss.sourceforge.net/docs/themes/SequenceFormats.html

Important

'Format' has nothing to do with which program you save your file! You don't have a choice: it needs to be 'plain text format' (.txt - not a file which can be opened with MS Word such as .doc or .rtf files). Wordpad is a good choice for this. 'Format' in bioinfo is all about how the information is structured and written down in the plain text file.

Page 39: BITS: Basics of sequence databases

Degree of annotation differs between entries

TYPETIER CLASS

Sequencing and sampling information

Assembly information

Feature annotation

ENA-Reads:

ENA-Assembly:

ENA-Annotation:

1) EMBL-Bank

2) Trace Archive - Raw data (capillary sequencing)

3) Sequence Read Archive - Raw data (Next Gen sequencing)

http://www.biotnet.org/sites/biotnet.org/files/documents/17/2010_ena_v2.0.ppt

ENA structure

Good seq annotations

Experiment informationis of most importance in batch submissions (e.g.

which species, which technique, ...)

Batch submitted sequences are annotated poorly, single submissions are annotated better

Page 40: BITS: Basics of sequence databases

SRA contains batch submitted records of which experiment information is of most importance

Since the sequences are barely (not) annotated, is experiment description important: which machine, which organism, which tissue, which developmental stage, disease, treatment, …

Page 41: BITS: Basics of sequence databases

How to get sequences into the db, and back out

Submit

submit retrieve

Sequin (GenBank stand alone)Bankit (GenBank web tool)Webin (EMBL online submission)

One or few sequences → Use one of the numerous webbased toolsGenBank: EntrezEMBL: EB-eyeMRS: developed for easy retrieval

Many sequences (Batch retrieval)→ use ftp (file transfer protocol)→ use perl (flexible pro-gramming language)→ BioMart http://www.biomart.org/

RetrieveAlways submit your sequence data (mostly obliged by journals) and include your ACC number in articles (not any other number).

Page 42: BITS: Basics of sequence databases

Example of a primary NA sequence record (ENA)

http://www.ebi.ac.uk/ena/about/formats

Page 43: BITS: Basics of sequence databases

Example of a primary NA sequence record (ENA)

http://www.ebi.ac.uk/ena/about/formats

Text format

Code usable for

searching

Data linked to that

code

Page 44: BITS: Basics of sequence databases

Primary sequence data contains a lot of redundancy!

Several gene sequences from different labs

EST sequencesfrom transcripts

Chromosome sequence

cDNA sequence

Al match to the same gene. Often you end up in your database search with all these sequences...A lot of redundancy!

Page 45: BITS: Basics of sequence databases

The primary sequences are the basis for analyses that generate derived sequence data

Scientists/Consortia → primary databases

– Source for further analyses. Which?

• Create protein sequences

• Curate the sequence database

• Assemble genomes

• Searching similarities

• Aggregate information about one gene

• …

Results stored in derived databases

Page 46: BITS: Basics of sequence databases

Protein databases come in two kinds

Page 47: BITS: Basics of sequence databases

The most important protein db is UniProt and contains 'automatic' and manual entries

UniProt Knowledge Base - 'the best annotated protein database of the world'

http://www.uniprot.org/

Page 48: BITS: Basics of sequence databases

The most important protein db is UniProt and contains 'automatic' and manual entries

Page 49: BITS: Basics of sequence databases

Refseq - The NCBI way to reduce redundancy in primary sequence data

RefSeq is NCBI 'Reference Sequences' (prot and nuc)

Redundancy from primary sequence data is reduced both automatically and by manual annotation of NA and protein sequences. 'one natural biological molecule = one entry'. Links back to the original primary sequences. Hugely popular and a basis for a lot of analyses.

http://www.ncbi.nlm.nih.gov/RefSeq/

Click to apply refseq filter in entrez search

Page 50: BITS: Basics of sequence databases

RefSeq has its own identifiers, not to be mixed up with accession numbers

Refseq entry codes looks similar as ACC numbers (but are not ACC numbers – underscore!); and RefSeq is also in GenBank format. Note: in 'Features' section one can find the raw sequences from what is was derived. (typical mistake: search with refseq code in uniprot)

NC_* (curated) complete genomic element (chromosome, plasmid,...)NT_* (automated) intermediate assembly from BACNZ_* (automated) incomplete genomic sequence from WGSNW_* (automated) intermediate assembly from WGSNG_* (curated) incomplete genomic element corresponding to geneNM_* (curated) mRNANR_* (curated) non-coding RNA or predicted transcript of pseudogeneNP_* (curated) proteinZP_* (automated) protein predicted from WGS sequence (NZ_*)YP_* (curated) other predicted protein sequences from NCBI Genome Annotation Pipeline XM_* (automated) mRNAXR_* (automated) non-coding RNA or predicted transcript of pseudogeneXP_* (automated) protein

http://www.ncbi.nlm.nih.gov/RefSeq/http://www.ncbi.nlm.nih.gov/RefSeq/key.html

Page 51: BITS: Basics of sequence databases

UniRef – UniProt redundancy reducing system for proteins sequences

Non redundant protein sequences from UniProt

~ refseq

Hiding redundant sequences by clustering them• UniRef100 = complete identical sequences• UniRef90 = 90% identical sequences• UniRef50 = 50% identical sequences

See http://www.uniprot.org/help/uniref

Page 52: BITS: Basics of sequence databases

NCBI's Gene – summarizes gene information including sequence information from primary dbs

Example of the gene NPR1 from A. thaliana

Page 53: BITS: Basics of sequence databases

UniGene – summarizes transcriptomic information around genes

Page 54: BITS: Basics of sequence databases

And a lot more derived databases with sequence information exist

Repbase :

repeats (Alu, …), maintained by Jerzy Jurka at the Genetic Information Research Institute (Mountain View CA, USA). CENSOR server allows to "clean" sequences. http://www.girinst.org/repbase

MiRBase → published miRNA sequences

http://www.mirbase.org/

Eukaryotic promoter database

http://www.epd.isb-sib.ch/

UniVec

GenBank subset + some sequences from commercial sources - ftp://ftp.ncbi.nih.gov/pub/UniVec/

Page 55: BITS: Basics of sequence databases

The most important sequence databases overview

DDBJ

ENA

GB

Prim seq data Derived

trEMBL

GenPept

Curated

SwissProt

RefSeq Entrez

ENA searchEB-eye

UniProt

Integrated SearchPortals

UNIPROT

Page 56: BITS: Basics of sequence databases

Common gene annotations on sequences

Genome sequence: e.g. Chr6

Enhancers/promotors

Gene sequence

mRNA

protein

exon

5'UTR 3'UTRCDS

Genetic code tables

Intron

terminator

AAAAAAAAAAAAA

poly(A) tail

Page 57: BITS: Basics of sequence databases

Searching the database for your gene of interest

First you have to determine for yourself which information you want

- NA sequences vs. protein sequences

- If NA, genomic sequences, or RNA derived

- All possible sequences that exists, or curated ones

- Protein sequences of which quality

- ...

Page 58: BITS: Basics of sequence databases

Entrez is a starting point for searches at NCBIhttp://www.ncbi.nlm.nih.gov/sites/gquery

Page 59: BITS: Basics of sequence databases

Visualising the db_xrefs in records at NCBI

Page 60: BITS: Basics of sequence databases

ENA has its text-search portalhttp://www.ebi.ac.uk/ena/

Page 61: BITS: Basics of sequence databases

Results from an ENA search are organised following the ENA database structure

Page 62: BITS: Basics of sequence databases

UniProt has a simple search box leading to a sophisticated search results page

Page 63: BITS: Basics of sequence databases

Complex searches can be achieved by using the index codes in the database

e.g.

“oc=Primates and de=complete and de=cds and de=MHC”

Could answer: give me all coding sequence of MHC available in primates.

Code usable for

searching

Page 64: BITS: Basics of sequence databases

Meta-search tools can search different sequence databases at once.

MRS

Open Source, developed by Maarten Hekkelman at Radboud U. (Nijmegen, the Netherlands). Allows searching in different databases at once, and provides also statistics on the databases.

Alternatives: ACNUC, SRS

Page 65: BITS: Basics of sequence databases

Logical operators

Q1 AND Q2&

Q1 OR Q2|

Q1 NOT Q2!

Searching involves making combinations of conditions.Here the difference between a logic and, or and not explained by venn diagrams.

Page 66: BITS: Basics of sequence databases

Hands-on!

Every module ends with an exercise session.

We will now explore how data is stored in different sequence databases. You get …. minutes for this exercise.

Afterwards, we summarizes some of the difficulties some of you might have experienced.

Page 67: BITS: Basics of sequence databases

Summary This course is organised in several modulesModule 1: Sequence databasesThree major nucleotide databanks host primary sequence dataThese databases are filled with NA sequence information by scientists and consortiaThe batch submissions originate mostly from sequencing centersEach primary database stores their sequences and batch submissions in their own way...Batch submissions are marked and/or stored differently than single submissionsThe 'normal' submissions are a minority in primary sequence databasesPrimary sequence dbs are synchronised and every sequence receives a unique identifierOne sequence entry contains three categories of different types of informationThis sequence information can be written in different formatsDegree of annotation differs between entries SRA contains batch submitted records of which experiment information is of most importanceHow to get sequences into the db, and back outPrimary sequence data contains a lot of redundancy! The primary sequences are the basis for analyses that generate derived sequence dataProtein databases come in two kindsThe most important protein db is UniProt and contains 'automatic' and manual entriesRefseq - The NCBI way to reduce redundancy in primary sequence dataRefSeq has its own identifiers, not to be mixed up with accession numbersUniRef – UniProt redundancy reducing system for proteins sequencesNCBI's Gene – summarizes gene information including sequence information from primary dbsUniGene – summarizes transcriptomic information around genesAnd a lot more derived databases with sequence information existSearching the database for your gene of interestEntrez is a starting point for searches at NCBIVisualising the db_xrefs in records at NCBIENA has its text-search portalResults from an ENA search are organised following the ENA database structureUniProt has a simple search box leading to a sophisticated search results pageComplex searches can be achieved by using the index codes in the databaseMeta-search tools can search different sequence databases at once.Hands-on!