Basics on bioinformatics Lecture 2 - unina.it Bioinf. (1).pdfBasics on bioinformatics Lecture 2...

Preview:

Citation preview

Basics on bioinformatics

Lecture 2

Nunzio D’Agostinonunzio.dagostino@entecra.it; nunzio.dagostino@gmail.com

Lecture 2

Database or databank?

Initially

o Databank (UK)

o Database (USA)

Solution

The abbreviation db

2

Entity-Relationship (ER) modeling

Notation uses three main constructs:

o Data entities

Represents a set or collection of objects in the real world that share the

same properties. Person, place, object, event or concept about which data is

to be maintained.

o Attributes

Named property or characteristic of an entity

o Relationships

Association between the instances of one or more entity typesAssociation between the instances of one or more entity types

Relationships can be classified as either

one – to – one 1�1one – to – many 1�Nmany – to –many N�N

Connectivity

3

1 : N

Cardinality

1 : 1

4

N : M

ER example

5

database: basic structure

Databases are composed of tables of data.

Gi Accession Length Cultivar Dev.stag Tissue sequence

30320090 CD003352 356 -Turning stage

of fruit ripeningPericarp GTACTCCTAAAC…..

15195408 BI421671 492 TA496 25-40 days old callus CCACAACCACA…..

50892290 AJ784669 346West Virginia

106

8 days post

anthesisfruit CAAATTTA…..

Databases are composed of tables of data.

Tables hold logically related sets of data. A table is essentially

the same thing as a spreadsheet: a set of rows and columns

6

database: basic structure

Gi Accession Length Cultivar Dev.stag Tissue sequence

30320090 CD003352 356 -Turning stage

of fruit ripeningPericarp GTACTCCTAAAC…..

15195408 BI421671 492 TA496 25-40 days old callus CCACAACCACA…..

50892290 AJ784669 346West Virginia

106

8 days post

anthesisfruit CAAATTTA…..

Each table has several records or entries : Each table has several records or entries :

a record stores all the information for a given individual

Records are the rows of a data table

7

database: basic structure

Gi Accession Length Cultivar Dev.stag Tissue sequence

30320090 CD003352 356 -Turning stage

of fruit ripeningPericarp GTACTCCTAAAC…..

15195408 BI421671 492 TA496 25-40 days old callus CCACAACCACA…..

50892290 AJ784669 346West Virginia

106

8 days post

anthesisfruit CAAATTTA…..

Each record has several fields:Each record has several fields:

A field is an individual piece of data, a single attribute of the

record.

Fields are the columns of a data table

8

database: basic structure

Gi Accession Length Cultivar Dev.stag Tissue sequence

30320090 CD003352 356 -Turning stage

of fruit ripeningPericarp GTACTCCTAAAC…..

15195408 BI421671 492 TA496 25-40 days old callus CCACAACCACA…..

50892290 AJ784669 346West Virginia

106

8 days post

anthesisfruit CAAATTTA…..

Each record (row) has a unique identifier, the primary key.Each record (row) has a unique identifier, the primary key.

the primary key serves to identify the data stored in this

record across all the tables in the database.

Databases are manipulated with a language called SQL (Structured

Query Language). It’s a “baby English” type of language: uses real

words, but rigid in terms of the order and placement.

Various database software: Oracle, MS Access, MySQL, etc.9

Why biological databases?

oMake biological data available to scientistsConsolidation of data (gather data from different sources)Provide access to large dataset that cannot be publishedexplicitly (genome, …)

oMake biological data available in computer-readable formatMake data accessible for automated analysisMake data accessible for automated analysis

10

Biological db

o Vary in size, quality, coverage, level of interest

o Many of the major ones covered in the annual Database Issue of

Nucleic Acids Research

11

2010

Biological db

12

Biological db

13

What makes a good db?

o comprehensiveness

o accuracy

o is up-to-date

o good interface

o batch search/download

o API (web services, DAS, etc.)

14

“must have” item when using db

o Remember the server, the database, and the program

version used

o Write down sequence identification numbers

o Databases are not like good wine

(use up-to-date builds)

o Use local installs when it becomes necessary15

Primary and derived data

Primary databases:

Databases consisting of data derived experimentally such as

nucleotide sequences and three dimensional structures.

Secondary databases:

Those data that are derived from the analysis or treatment ofThose data that are derived from the analysis or treatment of

primary data

16

Nucleotide sequence databases

GenBank www.ncbi.nlm.nih.gov/GenBank

17

www.ebi.ac.uk/emblwww.ddbj.nig.ac.jp

The 3 databases are synchronized on a daily basis, and the accessionnumbers are consistent.

There are no legal restriction in the usage of these databases.However, there are some patented sequences in the database

GenBank sample record

http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.htmlLOCUS AF115338 591 bp DNA linear BCT 19-AUG-1999 DEFINITION Pseudomonas fluorescens ECF sigma factor SigX (sigX) gene, complete cds. ACCESSION AF115338 VERSION AF115338.1 GI:4959391 KEYWORDS . SOURCE Pseudomonas fluorescens. ORGANISM Pseudomonas fluorescens Bacteria; Proteobacteria; gamma subdivision; Pseudomonadaceae; Pseudomonas. REFERENCE 1 (bases 1 to 591) AUTHORS Brinkman,F.S., Schoofs,G., Hancock,R.E. and De Mot,R. TITLE Influence of a putative ECF sigma factor on expression of the major outer membrane protein, OprF, in Pseudomonas aeruginosa and Pseudomonas fluorescens JOURNAL J. Bacteriol. 181 (16), 4746-4754 (1999) MEDLINE 99369842 PUBMED 10438740 REFERENCE 2 (bases 1 to 591) AUTHORS De Mot,R. TITLE Direct Submission JOURNAL Submitted (04-DEC-1998) F.A. Janssens Laboratory of Genetics,

headertitle

taxonomy

citation

18

JOURNAL Submitted (04-DEC-1998) F.A. Janssens Laboratory of Genetics, Applied Plant Sciences, K. Mercierlaan 92, Heverlee B-3001, Belgium FEATURES Location/Qualifiers source 1..591 /organism="Pseudomonas fluorescens" /strain="M114" /db_xref="taxon:294" gene 1..591 /gene="sigX" CDS 1..591 /gene="sigX" /codon_start=1 /transl_table=11 /product="ECF sigma factor SigX" /protein_id="AAD34329.1" /db_xref="GI:4959392" /translation="MNKAQTLSTRYDPRELSDEELVARSHTELFHVTRAYEELMRRYQ RTLFNVCARYLGNDRDADDVCQEVMLKVLYGLKNLEGKSKFKTWLYSITYNECITQYR KERRKRRLMDALSLDPLEEASEEKALQPEEKGGLDRWLVYVNPIDRGILVLRFVAELE FQEIADIMHMGLSATKMRYKRALDKLREKFAGETET" BASE COUNT 157 a 133 c 170 g 131 t ORIGIN 1 atgaataaag cccaaacgct atccacgcgc tacgaccccc gcgagctctc tgatgaggag 61 ttggtcgcgc gctcgcatac cgagcttttt cacgtaacgc gcgcctatga agaactgatg 121 cggcgttacc agcgaacatt atttaacgtt tgtgcgagat atcttgggaa cgatcgcgac 181 gcagacgatg tctgtcagga agtcatgttg aaggtgctgt atggcctgaa gaacctcgag 241 gggaaatcga agttcaaaac gtggctctac agcatcacgt acaacgaatg tattacgcag 301 tatcggaagg aacggcgaaa gcgtcgcttg atggacgcat tgagtcttga ccccctcgag 361 gaagcgtccg aagaaaaggc gcttcaaccc gaggagaagg gcgggcttga tcgctggctg 421 gtgtatgtga acccgattga ccgtggaatt ctggtgcttc gatttgtcgc agagctggaa 481 tttcaggaga tcgcagacat catgcacatg ggtttgagtg cgacaaaaat gcgttacaaa 541 cgtgctctag ataaattgcg tgagaaattt gcaggcgaga ctgaaactta g

features

sequence

Protein sequence database

The mission of UniProt is to provide the

scientific community with a comprehensive,

high-quality and freely accessible resource of

protein sequence and functional information.

UniprotKB Knowledgebase

is the central hub for the collection of functional information on proteins, with accurate,

consistent and rich annotation.

Swiss-Prot, which is

manually annotated

and reviewed.

TrEMBL, which is

automatically annotated

and is not reviewed.

The UniProt Reference

Clusters (UniRef), which is

used to speed up sequence

similarity searches.

19

UniProt entry

20

Protein data bank

The PDB archive contains information about experimentally

determined structures of proteins, nucleic acids, and complex

assemblies. (XrayXray,, NMR,NMR, ComputationallyComputationally predictedpredicted)

Mission: maintain a single archive of macromolecular structural data that is freely

and openly available to the global community

Number of Structures Available

21

PDB entry

22

Protein structure levels

23

The gene Ontology (GO)

GO goals

The GO Website http://www.geneontology.org 24

The gene Ontology (GO)

GO is divided in 3 domain (levels of annotation):

o Molecular function - basic activities of a gene product atthe molecular level

o Biological process - set of molecular events with a definedbeginning and an endbeginning and an end

o Cellular component - the parts of a cell or its extracellularenvironment

25

GO structure

nucleus chromosome mitochondrion

The structure of GO can be described in terms of direct acyclic graph (DAG), where each

GO term is a node, and the relationships between the terms are arcs between the nodes

Is_a

part_of part_of

Nuclear chromosome mitochondrial chromosome

GO currently has 2 relationship types:Is_a

An is_a child of a parent means that the child is a complete type of its parent, but can be discriminated in some way from other children of the parent.

Part_ofA part_of child of a parent means that the child is always a constituent of the parent that in combination with other constituents of the parent make up the parent.

26

Searching for papers

http://www.ncbi.nlm.nih.gov/pubmedhttp://scholar.google.com/

http://www.scopus.com/home.url

http://portal.isiknowledge.com/

27

Querying GenBank

http://www.ncbi.nlm.nih.gov/sites/gquery

Search from the Entrez main page the gene whose accession

number is BC043443.

o How many results we get in the Gene db?

o What is the official name of the gene? Other possible

28

o What is the official name of the gene? Other possible

names?

o On which DNA strand is it located?

o How many variants of splicing it has?

o Which disease is the gene associated to?

o Is it involved in the apoptosis process?

o How long is the coding sequence of the first variant of

slicing?

Querying GenBank

http://www.ncbi.nlm.nih.gov/genbank/

NG_000007

29

Querying GenBank

What kind of molecule is it? Genomic DNA

30

Querying GenBank

Where is locate the promoter of the gene HBB? Upstream the nucleotide 70545

31

Querying GenBank

Indicate the number of exons =

Indicate the length of the second exon =

Indicate the number of introns =

Indicate the length of the first intron =

3

71039-70817 +1 = 223 nts

2

70816-70685+1 = 132 nts

32

Querying GenBank

Indicate the location of the 5 'UTR =

Indicate the length of the 5 'UTR =

Indicate the location of the 3 'UTR =

Indicate the length of the 3 'UTR =

70545..70594

70594-70545 +1 = 50 nts

72019..72150

72150-72019 +1 = 132 nts

33

Querying GenBank

Indicate the nucleotide positions of the start codon = 70595,70596,70597

34

Querying GenBank

Download in FASTA format the sequence of the HBB gene

35

Querying GenBank

70545 72150

36

Querying GenBank

37

Querying GenBank

>gi|28380636:70545-72150 Homo sapiens beta globin region (HBB@); and hemoglobin, beta (HBB); and hemoglobin, delta (HBD); and hemoglobin, epsilon 1 (HBE1); and hemoglobin, gamma A (HBG1); and hemoglobin, gamma G (HBG2), RefSeqGene on chromosome 11 ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA GGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGC AGGTTGGTATCAAGGTTACAAGACAGGTTTAAGGAGACCAATAGAAACTGGGCATGTGGAGACAGAGAAG ACTCTTGGGTTTCTGATAGGCACTGACTCTCTCTGCCTATTGGTCTATTTTCCCACCCTTAGGCTGCTGG TGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATGGG CAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGAC AACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACT TCAGGGTGAGTCTATGGGACGCTTGATGTTTTCTTTCCCCTTCTTTTCTATGGTTAAGTTCATGTCATAG GAAGGGGATAAGTAACAGGGTACAGTTTAGAATGGGAAACAGACGAATGATTGCATCAGTGTGGAAGTCT CAGGATCGTTTTAGTTTCTTTTATTTGCTGTTCATAACAATTGTTTTCTTTTGTTTAATTCTTGCTTTCT TTTTTTTTCTTCTCCGCAATTTTTACTATTATACTTAATGCCTTAACATTGTGTATAACAAAAGGAAATA TCTCTGAGATACATTAAGTAACTTAAAAAAAAACTTTACACAGTCTGCCTAGTACATTACTATTTGGAAT ATATGTGTGCTTATTTGCATATTCATAATCTCCCTACTTTATTTTCTTTTATTTTTAATTGATACATAAT CATTATACATATTTATGGGTTAAAGTGTAATGTTTTAATATGTGTACACATATTGACCAAATCAGGGTAA TTTTGCATTTGTAATTTTAAAAAATGCTTTCTTCTTTTAATATACTTTTTTGTTTATCTTATTTCTAATA CTTTCCCTAATCTCTTTCTTTCAGGGCAATAATGATACAATGTATCATGCCTCTTTGCACCATTCTAAAG CTTTCCCTAATCTCTTTCTTTCAGGGCAATAATGATACAATGTATCATGCCTCTTTGCACCATTCTAAAG AATAACAGTGATAATTTCTGGGTTAAGGCAATAGCAATATCTCTGCATATAAATATTTCTGCATATAAAT TGTAACTGATGTAAGAGGTTTCATATTGCTAATAGCAGCTACAATCCAGCTACCATTCTGCTTTTATTTT ATGGTTGGGATAAGGCTGGATTATTCTGAGTCCAAGCTAGGCCCTTTTGCTAATCATGTTCATACCTCTT ATCTTCCTCCCACAGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCA CCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCA CTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACT GGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC

38

Querying GenBank

39

Querying GenBank: link to geneID

40

How many articles did Nunzio D’Agostino publish?

Querying PUBMEDhttp://www.ncbi.nlm.nih.gov/pubmed

41

Querying PUBMEDhttp://www.ncbi.nlm.nih.gov/pubmed

How many articles did Nunzio D’Agostino publish?

D'Agostino, Nunzio [Full Author Name] OR D Agostino, Nunzio [Full Author Name]

42

How many articles did Nunzio D’Agostino publish?

Querying PUBMEDhttp://www.ncbi.nlm.nih.gov/pubmed

D'Agostino, Nunzio [Full Author Name] OR D Agostino, Nunzio [Full Author Name]

How many of these are releted to EST?

43

How many articles did Nunzio D’Agostino publish?

Querying PUBMEDhttp://www.ncbi.nlm.nih.gov/pubmed

D'Agostino, Nunzio [Full Author Name] OR D Agostino, Nunzio [Full Author Name]

How many of these are releted to EST?

D'Agostino, Nunzio [Full Author Name] AND EST [Title/Abstract]

44

How many articles did Nunzio D’Agostino publish?

Querying PUBMEDhttp://www.ncbi.nlm.nih.gov/pubmed

D'Agostino, Nunzio [Full Author Name] OR D Agostino, Nunzio [Full Author Name]

How many of these are releted to EST?

D'Agostino, Nunzio [Full Author Name] AND EST [Title/Abstract]

How many of these are on the BMC Genomics Journal?

45

How many of these are on the BMC Genomics Journal?

How many articles did Nunzio D’Agostino publish?

Querying PUBMEDhttp://www.ncbi.nlm.nih.gov/pubmed

D'Agostino, Nunzio [Full Author Name] OR D Agostino, Nunzio [Full Author Name]

How many of these are releted to EST?

D'Agostino, Nunzio [Full Author Name] AND EST [Title/Abstract]

How many of these are on the BMC Genomics Journal?

46

How many of these are on the BMC Genomics Journal?

D'Agostino, Nunzio [Full Author Name] OR D Agostino, Nunzio [Full Author

Name] AND BMC Genomics [journal]

How many articles did Nunzio D’Agostino publish?

Querying PUBMEDhttp://www.ncbi.nlm.nih.gov/pubmed

D'Agostino, Nunzio [Full Author Name] OR D Agostino, Nunzio [Full Author Name]

How many of these are releted to EST?

D'Agostino, Nunzio [Full Author Name] AND EST [Title/Abstract]

How many of these are on the BMC Genomics Journal?

47

How many of these are on the BMC Genomics Journal?

D'Agostino, Nunzio [Full Author Name] OR D Agostino, Nunzio [Full Author

Name] AND BMC Genomics [journal]

How many articles do include the word “RNA-Seq” in the title?

How many articles did Nunzio D’Agostino publish?

Querying PUBMEDhttp://www.ncbi.nlm.nih.gov/pubmed

D'Agostino, Nunzio [Full Author Name] OR D Agostino, Nunzio [Full Author Name]

How many of these are releted to EST?

D'Agostino, Nunzio [Full Author Name] AND EST [Title/Abstract]

How many of these are on the BMC Genomics Journal?

48

How many of these are on the BMC Genomics Journal?

D'Agostino, Nunzio [Full Author Name] OR D Agostino, Nunzio [Full Author

Name] AND BMC Genomics [journal]

How many articles in PubMED do include the word “RNA-Seq” in the title?

RNA-Seq [title]

How many articles did Nunzio D’Agostino publish?

Querying PUBMEDhttp://www.ncbi.nlm.nih.gov/pubmed

D'Agostino, Nunzio [Full Author Name] OR D Agostino, Nunzio [Full Author Name]

How many of these are releted to EST?

D'Agostino, Nunzio [Full Author Name] AND EST [Title/Abstract]

How many of these are on the BMC Genomics Journal?

49

How many of these are on the BMC Genomics Journal?

D'Agostino, Nunzio [Full Author Name] OR D Agostino, Nunzio [Full Author

Name] AND BMC Genomics [journal]

How many articles in PubMED do include the word “RNA-Seq” in the title?

RNA-Seq [title]

How many reviews have been published in 2008 containing the word

"transcriptome”?

How many articles did Nunzio D’Agostino publish?

Querying PUBMEDhttp://www.ncbi.nlm.nih.gov/pubmed

D'Agostino, Nunzio [Full Author Name] OR D Agostino, Nunzio [Full Author Name]

How many of these are releted to EST?

D'Agostino, Nunzio [Full Author Name] AND EST [Title/Abstract]

How many of these are on the BMC Genomics Journal?

50

How many of these are on the BMC Genomics Journal?

D'Agostino, Nunzio [Full Author Name] OR D Agostino, Nunzio [Full Author

Name] AND BMC Genomics [journal]

How many articles in PubMED do include the word “RNA-Seq” in the title?

RNA-Seq [title]

How many reviews have been published in 2008 containing the word

"transcriptome”?

transcriptome [title] AND review [Publication Type] AND 2008[publication date]

Recommended