52
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr BINF 6211 Design and Implementation of Bioinformatics Databases Lecture 22 April 21 st , 2008 Dr. Jennifer W. Weller Dr. Andrew Carr

BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

  • Upload
    dohanh

  • View
    221

  • Download
    4

Embed Size (px)

Citation preview

Page 1: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

BINF 6211Design and Implementation of

Bioinformatics DatabasesLecture 22

April 21st, 2008Dr. Jennifer W. Weller

Dr. Andrew Carr

Page 2: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Agenda

• Genetic Databases– OMIM– dbSNP

Page 3: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Genetic Information Databases• From the phenotype perspective

– A mutation may be inferred from the way it tracks in crosses (inheritance)

– Given enough crosses, the relative location may be inferred • Recombination frequency with respect to linked phenotypes

– The physical map location provides an absolute position– Mutant alleles have the sequence changes leading to the range of

phenotypes associated with the disease– The frame of reference is within the set of samples having the

phenotype• From the physical location perspective

– Not all sequence changes (alleles) lead to different phenotypes• The changes may be synonomous• The changes may lead to subtle, multi-genic effects• Variants are defined with respect to the gene/chromosomal location• The frame of reference is at the genome sequence representative of a

population.

Page 4: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Online Mendelian Inheritance in Man

• A catalog of human genes and genetic disorders – Expert curation: Dr. Victor A. McKusick (JHU) and colleagues – Development for the World Wide Web and housed for serving by NCBI.

• Contents: textual information and references, a federated database– Links to MEDLINE– Links to sequence records in Entrez– Links to additional related resources at NCBI and elsewhere, as curators deem

relevant.– http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim

Page 5: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

• List is not comprehensive – contents must fit the categories defined.

• Two strategies to get subsets of interest– Via phenotype

• Use the Limits page to retrieve records that have the prefixes (+,#,%, ) by checking the box in front of each

GO– Via clinical synopsis

• Don’t select the above boxes GO• Not all disease-related records have such a synopsis

– Use the History page to combine the two searches

http://www.ncbi.nlm.nih.gov/Omim/mimstats.html

Page 6: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Y-linked is 4xxxxx OID

Page 7: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Record Types• “Mendelian inheritance" refers to the transmission of inherited

characters– Via reproductive transmission of genes.

• Character types have keys in certain ranges, as below– (100000- 200000- ) Autosomal loci or phenotypes from before May 15,

1994.– (300000- )X-linked loci or phenotypes– (400000- )Y-linked loci or phenotypes– (500000- )Mitochondrial loci or phenotypes– (600000- )Autosomal loci or phenotypes from after May 15, 1994

• Allelic variants have the MIM number of the parent entry, a period, then a unique 4-digit number. – Example: Factor IX (hemophilia B) locus is 306900

• Alleles are 306900.0001 to 306900.0101. – The beta-globin locus (HBB) is 141900

• Sickle hemoglobin allele is 141900.0243.

Page 8: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

MIM special characters• Symbols preceding an entry number mean:

– An asterisk (*) indicates a gene of known sequence.– A number symbol (#) indicates that it is a descriptive entry

• This is usually of a phenotype, and will be explained in the first paragraph, discussion of related genes is included in the Gene entry

• This does not represent a unique locus.– A plus sign (+) description of a gene of known sequence and

a phenotype.– A percent sign (%) a confirmed Mendelian phenotype or

phenotypic locus but the molecular basis is not known.– No symbol a description of a phenotype where the Mendelian

basis is not proven, or there may be phenotypic crossover– A caret symbol (^) the entry no longer exists.

Page 9: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Other limitations• There is no aggregation function to track how many inherited diseases have

a known sequence• There is no aggregation function for inherited diseases with a known

phenotype but no corresponding sequence (those with % prefix)– Entrez Gene will let you retrieve human genes for which there is no sequence

data or for which only a phenotype is known BUT there is no keyword here for disease genes

• human[orgn] NOT gene_nucleotide[filter]• human[orgn] AND phenotype_only[Properties]• Note: these lists will overlap• To get those with a phenotype and an OMIM record use human[orgn] AND

phenotype_only[Properties] AND gene_omim[filter]• While there is an emphasis on inheritance and cytogenetics, there is very

little information on chromosomal aberrations (these are often NOT inherited).

– For this the genome-wide map of chromosomal break points elsewhere is a better source (although this does not include monoploid or polyploid examples)

• An OMIM record may link to a gene not actually the primary locus, if it was included in the discussion by the authors of a paper – these are the links at the top and bottom of the record, while those in the side bar are limited to the specific locus.

Page 10: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Page 11: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Page 12: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Gene Map• This section has information on the cytogenetic

locations of genes – a single tabular file, with chromosomes in order, each from ptel to q tel– Limited to those with demonstrated cytogenetic

location• if only mapped to a chromosome given at the end of the

list for that chromosome– The Web version is searchable by gene symbol,

chromosomal location or keyword– There is an associated file called the GeneMapKey to

describe the column headings in the file, and special characters used

Page 13: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Morbid Map• Alphabetical list of diseases described in OMIM, with location as

known– Searchable by gene name, location and keyword– There is a graphical view of this data, which is visualizable in the Entrez

Map Viewer• You need to select the correct display settings in order to interpret the input

correctly• Special symbols

– [ ] information for molecular aberrations that don’t lead to something classified as a disease

– { } indicates a variant that leads to pathogen susceptibility– ? Means the mapping status is unresolved

• After the name of the disorder there is a (number) that indicates the method of mapping

• With respect to the WT gene (1)• With respect to the mutant allele (2)• Both (3)

Page 14: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Local hosting• OMIM is not relational but some of the

information is tracked in a relational system:– MIM number, create date, update dates

• It uses the ASN.1 format– You can download selected files (matrices of

commonly requested data – think data hypercubes):• The complete text of OMIM• The OMIM Gene table, either from the ftp site or from the

directory of the NCBI Web site as an alphabetical list of gene symbols and their MIM numbers.

• The OMIM Gene Map key and columns in the GeneMap file • The OMIM Morbid Map

Page 15: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

OMIM Case Study

• I want to design an array that will capture the mutations known to be associated with the Collagen I-A1 gene.

• I want to know what other genes I might need to assay for patients with this phenotype

• I want to know what patient data I should collect to do a good job on clinical signs and symptoms.

Page 16: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Page 17: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Page 18: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Page 19: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Page 20: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Nearby genes on the chromosome

Page 21: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Page 22: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Page 23: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Page 24: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Go back to the Collagen DB

Lists of specific types of mutations, but not in a text file

Page 25: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

For SNP-specific assays

Page 26: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Page 27: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, CarrCircular

Page 28: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Page 29: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Page 30: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Page 31: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Page 32: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

NIA database, but no data on bone

Page 33: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Page 34: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Page 35: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Page 36: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Pathway data rather than chromosomal location data

Page 37: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Signs and Symptoms

Page 38: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Page 39: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Page 40: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

SNPs• SNPs are single nucleotide polymorphisms, f=1:300 nt

– Sequence variants that do not change the number of bases in a gene

• Can still cause early truncation of a gene product• Most are biallelic• Several large-scale international and commercial projects have

undertaken to assess the level of polymorphism in the human genome, in various populations

• Why useful: the collection of such markers is unique for an individual– Mapping– defining population structure– performing functional studies

Page 41: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

dbSNP• dbSNP is a public database of single nucleotide polymorphisms (SNPs) and

abit more– Any species is allowed and from any part of a particular genome.– SNPs linked to known genes or expressed DNA segments (ESTs) are most

useful. • Thus SNPs from these regions are prioritized for integration with other ncbi

databases/view/tools.• dbSNP includes several types of simple genetic polymorphisms

– single-base nucleotide substitutions– small-scale multi-base deletions or insertions– retroposable element insertions– microsatellite repeat variation.

• Experimental information is also included: – the sequence information around the polymorphism– specific experimental conditions necessary to perform an experiment (such as

PCR of the locus)– frequency information by population or individual genotype.

Page 42: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Integration map

Page 43: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

dbSNP Schemas• dbSNP Schema

– > 100 tables and many relationships among tables. – No single ER diagram with all dbSNP tables is available

• Sub-schemas are available in which tables are grouped according to subject areas:– - Batch Submission:– - Submitted SNP– - Submitted snp, population frequency and individual genotype– - Frequency calculation by submitted snp and population.– - SNP Mapping and Annotation– Version control: b125_SNPContigLoc_b34_3: is the mapping

data for b125 snps that are mapped to NCBI genome build 34 version 3.

Page 44: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Page 45: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Page 46: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Page 47: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

• SNPs are indexed by two different accession numbers– the HANDLE | ID /

NCBI | ssASSAY IDforms which refer to an individual submission record

– the NCBI | rsSNP IDform which refers to the abstracted SNP and all associated records.

Page 48: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

No data

Try next

Page 49: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

What should the record have

Page 50: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Page 51: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Page 52: BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course Notes/BINF6211_f2008...Design and Implementation of Bioinformatics Databases Lecture 22 April

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr