18
Chapter 1 Review of Literature

Chapter 1 title - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3417/10/10_chapter 1.pdf · a DNA sequence are called single-nucleotide polymorphisms, or SNPs (Figure 1.1)

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Chapter 1 title - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3417/10/10_chapter 1.pdf · a DNA sequence are called single-nucleotide polymorphisms, or SNPs (Figure 1.1)

Chapter 1

Review of Literature

Page 2: Chapter 1 title - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3417/10/10_chapter 1.pdf · a DNA sequence are called single-nucleotide polymorphisms, or SNPs (Figure 1.1)

Review of Literature

1.1 Background All of us exhibit a wide range of phenotypic variation: morphological,

physiological, psychological, behavioral, susceptibility to diseases, as well as

response to drugs. The reason for this phenotypic diversity is thought to be due to

the existence of genetic variations, although gene-environment interactions are

increasingly becoming important. The genetic variations may be of several types,

such as single nucleotide polymorphisms (SNPs), insertion/deletion, block

substitutions, inversions, variable number of tandem repeat sequences (VNTRs),

microsatellites, and copy number variations (CNVs).

Recent astounding progress in sequencing technology has opened a way for

personal genomics, with six genome sequences already published in the last two

years: that of J. Craig Venter (Levy et al. 2007), James D. Watson (Wheeler et al.

2008), as well as four additional genomes from Han Chinese (Asian) (Wang et al.

2008), Nigerian (African) (Bentley et al. 2008), and two Korean (Ahn et al.

2009;Kim et al. 2009) individuals. The availability of these genomes has furthered

our understanding of the various forms of human genetic variations. While some

of these genetic variations have no proven phenotypic influences, others may have

significant consequences. For instance, they may increase our risk of developing a

disease or lower the likelihood of our response to a particular pharmaceutical

treatment. Comprehensive understanding of various forms of genetic variations

may uncover the genetic basis of human phenotypic differences.

SNPs (often pronounced as ‘snips’), the most common source of human genetic

variations, have been implicated in a number of human complex disorders, as well

as in inter-individual variability in drug response. However, the identification of

disease-associated SNPs from the large pool of SNPs is a daunting task, especially

keeping in mind the cost and labour involved in the existing methods, including

whole genome association studies (or WGAS).

Computational approaches, by prioritization of functional SNPs, may facilitate

experimental efforts to identify SNPs that impact biological processes. This thesis

deals with the development of such computational methods capable of identifying

1 A T GC

Page 3: Chapter 1 title - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3417/10/10_chapter 1.pdf · a DNA sequence are called single-nucleotide polymorphisms, or SNPs (Figure 1.1)

Review of Literature

functionally relevant coding regions, as well as SNPs likely to be deleterious, and

their applications in population genetics.

1.2 Single Nucleotide Polymorphisms

1.2.1 Introduction The genetic variations resulting in the substitution of one nucleotide for another in

a DNA sequence are called single-nucleotide polymorphisms, or SNPs (Figure

1.1). For a variation to be considered a SNP, it must occur in at least 1 per cent of

the population (Botstein, Risch 2003).

T

Individual 1 Individual 2

Figure 1.1 A single nucleotide difference in individuals 1 and 2, depicting SNP.

These DNA base pair polymorphisms are of two types: transitions, which involve

substitution of a purine by another purine (A G) or a pyrimidine by another

pyrimidine nucleotide (C T); and transversions, which involve substitution of a

purine by a pyrimidine or vice versa.

C G A

2 A T GC

Page 4: Chapter 1 title - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3417/10/10_chapter 1.pdf · a DNA sequence are called single-nucleotide polymorphisms, or SNPs (Figure 1.1)

Review of Literature

1.2.2 Classification of SNPs SNPs are classified based on their genomic location into coding and non-coding

SNPs. Non-coding SNPs may occur in the promoter region of the gene, within

introns, 5´- and 3´- untranslated regions, and intergenic regions (Figure 1.2).

Coding SNPs, which occur in the coding (exonic) region of genes, are of two

types; synonymous and non-synonymous. Synonymous SNPs change a codon

specifying an amino acid into another that codes for the same amino acid.

Therefore, there is no change in the amino acid sequence of the protein. Non-

synonymous SNPs are further sub-classified into missense and nonsense SNPs.

While missense SNPs change the codon specifying an amino acid to another

specifying a different amino acid, nonsense SNPs change the codon specifying an

amino acid to a stop codon.

Figure 1.2 Classification of SNPs according to their genomic location.

While majority of the SNPs reside in the intergenic region, genic SNPs comprise

~ 7 million SNPs. Table 1.1 shows the distribution of genic SNPs in the human

genome (dbSNP build 130; May 03, 2009).

3 A T GC

Page 5: Chapter 1 title - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3417/10/10_chapter 1.pdf · a DNA sequence are called single-nucleotide polymorphisms, or SNPs (Figure 1.1)

Review of Literature

Table 1.1 Distribution of genic SNPs in the human genome

Genomic Region SNP Count Gene Count

Intron 66,71,226 20,818

UTR 2,00,750 17,830

Missense 1,10,927 20,411

Synonymous 78,310 18,703

Nonsense 3,536 2,431

1.2.3 SNP databases Today, the primary resource of SNPs is dbSNP available at NCBI (Sherry et al.

2001), which currently contains more than 17 million SNPs (dbSNP build 130).

Information on disease-associated SNPs is available from several databases:

Human Gene Mutation Database (HGMD) (Stenson et al. 2009) collates known

gene mutations responsible for human inherited diseases; Genetic Association

Database (GAD) (Becker et al. 2004) is an archive of human genetic association

studies of complex diseases and disorders; Online Mendelian Inheritance in Man

(OMIM) database is a catalog of genetic disorders of inherited diseases mapped to

human genes, and highly penetrant, but rare (MAF < 0.01) mutations (Hamosh et

al. 2005;Rashbass 1995); Swiss-Prot classifies SNPs into disease (disease-

associated SNPs) and polymorphisms (benign SNPs) (Boeckmann et al. 2003);

Human Genome Variation Database (HGVbase) (Fredman et al. 2004) provides a

centralized compilation of summary level findings from genetic association

studies, thus facilitating research into DNA sequence variation and human

phenotypes.

Databases with frequency information of SNPs across different human

populations are also available. The ALelle FREquency Database (ALFRED)

(Cheung et al. 2000) is a database of allele frequencies of DNA variants for

multiple anthropologically defined populations. ALFRED has data on 18068

polymorphisms in 681 populations. Similarly, HapMap (International HapMap

Consortium 2003) characterizes over 3.1 million human SNPs genotyped in 270

individuals from four geographically diverse populations. IGVdb is another major

4 A T GC

Page 6: Chapter 1 title - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3417/10/10_chapter 1.pdf · a DNA sequence are called single-nucleotide polymorphisms, or SNPs (Figure 1.1)

Review of Literature

effort that has genotyped individuals from distinct Indian sub-populations (Indian

Genome Variation Consortium 2005;Indian Genome Variation Consortium 2008).

Various resources for visualisation of SNP locations and other genome

annotations have become available in the public domain. One such resource is the

UCSC Genome Browser (Kuhn et al. 2009), wherein a large array of annotations

has been assembled. The other primary genome resource is Ensembl (Birney et al.

2006;Hubbard et al. 2009). Within Ensembl, users can visualise variation in and

around genes, and their data annotations are of high quality and embedded into an

elegant interface. In addition, a number of (>750) locus specific databases

(LSDBs) are also available in the public domain (Appendix 1.1). Collectively,

these efforts may help researchers find genetic basis to phenotypic variability

among individuals.

1.2.4 SNPs as molecular markers The introduction of molecular markers in genetic analysis has transformed

research in genetic medicine. These molecular markers are the genetic variations

associated with phenotypic traits like predisposition to common diseases and

individual variations in drug responses. The most important property of a good

marker is that it should be sufficiently polymorphic such that a randomly chosen

individual will be heterozygous for it. This is quantitatively assessed by the “mean

heterozygosity” or “polymorphism information content” parameters assigned to

each marker (Botstein, Risch 2003). Moreover, the marker alleles should be easily

and inexpensively genotyped; and should be distributed throughout the genome.

Blood group variants, HLA variants, and electrophoretic variants of serum

proteins were used as molecular markers for human genetic analysis till 1970s.

However, all these had several limitations (Strachan, Read 1999). Since the

discovery of restriction fragment length polymorphisms (RFLPs) in the human

genome, variable DNA sequences were used almost exclusively as genetic

markers. But RFLPs suffer from the major handicap of limited polymorphism;

each has only two alleles, allowing a maximum probability of heterozygosity

equal to 0.5. More polymorphic markers such as variable number tandem repeats

5 A T GC

Page 7: Chapter 1 title - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3417/10/10_chapter 1.pdf · a DNA sequence are called single-nucleotide polymorphisms, or SNPs (Figure 1.1)

Review of Literature

(VNTRs) whose notable polymorphism is due to variability in the number of

tandem repeats of length < 100 nt, have difficulties in detection and uneven

distribution in the genome (Jeffreys et al. 1985). The polymorphism of

microsatellites is due to variations in the number of tandem repeats of short

sequence units typically ranging from two to four nucleotides in size. Many have

5-10 alleles and heterozygosity levels of 0.75 or greater. However, the mutation

rate of microsatellites is high, which is a concern for association and linkage

disequilibrium (LD) studies.

SNPs came into use in the twentieth century (Venter et al. 2001) (Sachidanandam

et al. 2001). The major advantage of SNPs is their abundance, theoretically

allowing detection of tighter linkages and associations. Millions of SNPs have

already been identified, corresponding to a frequency of about 1/300 bp. In

addition to frequency, SNPs have the benefit of being more stable and easily

amenable to automation for assessment in large scale experiments (Wang, Moult

2001);(Goddard et al. 2000). Moreover, the extraordinary abundance of SNPs

largely offsets the disadvantage of their being biallelic and makes them the most

attractive molecular marker system developed so far.

1.3 SNPs in Disease Susceptibility

1.3.1 Types of genetic diseases Genetic diseases, one of the major classes of human diseases, based on their

genetic etiology, have been broadly classified into monogenic and polygenic

(complex).

In monogenic diseases (also referred to as ‘Mendelian’ or ‘single-gene’ disorders),

mutations in a single gene are both necessary and adequate to produce the clinical

phenotype. For example, single amino acid substitutions of the human β-globin

gene (HBB) are responsible for β-thalassaemia, sickle cell anemia, and other

haemoglobinopathies, which are the most common genetic diseases of blood

(Doss, Sethumadhavan 2009). Sickle cell disease arises from a mutation

substituting thymine for adenine in the sixth codon of the beta-chain gene, GAG

to GTG, resulting in the substitution of glutamine by valine at position 6 of the Hb

6 A T GC

Page 8: Chapter 1 title - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3417/10/10_chapter 1.pdf · a DNA sequence are called single-nucleotide polymorphisms, or SNPs (Figure 1.1)

Review of Literature

beta chain.

On the other hand, common human diseases, such as obesity, diabetes,

cardiovascular disease, cancer and asthma, follow a more complicated inheritance

pattern, and are proving much harder to analyze. Difficulties are caused by

incomplete penetrance (a person carrying a predisposing allele may not exhibit the

disease phenotype); genetic heterogeneity (mutations in one or several genes may

result in identical phenotypes); and polygenic inheritance (a trait is controlled by

multiple gene interactions, such that each individual predisposing allele has a low

risk factor and shows weak correlation with the disease trait). In addition,

environmental factors may also play an important role in shaping disease

phenotypes (Yue et al. 2005). Such disorders are referred to as ‘complex’ or

‘multifactorial’ disorders.

The identification of genes contributing to the susceptibility and progression of

complex human diseases has become a major focus of genetics research in the

post-genomics landscape (Glazier et al. 2002).

1.3.2 Hypotheses regarding the role of genetic variants in

complex disorders Two contrasting hypotheses have been discussed in the past regarding the role of

genetic variants in complex disorders.

The ‘common disease-common variant’ (CD-CV) hypothesis posits the role of

common variants, with small to modest penetrance (Risch, Merikangas 1996), in

susceptibility to complex traits. The APOE*4 allele (Corbo, Scacchi 1999), which

confers increased susceptibility to Alzheimer disease (Saunders et al. 1993); the

CCR5∆32 allele, which prevents infection by HIV-1 (Dean et al. 1996) and the F5

1691 G -> A allele (also known as FV Leiden) in deep vein thrombosis (Bertina et

al. 1994) are examples of common variants causing a common phenotype in

human populations.

On the other hand, the ‘common disease-rare variant’ hypothesis proposes that a

7 A T GC

Page 9: Chapter 1 title - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3417/10/10_chapter 1.pdf · a DNA sequence are called single-nucleotide polymorphisms, or SNPs (Figure 1.1)

Review of Literature

significant proportion of the inherited susceptibility to common complex diseases

may be due to the summation of the effects of a series of low frequency variants

of a variety of different genes, each conferring a moderate but readily detectable

increase in relative risk (Bodmer 1999;Frayling et al. 1998). An important role for

rare variants in inherited multifactorial susceptibility to colorectal cancer has been

suggested by the effects of rare, highly penetrant variants in the APC gene

(Bodmer 1999;Frayling et al. 1998).

On the whole, concerted efforts are being made by various researchers to

determine the significance of common and rare variants in complex disorders.

1.4 SNPs in Pharmacogenomics There is wide variability in the response of individuals to standard doses of drug

therapy. This is an important problem in applied medicine, where it may lead to

therapeutic failures or adverse drug reactions (Figure 1.3).

Figure 1.3 Different patients respond to drugs differently.

Knowledge of human SNPs promises to uncover the cause of this inter-individual

8 A T GC

Page 10: Chapter 1 title - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3417/10/10_chapter 1.pdf · a DNA sequence are called single-nucleotide polymorphisms, or SNPs (Figure 1.1)

Review of Literature

variability in drug response. Pharmacogenomics is a science that explores the

ways by which variations in genes may be used to predict a patient’s response to a

particular drug (Roses 2004).

1.4.1 SNPs associated with variable drug response A number of variants have been shown to be associated with variable drug

response among individuals (Appendix 1.2). Mentioned below are a few examples

of SNPs shown to be associated with differential drug response.

• Inhaled corticosteroids (e.g., budesonide) are one of the most commonly

used therapeutic agents in treatment of asthma (Bruni et al. 2009;Humbert

et al. 2008). Gene encoding TBX21 (transcription factor T-box expressed

in T cells) has been implicated in asthma pathogenesis. C allele of a

missense SNP in this gene (rs2240017, His33Gln) has been shown to be

associated with significant improvement in airway responsiveness to

budesonide in asthmatics (Tantisira et al. 2004).

• The β(2)-adrenergic receptor (ADRB2) is the target for β (2)-agonist drugs

used for bronchodilation in asthma and other respiratory diseases. The

genotype at nucleotide position 46 in β(2)AR gene of asthmatic patients

has been shown to be significantly associated with his responder status to

salbutamol treatment (Bhatnagar et al. 2005).

• Inter-patient variability in blood pressure response to β-blocker

monotherapy is well-known. Two common β(1)-adrenergic receptor

polymorphisms (ADRB1), rs1801252 (A allele) and rs1801253 (C allele),

have been associated with good antihypertensive response to metoprolol

and carvedilol in patients with hypertension (Johnson et al. 2003).

• Catechol-o-methyl transferase (COMT) is a strong candidate for

therapeutic response to antipsychotic medication. G allele encoding valine

at position 158 in this gene (rs4680) was found to be over-represented in

poor responders of risperidone (Gupta et al. 2009).

• Therapy with statins lowers total and low-density lipoprotein (LDL)

cholesterol, and has proven to be highly effective for cardiovascular risk

reduction. However, there is wide variation in inter-individual response to

statin therapy. C allele of a missense SNP (rs20455, Trp719Arg) in

9 A T GC

Page 11: Chapter 1 title - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3417/10/10_chapter 1.pdf · a DNA sequence are called single-nucleotide polymorphisms, or SNPs (Figure 1.1)

Review of Literature

kinesin-like protein 6 (KIF6) has been shown to be associated with

improved response to statin drugs, including atorvastatin, pravastatin,

rosuvastatin and simsvastatin (Iakoubova et al. 2008).

• Inter-individual variability in pain relief by morphine has been shown to

be significantly (P<0.0001) associated with the SNP rs1799971 in OPRM1

gene (encoding µ-opioid receptor), which is the primary site of action for

morphine, with A allele showing better response to morphine (Campa et

al. 2008).

All these studies provide us a good insight into the role of SNPs in variable drug

response, and their plausible role in pharmacogenomics.

1.4.2 SNPs associated with adverse drug reactions Some individuals develop adverse effects to a particular drug, while others do not.

This inter-individual variability has been attributed to the presence of SNPs

(Appendix 1.3). Following are some of the examples indicating the role of SNPs

as predictors of adverse drug reactions:

• 5-Fluorouracil (5-FU) is a drug given for the treatment for some types of

cancer, including colon, breast, stomach, and esophagus cancer.

Dihydropyrimidine dehydrogenase (DPYD) plays an important role in the

metabolism of 5-FU. The incidence of chemotherapeutic toxicity (middle-

severe nausea and vomiting) has been found to be significantly higher in

gastric carcinoma and colon carcinoma patients with C allele of SNP

rs1801265, & G allele of SNP rs1801159 in DPYD (Zhang et al. 2007).

• A allele of SNP rs2075252, and T allele of SNP rs4668123 in LRP2 gene

(Low-density lipoprotein receptor-related protein 2) were found to be

associated with higher incidence of ototoxicity (hearing loss) in patients

treated with cisplatin, a platinum-based chemotherapy drug used to treat

various types of cancers (Riedemann et al. 2008).

• Cyclosporine, one of the immunosuppressive drugs usually given in renal

transplant cases, has been found to be associated with gum hyperplasia in

some patients. This toxicity has been shown to be associated with A allele

of SNP rs231775 in CTLA4 (Cytotoxic T-lymphocyte antigen 4) in these

10 A T GC

Page 12: Chapter 1 title - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3417/10/10_chapter 1.pdf · a DNA sequence are called single-nucleotide polymorphisms, or SNPs (Figure 1.1)

Review of Literature

cases (Kusztal et al. 2007).

• Methotrexate, an antifolate chemotherapeutic agent, is widely used, alone

or in combination with other drugs, in the treatment of a number of

hematologic malignancies (Robien et al. 2005;Stern, Raizer 2005;Sterba et

al. 2005;Colozza et al. 2006) as well as benign cancers (Grim et al.

2003;Choy et al. 2005), especially non-Hodgkin’s lymphomas (NHL).

However, its effectiveness is limited by toxicity against normal tissues,

particularly towards gastrointestinal epithelium, bone marrow and liver

(Gorlick, Bertino 1999;Ulrich et al. 2002). T allele of SNP rs1801133 in

MTHFR has been associated, and may explain, the toxicity and variable

outcome of methotrexate therapy in patients with NHL (Gemmati et al.

2007).

In view of the above observations, it seems reasonable to believe that SNPs might

play a crucial role in identifying inter-individual variability in drug response, as

well as decreasing the risk for unexpected toxicities.

1.5 SNPs in Population-Based Studies Human subpopulations are known to differ in the susceptibility to the diseases,

response to drugs, and also in the allele frequency distribution of SNPs. For

example, in the United States, asthma prevalence and mortality are the highest

among Puerto Ricans and the lowest among Mexicans (Choudhry et al. 2006).

Similarly, variability in response to salbutamol and albuterol has been observed

among asthmatics in Indian and Puerto Rican populations, respectively (Kukreti et

al. 2005;Choudhry et al. 2005). Thus, when the population under study consists of

a mixture of two or more subpopulations that have different allele frequencies

(population admixture/ stratification) and disease risks, associations between

genotype and outcome may be spurious. Hence, it becomes crucial to uncover the

heterogeneity, if any, existing in the population under study.

Two different views have been proposed relating the distribution of a SNP across

populations with disease-association. The CD-CV hypothesis, as previously

discussed in section 1.3.2, proposes that risk alleles for common complex diseases

11 A T GC

Page 13: Chapter 1 title - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3417/10/10_chapter 1.pdf · a DNA sequence are called single-nucleotide polymorphisms, or SNPs (Figure 1.1)

Review of Literature

are common (i.e. ≥ 5%). Thus, they are likely to be found in multiple human

populations, rather than being population specific (Lander 1996;Chakravarti

1999;Reich, Lander 2001;Pritchard, Cox 2002). However, Ioannidis et al, from a

meta-analysis of disease-association studies, proposed that the frequencies of

disease-associated alleles show large heterogeneity between races (Ioannidis et al.

2004). Thus the inquest of whether risk alleles discovered in one population

account for disease prevalence across all human populations, still remains

unanswered.

The International HapMap Project (International HapMap Consortium 2003), till

date, has genotyped over 3.1 million human SNPs in individuals from four

geographically diverse populations (Yoruba in Ibadan, Nigeria; Japanese in

Tokyo; Han Chinese in Beijing; and Utah residents with northern and western

European ancestry). This may help in understanding the patterns of common

genetic diversity in the human genome in order to accelerate the search for the

genetic basis of human diseases.

Moreover, the Indian Genome Variation (IGV) Consortium (Indian Genome

Variation Consortium 2005;Indian Genome Variation Consortium 2008) has

shown the existence of heterogeneity among 55 Indian populations, and also

identified clusters of sub-populations with substantial genetic homogeneity. This

may be instrumental in careful design and analysis of association studies across

Indian populations. As an example, a strong allelic/genotypic association of a

missense SNP (rs1042713; R16G) in the β2-adrenergic receptor (ADRB2) with

response to salbutamol in the Indian population has been observed (Kukreti et al.

2005). However, the SNP shows substantial heterogeneity across Indian

populations, with AA genotype frequencies ranging from 0.048 in AA-C-IP4 to

0.69 in DR-S-LP3 (Figure 1.4), which could be associated with the variable

response to salbutamol among Indian asthmatics (Kukreti et al. 2005).

12 A T GC

Page 14: Chapter 1 title - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3417/10/10_chapter 1.pdf · a DNA sequence are called single-nucleotide polymorphisms, or SNPs (Figure 1.1)

Review of Literature

Figure 1.4 Color composite map of genotype frequency of rs1042713. Red:

genotype frequency of AA (poor responder); blue: genotype

frequency of GG (good responder); green: genotype frequency of

AG (Indian Genome Variation Consortium 2008).

These studies provide a framework for designing future epidemiological studies to

identify populations with differential disease susceptibility and variable response

to a given drug or a class of drugs.

1.6 Functional Missense SNPs Missense SNPs, which lead to substitution of an amino acid by another amino

acid, are the most pertinent to human inherited diseases (Stenson et al. 2003).

According to the 2009 release of HGMD (Krawczak et al. 2000), non-

synonymous SNPs account for more than half of all genetic polymorphisms

known to cause inherited diseases (Table 1.2).

13 A T GC

Page 15: Chapter 1 title - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3417/10/10_chapter 1.pdf · a DNA sequence are called single-nucleotide polymorphisms, or SNPs (Figure 1.1)

Review of Literature

Table 1.2 Relative frequencies of types of mutations underlying disease

phenotypes.

Change Number % of total

Missense/nonsense 49806 56.4

Splicing 8548 9.7

Regulatory 1459 1.7

Small deletions 14063 15.9

Small insertions 5751 6.5

Small indels 1295 1.5

Repeat variations 267 0.3

Gross insertions/duplications 1053 1.2

Complex rearrangements 772 0.9

Gross deletions 5303 6.0

However, a large number of missense SNPs are ‘benign’ and have minimal impact

on the structure or function of protein; while some are ‘functional’, and may lead

to significant changes in protein properties. These functional missense SNPs may

exert their effect on human physiology through various mechanisms, including

modification of splice sites; inactivation of protein functional sites, such as

catalytic, ligand-binding and post-translational modification sites; alteration of

protein solubility and stability; or affecting the interactions of proteins - thereby

perturbing protein functions, such as, the kinetic parameters of enzymes, signal

transduction activities of transmembrane receptors, and architectural roles of

structural proteins (Rebbeck et al. 2004). For instance, a missense SNP

(Cys260Tyr) associated with hereditary hemochromatosis in HLA-H protein

disrupts a critical disulphide bond (Figure 1.5).

14 A T GC

Page 16: Chapter 1 title - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3417/10/10_chapter 1.pdf · a DNA sequence are called single-nucleotide polymorphisms, or SNPs (Figure 1.1)

Review of Literature

Figure 1.5 A missense SNP (C260Y) in HLA-H protein disrupts a critical

disulphide bond, thereby perturbing the structure of this protein.

1.7 Identification of Functional Missense SNPs :

Finding Needles in Haystack? As the amount of genomic information that is available greatly exceeds the

information about the function of variants, it becomes imperative to develop

methods that prioritize the genetic variants to be genotyped in genetic studies. The

increasingly large number of SNPs deposited in the public databases has provided

a platform to perform genome-wide computational analyses of these SNPs and

their relationship with inter-individual variation in susceptibility to complex

disorders and response to drugs.

The initial efforts to understand the patterns of SNPs in the coding regions of

genes were made by Cargill et al (Cargill et al. 1999) and Halushka et al

(Halushka et al. 1999). Since then, progress in comparative genomics, and

evidence that functional elements tend to lie in conserved regions (Carlton et al.

2006), have enabled, to some extent, the prediction of SNPs likely to be

deleterious for the structure or function of proteins, and may therefore lead to

disease. Over the last decade, several methods have utilized the sequence

conservation of a particular amino acid within a family of sequences to predict

whether an amino acid substitution affects protein function. SIFT (Sorts Intolerant

from Tolerant) (Ng, Henikoff 2001;Ng, Henikoff 2003), for example, presumes

15 A T GC

Page 17: Chapter 1 title - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3417/10/10_chapter 1.pdf · a DNA sequence are called single-nucleotide polymorphisms, or SNPs (Figure 1.1)

Review of Literature

that critical amino acids will be conserved in the protein family, and so changes at

well-conserved positions are predicted as deleterious. Such evolutionary

information, integrated with other structural and biochemical properties of

functional SNPs, may permit comprehensive prediction of functional SNPs.

The Crescendo method identifies residues that have a higher degree of

conservation than would be expected on the basis of the local structural

environment (Chelliah et al. 2004). Thereafter, known SNPs are mapped onto the

structure of the proteins, and based on the assumption that these additional

restraints may be due to functions mediated by interactions with other molecules,

their effect on function is predicted. For instance, if a residue is a known catalytic

residue or is close to a known binding site, substitutions of this residue are

predicted to affect function.

The LS-SNP database (Karchin et al. 2005) contains predictions of missense SNPs

using features of protein structure, sequence and evolution, specifically SNPs that

interfere with the formation of domain–domain interfaces or have an effect on

protein–ligand binding, based on machine learning techniques.

Predictions of the PolyPhen method are based on empirical rules based on the

sequence, phylogenetic and structural information characterizing the substitution

(Ramensky et al. 2002;Sunyaev et al. 2001).

Several methods have estimated the loss or gain in energy of the protein structure

due to single amino acid substitutions. The Site Directed Mutator (SDM) method

uses a set of conformationally constrained environment-specific substitution tables

(ESSTs) to calculate the difference in the stability scores, analogous to the

difference in free energy, for the folded and unfolded state for the wild-type and

mutant protein structures (Topham et al. 1997). The I-Mutant2.0 method

(Capriotti et al. 2005) uses both sequence and structural information in support

vector machine (SVM) learning to predict protein stability changes upon single

amino acid substitutions, as does the MUpro method (Cheng et al. 2006).

16 A T GC

Page 18: Chapter 1 title - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/3417/10/10_chapter 1.pdf · a DNA sequence are called single-nucleotide polymorphisms, or SNPs (Figure 1.1)

Review of Literature

A widely used computational technique to predict functional residues, based upon

the evolutionary conservation of sequences, is the "evolutionary trace" (ET)

method (Innis et al. 2000;Lichtarge et al. 1996;Lichtarge, Sowa 2002). In this

method, each residue is ranked by evolutionary importance by comparison with

groups of proteins which originate from a common node in a phylogenetic tree.

The information obtained by the ET method can then be mapped on to known

protein structures, thus allowing us to identify clusters of important amino acids.

All these studies indicate that the need for identification of functional genetic

variants among a vast number of irrelevant ones in which they are immersed has

translated into a need for sophisticated tools to effectively prioritize genetic

variants underlying complex diseases.

1.8 Objectives of the Study In this thesis, we comprehensively study all human missense SNPs reported in

public databases for their functional and demographic significance in complex

disorders and pharmacogenetics. Specifically, following were the objectives of

this work:

1. To identify protein coding exons likely to harbor disease-associated missense

SNPs.

2. To develop an automated classification scheme capable of distinguishing

between functional and benign missense SNPs.

3. To prioritize SNPs for disease-association and pharmacogenetics in Indian

populations.

4. a) To develop a comprehensive database for the analysis of protein coding

exons.

b) To develop a web-server for the identification of functional missense SNPs.

17 A T GC