http://www.iaeme.com/IJCET/index.asp 54 [email protected]
International Journal of Computer Engineering & Technology (IJCET)
Volume 8, Issue 5, Sep-Oct 2017, pp. 54–66, Article ID: IJCET_08_05_007
Available online at
http://www.iaeme.com/ijcet/issues.asp?JType=IJCET&VType=8&IType=5
Journal Impact Factor (2016): 9.3590(Calculated by GISI) www.jifactor.com
ISSN Print: 0976-6367 and ISSN Online: 0976–6375
© IAEME Publication
A REVIEW ON BIOCOMPUTING APPROACHES
AND TOOLS FOR IDENTIFICATION OF
SINGLE NUCLEOTIDE POLYMORPHISMS
Neelofar Sohi
Assistant Professor, Department of Computer Engineering,
Punjabi University, Patiala, India
Amardeep Singh
Professor, Department of Computer Engineering,
Punjabi University, Patiala, India
ABSTRACT
Single Nucleotide Polymorphisms (SNPs) are the most common source of genetic
variations. There has been enormous research in the area of Biocomputing and
Bioinformatics on identification and analysis of SNPs. A large number of methods
have been developed for their identification ever since the importance of SNPs in
understanding of diseases emerged with the completion of Human Genome Project.
This paper reviews Single Nucleotide Polymorphisms, their importance, their
association to diseases, Biocomputing approaches and tools available for their
identification up to 2017.
Key word: Single Nucleotide Polymorphisms, SNPs, Biocomputing, Genetic
Variations, SNP identification.
Cite this Article: Neelofar Sohi and Amardeep Singh, A Review on Biocomputing
Approaches and Tools for Identification of Single Nucleotide Polymorphisms.
International Journal of Computer Engineering & Technology, 8(5), 2017, pp. 54–66.
http://www.iaeme.com/ijcet/issues.asp?JType=IJCET&VType=8&IType=5
1. INTRODUCTION
There occurred a major breakthrough in the history of genetics with completion of the Human
Genome Project (HGP). HGP, the world’s largest international collaborative research project
was founded in 1990 by the US Department of Energy and the National Institute of Health
(NIH), which aimed at complete human genome sequencing. The project completed in 2003,
sequencing the human genome’s 3.3 billion base pairs and revealed that there are about 20,
500 human genes. The valuable information furnished by HGP opened new avenues for
understanding of diseases, genetic basis and genetic variants responsible for the diseases. This
understanding of connection between sequence variations and phenotype can lead to better
diagnosis, prevention and treatment of diseases [1]. The sequence analyses show that 99% of
A Review on Biocomputing Approaches and Tools for Identification of Single Nucleotide
Polymorphisms
http://www.iaeme.com/IJCET/index.asp 55 [email protected]
genome sequences of different individuals are identical [2]. A difference of 1% is due to the genetic variations
that lead to inherited diseases. The genetic variations are of various types like Single Nucleotide
Polymorphisms (SNPs), insertion/deletion, block substitutions, inversions, variable number of
tandem repeat sequences (VNTRs) and copy number variations (CNVs). Single nucleotide
polymorphisms are the most common source of genetic variations accounting for 90% of the
sequence differences and about half of the known human inherited diseases [3]. There are
about 349,313,504 human SNPs out of which 24,983,387 are validated as listed in the
Database of Short Genetic Variations, dbSNP [4] hosted by National Centre for
Biotechnology Information (NCBI) (dbSNP latest build 150; February 2017 available at
ftp://ftp.ncbi.nlm.nih.gov/snp/).
1.1. Motivation for Work
There has been an explosion of research on SNPs and their identification. Many review
papers are available on SNPs. But there is a lack of systematic review covering SNPs from
various aspects in last five years. This paper reviews SNPs, their importance, association to
diseases, Biocomputing approaches and tools available for their identification up to 2017.
1.2. Single Nucleotide Polymorphisms
Single Nucleotide Polymorphisms are the genetic variations that occur when single
nucleotide i.e. Adenine (A), Guanine (G), Cytosine (C) or Thymine (T) in the genome
sequence gets altered. Single nucleotide polymorphisms result in substitution of one
nucleotide for another in a DNA sequence.
Figure 1 Single Nucleotide Polymorphism
SNPs are mostly biallelic polymorphisms, that is, the nucleotide identity at these positions
is constrained to one of two possibilities in humans [5]. There are two types of Single
Nucleotide polymorphisms, transitions where substitution is between purines (A, G) or
between pyrimidines (C, T) and transversions which involve substitution of a purine by a
pyrimidine or vice versa.
1.2.1. Classification of Single Nucleotide Polymorphisms
SNPs are classified based on their genomic location into coding and non-coding SNPs.
Coding SNPs occur in the coding region of the gene which takes part in protein formation.
Coding SNPs are of two types: Synonymous SNPs and Non- Synonymous SNPs.
Synonymous SNPs change a codon specifying an amino acid into another that codes for the
same amino acid hence no change occurs in the amino acid sequence of protein. Non-
synonymous SNPs change the amino acid sequence of protein. Missense SNPs change the
codon specifying an amino acid into another that produces different amino acid whereas
Nonsense SNPs change the codon into stop codon which terminates the process of protein
formation leading to incomplete or non-functional protein. Non-Coding SNPs may occur in
introns, promoter region, within 5’ and 3’, untranslated region and intergenic region.
Neelofar Sohi and Amardeep Singh
http://www.iaeme.com/IJCET/index.asp 56 [email protected]
Figure 2 Classification of Single Nucleotide Polymorphisms
1.3. Functional Effects of Single Nucleotide Polymorphisms
SNPs may occur in the coding region of genes, non-coding regions of genes or in the
intergenic regions (regions between genes). A SNP is said to be functional if it affects factors
such as splicing, transcription and protein structure hence causing a phenotypic difference
between members of the species. A variant may affect the expression or translation of a gene
product, either by interrupting a regulatory region or by interfering with normal splicing and
mRNA function [6]. About 3% to 5% of human SNPs are functional.
1.3.1. Effect of Coding region SNPs
Coding region SNPs are of more interest to scientists as they are more likely to alter the
biological function of a protein. An altered protein in many cases is responsible for disorders
and abnormalities in humans. Synonymous changes in coding regions are more common than
non-synonymous changes. Non-synonymous SNPs account for more than half of all genetic
polymorphisms known to cause inherited diseases (as per 2009 release of HGMD) [7].
1.3.2. Effect of Non Coding region SNPs
The role of noncoding SNPs is much less studied. Many noncoding SNPs that reside in the
noncoding sequences (e.g. introns, promoter region, within 5’ and 3’ untranslated region)
surrounding protein coding genes have been shown to have profound effects on the
expression of neighbouring genes and may cause disease phenotypes. These are called
regulatory SNPs. Such SNPs may affect the gene expression by interrupting a regulatory
region [8]. Non-coding SNPs are also linked to higher risk of cancer [9]. Some studies
suggest coding region SNPs specifically non-synonymous SNPs show greater functional
effect and association with diseases whereas some studies associate non-coding SNPs with
these effects.
1.4. Association of Genetic Variations and Diseases
There are two classes of Genetic variations viz. mutation and polymorphism. DNA sequence
variations that do not lead to diseases are termed as polymorphisms. To be termed as ‘normal’
they must occur in at least 1% of the population. Many polymorphisms may be found in
genes and influence characteristics like hair colour, eye colour and height. DNA sequence
variations that may lead to diseases are termed as mutations. More than 99% of human
genome sequences are identical; only 1% is different which may be responsible for
predisposition to diseases and variability in response to drugs [5]. A study to establish
relationship between a disease and regions of genome is termed as association study. If a
A Review on Biocomputing Approaches and Tools for Identification of Single Nucleotide
Polymorphisms
http://www.iaeme.com/IJCET/index.asp 57 [email protected]
variation is found to be causative of a disease then occurrence of that variation would be
higher in group of cases-with disease than group of controls-not affected by disease.
Penetrance and Expressivity are the two parameters employed to find association. Penetrance
is defined as percentage of individuals having a particular mutation who exhibit clinical signs
or phenotype of the associated disease. Expressivity is defined as the extent to which a
genotype is phenotypically expressed in individuals [10].
1.5. Importance of SNP Identification
Genetic and environmental factors are the key factors responsible for diseases. Environmental exposures
play a larger role in human phenotypic variation than genetic variation, but environmental
exposures are fundamentally more difficult to measure. DNA is stable throughout life, with a
single physical chemistry that enables generic approaches for measurement. Single nucleotide
polymorphisms are the most common source of genetic variations. SNPs underlie the difference
in a person’s susceptibility and predisposition to inherited diseases. Hence, identification of
Single Nucleotide Polymorphisms may lead to early prevention, diagnosis and treatment of
diseases. The knowledge of SNPs also helps in understanding the variation among individuals
in drug response. Pharmacogenomics is the study of how the variations in genes are related to
an individual’s response to a particular drug. Hence, identification of Single Nucleotide
Polymorphisms can be useful for personalised medicine [11]. Genetic diseases, major class of
human diseases are broadly classified into monogenic and polygenic diseases. Monogenic
diseases are those where single gene mutation is responsible for the disease. In 1990s, the
focus of research shifted from monogenic diseases towards analysis of complex multifactorial
diseases like osteoporosis, diabetes, cardiovascular diseases, inflammatory diseases,
psychiatric disorders and most cancers. These diseases are polygenic with multiple gene
variants, each contributing a small effect to the disease. It is very hard to analyse them and
harder to locate the involved genes (Collins et al., 1998). Complex polygenic diseases occur
at a much higher frequency and consequently are a great social burden [12]. Therefore, for
identification of involved genes a highly reliable marker is required. Markers are the genetic
variations associated with phenotypic conditions which can be used to locate genes associated
with a disease [11]. SNP markers emerged as an appropriate marker system owing to their
advantages of being polymorphic, stable, abundant, amenability to automation, easy and
inexpensive genotyping and capability to have direct functional consequence besides being
surrogate markers. Therefore, by using SNP markers, it is often possible to test for
association between a functional variant and a phenotype directly [12]. They are used in
genome-wide association studies (GWAS) for gene to phenotype mapping [11]. SNPs have
been used for human identification. SNPs can be used to reconstruct the history of genome in
population studies. SNPs are suitable for population studies as they are abundant,
evolutionary stable and are inherited from one generation to the next. SNP detection has also
been used in forensic genetics where it can be used to evaluate rare, degraded and nearly
fossilized nucleic acid evidence. Studying the frequency and distribution of SNPs can lead to
information on the evolution of the species [5]. Therefore, SNPs have been used in the study
of evolution, race, migration and lineage [13]. SNPs have been used for population
classification besides using genes [14, 15]. A classification procedure was proposed for two
populations using eight SNP marker selected from a sample of 641 collected from an
Epidemic Society of Shanghai [13]. SNPs having high Allele Mutation Frequency
(0.249>MAF>0.355) were selected randomly from eight different chromosomes. Accuracy of
classification procedure depends upon number of SNP markers.
Neelofar Sohi and Amardeep Singh
http://www.iaeme.com/IJCET/index.asp 58 [email protected]
1.6. Challenges in SNP Identification
SNP identification, also termed as characterisation aims at identifying, analysing and
annotating SNPs. SNP annotation aims at predicting the effect and function of SNPs [6].
There is a lack of SNP identification methods which also provide annotation. Several
methods exist for identifying SNPs though there is no global approach to identify all types of
SNPs. Most of the methods focus at coding region SNPs; SNPs lying in non-coding regions
are ignored owing to poor understanding of effects of non-coding SNPs. Such SNPs may
affect the gene expression by interrupting a regulatory region [8]. Another issue is that a SNP
may be present in an individual but it may or may not have deleterious effect. SNP is more or
less deleterious depending upon whether allele containing it is more or less expressed.
Penetrance and expressivity are the parameters to analyse this aspect. Penetrance is defined
as percentage of individuals having a particular mutation who exhibit clinical signs
(phenotype) of the associated disease. SNPs lead to inherited predisposition to a disease but it
may have reduced or incomplete penetrance in a particular individual. Penetrance of a
mutation may be affected with age and gender in addition to other factors [10]. So, SNP
identification method must also cater to distinguish between disease-causing and functionally
neutral SNPs. SNPs are excellent markers for locating candidate genes associated with a
disease [11]. Hence, SNP identification method can be enriched with capability to identify
candidate genes containing causal variants associated with a disease.
2. BIOCOMPUTING APPROACHES FOR IDENTIFICATION OF SNPS
A general approach followed for identification of SNPs is to create a catalogue of variants in
the human genes and test them for association with diseases. The 1000 Genomes Project set
out to provide a comprehensive description of common human genetic variation
reconstructed the genomes of 2,504 individuals from 26 populations characterizing a broad
spectrum of genetic variation, in total over 88 million variants (84.7 million SNPs, 3.6 million
short insertions/deletions and 60,000 structural variants).This project provides a benchmark
for distribution and understanding of human genetic variation and processes that shape
genetic diversity and disease biology [16]. Another approach is to generate a genome-wide
high resolution map of known polymorphisms and test them for association with diseases.
The goal of the International HapMap Project is to develop a map and determine the common
patterns of DNA sequence variation in the human genome to carry out candidate-gene,
linkage based and genome-wide association studies. The International HapMap Consortium is
developing a map of these patterns across the genome by determining the genotypes of one
million or more sequence variants, their frequencies and the degree of association between
them, in DNA samples from populations with ancestry from parts of Africa, Asia and Europe.
The HapMap will allow the discovery of sequence variants that affect common disease and
development of diagnostic tools [17]. Complex polygenic diseases are believed to have
multiple gene variants contributing to disease whereas it remains a question whether mutation
at any one gene are necessary and sufficient to lead to the phenotype. One prominent
approach to locate SNPs underlying complex traits is ‘linkage disequilibrium’ where physical
distance between two polymorphisms is treated as the arbiter of degree of association
between them. Each SNP has a unique history therefore time of origin and their relative
frequencies also affect Linkage Disequilibrium between two SNPs [8].
3. BIOCOMPUTING TOOLS FOR IDENTIFICATION OF SNPS
With SNPs being the most common form of genetic variation, there is great interest in SNP
discovery. To serve this need, a large number of public databases are available online which
offer data about single nucleotide polymorphisms (SNPs), other forms of genetic variations,
A Review on Biocomputing Approaches and Tools for Identification of Single Nucleotide
Polymorphisms
http://www.iaeme.com/IJCET/index.asp 59 [email protected]
diseases and genome annotations. Table 1 provides summary of some of these popular
databases. A no. of Biocomputing methods and tools for identification of SNPs exist. Many
reviews on various aspects of SNP identification, SNP identification software and sequencing
technologies are available [6, 18-24]. Some of the prominent SNP identification tools and
their links are listed in Table 2. Few state-of-the-art techniques for identification of SNPs are
presented in the following section:
3.1. PolyPhred
Figure 3 Flowchart for PolyPhred
Phred is a base-calling computer program with improved accuracy and lower error rates
than ABI sequencing software [25]. It assigns an error probability to each called base. It uses
a four phase procedure to determine a sequence of base calls from processed trace. In the first
phase, the idealized peak locations (predicted peaks) are determined. In the second phase,
observed peaks are identified in the trace. In third phase, observed peak locations are matched
to the predicted peak locations, omitting some peaks and splitting some others. In the final
phase, un-matched observed peaks are checked to see if they represent some base but could
not be assigned to a predicted peak, if found, the corresponding base is inserted into the read
sequence. Phrap is the sequence alignment program to align the subject sequence against the
reference sequence. PolyPhred is an automated program for identification of SNPs used in
conjunction with Phred, Phrap and Consed. It reads the normalised peak areas and quality
values obtained from Phred for each position in the sequence. If a second peak is detected at a
base and there is reduction in the peak height, PolyPhred calls it a heterozygous site. Consed
is an editing and viewing tool that can be used by analyst for editing and evaluating the traces
[26]. As a part of Japanese Millennium Genome Project, Haga et al. (2002) identified a total
of 190562 genetic variations consisting of 174269 SNPs and 16293 insertion/deletions from
DNA samples of 24 Japanese individuals using PolyPhred [26] for SNP identification [27].
Data and methods of the study are available at web site (http://snp.ims.u-tokyo.ac.jp).
3.2. Newberg’s Technique
Phred is used for base-calling to identify the bases in the sequence. Next, sequence alignment
is done in order to align the sequence against the reference sequences using Phrap. It
classifies sequences that do not have sufficient similarity information as singlets and excludes
them from the set of assembled contigs. SNP identification procedure consists of series of
four filters: Filter 1 eliminates cluster of mismatches that occur in region of low quality trace
data.
Neelofar Sohi and Amardeep Singh
http://www.iaeme.com/IJCET/index.asp 60 [email protected]
Figure 4 Flowchart for Newberg’s Technique
This filter searches for window sizes of 5, 10 or 20 bp around single base-pair mismatch
position. Filter 2 identifies the type of sequence mismatch as base substitution or
insertion/deletion. Filter 3 and filter 4 checks the quality of each base call relative to its
position and frequency in a contig. Filter 3 ignores mismatches in first 100 bases and filter 4
requires a mismatch to occur in more than one sequence in a contig to be considered a high
quality candidate SNP. This eliminates the mismatches that could arise from copying errors
[28].
3.3. SNP Detector
Figure 5 Flowchart for SNP Detector
At first, Phred is used for base calling, quality scores and primary and secondary peak
information for each trace file. SIM, based on Smith Waterman algorithm is used for
alignment of the subject sequence to a reference sequence. Neighbourhood Quality Standard
is used to check the variation site and each base in its flanking window to exceed a user-
defined quality threshold. A variation is considered a true variation if this base and each base
in its 4bp flanking region exceed Phred quality score ≥ 25 and sequence similarity must be ≥
95%. Height of the Secondary Peak corresponding to heterozygous allele must be atleast 30%
of the height of Primary Peak. Peak with peak height less than 20% is considered as noise
[29].
A Review on Biocomputing Approaches and Tools for Identification of Single Nucleotide
Polymorphisms
http://www.iaeme.com/IJCET/index.asp 61 [email protected]
3.4. PolyScan
Figure 6 Flowchart for PolyScan
Phred is used for base-calling to identify the bases in the sequence. Next, sequence
alignment is done in order to align the sequence against the reference sequences using cross-
match program. SNPs are identified as doublet peaks whose heights are half of those for
homozygous individuals. Drop in the peak height is the indicator of presence of variation.
Two procedures namely Horizontal and Vertical Scan are performed: In Horizontal Scan,
distance metrics are computed within reads and in Vertical Scan, peak height is computed
[30].
3.5. VarDetect
Figure 7 Flowchart for VarDetect
Partitioning and Re-sampling technique improves the base calling procedure. Main idea is
to detect primary peak corresponding to primary allele and secondary peak corresponding to
secondary allele. Presence of multiple peaks at a location indicates presence of SNP. Next,
Observed peak intensity ratio; Qoi is calculated from peak intensities of various peaks at i
th
location as Qo =Highest Peak intensity/ (sum of all intensities). Vicinity peak intensity ratio;
Qvi is calculated relative to two bases to the left and two bases to the right of base call location
as Qv3=(I1+I2+I4+I5)/4. SNPs can be predicted from difference between Observed peak
intensity ratio; Qoi and Vicinity peak intensity ratio; Qv
i; Qv
i- Qo
i=δ called detection value. If
this difference is significant, above a defined threshold value, this is indicative of SNP. Next,
CodeMap technique is used to convert chromatogram traces to numeric codes. Homozygous
bases are converted into 0 and 2 codes while heterozygous base is converted to 1 [31].
Neelofar Sohi and Amardeep Singh
http://www.iaeme.com/IJCET/index.asp 62 [email protected]
3.6. PineSAP
Figure 8 Flowchart for PineSAP
Phred is used for base-calling and sequence alignment is done using Phrap and Probcons
RNA. Next, Polybayes and PolyPhred techniques are used for SNP identification [32].
Table 1 Summary of some prominent Single Nucleotide Polymorphism Databases
Database URL of web resource Information in the Database
Database of
Expressed Sequence
Tags (dbEST)
http://www.ncbi.nlm.nih.gov/dbEST Short-pass reads of cDNA (transcript)
sequences [33]
Entrez - Integrated database retrieval system that
provides access to a diverse set of 40
databases that together contain 1.3 billion
records [34]
Human Genic Bi-
Allelic Sequence
(HGBASE)
http://hgbase.interactiva.de/ Database of polymorphisms
located in human intra-genic sequences [35]
Online Mendelian
Inheritance in Man
(OMIM)
http://www.ncbi.nlm.nih.gov/sites/entrez?db=
omim
Human genes and genetic phenotypes;
relationship between phenotype and genotype
[36]
Database of Short
Genetic Variations
(dbSNP)
http://www.ncbi.nlm.nih.gov/SNP Primary resource for Single Nucleotide
variations, microsatellites, small-scale
insertions and deletions [4]
JSNP database http://snp.ims.u-tokyo.ac.jp/ Catalog of SNPs responsible for different
genetic related disorders in the
Japanese population [37]
University of
California Santa Cruz
(UCSC) Genome
Browser
http://genome.ucsc.edu/ popular web-based tool for displaying portion
of a genome including gene predictions,
mRNA and expressed sequence tag
alignments, SNPs [38]
SWISS-PROT www.expasy.org/sprot/ High level of annotation of function, domain
structure, post-translational modifications,
variants of protein [39]
Human Gene
Mutation Database
(HGMD)
http://www.hgmd.cf.ac.uk/ac/index.php Germline mutations underlying human
inherited diseases [7]
HapMap http://www.hapmap.org Catalog of common genetic variants [17]
Database of genomic
Variants (DGV)
http://projects.tcag.ca/variation catalog of structural variation in healthy
control samples for studying correlation to
genomic variation with phenotypic data [40]
CASCAD http://cascad.niob.knaw.nl. Catalog of candidate single nucleotide
polymorphisms predicted using a
A Review on Biocomputing Approaches and Tools for Identification of Single Nucleotide
Polymorphisms
http://www.iaeme.com/IJCET/index.asp 63 [email protected]
computational approach from publicly
available sequence data of the rat and
zebrafish [41]
SNP Function Portal - SNP functional annotations for genomic
elements, transcription regulation, protein
function, pathway, disease and population
genetics [42]
Database of genotype
and phenotype
(dbGAP)
http://www.ncbi.nlm.nih.gov/sites/entrez?db=
gap
Genotype-phenotype association, results of
GWAS, medical resequencing, molecular
diagnostic assays [43]
Functional Single
Nucleotide
Polymorphism
(F-SNP)
http://compbio.cs.queensu.ca/F-SNP/ Functional Single Nucleotide Polymorphism
(F-SNP) database integrates information
obtained
from 16 bioinformatics tools and databases
about
the functional effects of SNPs [44]
Database of genomic
Structural Variation
(dbVar)
http://www.ncbi.nlm.nih.gov/dbvar Large-scale
copy number variants (CNV), insertions,
deletions, inversions and translocations
longer than 50 base pairs (bp) [45]
SNPedia https://www.snpedia.com/ SNPedia is a wiki investigating human
genetics. It provides personal genome
annotation, interpretation and analysis [46]
BioProject (www.ncbi.nlm.nih.gov/ bioproject/) Sequence variants, genotype-phenotype
association, nucleotide sequence sets and
epigenetic information [47]
GWAS db http://jjwanglab.org/gwasdb Contains genetic variants with functional
annotations, genomic mapping, gene
expression and disease associations [48]
Pharmacogenetics
Knowledge Base
(PharmGKB)
http://www.pharmgkb.org/ Annotation of genetic variants and gene-
drug-disease relationship [49]
Dog Genome SNP
Database (DoGSD)
http://dogsd.big.ac.cn/
Dog Genome SNP Database which provides
information about already identified SNPs in
dog/wolf
related genetic diseases [50]
Ensembl http://www.ensembl.org Finds SNPs and other variants for a gene and
association with diseases [51]
Table 2 Single Nucleotide Polymorphism identification tools
Tool URL of Tool
SNPeffect http://snpeffect.vib.be/
novoSNP http://www.molgen.ua.ac.be/bioinfo/novosnp/
PupaSuite http://pupasuite.bioinfo.cipf.es/
Sorting Intolerant from Tolerant (SIFT) http://blocks.fhcrc.org/sift/SIFT.html
Polymorphism Phenotyping (PolyPhen) http:/genetics.bwh.harvard.edu/pph/
SAP prediction method http://sapred.cbi.pku.edu.cn/
PMut http://mmb2.pcb.ub.es:8080/PMut/
Screening for Nonacceptable Polymorphisms
(SNAP)
http://cubic.bioc.columbia.edu/services/SNAP/
SNPSeek http://snp.wustl.edu/cgi-bin/SNPseek/index.cgi
SNP@Promoter http://variome.kobic.re.kr/SNPatPromoter/
SNPper http://snpper.chip.org/
Genewindow http://www.genewindow.nci.nih.gov/
SIFT http://blocks.fhcrc.org/sift/SIFT.html
PolyPhen http://www.bork.embl-heidelberg.de/PolyPhen/
Neelofar Sohi and Amardeep Singh
http://www.iaeme.com/IJCET/index.asp 64 [email protected]
SNP3D http://www.snps3d.org/
SNPeffect http://snpeffect.vib.be/index.php
PicSNP http://plaza.umin.ac.jp/_hchang/picsnp/
Pupa SNP Finder http://pupasnp.bioinfo.cnio.es/
BioPerl http://bio.perl.org/
HaploSNPer http://www.bioinformatics.nl/tools/haplosnper/
Human chromosome 21 cSNP database http://csnp.unige.ch/
QualitySNPng http://www.bioinformatics.nl/QualitySNPng/
4. CONCLUSIONS
There has been an explosion of research on SNPs and SNP identification ever since
association of genetic variations and diseases emerged. SNP identification can enable
identification of candidate genes containing causal variants and identification of candidate
causal SNPs having predisposition to diseases. This enables early prevention, diagnosis and
treatment of diseases. There is a need of identification methods which could process large
number of bases with low cost and in less time, paving way for research on computational
approaches in future. Bioinformatics aims at developing and organising large number of
available databases storing information on SNPs, their functions and associated diseases. It
also aims to develop effective tools which could mine the relevant data from these databases
to enable analysis of the disease causing SNPs. This is a review paper aiming to guide
researchers on Single Nucleotide Polymorphisms (SNPs) and Biocomputing approaches and
tools available for SNP identification.
REFERENCES
[1] An Overview of the Human Genome Project (2015). National Human Genome Research
Institute (NHGRI) homepage. Available: http://www.genome.gov/12011238
[2] Sachidanandam, R. et al. (2001). A map of human genome sequence variation containing
1.42 million single nucleotide polymorphisms. Nature, 409:928-933.
[3] Collins, F.S. et al. (1998). A DNA polymorphism discovery resource for research on
human genetic variation. Genome Research, 8: pp. 1229–1231.
[4] Sherry, S.T. et al. (2001). dbSNP: the NCBI database of Genetic variation. Nucleic Acids
Research, 29(1): pp. 308-311.
[5] Sripichai, O. and Fucharoen, S. (2007). Genetic Polymorphisms and Implications for
Human Diseases. Journal of the Medical association of Thailand, 90(2): 394-398.
Available: http://www.medassocthai.org/journal
[6] Mooney, Sean (2005). Bioinformatics approaches and resources for single nucleotide
[7] polymorphism functional analysis. Briefings in Bioinformatics, 6(1): 44–56.
[8] Krawczak, M. et al. (2000). Human Gene Mutation Database-A Biomedical Information
and Research Resource. Human Mutation, 15: 45-51.
[9] Chakravarti, A. (1999). Population Genetics-making sense out of sequence. Nature
genetics (supplement), 21: 56-60.
[10] Gongcheng, L. et al. (2014). Regulatory Variants and Disease: The E-Cadherin −160C/A
SNP. Molecular Biology International, 2014.
[11] Available: http://www.hindawi.com/journals/mbi/2014/967565/
[12] Shawky, R.M. (2014). Reduced penetrance in human inherited disease. The Egyptian
Journal of Medical Human Genetics, 15: 103-111.
[13] Altshuler, D. et al. (2008). Genetic Mapping in Human Disease. Science, 322(5903):
881–888.
[14] Gray, I. C. et al. (2000). Single nucleotide polymorphisms as tools in human genetics.
Human Molecular Genetics, 9(16): 2403-2408.
A Review on Biocomputing Approaches and Tools for Identification of Single Nucleotide
Polymorphisms
http://www.iaeme.com/IJCET/index.asp 65 [email protected]
[15] Hu, P. et al. (2016). A Simple Algorithm for Population Classification. Scientific Reports,
6, article no.: 23491.
[16] Zhou, N. and Wang L. (2007). Effective selection of informative SNPs and classification
on the HapMap genotype data. BMC Bioinformatics, 8:484.
[17] Kohnemann, S. and Pfeiffer H. (2011). Application of mtDNA SNP analysis in forensic
casework. Forensic Sci. Int. Genet., 5(3): 216–221.
[18] The 1000 Genomes Project Consortium (2015). A global reference for human genetic
variation. Nature, 526: 68-87.
[19] International HapMap Consortium (2005). Nature, 437:1299.
[20] Mooney, S.D. et al. (2010). Bioinformatic Tools for Identifying Disease Gene and SNP
Candidates. Methods Molecular Biology, 628: 307–319.
[21] Johnson, A.D. et al. (2009). SNP bioinformatics: a comprehensive review of resources.
Circulation: Cardiovascular Genetics, 2(5): 530-536.
[22] Medvedev, P. et al. (2009). Computational methods for discovering structural variation
with next-generation sequencing. Nature, 6: s13-s18.
[23] Nielsen, R. et al. (2011). Genotype and SNP calling from next-generation sequencing
data. Nature Reviews Genetics, 12(6): 443–451.
[24] Kumar, S. et al. (2012). SNP Discovery through Next-Generation Sequencing and Its
Applications. International Journal of Plant Genomics, 2012, article id: 831460.
[25] Bianco et al. (2013). Database tools in genetic diseases research. Elsevier Genomics, 101:
75–85.
[26] Ghorbani, M. and Karimi, H.(2014). Ten Bioinformatics Tools for Single Nucleotide
Polymorphisms Detection. American Journal of Bioinformatics, 3(2): 45-48.
[27] Ewing, B.L. et al. (1998). Base-calling of automated sequencer traces using Phred I.
Accuracy assessment. Genome Research, 8: 175-185.
[28] Nickerson, D.A. et al. (1997). PolyPhred: Automating the detection and genotyping of
single nucleotide substitutions using fluorescense-based resequencing. Nucleic Acids
Research, 25: 2745–2751.
[29] Haga, H. et al. (2002). Gene-based SNP discovery as part of the Japanese Millennium
Genome Project: identification of 190 562 genetic variations in the human genome.
Journal of Human Genetics, 47: 605-610.
[30] Newberg, L.P. et al. (1999). Mining SNPs from EST Databases. Genome Research, 9:
167–174.
[31] Zhang, J. et al. (2005). SNPdetector: A Software Tool for Sensitive and Accurate SNP
Detection. PLoS Computational Biology, 1(5): 0395-0404.
[32] Chen, K. et al. (2007). PolyScan: An automatic indel and SNP detection approach to the
analysis of human resequencing data. Genome Research, 17: 659–666.
[33] Ngamphiw, C. et al. (2008). VarDetect: a nucleotide sequence variation exploratory tool.
BMC Bioinformatics, 9(12): S9.
[34] Wegrzyn, J.L. et al. (2009). PineSAP-sequence alignment and SNP identification
pipeline. Bioinformatics, 25(19): 2609–2610.
[35] Boguski, M.S. et al. (1993). dbEST–database for expressed sequence tags. Nature
Genetics, 4: 332–333.
[36] Schuler, G.D. et al. (1996). Entrez: molecular biology database and retrieval system.
Methods Enzymology, 266:141-62.
[37] Sarkar, C. et al. (1998). Human genetic bi-allelic sequences (HGBASE), a database of
intra-genic polymorphisms. Memorias do Instituto Oswaldo Cruz, 93(5): 693-4.
Neelofar Sohi and Amardeep Singh
http://www.iaeme.com/IJCET/index.asp 66 [email protected]
[38] Hamosh, A. et al. (2000). Online Mendelian Inheritance in Man (OMIM). Human
Mutations, 15: 57–61.
[39] Hirakawa, M. et al. (2002). JSNP: a database of common gene variations in the Japanese
population. Nucleic Acid Research, 30(1):158-62.
[40] Kent, W.J. et al. (2002). The Human Genome Browser at UCSC. Genome Research, 12:
996-1006.
[41] Boeckmann, B. et al. (2003). The SWISS-PROT protein knowledgebase and its
supplement TrEMBL in 2003. Nucleic Acid Research, 31(1): 365-70.
[42] Iafrate, A.J. et al. (2004). Detection of large-scale variation in the human genome. Nature
Genetics, 36(9): 949-51.
[43] Guryev et al. (2005). CASCAD: a database of annotated candidate single nucleotide
polymorphisms associated with expressed sequences. BMC Genomics, 6(10).
[44] Wang, P. et al. (2006). SNP Function Portal: a web database for exploring the function
implication of SNP alleles. Bioinformatics, 22(14): e523-9.
[45] Mailman, M.D. et al.(2007). The NCBI dbGaP database of genotypes and phenotypes.
Nature Genetics, 39:1181–1186.
[46] Lee, P.H.. and Shatkay, H. (2008). F-SNP: computationally predicted functional SNPs for
disease association studies. Nucleic Acids Research, 36(Database issue):D820-4.
[47] Church, D.M. et al. (2010). Public data archives for genomic structural variation. Nature
Genetics, 42: 813-814.
[48] Cariaso, M. and Lennon, G. (2012). SNPedia: a wiki supporting personal genome
annotation, interpretation and analysis. Nucleic Acids Research, 40(Database issue):
D1308–D1312.
[49] Barrett, T. et al. (2012). BioProject and BioSample databases at NCBI: facilitating capture
and organization of metadata. Nucleic Acids Research, 40(Database issue): D57-63.
[50] Li, M.J. et al. (2012). GWASdb: a database for human genetic variants identified by
genome-wide association studies. Nucleic Acids Research, 40(Database issue): D1047-54.
[51] Whirl-Carrillo, M. et al. (2012). Pharmacogenomics Knowledge for Personalized
Medicine. Clinical Pharmacology and Therapeutics, 92(4): 414–417.
[52] Bai, B. et al. (2015). DoGSD: the dog and wolf genome SNP database. Nucleic Acids
Research, 43(Database issue): D777-83.
[53] Bronwen, L.A. et al. (2016). The Ensembl gene annotation system. Database 2016,
baw093.