65
1 IDENTIFICATION OF GENETIC VARIANT IN BUFFALO GENOME USING ddRAD SEQUENCE A DISSERTATION SUBMITTED TO ORISSA UNIVERSITY OF AGRICULTURE & TECHNOLOGY, BHUBANESWAR IN PARTIAL FULFILMENT OF THE REQUIREMENT FOR THE DEGREE OF MASTER OF SCIENCE IN BIOINFORMATICS BY ANJAN KUMAR PRADHAN Adm. No.-28BI/15 DEPARTMENT OF BIOINFORMATICS CENTRE FOR POST GRADUATE STUDIESORISSA UNIVERSITY OF AGRICULTURE AND TECHNOLOGY BHUBANESWAR-751003 2017 Advisor Mrs. Sushma Rani Martha

IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

1

IDENTIFICATION OF GENETIC VARIANT IN BUFFALO

GENOME USING ddRAD SEQUENCE

A

DISSERTATION

SUBMITTED TO ORISSA UNIVERSITY OF AGRICULTURE &

TECHNOLOGY, BHUBANESWAR

IN PARTIAL FULFILMENT OF THE REQUIREMENT FOR THE

DEGREE OF

MASTER OF SCIENCE IN BIOINFORMATICS

BY

ANJAN KUMAR PRADHAN

Adm. No.-28BI/15

DEPARTMENT OF BIOINFORMATICS

CENTRE FOR POST GRADUATE STUDIESORISSA UNIVERSITY OF

AGRICULTURE AND TECHNOLOGY

BHUBANESWAR-751003

2017

Advisor Mrs. Sushma Rani Martha

Page 2: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

2

Page 3: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

3

CERTIFICATE –II

This is to certify that the dissertation entitled “Identification of Genetic Variant In Buffalo

Genome Using ddRAD Sequence” submitted by Anjan Kumar Pradhan, to the

OrissaUniversity Of Agriculture & Technology, Bhubaneswar in the partial fulfillment of the

requirements for the award of the degree of Master of Science in Bioinformatics has been

approved by the students advisory committee after an oral examination of the same in

collaboration with external examiner.

ADVISORY COMMITTEE

1. Dr. D.C. Mishra Chairman …………….

Senior Scientist

ICAR-IASRI,New Delhi

2. Mrs. Sushama Rani Martha Member

……………..

Asst. Professor

Department of Bioinformatics

3. Mr. Sukanta kumar Pradhan

Head of the Department Member

……………..

Department of Bioinformatics

External Examiner .……………..

Page 4: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

4

ACKNOWLEDGEMENT

It is my priviledge to share my deep sense of gratitude to my advisor Mrs. Sushma Rani Martha, Asst.

Professor, Department of Bioinformatics, Orissa University Of Agriculture and Technology, for her

constant guidance, support and help without which the project would not have been successfully

completed.

I express my profound gratitude to Dr D.C. Mishra, Senior Scientist, ICAR-IASRI, New Delhi me to

carry out the dissertation work under his guidance.

I am thankful to the member of the advisory committee Mrs Sushma Rani Martha,Asst.Professor , Dept

of Bioinformatics, OUAT. Mr. Sukant Kumar Pradhan,HOD, Dept of Bioinformatics, OUAT and Dr

D.C. Mishra, Senior Scientist, ICAR-IASRI, New Delhi for their support and encouragement to carry

out this work successfully.

I would like to owe my sincere thanks to Sushree Didi for their encouragement.

I convey my heartly thanks to all my faculty members Mr. Surya Narayan Ratha, Mrs. Sucharita

Balabanta Ray and Mr. Sujit Kumar Dash for their guidance in each and every step during my

experimental laboratory work. I heartily thank to Dr.k. k. Chaturbedi for their help, support and giving

me innovative ideas for the completion of this project work.

I would like to express my heartiest and cordial regards to my beloved parents who are real inspiration

for me in every step of my life giving me unbound emotional support.I feel great pleasure to express my

love to all my sweetest friends for their strong mental support through out the project period and in

these two years.

I feel honoured to be a part of this auspicious university for providing me a healthy atmosphere in

these two years.Last but not the least I express my gratitude to god for invaluable inspiration for

accomplishment of such a splendid work.

Anjan Kumar Pradhan

Page 5: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

5

CONTENTS

CHAPTER NO. PARTICULARS PAGE NO.

I. INTRODUCTION 1-3

II. REVIEW OF LITERATURE 4-12

III. MATERIALS AND METHODS 13-35

IV.

V.

RESULT AND DISCUSSION

CONCLUSION

REFERENCE

CURRICULUM VITAE

36-50

51

LIST OF FIGURES

Page 6: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

6

FIGURE

NO.

PARTICULARS PAGE NO.

1. Difference between RAD & ddRAD 16

2. Per base sequence quality 20

3. Per tile sequence quality 20

4. Per sequence quality score 21

5. Per base sequence content 22

6. Per sequence GC content 22

7. Per base N cotent 23

8. Sequence length disribution 24

9. Sequence duplication levels 25

10. Adapter content 27

11. Kmer content 27

12. Trimming report 29

13. Staks diagram 32

14. SNPs(Milk yield trait samples) 39

15. Haplotypes 40

16. SNPs(Lactation period trait samples) 42

17. Haplotypes 42

18. SNPs (Age at first calving) 49

19.

Haplotypes 49

Page 7: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

7

LIST OF TABLES

TABLE

NO.

PARTICULARS PAGE

NO.

1. ddRAD sequence 14

2. Basic statistics 19

3. Over represented sequences 26

4. Milk yield trait 37

Page 8: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

8

Name of the Student : Anjan Kumar Pradhan

Admission No : 28BI/15

Title of thesis :Identification Of Genetic Variant In Buffalo

Genome UsingddRAD Sequence

Degree for which thesis submitted : Master of Science in Bioinformatics

5. Marker (Milk yield trait) in the population 38

6. Lactation period trait 41

7. SNP Summary statistics in the population 43

8. Haplotype Summary statistics in the population 43

9. Hapstats. Summary statistics in the population 44

10. Sumstats.Summary statistics in the population 45

11. Sumstats_Summary statistics in the population 46

12. Marker (Lactation period trait) in the population 46

13. Age at first calving trait 47

13.1 Marker (Age at first calving trait) in the population 48

Page 9: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

9

Name of the Dept, & University : Department of Bioinformatics,

Centre for Post Graduate Studies,

Orissa University Of Agriculture

&Technology, Bhubaneswar,

Orissa, 751003

Year of submission : 2017

Name of the advisor : Mrs. Sushma Rani Martha

ABSTARCT

Bubalus bubalis (water buffalo) is an agro-economically important livestock species due to its

multipurpose use in India and other Asian countries. The aim of this study is to identify single

nucleotide polymorphisms (SNPs) from buffalo genome using ddRAD sequencing through

STACKS pipeline. Here we have used double digest restriction-associated DNA sequencing

(ddRAD) to identification and annotation of genetic variant from buffalo three traits such as

Milk yield, Lactation period, Age at first calving.The Stacks pipeline uses ddRAD-Sequence data

to create genetic maps and conduct population analysis. It assembles loci de novo from an individual’s

sequence reads or by using a reference sequence. These loci are catalogued and compared against other

individuals’ loci to create a map of alleles. Stacks can identify thousands of markers and use this

information to study genomic structure and assembly.Stacks employs a Catalog to record all loci

identified in a population and matches individuals to that Catalog to determine which

haplotype alleles are present at every locus in each individual.

Keyword:(ddRAD sequence, STACKS pipeline, Sequence Alignment Mapping, Genetic

Variant, NGS, Stacks Web Interface.)

Page 10: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

10

CHAPTER-I

INTRODUCTION

WATER buffalo (Bubalusbubalis) was domesticatedapproximately 5000 years ago in India to

secure supply of milk, meat and power1 [Rudolph, M. C. et al 2007]. It has been grouped into

(i) swamp, primarily developed for draught purpose and (ii) river buffalo, primarily used for

milk production. Among the total of 13 recognized breeds of water buffalo, majority are milch

breeds in India and some of them have been listed on a state-level conservation plan by the

Ministry of Agriculture, Government of India2 [Ding, X. et al 2012]. As buffalo milk occupies

the highest share in Indian dairy sector, the future improvement in traits of economic

importance is dependent on genetic variation present within and between breeds. Even though

they have an important role in Indian agricultural economy, most of the breeds have not been

exploited for their full genetic potential. Recently, genomic selection in cattle has been adopted

globally to accelerate genetic gains3 [Van Horn, C. G., Caviglia 2005]. Molecular markers like

single nucleotide polymorphisms (SNPs) can play a significant role in livestock improvement

through conventional breeding programmes. However, the present genomic resources are

limited for river buffalo. Moreover, molecular genetic diversity in river buffalo is explored

using cattle-based microsatellite markers4 [Mashek, D. G. and Coleman 2006]. Taking

advantage of the availability of fully sequenced cattle genome and other related genomic

resources, and given the close evolutionary relationship between cattle and river buffalo. We

sequenced the river buffalo genomes on a large scale to detect genetic variants, in particular,

identified large-scale SNPs, which may help in the study of river buffalo genomics. Genetic

component plays a major role in milk production and other functional traits of dairy animal 5

[Mercade, A. et al 2006].

The advent of next-generation sequencing has enabled a robust and more cost-effective

approach for the identification of high-throughput SNPs. Recently, exome/targeted capture

sequencing has been used to analyse disease traits in livestock species because it is efficient

and costeffective6. In the present study we carried out targeted sequencing, for discovering

variants in and across targeted regions. To the best of our knowledge, there are no earlier

studies on targeted (exome) sequencing in river buffalo for high-throughput variant discovery.

Page 11: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

11

Although there are many advantages to raising water buffalo as described above, these animals

remain underutilized. In particular, water buffalo breeders and farmers have been facing many

challenges and problems, such as poor reproductive efficiency, sub-optimal production

potential, higher than normal incidence of infertility, and lower rates of calf survival. Genome

research has created a broad basis for promoting and utilizing gene technologies in many fields

of livestock production. For example, genome biotechnology will provide a major opportunity

to advance sustainable animal production systems of higher productivity through manipulating

the variation within and between breeds to realize more rapid and better-targeted gains in

breeding value. This type of research will also make it possible to distinguish molecular

phenotypes and thus improve the use of genetic resources in domestic animals. Therefore, the

present review focuses on the currently available genome resources in water buffalo, thus

providing knowledge and technologies that can help optimize production potentials,

reproduction efficiency, product quality, nutritional value and resistance to diseases in the

species. Genetics is responsible for approximately half the observed change in performance

internationally in well-structured cattle breeding programs. Almost all, if not all, individual

characteristics, including animal health, have a genetic basis. Once genetic variation exists then

breeding for improvement is possible. Although the heritability of most health traits is low to

moderate, considerable exploitable genetic variation does exist.

Water buffalo provide more than 5% of the world’s milk supply, which contains less water and

more fat, lactose, protein, and minerals than cow milk [Schwehm, J. M 1998]. Water buffalo

milk is used to make butter, butter oil, high quality cheeses, and other high quality dairy

products. They have leaner meat that contains less fat and cholesterol than beef, while having a

comparable taste [Manjithaya, R. R. and Dighe 2004]. Their hide can be used to make good

quality leather products and they make good beasts of burden, providing 20% to 30% of all

farm power, and are superior draught animals in waterlogged conditions such as rice paddies.

Water buffalo utilize less digestible feeds than cattle making them easier to maintain using

locally available roughages. In addition, water buffalo are used as cash--to be sold when the

need arises; thus securing the economic status of many families. The husbandry system of

water buffalo depends on the purpose for which they are bred and maintained. They are often

referred to as "the living tractor of the East". It probably is possible to plough deeper with

Page 12: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

12

buffalo than with either oxen or horses. India is considered as the home tract of some of the best

buffalo breeds. Because of preference of buffaloes for milk, many she buffaloes from the breeding tract

are moved to the thickly populated urban and industrial centre for meeting the milk requirements of this

population. Here generally they are slaughtered after completion of one or two lactation. Their

progenies allowed to die due to neglect and thus no replacement of superior germplasm is

possible. Indian buffaloes are in important source of milk supply today and yield nearly three

times as much milk as cows. More than half of the total milk produced (55%) in the country

was contributed by the 47.22 million milch buffaloes, where as the 57.0 million cows

contribute only 45% of the total milk yield. Indian Buffaloes are water buffaloes. There are

about 10 indigenous standard breeds of buffaloes, which are well known for their milking

qualities.

Bubalus bubalis (water buffalo) is an agro-economically important livestock species due to its

multipurpose use in India and other Asian countries. The aim of this study is to identify single

nucleotide polymorphisms (SNPs) from Buffalo Three Traits such as (Milk yield, Lactation

period and Age at fast calving) using ddRAD sequence through STACKS PIPELINE.Stacks

identifies loci in a set of individuals, aligned to a reference genome (including gapped

alignments), and then genotypes each locus. Stacks incorporates a maximum likelihood

statistical model to identify sequence polymorphisms and distinguish them from sequencing

errors. Stacks employs a Catalog to record all loci identified in a population and matches

individuals to that Catalog to determine which haplotype alleles are present at every locus in

each individual.

OBJECTIVES

• Data compilation and preprocessing of the ddRAD sequence data for three-traits in

Buffalo.

• Identification & annotation of genetic variant.

Page 13: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

13

CHAPTER-II

REVIEW OF LITERATURE

Since the early 1800’s, breed development was based on phenotype selection on coat color and

polled phenotypes, and included the imposition of severe bottlenecks followed by breed

expansion via artificial insemination. During the last 50 years, animal breeding based on

quantitative genetics has resulted in a remarkable progress in improving production traits for

milk and meat (Andersson and Georges 2004). Therefore, selection (natural and human-

imposed) and nonselective forces (the demographic events and introgression) drove changes

within the cattle genome. Their combined effects have created exceptional phenotypic diversity

and genetic adaptation to local environment across the globe within the modern cattle breeds. It

is generally accepted that there are four mechanisms of evolutionary change: Mutation, genetic

drift, gene flow or migration (demographic history), and selection. However, only selection is

locus specific, while the first three forces work uniformly across the whole genome. Selection

can be divided into three modes: Positive, purifying (or negative selection, eliminating a

deleterious mutation), and balancing selection (including heterozygote advantage and

frequencydependentselection). Positive selection is a mode of natural selection that drives the

increase in prevalence of advantageous alleles due to their favorable effects on fitness (Biswas

and Akey 2006; Kelley and Swanson 2008; Oleksyk et al. 2010). Genetic hitchhiking refers

changes in the frequency of an allele because of linkage with a positively selected or neutral

allele at another locus. The availability of genomic data has spurred many approaches for

mapping positive selection, mainly based on reduced local variability, deviations in the marker

frequency, increased linkage disequilibrium (LD), and extended haplotype structure. These

methods such as CLR,CMS, FST, EHH, iHS, and hapFLK (Tajima 1989; Fay and Wu 2000;

Sabeti et al. 2002; Nielsen et al. 2005; Voight et al. 2006; Grossman et al. 2010; Fariello et al.

2013) have been widely used in human, mouse, rat, and domesticated animals like dogs, cattle,

sheep, pigs, horses, and chickens (Waterston et al. 2002; Gibbs et al. 2004; Rubin et al. 2010,

2012; Kijas et al. 2012; Petersen et al. 2013). One method (di) was recently developed to

identify genomic regions indicative of selection with a high degree of genetic differentiation

Page 14: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

14

between dog breeds (Akey et al. 2010). Distinct from FST, which measures the fraction of total

genetic variation between two populations, the pi value is defined as a function of unbiased

estimates of all pairwise FST between one breed and the remaining breeds within a population.

It is suited for detecting selection specific to a particular breed, or subset of breeds, and

isolating the direction of change. It was utilized to track lineage-specific signatures of selection

in the dog and horse genomes, revealing its power to detect selection acting on both newly

arisen and preexisting variations (Akey et al. 2010; Petersen et al.2013). Selection mapping is a

powerful approach, together with genome-wide association studies, to detect candidate genes

associated with quantitative traits. Selection mapping in cattle has been previously investigated

using a lower densitymarkers like BovineSNP50 array (Flori et al. 2009; Hayes et al. 2009;

Qanbari et al. 2010, 2011; Stella et al. 2010; Rothammer et al. 2013). Only recently similar

studies were reported based on a higher density markers like BovineHD array (Porto-Netoet al.

2013; Utsunomiya et al. 2013; Kemper et al. 2014; Perez et al. 2014). More recently, sequence-

based signatures were reported in Fleckvieh (Qanbari et al. 2014). However, these studies

focused on limited breeds with specific traits. Therefore, it is possible many breed-specific

selection signatures remain undetected due to lack of comparison acrossbreeds. There are a few

of targeted studies of the haplotype pattern and evolution on selected gene families like Toll-

like receptors in cattle (Seabury et al. 2010). However, to our knowledge, no systematic effort

has been reported to investigate the haplotype pattern and evolution of positively selected

genes in the cattle genome. In this study, we investigated diverse genomic selection using high-

density single nucleotide polymorphism (SNP) data of five distinct cattle breeds, including

Holstein (HOL), Angus (ANG), Charolais (CHL), Brahman (BRM), and N’Dama(NDA).

HOL, ANG, and CHL are taurine breeds from Europe. HOLs represent the highest-production

dairy animals, originally from the Netherlands and northern Germany. Their black-and-white

color was due to artificial selection by the breeders. ANG cattle, first developed in Scotland,

are used in beef production. They are naturally polled (do not have horns) and solid black or

red in color. CHL is a dual purpose breed (both milk and beef) originated in France, which is

known for its large body size, bone structure, and white to cream coat. NDA is an indigenous

local taurine breed from West Africa. With a small size and fawn coat, NDA is well known for

its trypanotolerant and shows superior resistance to ticks and other parasites

(http://www.ansi.okstate.edu/breeds/cattle/, last accessed December 2, 2014). BRM is a

composite of several zebu breeds imported from India (Guzerat, Kankrej, Gir, and others), and

Page 15: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

15

was first bred in America in the 1880s for beef production with a minor taurinecontribution

(Decker et al. 2014). The BRM is known for its gray coat, heat tolerance, and disease

resistance. We performeda genome-wide scan with the BovineHD SNP genotypes to map

selection signatures among these five diverse cattle breeds.

2.1Genetic Variant

An alteration in the most common DNA nucleated sequence. The term variant can be used to

describe an alteration that may begin, pathogenic or of unknown significance. The term variant

is increasingly being used in place of the term mutation. Mutations –changes at the level of

DNA; one or more base pairs has undergone a change; change could be at random or due to a

factor in the environmentMajor deletions, insertions, and genetic rearrangements can affect

several genes or large areas of a chromosome at oncePolymorphisms –differences in individual

DNA which are not mutationsSingle-nucleotide polymorphisms (SNPs) are the most common,

occurring about once every 1,000 bases or Copy number variations –some DNA repeats itself

(i.e. AAGAAGAAGAAG) and there can be variation in the number of repeats.Genetic

variation means that biological systems – individuals and populations – are different over

space. Each gene pool includes various alleleshttps://en.wikipedia.org/wiki/Allele of genes. The

variation occurs both within and among population, supported by individual carriers of the

variant genes. Genetic variation is brought about, fundamentally,

by mutationhttps://en.wikipedia.org/wiki/Mutation, which is a permanent change in the chemical

structure of chromosomeshttps://en.wikipedia.org/wiki/Chromosomes. Genetic

recobinationhttps://en.wikipedia.org/wiki/Genetic_recombination also produces changes within

alleles.

2.1.1Among individuals within a population

Genetic variation among individuals within a population can be identified at a variety of levels.

It is possible to identify genetic variation from observations

of phenotypichttps://en.wikipedia.org/wiki/Phenotype variation in either quantitative traits (traits

that vary continuously and are coded for by many genes (e.g., leg length in dogs)) or discrete

traits (traits that fall into discrete categories and are coded for by one or a few genes (e.g.,

white, pink, red petal color in certain flowers)).

Page 16: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

16

Genetic variation can also be identified by examining variation at the level

of enzymeshttps://en.wikipedia.org/wiki/Enzyme using the process of protein

electrophoresishttps://en.wikipedia.org/wiki/Protein_electrophoresis. Polymorphic genes have

more than one allele at each locus. Half of the genes that code for enzymes in insects and

plants may be polymorphic, whereas polymorphisms are less common in vertebrates.

Ultimately, genetic variation is caused by variation in the order of bases in the nucleotides in

genes. New technology now allows scientists to directly sequence DNA which has identified

even more genetic variation than was previously detected by protein electrophoresis.

Examination of DNA has shown genetic variation in both coding regions and in the non-coding

intron region of genes. Genetic variation will result in phenotypic variation if variation in the

order of nucleotides in the Dna sequencehttps://en.wikipedia.org/wiki/DNA_sequence results in a

difference in the order of amino acidshttps://en.wikipedia.org/wiki/Amino_acid in proteins coded

by that DNA sequence, and if the resultant differences in amino acid

sequencehttps://en.wikipedia.org/wiki/Peptide_sequence influence the shape, and thus the

function of the enzyme.

2.1.2Between populations

Geographic variation means genetic differences in populations from different locations. This is

caused by natural selectionhttps://en.wikipedia.org/wiki/Natural_selection or genetic drift.

2.1.3Measurement

Genetic variation within a population is commonly measured as the percentage of gene

loci that are polymorphic or the percentage of gene loci in individuals that are heterozygous.

2.1.4 Sources

Random mutationshttps://en.wikipedia.org/wiki/Mutation are the ultimate source of genetic

variation. Mutations are likely to be rare and most mutations are neutral or deleterious, but in

some instances the new alleles can be favored by natural

selection.polyploidyhttps://en.wikipedia.org/wiki/Polyploidy is an example of chromosomal

mutation. Polyploidy is a condition wherein organisms have three or more sets of genetic

Page 17: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

17

variation (3n or more).Crossing over and random segregation

during meiosishttps://en.wikipedia.org/wiki/Meiosis can result in the production of

new alleleshttps://en.wikipedia.org/wiki/Allele or new combinations of alleles. Furthermore,

random fertilization also contributes to variation.Variation and recombination can be facilitated

by transposable genetic elementhttps://en.wikipedia.org/wiki/Transposable_elements, endogenous

retroviruseshttps://en.wikipedia.org/wiki/Endogenous_retrovirus, LINEs, SINEs, etc.For a given

genome of a multicellular organism, genetic variation may be acquired in somatic cells or

inherited through the germline.

2.1.5 Forms

Genetic variation can be divided into different forms according to the size and type of genomic

variation underpinning genetic change. Small-scale sequence variation includes base-pair

substitutionhttps://en.wikipedia.org/wiki/Base-

pair_substitution and indelshttps://en.wikipedia.org/wiki/Indels. Large-scale structural

variationhttps://en.wikipedia.org/wiki/Structural_variation can be either copy number

variationhttps://en.wikipedia.org/wiki/Copy_number_variation (losshttps://en.wikipedia.org/wiki/D

eletion_(genetics) or gainhttps://en.wikipedia.org/wiki/Gene_duplication), or chromosomal

rearrangementhttps://en.wikipedia.org/wiki/Chromosomal_rearrangement (translocationhttps://en.

wikipedia.org/wiki/Chromosomal_translocation, inversionhttps://en.wikipedia.org/wiki/Chromosom

al_inversion, or Segmental

acquired uniparentaldisomyhttps://en.wikipedia.org/wiki/Uniparental_disomy).Numerical

variation in

whole chromosomeshttps://en.wikipedia.org/wiki/Chromosome or genomeshttps://en.wikipedia.or

g/wiki/Genome can be

either polyploidyhttps://en.wikipedia.org/wiki/Polyploidy or aneuploidyhttps://en.wikipedia.org/wi

ki/Aneuploidy.

2.1.6 Maintenance in populations

A variety of factors maintain genetic variation in populations. Potentially harmful recessive

alleles can be hidden from selection in

the heterozygoushttps://en.wikipedia.org/wiki/Zygosity individuals in populations

Page 18: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

18

of diploidhttps://en.wikipedia.org/wiki/Ploidy organisms (recessive alleles are only expressed in

the less common homozygoushttps://en.wikipedia.org/wiki/Zygosity individuals). Natural

selection can also maintain genetic variation in balanced polymorphisms. Balanced

polymorphisms may occur when heterozygotes are favored or when selection is frequency

dependent. The only source of genetic variation in asexual organisms is mutations. Thus, if

replication of the genetic material was perfect then we would have no genetic variation, and

thus, no evolution. The importance of genetic variation is seen when the environment changes.

In such cases if genetic variation is not present then the prevalent genotypes might not be

suitable to the changed environment and the species might die out as there is no genetic

variation. If there is genetic variation then natural selection can act on it and bring about

adaptation to the new environment. In scientific terms. It's the degree by which progeny differs

from their parents. These are the differences found in morphological, physiological, cytological

and behaviouristic traits of individuals belonging to same species, race and family. They

appear in offsprings and siblings due to

• reshuffling of genes by chance separation of chromosome.

• crossing over

• chance combination of chromosomes during meiosis and fertilisation

• mutations

• effect of environment.

Water buffalo milk presents physicochemical features different from that of other ruminant

species, such as a higher content of fatty acids and proteins. The physical and chemical

parameters of swamp and river type water buffalo milk differ. Water buffalo milk contains

higher levels of total solids, crude protein, fat, calcium, and phosphorus, and slightly higher

content of lactose compared with those of cow milk. The high level of total solids makes water

buffalo milk ideal for processing into value-added dairy products such as cheese. The

conjugated linoleic acid (CLA) content in milk ranged from 4.4 mg/g fat in September to 7.6

mg/g fat in June. Seasons and genetics may play a role in variation of CLA level and changes

in gross composition of the water buffalo milk.Water buffalo milk is processed into a large

variety of dairy products:Cream churns much faster at higher fat levels and gives higher

overrun than cow cream.Butter from water buffalo cream displays more stability than that from

cow cream.Ghee from water buffalo milk has a different texture with a bigger grain size than

Page 19: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

19

ghee from cow milk.Heat-concentrated milk products in the Indian subcontinent include

paneer, khoa, rabri, kheer and basundi.Fermented milk products include dahi, yogurt, and

chakka.

Water buffalo meat, sometimes called "carabeef", is often passed off as beef in certain regions,

and is also a major source of export revenue for India. In many Asian regions, buffalo meat is

less preferred due to its toughness; however, recipes have evolved (rendang, for example)

where the slow cooking process and spices not only make the meat palatable, but also preserve

it, an important factor in hot climates where refrigeration is not always available.Their hides

provide tough and useful leather, often used for shoes.Bone and horn products.Abihu dancer is

blowing a hornpipe.The bones and horns are often made into jewellery, especially earrings.

Horns are used for the embouchure of musical instruments, such as ney and kaval.

2.2Next Generation Sequencing

In the past, two general strategies have been widely used for whole genome sequencing: BAC

by BAC sequencing and shotgun sequencing. Both strategies employ the Sanger method,

which is relatively costly, time consuming, and labor intensive [31]. Therefore, the high

demand for low-cost sequencing has led to the development of high-throughput sequencing

technologies, called next-generation sequencing. As recently reviewed by Jiang et al. [32],

three such next-generation sequencing technologies have been commercialized, such as

Roche/454 life science (http://www.454.com), Illumina/ Solexa (http://www.Illumina.com) and

Applied Biosystem/SOLiD (http://solid.appliedbiosystems.com). These new generation

sequencing methods no longer use the Sanger method for sequencing. Instead, the 454

technology is based on pyrosequencing and emulsion PCR; the Solexa technology utilizes a

sequencing-by-synthesis approach for sequencing single DNA molecules attached to

microspheres and the SOLiD (supported oligonucleotide ligation and detection) technology is a

short-read sequencing method based on ligation. Nevertheless, these next-generation

sequencing methods can produce a large amount of sequences in a relatively shortime.

Sanger sequencing was developed by Frederick Sanger and colleagues in 1977 and was widely

used for about 25 years. Nowadays its mostly replaced by Next-gen sequencing. Sanger

Page 20: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

20

sequencing is quite slow and can sequence only a few thousand nucleotides in a week. The

next-gen sequencing method is fast, easy to operate and cost effective. It can sequence about

200 billion nucleotides in a week, which is going to rise to 600 billion in the next few years. Its

like comparing the data stored in a floppy drive(Sanger sequence) to a 2TB hard drive(Next-

gen sequencing). The level of advancement that genome sequencing has undergone in the last

few years is so vast that now we can sequence the entire the genome of any individual within a

couple of hours.

Nucleic acid sequencing is a method for determining the exact order of nucleotides present in a

given DNA or RNA molecule. In the past decade, the use of nucleic acid sequencing has

increased exponentially as the ability to sequence has become accessible to research and

clinical labs all over the world. The first major foray into DNA sequencing was the Human

Genome Project, a $3 billion, 13-year-long endeavor, completed in 2003. The Human Genome

Project was accomplished with first-generation sequencing, known as Sanger sequencing.

Sanger sequencing (the chain-termination method), developed in 1975 by Edward Sanger, was

considered the gold standard for nucleic acid sequencing for the subsequent two and a half

decades (Sanger et al., 1977). Since completion of the first human genome sequence, demand

for cheaper and faster sequencing methods has increased greatly. This demand has driven the

development of second-generation sequencing methods, or nextgeneration sequencing (NGS).

NGS platforms perform massively parallel sequencing, during which millions of fragments of

DNA from a single sample are sequenced in unison. Massively parallel sequencing technology

facilitates high-throughput sequencing, which allows an entire genome to be sequenced in less

than one day. In the past decade, several NGS platforms have been developed that provide low-

cost, high-throughput sequencing. Here we highlight two of the most commonly used

platforms in research and clinical labs today: the LifeTechnologies Ion Torrent Personal

Genome Machine (PGM) and the IlluminaMiSeq. The creation of these and other NGS

platforms has made sequencing accessible to more labs, rapidly increasing the amount of

research and clinical diagnostics being performed with nucleic acid sequencing.

The growing power and reducing cost sparked an enormous range of applications of Next

generation sequencing (NGS) technology. Gradually, sequencing is starting to become the

standard technology to apply, certainly at the first step where the main question is “what's all

Page 21: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

21

involved”, “what's the basis”. It should be realized that for many applications sequencing

would always have been the method of choice, yet it was science-fiction, technically

unthinkable and later possible but far too costly. We perform genome-wide association studies

(GWAS) using SNP-arrays simply because we cannot afford to perform wholegenome

sequencing in ten-thousands of individuals. This is changing rapidly and sequencing will

become our molecular microscope, the tool to get a first look. Although replication,

transcription, translation, methylation and nuclear DNA folding are completely different

processes, they can all be studied using sequencing. An important advantage of sequence data

is its quality, robustness and low noise. It should be noted that a successful NGS project

requires expertise both at the wet lab as well as the bioinformatics side in order towarrant high

quality data and data interpretation. The sequence itself is hard evidence of its correctness. A

sequencing system will not produce “random” sequences and when it does this becomes

evident immediately from QC calls obtained from spike-in controls. Furthermore random

sequences will have no match and can be easily discarded.

2.3 Single Nucleotide Polymorphisms Identification

Single nucleotide polymorphisms (SNPs) are

stable, biallelichttp://bioinformatica.upf.edu/2002/projects/4.1/definitions.html - bial sequence

variants that are distributed throughout the genome and are present at an appreciable frequency

(>1%) in human population. With other types of polymorphism, like insertions or deletions,

they cause part of genome variation among individuals. Nevertheless, the biggest part of this

sequence variation is attributable to them.

There are three main reasons to identify SNPs:

• Some of them may be involved in diseases due to the mutation they cause.

• Even SNPs that do not change protein expression may be close to deleterious and

unknown mutations on the same chromosome. Thus, they might be used as markers to

make broad haplotypehttp://bioinformatica.upf.edu/2002/projects/4.1/definitions.html -

haplo analysis.

• Their low rate of recurrent mutations makes them stable indicators of human history.

SNPs might be used to reveal the connections between human beings through the time.

Page 22: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

22

The principal projects involved in this identification are the SNP Consortium (95% of the

sequenced SNPs has been done by it) and the International Human Genome Sequencing

Consortium. Single-nucleotide polymorphisms may fall within coding sequences

of genes, non-coding regions of genes, or in the intergenic regions (regions between genes).

SNPs within a coding sequence do not necessarily change the amino acid sequence of

the protein that is produced, due to degeneracy of the genetic code.Association studies can

determine whether a genetic variant is associated with a disease or trait.[6]

A tag SNP is a representative single-nucleotide polymorphism (SNP's) in a region of the

genome with high linkage disequilibrium (the non-random association of alleles at two or more

loci). Tag SNPs are useful in whole-genome SNP association studies in which hundreds of

thousands of SNPs across the entire genome are genotyped.

Haplotype mapping: sets of alleles or DNA sequences can be clustered so that a single SNP

can identify many linked SNPs.Linkage Disequilibrium (LD), a term used in population

genetics, indicates non-random association of alleles at two or more loci, not necessarily on the

same chromosome. It refers to the phenomenon that SNP allele or DNA sequence which are

close together in the genome tend to be inherited together. LD is affected by two parameters: 1)

The distance between the SNPs [the larger the distance the lower the LD]. 2) Recombination

rate [the lower the recombination rate the higher the LD].

CHAPTER-III

MATERIALS AND METHOD

Page 23: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

23

Materials Used

The water buffalo Traits : Milk yield, Lactation period and Age at first calving had collected

And the reference genome of Cattle (GCA_000003055.5_bos_taurus_UMD_3.1.1

_genomic(1).fna)have downloaded from (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000

/003/055/GCA_000003055.5_Bos_ taurs_UMD_3.1.1/) NCBI. Every individual sample

demultiplexed forward and reverse FASTQ files in the analysis (ddRAD sequence data only).

A simple naming convention (a single-word localitycode/name and a single-word sample

identifier separated by an underscore) must be followed for every sample; examples

are1_R1_001.fastq and 1_R2_001.fastq. A sample script for using a text file containing sample

names and process radtagsfrom Stacks to properly demultiplex samples and put them in the

proper naming convention.

3.1Double Digest Restriction-Site Associated DNA Sequencing

The double digest restriction-site associated DNA sequencing technology (ddRAD-sequence)

is a reduced representation sequencing technology by sampling genome-wide enzyme loci

developed on the basis of next-generation sequencing. ddRAD-sequence has been widely

applied to SNP marker development and genotyping onanimals, especially on marine animals

as the original ddRAD protocol is mainly built and trained based on animal data. However,

wide application of ddRAD-sequence technology in plant species has not been achieved so far.

Here, we aim to develop an optimized ddRAD library preparation protocol be accessible to

most buffalo species without much startup pre-experiment and costs. Double digest RAD

sequencing (ddRADsequence), by contrast, uses a two enzyme double digest followed by

precise size selection that excludes regions flanked by either [a] very close or [b] very distant

RE recognition sites, recovering a library consisting of only fragments close to the target size

(red segments).

Table. 1 ddRAD sequence

Names

(Restrict on enzyme) Sequence (5′ – 3′)

Page 24: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

24

Names

(Restrict on enzyme) Sequence (5′ – 3′)

TCTTTCCCTACACGACGCTCTTCCGATCTGCA

PstI GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

CTGGAGTTCAGACGTGTGCTCTTCCGATCT

EcoRI AATTAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC

TCTTTCCCTACACGACGCTCTTCCGATCT

HindIII AGCTAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

CTGGAGTTCAGACGTGTGCTCTTCCGATC

SalI TCGAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC

CTGGAGTTCAGACGTGTGCTCTTCCGATCT

MspI CGAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC

Indexed primers for PCRa

Forward primer

AATGATACGGCGACCACCGAGATCTACACXXXXXXXXACACTCTTTCCCTACACGACGCTCTTCC

Reverse primer

CAAGCAGAAGACGGCATACGAGATXXXXXXXXGTGACTGGAGTTCAGACGTGTGCTCTTC

Genotyping requires thousands of genomes to be compare in a reliable, consistent way.

Restriction site associated DNA sequencing (RAD-Sequence) interrogates a fraction of

the genome across many individuals, an ideal method for genotyping. By using restriction

enzyme digestion and sequencing the regions adjacent to restriction sites, researchers can

examine the same subset of genomic regions for thousands of individuals and identify

many genetic markers along the genome. Other NGS methods examine a larger portion

Page 25: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

25

of the genome and offer more data, but they are costly and cannot be used to study the

thousands of individuals required for genotyping.RAD-Sequence applications include:

Genetic marker discovery, Local genome assembly, QTL mapping, Linkage mapping.

SciGenom uses double digest RAD-Sequence (ddRAD-Sequence), a variation of RAD-

Sequence, for genotyping. Traditional RAD-Sequence uses one restriction enzyme and

random shearing to generate fragments from genomic DNA. However, these are high

DNA loss steps and offer little control over the fragments that are sequenced. For

organisms without a reference genome, a significant portion of the RAD-Sequence data has

been discarded due to sequence read errors and the presence of variable sites. ddRAD-

Sequence was designed to address RAD-Sequence short-comings. In ddRAD-Sequence,

genomic DNA is digested with two restriction enzymes, and the resulting fragments

undergo adaptor ligations and precise size selection before sequencing. Only a very

small fraction of the fragments will be sequenced. These fragments are naturally

selected to be from the same genomic regions across individuals. Further, ddRAD requires

half as many reads to achieve high confidence SNP calling, because the chance of obtaining

duplicate reads from the same restriction site are very low. Due to these modifications,

ddRAD has become a more economical method to genotype thousands of individuals,

and has been used for SNP discovery between two Peromyscus species that have no reference

sequence.

3.1.1Paired-End ddRAD-Sequence

ddRAD-sequence paired-end data. Each pair of R1 and R2 is gapped by a known amount +/-

approximately 90bp, so linking these reads as haplotypes would be ideal. The 90bp flop is not

critically important and each R1/R2 pair could be concatenated with an appropriate number of

Ns. Then each full locus could be run through the pipeline as a coherent whole.

RAD Vs ddRAD

RAD :reads between the restriction site and a random site.

ddRAD:reads between the 2 restric2on sites. So more flexibility on the balance coverage /

depth of coverage.

Page 26: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

26

Fig1. Difference between RAD and ddRAD Sequencing

Double-digest restriction site-associated DNA sequencing (ddRAD-Sequence) enables high-

throughput genome-wide genotyping with next-generation sequencing technology.

computational in silico prediction of restriction sites from the genome sequence is recognized

as an effective approach for choosing the restriction enzymes to be used, few reports have

evaluated the in silico predictions in actual experimental data. In this study, we designed and

demonstrated a workflow for in silico and empirical ddRAD-Sequence analysis in Buffalo, as

follows: (i) in silico prediction of optimum restriction enzymes from the reference genome, (ii)

verification of the prediction by actual ddRAD-Sequence data of four restriction enzyme

combinations, (iii) establishment of a computational data processing pipeline for high-

confidence single nucleotide polymorphism (SNP) calling, and (iv) validation of SNP accuracy

by construction of genetic linkage maps. The quality of SNPs based on de novo assembly

reference of the ddRAD-Sequence reads was comparable with that of SNPs obtained using the

published reference genome of Cattle. Comparisons of SNP calls in diverse Buffalo lines

revealed that SNP density in the genome influenced the detectability of SNPs by ddRAD-

Sequence.

Page 27: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

27

SciGenome uses double digest RAD-Sequence (ddRAD-Sequence), a variation of RAD-

Sequence, for genotyping. Traditional RAD-Sequence uses one restriction enzyme and random

shearing to generate fragments from genomic DNA. However, these are high DNA loss steps

and offer little control over the fragments that are sequenced. For organisms without a

reference genome, a significant portion of the RAD-Sequence data has been discarded due to

sequence read errors and the presence of variable sites. ddRAD-Sequence was designed to

address RAD-Sequence short-comings. In ddRAD-Sequence, Genomic DNA is digested with

two restriction enzymes, and the resulting fragments undergo adaptor ligations and precise

size selection before sequencing. Only a very small fraction of the fragments will be

sequenced. These fragments are naturally selected to be from the same genomic regions across

individuals undergo adaptor ligations and precise size selection before sequencing. Only a very

smallfraction of the fragments will be sequenced. These fragments are naturally selected to be

from the same genomic regions across individuals. Further, ddRAD requires half as many

reads to achieve high confidence SNP calling, because the chance of obtaining duplicate reads

from the same restriction site are very low. The Stacks pipeline uses RAD-Sequence data to

create genetic maps and conduct population analysis. It assembles loci de novo from an

individual’s sequence reads or by using a reference sequence. These loci are catalogued and

compared against other individuals’ loci to create a map of alleles. Stacks can identify

thousands of markers and use this information to study genomic structure and assembly. Stacks

can export data to JoinMap, R/gtl and VCF formats.In addition to Stacks, SciGenom has the

ability to use GATK, MUSCLE, MCL and BLAST in the analysis pipline.

3.2 FastQC

Modern high throughput sequencers can generate tens of millions of sequences in a single run. Before

analysing this sequence to draw biological conclusions you should always perform some simple quality

control checks to ensure that the raw data looks good and there are no problems or biases in your data

which may affect how you can usefully use it. Most sequencers will generate a QC report as part of

their analysis pipeline, but this is usually only focused on identifying problems which were generated

by the sequencer itself. FastQC aims to provide a QC report which can spot problems which originate

either in the sequencer or in the starting library material. FastQC can be run in one of two modes. It can

either run as a standalone interactive application for the immediate analysis of small numbers of FastQ

Page 28: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

28

files, or it can be run in a non-interactive mode where it would be suitable for integrating into a larger

analysis pipeline for the systematic processing of large numbers of files.

3.2.1 Opening a Sequence file

To open one or more Sequence files interactively simply run the program and select File >

Open. You can then select the files you want to analyse. Newly opened files will immediately

appear in the set of tabs at the top of the screen. Because of the size of these files it can take a

couple of minutes to open them. FastQC operates a queueing system where only one file is

opened at a time, and new files will wait until existing files have been processed.

FastQC supports files in the following formats

• FastQ (all quality encoding variants)

• CasavaFastQ files*

• ColorspaceFastQ

• GZip compressed FastQ

• SAM

• BAM

• SAM/BAM Mapped only (normally used for colorspace data)

I have used FastQ file in FastQC.

3.2.2Evaluating Results

The analysis in FastQC is performed by a series of analysis modules. The left hand side of the

main interactive display or the top of the HTML report show a summary of the modules which

were run, and a quick evaluation of whether the results of the module seem entirely normal

(green tick), slightly abnormal (orange triangle) or very unusual (red cross).

It is important to stress that although the analysis results appear to give a pass/fail result, these

evaluations must be taken in the context of what you expect from your library. A 'normal'

sample as far as FastQC is concerned is random and diverse. Some experiments may be

expected to produce libraries which are biased in particular ways. You should treat the

summary evaluations therefore as pointers to where you should concentrate your attention and

Page 29: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

29

understand why your library may not look random and diverse. Specific guidance on how to

interpret the output of each module can be found in the modules section of the help.

3.2.3 FastQC Report

The analysis in FastQC is performed by a series of analysis modules. The left hand side of the

main interactive display or the top of the HTML report show a summary of the modules which

were run, and a quick evaluation of whether the results of the module seem entirely normal

(green tick), slightly abnormal (orange triangle) or very unusual (red cross). Quality check is

done for the following parameters:-

• Basic Statistics: The Basic Statistics module generates some simple composition

statistics for the file analysed. Basic Statistics never raises a warning.It never raises an

error.

Table. 2 Basic Statistics

Measure Value

Filename 1_R1_001.fastq

File type Conventional base calls

Encoding Sanger / Illumina 1.9

Total Sequences 172201

Sequences flagged as poor quality 0

Sequence length 100

%GC 46

Filename 1_R1_001.fastq

File type Conventional base calls

• Per base sequence quality: This view shows an overview of the range of quality

values across all bases at each position in the FastQ file.The y-axis on the graph shows

the quality scores. The higher the score the better the base call. The background of the

graph divides the y axis into very good quality calls (green), calls of reasonable quality

(orange), and calls of poor quality (red).

Page 30: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

30

Fig2 . Per base sequence quality

Fig3. Per tile sequence quality

• Per sequence quality scores: It is often the case that a subset of sequences will have

universally poor quality, often because they are poorly imaged (on the edge of the field

of view etc), however these should represent only a small percentage of the total

Page 31: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

31

sequences.A warning is raised if the most frequently observed mean quality is below 27

- this equates to a 0.2% error rate.An error is raised if the most frequently observed

mean quality is below 20 - this equates to a 1% error rate.

Fig 4. Per sequence quality scores

• Per base sequence content: Per Base Sequence Content plots out the proportion of

each base position in a file for which each of the four normal DNA bases has been

called.This module issues a warning if the difference between A and T, or G and C is

greater than 10% in any position.This module will fail if the difference between A and

T, or G and C is greater than 20% in any position.

Page 32: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

32

Fig5. Per base sequence content

• Per sequence GC

content:

Fig6. Per sequence GC content

Page 33: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

33

This module measures the GC content across the whole length of each sequence in a file and

compares it to a modelled normal distribution of GC content. A warning is raised if the sum of

the deviations from the normal distribution represents more than 15% of the reads.This module

will indicate a failure if the sum of the deviations from the normal distribution represents more

than 30% of the reads.

• Per base N content: If a sequencer is unable to make a base call with sufficient

confidence then it will normally substitute an N rather than a conventional base] call

.This module plots out the percentage of base calls at each position for which an N was

called. This module raises a warning if any position shows an N content of >5%.This

module will raise an error if any position shows an N content of >20%.

Fig7. Per base N content

• Sequence Length Distribution: In many cases this will produce a simple graph

showing a peak only at one size, but for variable length FastQ files this will show the

relative amounts of each different size of sequence fragment.This module will raise a

warning if all sequences are not the same length.This module will raise an error if any

of the sequences have zero length.

Page 34: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

34

Fig8. Sequence length distribution

• Sequence Duplication Levels: This module counts the degree of duplication for every

sequence in the set and creates a plot showing the relative number of sequences with

different degrees of duplication.This module will issue a warning if non-unique

sequences make up more than 20% of the total.This module will issue a error if non-

unique sequences make up more than 50% of the total.

Page 35: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

35

Fig 9. Sequence duplication levels

• Overrepresented sequences: This module lists all of the sequence which make up

more than 0.1% of the total. To conserve memory only sequences which appear in the

first 200,000 sequences are tracked to the end of the file. It is therefore possible that a

sequence which is overrepresented but doesn't appear at the start of the file for some

reason could be missed by this module.This module will issue a warning if any

sequence is found to represent more than 0.1% of the total.This module will issue an

error if any sequence is found to represent more than 1% of the total.

Page 36: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

36

Table. 3 Overrepresented sequences

Sequence Count Percentage Possible Source

ATAGAGGCCAGCGGTAGATCGG

AAGAGCACACGTCTGAACTCCA

GTCACT

1420 0.82461774321

86805

Illumina Multiplexing

PCR Primer 2.01

(100% over 34bp)

ATAGAGGCCATGCCTCTCTAGT

TCTTCAAGGGATGACAGGACAC

TTGTCG

795 0.46166979285

83458 No Hit

ATAGAGGCCATGCCAGGCCTCC

CTGTCCATCACCAACTCCCGGA

GTTCAC

399 0.23170597151

003766 No Hit

ATAGAGGCCATGCATTGGAGAA

GGAAATGGCAACCCACTCCAGT

GTTCTT

331 0.19221723451

083328 No Hit

ATAGAGGCCATGCTAACTAGTT

ACGCGACCCCCGAGCGGTCGGC

GTCCCC

279 0.16201996504

085342 No Hit

ATAGAGGCCATGCTGCGATTCA

TGGGGTCGCAAAGAGTCGGACA

CGACTG

206 0.11962764443

876632 No Hit

ATAGAGGCCATGCCAGGCCTCC

CTGTCCATCACCAACTCCCAGA

GTTCAC

203 0.11788549427

703672 No Hit

Page 37: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

37

• Adapter Content

Fig10. Adapter content

• Kmer Content

Fig11. Kmer content

Page 38: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

38

This module counts the enrichment of every 5-mer within the sequence library. It calculates an

expected level at which this k-mer should have been seen based on the base content of the

library as a whole and then uses the actual count to calculate an observed/expected ratio for

that k-mer. In addition to reporting a list of hits it will draw a graph for the top 6 hits to show

the pattern of enrichment of that Kmer across the length of your reads. This will show if you

have a general enrichment, or if there is a pattern of bias at different points over your read

length.

3.3 Trimming

After FastQC checks that it is recognizing the proper number of samples in the current

directory, after that to proceed with quality trimming of sequence data. Trimmomatic-vo.32

for double-digest RAD adapters and trims bases with quality scores PHRED +33 or PHRED

+64. The read mapping and variant calling steps of STACKSaccount for base quality, so

minimal trimming of the data is needed. Typically, quality trimming only needs to be

performed once. Trimmomatic is a fast, multithreaded command line tool that can be used to

trim and crop Illumina (FASTQ) data as well as to remove adapters. These adapters can pose a

real problem depending on the library preparation and downstream application.

There are two major modes of the program: Paired end mode and Single end mode. The paired

end mode will maintain correspondence of read pairs and also use the additional information

contained in paired reads to better find adapter or PCR primer fragments introduced by the

library preparation process. Trimmomatic works with FASTQ files (using phred + 33 or phred

+ 64 quality scores, depending on the Illumina pipeline used). Files compressed using either

„gzip‟ or „bzip2‟ are supported, and are identified by use of „.gz‟ or „.bz2‟ file extensions.

Trimmomatic performs a variety of useful trimming tasks for illumina paired-end and single

ended data. The selection of trimming steps and their associated parameters are supplied on the

command line – (java -jar /opt/software/Trimmomatic-0.32/trimmomatic-0.32.jar PE

1_R1_001.fastq 1_R2_001.fastq 1_R1_forward_paired.fastq 1_R1_forward_unpaired.fastq

1_R2_reverse_paired.fastq 1_R2_reverse_unpaired.fastq ILLUMINACLIP:TrueSeq3-

PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36).

Page 39: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

39

3.3.1Trimmomatic Report

Paired End Mode: For paired-end data, two input files, and 4 output files are specified, 2 for

the 'paired' output where both reads survived the processing, and 2 for corresponding 'unpaired'

output where a read survived, but the partner read did not.

Fig12. Trimming report

3.4. Indexing

The reference genome must first be "indexed" through “bowtie2-build”. From the directory

containing the genome.fna file, run the "bowtie2-build"command (/opt/software/

bowtie2/bowtie2-buildGCA_000003055.5_Bos_taurus_UMD_3.1.1_genomic\(1\). fna output).

This command will create 6 files with a *.bt2 file extension. These will then be used by

Bowtie 2 to map data.

3.5. Sequence Alignment Map

Page 40: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

40

I have mapped my trimming result FASTQ file to the reference genome through bowtie2-align-

s using command (/opt/software/bowtie2/bowtie2-align-s -x output -1 1_R1_forward

_paired.fastq -2 1_R2_reverse_paired.fastq -S samfile_1.sam), i will normally end up with a

SAM alignment file. SAM stands for Sequence Alignment/Map format, and BAM is the

binary version of a SAM file. Sequence Alignment Map (SAM) is a text-based format for

storing biological sequences aligned to a reference sequence developed by Heng Li.[1] It is

widely used for storing data, such as nucleotide sequences, generated by Next generation

sequencing technologies. "The format supports short and long reads (up to 128Mbp) produced

by different sequencing platforms and is used to hold mapped data within the GATK and

across the Broad Institute, the Sanger Centre, and throughout the 1000 Genomes project.

Sequence Alignment/Map (SAM) format for alignment of nucleotide sequences (e.g.

sequencing reads) to (a) reference sequence(s). May contain base-call and alignment qualities

and other data." [2]

The SAM format consists of a header and an alignment section.[1]The binary representation of

a SAM file is a BAM file, which is a compressed SAM file.[1] SAM files can be analysed and

edited with the software SAMtools.[1] The header section must be prior to the alignment section

if it is present. Heading's begin with the '@' symbol, which distinguishes them from the

alignment section. Alignment sections have 11 mandatory fields, as well as a variable number

of optional fields.The SAM flag is a little more difficult to decipher - the value of the flag is

formulated as a bitwise flag, with each binary bit corresponding to a certain parameter. See the

format specification for more info . For example, if the 0x10 bit is set, then the read is aligned

as the reverse compliment (i.e. maps to the - strand). Usually, the process of removing

duplicate reads or removing non-unique alignments is handled by the downstream analysis

program.

3.6. Stacks Pipeline

Several molecular approaches have been developed to focus short reads to specific, restriction-

enzyme anchored positions in the genome. Reduced representation techniques such as CRoPS,

RAD-seq, GBS, double-digest RAD-seq, and 2bRAD effectively subsample the genome of

multiple individuals at homologous locations, allowing for single nucleotide polymorphisms

(SNPs) to be identified and typed for tens or hundreds of thousands of markers spread evenly

Page 41: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

41

throughout the genome in large numbers of individuals. This family of reduced representation

genotyping approaches has generically been called genotype-by-sequencing (GBS) or

Restriction-site Associated DNA sequencing (RAD-seq). Stacks is designed to work with any

restriction-enzyme based data, such as GBS, CRoPS, and both single and double digest RAD.

Stacks is designed as a modular pipeline to efficiently curate and assemble large numbers of

short-read sequences from multiple samples. Stacks identifies loci in a set of individuals, either

de novo or aligned to a reference genome (including gapped alignments), and then genotypes

each locus. Stacks incorporates a maximum likelihood statistical model to identify sequence

polymorphisms and distinguish them from sequencing errors. Stacks employs a Catalog to

record all loci identified in a population and matches individuals to that Catalog to determine

which haplotype alleles are present at every locus in each individual.

Stacks is implemented in C++ with wrapper programs written in Perl. The core algorithms are

multithreaded via OpenMP libraries and the software can handle data from hundreds of

individuals, comprising millions of genotypes. Stacks incorporates a MySQL database

component linked to a web front end that allows efficient data visualization, management and

modification.

Stacks proceeds in five major stages:

• First, reads are demultiplexed and cleaned by the process_radtags program

• The next three stages comprise the main Stacks pipeline

• building loci (ustacks/pstacks), creating the catalog of loci (cstacks)

• And matching against the catalog (sstacks)

• In the fifth stage, either the populations or genotypes program is executed, depending

on the type of input data. So according to my data I have executed population program

.This flow is diagrammed in the following figure.

Page 42: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

42

Fig13. Stacks diagram

The goal in Stacks is to assemble loci in large numbers of individuals in a population or genetic

cross, call SNPs within those loci, and then read haplotypes from them. Therefore Stacks wants

data that is a uniform length, with coverage high enough to confidently call SNPs. Although it

is very useful in other bioinformatic analyses to variably trim raw reads, this creates loci that

have variable coverage, particularly at the 3’ end of the locus. In a population analysis, this

Page 43: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

43

results in SNPs that are called in some individuals but not in others, depending on the amount

of trimming that went into the reads assembled into each locus, and this interferes with SNP

and haplotype calling in large populations.

3.6.1.Protocol Type

Stacks supports all the major restriction-enzyme digest protocols such as RAD-seq, double-

digest RAD-seq, and GBS, among others. For double-digest RAD data that has been paired-

end sequenced, Stacks supports this type of data by treating the loci built from the single-end

and paired-end as two independent loci. In the near future, we will support merging these two

loci into a single haplotype.

3.6.2.Sequencer Type

Stacks is optimized for short-read, Illumina-style sequencing. There is no limit to the length the

sequences can be, although there is a hard-coded limit of 1024bp in the source code now for

efficency reasons, but this limit could be raised if the technology warranted it. Stacks can also

be used with data produced by the Ion Torrent platform, but that platform produces reads of

multiple lengths so to use this data with Stacks the reads have to be truncated to a particular

length, discarding those reads below the chosen length. The process_radtags program can

truncate the reads from an Ion Torrent run. Other sequencing technologies could be used in

theory, but often the cost versus the number of reads obtained is prohibitive for building stacks

and calling SNPs.

3.6.3.paired-End Reads

Stacks does not directly support paired-end reads where the paired-end read is not anchored by

a second restriction enzyme. In the case of double-digest RAD, both the single-end and paired-

end read are anchored by a restriction enzyme and can be assembled as independent loci. In

cases such as with the RAD protocol, where the molecules are sheared and the paired-end

therefore does not stack-up, cannot be directly used. However, they can be indirectly used by

say, building contigs out of the paired-end reads that can be used to build phylogenetic trees or

to identify orthologous genes and Stacks includes some tools to help do that.

Page 44: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

44

3.6.4.Run The Pipeline

The simplest way to run the pipeline is to use one of the two wrapper programs provided: if

you do not have a reference genome you will use denovo_map.pl, and if you do have a

reference genome use ref_map.pl. In each case you will specify a list of samples that you

demultiplexed in the first step to the program, along with several command line options that

control the internal algorithms. So I had a reference genome and used ref_map.pl.ref_map.pl

expects data that has been aligned to a reference genome, and accepts either SAM or BAM

files.

3.7. ref_map pipeline

The ref_map.pl program will execute the Stacks pipeline by running each of the Stacks

components individually. It is the simplest way to run Stacks and it handles many of the

details, such as sample numbering and loading data to the MySQL database, if desired.

The ref_map.pl program expects data to have been aligned to a reference genome, and can

accept data from any aligner that can produce SAM or BAM formated files. The program

performs several stages, including:

• Running pstacks on each of the samples specified, assembling loci according to the

alignment positions provided for each read, and calling SNPs in each sample.

• Executing cstacks to create a catalog of all loci across the population (or from just the

parents if processing a genetic map). Loci from different samples are matched up across

the data set according to alignment position.

• Next, sstacks will be executed to match each sample against the catalog. In the case of a

genetic map, the parents and progeny are matched against the catalog.

• In the case of a population analysis, the populations program will be run to generate

population-level summary statistics. If you specified a population map (-O option) it

will be supplied to populations. If you are analyzing a genetic map,

the genotypes program will be executed to generate a set of markers and a set of initial

genotypes for export to a linkage mapping program.

• Computation is now complete. If database interaction is enabled, ref_map.pl will

upload the results of each stage of the analysis: individual loci, the catalog, matches

against the catalog, and genotypes or sumamry statistics into the database.

Page 45: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

45

• Lastly, if database interaction is enabled, index_radtags.pl will be run to build a

database index to speed up access to the database and enable web-based filtering.

After create SAM file, then run the stacks piplineref_map.plusing parameter (-b,-B,-s,-

o).b=batch id,B=load data into mysqldatabase,s=individual sample,o=output. ref_map.pl will

execute the pipeline, running pstacks instead of ustacks, taking the aligned reads as assembled

loci and calling SNPs in each locus. ref_map.pl then runs the rest of the pipeline in the same

way, however, the -g option is provided to cstacks and sstacks to cause their matching

algorithms to match on genomic location, not sequence similarity.

Output:

• Building loci: Generates 3 files per sample: – sample_alleles.tsv

– sample_ snps.tsv – sample_ tags.tsv

• Cataloguing of observed SNPs: – batch_1001.catalog.alleles.tsv – batch_1001.catalog.snps.tsv – batch_1001.catalog.tags.tsv

• Verifying individual samples against catalogue – batch_1001.catalog.matches.tsv

I have the database and web interface installed (MySQL and the Apache Webserver) then

ref_map.pl can upload the output from the pipeline to the database for viewing in a web

browser.

3.8 The Stacks Web Interface

To visualize data, Stacks uses a web-based interface (written in PHP) that interacts with a

MySQL database server. MySQL provides various functions to store, sort, and export data

from a database.The output from the Stacks pipeline is meant to be loaded into

a MySQL database and viewed online, facilitating data mining, and data correction. A database

schema is provided along with a set of PHP files to display the results of the Stacks pipeline,

resulting in an interface like this.

Page 46: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

46

CHAPTER IV

RESULTS AND DISCUSSION

RESULTS

Double-digest RAD data from buffalo genome of three traits to identification & annotation of

genetic variantwith Stacks, the first generally available, widely used pipeline for analysis of

ddRADseq data. The goal in Stacks is to assemble loci in large numbers of individuals in a

population or genetic cross, call SNPs within those loci, and then read haplotypes from them.

Therefore Stacks wants data that is a uniform length, with coverage high enough to confidently

call SNPs. Although it is very useful in other bioinformatic analyses to variably trim raw reads,

this creates loci that have variable coverage, particularly at the 3’ end of the locus. In a

population analysis, this results in SNPs that are called in some individuals but not in others,

depending on the amount of trimming that went into the reads assembled into each locus, and

this interferes with SNP and haplotype calling in large populations.

Page 47: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

47

Table. 4 Milk yield trait

Id Type Unique

Stacks

Polymorphic

Loci

SNPs

Found Source

1 Sample 1240 6 12 samfile_1

2 Sample 19789 427 488 samfile_24

3 Sample 25654 525 609 samfile_27

4 Sample 15046 322 364 samfile_32

5 Sample 36542 859 1077 samfile_40

6 Sample 23564 502 633 samfile_42

7 Sample 11437 220 272 samfile_60

8 Sample 61262 1975 2311 samfile_63

9 Sample 42941 1269 1466 samfile_68

10 Sample 23460 439 520 samfile_73

11 Sample 73940 2688 3242 samfile_75

12 Sample 46808 1241 1541 samfile_25

13 Sample 16216 261 307 samfile_28

14 Sample 24744 463 562 samfile_29

15 Sample 59261 1822 2140 samfile_35

16 Sample 53680 1879 2205 samfile_36

17 Sample 52856 1536 1855 samfile_37

18 Sample 31352 771 914 samfile_39

19 Sample 38852 936 1070 samfile_59

20 Sample 1159 12 14 samfile_61

21 Sample 3237 42 52 samfile_62

22 Sample 1624 13 14 samfile_64

23 Sample 55711 1857 2351 samfile_67

24 Sample 39765 1077 1286 samfile_71

25 Sample 23716 423 497 samfile_74

(Milk yield trait samples, SNPs found an individual samples and total SNPs-“25802” found

from 25 samples.It’s Unique stacks id and polymorphic loci also given.)

Page 48: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

48

Table. 5 Marker

(Above this fig. calculated total genoytpes, genotypefrequencies, Mean log likelihood and

Genotype Map. )

SNPs: (The sequence type is primary.Stacks depth=5x means that number or reads contained

in the locus that matched to the catalog.SNPs found at particular position of the sequence.At

column 4,5,6,7 found 4-nucleotide CATG& called as haplotype.AGGCis minor alleles and

Page 49: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

49

CATG is major alleles. Deleveraged Flag If "1", this stack was processed by the

deleveraging algorithm and was broken down from a larger stack.Blacklisted Flag If "1", this

stack was still confounded depsite processing by the deleveraging algorithm.Lumberjackstack

Flag If "1", this stack was set aside due to having an extreme depth of coverage.)

Fig14. SNPs

Page 50: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

50

Haplotypes: (Two sample i.e. samfile1 & samfile67 matches.Genotype

frequencya,b:2(100.0%),Find two haplotype& showing alleles for each particular

column.chr:GK000005,102.47Mb, + ,LnL:-24.085).

Fig15. Haplotypes

Page 51: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

51

Table. 6 Lactation period trait

Id Type Unique

Stacks

Polymorphic

Loci

SNPs

Found Source

1 Sample 1240 6 12 samfile_1

2 Sample 4789 68 76 samfile_2

3 Sample 9200 145 166 samfile_3

4 Sample 13724 236 278 samfile_4

5 Sample 8096 116 142 samfile_5

6 Sample 6139 90 106 samfile_6

7 Sample 19904 451 527 samfile_7

8 Sample 17745 342 418 samfile_8

9 Sample 6516 99 111 samfile_16

10 Sample 12879 248 282 samfile_65

11 Sample 36858 1033 1213 samfile_70

12 Sample 11922 186 219 samfile_9

13 Sample 2242 14 16 samfile_10

14 Sample 5501 64 92 samfile_11

15 Sample 20139 434 511 samfile_12

16 Sample 2613 23 29 samfile_13

17 Sample 8351 97 116 samfile_14

18 Sample 16160 327 376 samfile_15

Page 52: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

52

19 Sample 39765 1077 1286 samfile_71

20 Sample 73940 2688 3242 samfile_75

(Lactation period trait samples, SNPs found an individual samples and total SNPs-“9218”

found from 20 samples it’s Unique stacks id, polymorphic loci also given.)

SNPs: (Identify four SNPs from individual sequence.)

Fig 16. SNPs

Haplotypes: (Identify twohaplotypes,chr:GK000005.2,102.47Mb, + ,LnL:-16.48 Genotype

frequency aa:1(50.0%) &ab:1(50.0%).Two sample i.esamfile 1 &samfile 8 matches.)

Page 53: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

53

Fig 17. Haplotypes

Table. 7 SNP Summary Statistics

SNP Summary Statistics

Pop BP Colum

n

Allel

e 1

Allel

e 2 P

N

Obs

Het

ObsHo

m

Exp

Het

ExpHo

m π

FI

S

1

.

defaultpo

p

10247154

9 4 A C

0.5000

0 1

1.00

0 0.000

0.50

0 0.500

1.00

0 0

2

.

defaultpo

p

10247155

0 5 A G

0.5000

0 1

1.00

0 0.000

0.50

0 0.500

1.00

0 0

3

.

Defaultpo

p

10247155

1 6 G T

0.5000

0 1

1.00

0 0.000

0.50

0 0.500

1.00

0 0

4

.

defaultpo

p

10247155

2 7 C G

0.5000

0 1

1.00

0 0.000

0.50

0 0.500

1.00

0 0

Page 54: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

54

(Pop=population,BP=Base pair if aigned to a reference genome this is the base pair for the

particular SNP.P= Mean frequency of the most frequent allele at each locus in this

population.N= Number of alleles/haplotypes present at this locus.Obs Het=Mean observed

heterozygosity in this population.ObsHom=Mean observed homozygosity in this

population.Exp Het=Mean expected heterozygosity in this population.ExpHom=Mean

expected homozygosity in this population.π = an estimate of nucleotide diversity.Fis=The

inbreeding coefficient of an individual (I) relative to the subpopulation (S).)

Table. 8 Haplotype Summary Statistics

Haplotype Summary Statistics

Pop BP N Haplotype Cnt Gene Diversity Haplotype Diversity

1. defaultpop 102471545 4 2 0.500 2.000

(N= Number of alleles/haplotypes present at this locus.Haplotype Cnt= Raw number of reads

that have this haplotype.Gene Diversity=is the proportion of polymorphic loci across the

genome.Haplotype Diversity=It is controlled by a variety of process, including

mutation,recombination,marker as certainment and demography.)

Table. 9 Hapstats Summary Statistics

#

Ba

tc

h

ID

Loc

us

ID

Chr BP Pop

ID

N Hapl

otyp

e Cnt

Gen

e

Dive

rsity

Smo

othe

d

Gene

Diver

sity

Smo

othe

d

Gene

Diver

sity

P-

value

Hapl

otyp

e

Diver

sity

Smo

othe

d

Hapl

otyp

e

Diver

sity

Smo

othe

d

Hapl

otyp

e

Diver

sity

P-

value

Haplotyp

es

4

137

595

GJ058

424.1

300

72

defau

ltpop 6 3

0.73

33 0 0 1.4 0 0

CCT:2;TA

C:1;TAT:3

4

529

00

GJ058

425.1

107

783

defau

ltpop 4 2

0.66

67 0 0 2 0 0

ATA:2;CC

T:2

4

804

69

GJ058

435.1

654

66

defau

ltpop

1

0 2

0.53

33 0 0 1.6 0 0

CCT:4;TA

G:6

4

270

18

GJ058

435.1

655

43

defau

ltpop 8 2

0.53

57 0 0

0.53

57 0 0 A:5;G:3

Page 55: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

55

4

804

70

GJ058

439.1

420

5

defau

ltpop 6 2

0.53

33 0 0

2.13

33 0 0

ACCG:2;G

TGT:4

4

529

04

GJ058

444.1

491

76

defau

ltpop 2 2 1 0 0 1 0 0 C:1;T:1

4

529

05

GJ058

444.1

493

19

defau

ltpop 2 2 1 0 0 1 0 0 A:1;G:1

Table. 10 Sumstats.Summary statistics

(Calculating summary statistics, such as heterozygosity, π, and FIS.)

Page 56: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

56

Table. 11 Sumstats_Summary.Summary Statistics

(There are two tables in this file containing the same headings. The first table, labeled

"Variant" calculated these values at only the variable sites in each population. The second

table, labeled "All positions" calculted these values at all positions, both variable and fixed, in

each population.)

Table. 12 Marker

Page 57: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

57

Table. 13 Age at first Calving trait

Id Type Unique

Stacks

Polymorphic

Loci

SNPs

Found Source

1 Sample 24375 594 723 samfile_17

2 Sample 5579 87 99 samfile_19

3 Sample 17167 345 389 samfile_47

4 Sample 40380 1034 1205 samfile_48

5 Sample 19424 374 441 samfile_49

6 Sample 36098 891 1028 samfile_50

7 Sample 3459 36 42 samfile_51

8 Sample 81944 2994 3633 samfile_52

9 Sample 34541 866 1089 samfile_55

10 Sample 36350 975 1125 samfile_57

11 Sample 22824 450 546 samfile_58

12 Sample 22065 468 563 samfile_18

13 Sample 7403 96 120 samfile_20

Page 58: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

58

14 Sample 15038 271 322 samfile_21

15 Sample 10624 158 207 samfile_22

16 Sample 36343 988 1195 samfile_43

17 Sample 28934 674 795 samfile_44

18 Sample 9812 133 159 samfile_45

19 Sample 28248 798 913 samfile_46

20 Sample 31060 616 738 samfile_53

21 Sample 32280 828 954 samfile_54

22 Sample 48987 1427 1628 samfile_56

(Age at first calving trait samples, SNPs found an individual samples and total SNPs-

“17194”found from 20samples.It’s Unique stacks id, polymorphic loci also given.)

Table. 13.1 Marker

Page 59: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

59

(Above this fig. calculated total genoytpes,genotypefrequencies,Mean log likelihood and

Genotype Map.)

SNPs: (Identify four SNPs from individual sequence.)

Page 60: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

60

Fig18. SNPs

Haplotypes: (Identify two Haplotypes,chr:Gk000002.2, 36.77Mb,- ,LnL:-18.745. Genotype

frequency aa:1(50.0%) & ab:1(50.0%). Two sample i.esamfile 17 &samfile 54

matches.Identify Alleles fom each column of SNPs.)

Fig 19. Haplotypes

DISCUSSION

Page 61: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

61

In the present study, we generated 47Gb sequence data by targeted ddRAD sequence. The data

were mapped against cattle genome assemblywith overall mapping rate of ~98%. Mapping rate

was higher compared to that reported in an earlier study23, mainly due to experiment design,

wherein we have targeted coding regions which are conserved compared to other parts of the

genome, followed by detection of SNPs.A Catalog to record all loci identified in a population

and matches individuals to that Catalog to determine which haplotype alleles are present at

every locus in each individual have seen above the figure. I have calculated summary statistics,

such as heterozygosity, π, and FIS for the population program.I have found out the Variant

position and all poitions (variant&fixed).The total genotypes, genotype frequencies, Mean log

likelihood and genotype map has seen the result.

Page 62: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

62

CHAPTER V

CONCLUSION

ddRAD sequences are developed to focus short reads to specific, restriction-enzyme anchored

positions in the genome.The power of short read sequencing technology and reduced

representation of genome coverage to call sequence variation in the progeny of a segregating

mapping population.Our investigation provides clear evidence that Stacks is designed as a

modular pipeline to efficiently curate and assemble large numbers of short-read sequences

from multiple samples using ddRAD and by extension, other related techniques such as

RADseq and GBS, are useful tools. Identifies loci in a set of individuals, either de novo or

aligned to a reference genome (including gapped alignments), and then genotypes each locus.

Stacks incorporates a Mean Log likelihood statistical model and identify polymorphic

loci(0.3585%) and distinguish them from sequencing errors. It employs a Catalog to record all

loci identified in a population and matches individuals to that Catalog to determine which

haplotype alleles are present at every locus in each individual.The total SNPs found in buffalo

three important traits such as :1-Milk yield, 2-Lactation period, 3-Age at first calving

“25802”,”9218”& “17914” .From each trait got 1 to 36947 tags, chromosome location-

GJ057537.1,start 0 mb to end 159 mb.Type is exon.Alleles found 1 to 100, matching sample: 1

to 1000,LnL:-500 to o. To compute population genetic measures such as FIS and π within

populations and FST between populations.

Page 63: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

63

REFERENCE

• Rudolph, M. C. et al., Metabolic regulation in the lactating mammary gland: a lipid synthesizing machine. Physiol. Genomics, 2007, 28, 323–336.

• Ding, X. et al., A novel single nucleotide polymorphism in exon 7 of LPL gene and its association with carcass traits and visceral fat deposition in yak (Bos grunniens) steers. Mol. Biol. Rep., 2012, 39, 669–673.

• Van Horn, C. G., Caviglia, J. M., Li, L. O., Wang, S., Granger, D. and Coleman, R. A., Characterization of recombinant longchain rat acyl-CoA synthetase isoforms 3 and 6: identification of a novel variant of isoform 6. Biochemistry, 2005, 44, 1635–1642.

• Mashek, D. G. and Coleman, R. A., Cellular fatty acid uptake: the contribution of metabolism. Curr. Opin. Lipidol., 2006, 17, 274–278.

• Mercade, A. et al., Characterization of the porcine acyl‐CoA

synthetaselong‐chain 4 gene and its association with growth and meat

quality traits. Anim. Genet., 2006, 37, 219–224.

• Schwehm, J. M., Kristyanne, E. S., Biggers, C. C. and Stites, W.E., Stability effects of increasing the hydrophobicity of solventexposed side chains in staphylococcal nuclease. Biochemistry,1998, 37, 6939–6948.

• ?Manjithaya, R. R. and Dighe, R. R., The 3 messenger RNA?untranslated region of bovine follicle-stimulating hormone downregulates reporter expression: involvement of Au-rich elements and transfactors. Biol.

Reprod., 2004, 71, 1158–1166.

• Rao, Y. S., Wang, Z. F., Chai, X. W., Nie, Q. H. and Zhang, X. Q., Relationship between 5 UTR length and gene expression pattern in chicken. Genetica, 2013, 141, 311–318.

• Cohen-Zinder, M. et al., Identification of a missense mutation in the bovine ABCG2 gene with a major effect on the qtl on chromosome 6 affecting milk yield and composition in Holstein cattle. Genome Res., 2005, 15, 936–944.

• Robenek, H. et al., Butyrophilin controls milk fat globule secretion.Proc.

Natl. Acad. Sci. USA, 2006, 103, 10385–10390.

Page 64: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

64

CURRICULUM VITAE

Anjan kumar Pradhan

Email id:- [email protected]

Mob No:- 9668790024

OBJECTIVE

To enhance my performance for better output through a disciplined, organized and progressive ways with my sincerity, creativity,hard workentrusted to me.

ACADEMIC QUALIFICATION

QUALIFICATI

ON

SPECIALISTIO

N

UNIVERSITY/INSTITUTION YEAR OF

PASSING

% OF

MARK

M.Sc. Bioinformatics Orissa University Of Agriculture and Technology

continuing

B.Sc. Zoology(hns) FakirMohan University 2015 53.83

+2science Science SubarnaRekha Mahavidyalaya

2012 53.5

10th GhantuaSahajog High School 2010 71.5

BIOLOGICAL TOOLS AND DATABASES KNOWN

• Pymol,PSI-PRED,RM2TS,RASPD,parDOCK ,Genemark,Glimmer,Easygene,GOR-V,Geno3D,Phyre2,Autodock,VMD,BlAST,MEGA.

• GenBank,EMBL,PDB,NCBI ,NDB,Uniprot-KB,SWISS-PROT,Prosite,Pfam,DDBJ.

COMPUTER LANGUAGE KNOWN

• PERL,CGI-Perl,PHP,HTML,C++,JAVA

AREA OF INTEREST

• I have interest in genomics,proteomics,molecular modeling drug design and cancer Biology.

PERSIONAL DETAILS

Page 65: IDENTIFICATION OF GENETIC VARIANT IN BUFFALO …...1. Difference between RAD & ddRAD 16 2. Per base sequence quality 20 3. Per tile sequence quality 20 4. Per sequence quality score

65

Father’s Name:- Mr. Satyajit Pradhan

Date Of Birth:- 21/01/1995

Nationality:- Indian

Language:- English, Hindi,Odia

Hobbies:- Reading newspaper,current affairs,browsing internet andVolleyball

PERMANENT ADDRESS

AT-Bishnupur,PO-Devog, DIST-Balasore, STATE-Odisha, PIN-756023

SEMINAR/WORKSHOP ATTENED

• Attended workshop on “Supercomputing Facility for Bioinformatics &

Computational Biology” during 12th Sep.,2016 to 22nd Sep.,2016 organized by Indian Institute Of Technology, Delhi.

• Attended national seminar on “Microbial Technology Prospects & Application”during 25-26 Dec,2015 organized by Orissa University Of Agriculture and Technology Bhubaneswar.

DECLARATION

I hereby declare that all the above mentioned information is true to the best of my knowledge and belief.

Date:

BHUBANESWAR Anjan Kumar Pradhan