Genomes Third Edition Chapter 5: Understanding a Genome Sequence Copyright © Garland Science 2007 Terry Brown

GenomesThird Edition

Chapter 5:Understanding a Genome Sequence

Copyright © Garland Science 2007

Terry Brown

Understanding A Genome Sequence

• To understand the function of genome sequencing is not the ultimate goal.

• The major challenge lies in understanding that what portion of genome get expressed?

• To search for the gene is not easy job, it need both: – bioinformatics – and molecular biology techniques together to find out a gene.

• To find a gene in genomic DNA means to establish:– Gene is expressed in that organism– If it is eukaryotic than what type of introns and exons are there

– What are the regulatory parts which regulate the gene.

– What is the start and end point of a gene in genomic sequences

Locating the Genes in A Genome Sequence

• Once the sequence of genome or a part of genome is available

• Then genes can be find by:– By analyzing the sequences using computers (bioinformatics)

– Or by locating the genes by experimental methods

• Genes can be located along the piece of DNA by inspection of the sequence for:– Special features associated with the genes– Like start and stop codons

• The gene inspection is a powerful tool and usually the first method that is applied to analyze sequence

The Coding Regions of Genes are Open Reading Frames

• The genes that code for proteins contain open reading fames (ORF)

• The ORF is a series of codons that specify amino acid sequence of the protein

• The ORF:– Begins with start codon (usually ATG)– Ends with termination codon (TAA, TAG or TGA)

• The search of an ORF needs to be done on both the DNA strands

• That makes it to search for six reading frames which is an easy task if done by the aid of a computer

• The success of find a true gene by this way lies in:– That if the DNA sequence is random and GC content

is about 50% then each of the three termination codon should appear by chance once in 64 bp (43=64)

– If the GC content is less than that still they should appear once in 100-200 bp

– Meaning any by chance sequence should be terminated around 50 amino acids coded.

• The average length of codons of: – E. coli genes is 317– For Saccharomyces cerevisiae is 483– And for human is about 450 codons

• So programs can be made select any ORF which are longer than say 100 codons

• In practice its very effective for E.coli genome

• But its not much effective for higher eukaryotic genomes due to presence of introns.

Simple ORF scans are less effective with DNA of higher

eukaryotes

• The ORF search is quite effective for bacterial genomes but less effective in higher eukaryotic genomes due to:– Introns– The space between the gene/intergenic region is quite long which can mistakenly be read as ORF• Intergenic DNA in E. coli is only 11% of its genome

• While its 62% in human genomes

Figure 5.3 Genomes 3 (© Garland Science 2007)

• The problem of introns is the main challenge for bioinfomaticians

• It can be partially solved by three strategies

• Codon Bias• Exon-Intron boundries• Upstream regulatory sequences

Codon Bias

• This is based on the observation that not all the codons are equally frequently found in the genes of a particular organism

• For example leusine is specified by six codons (TTA, TTG, CTT, CTC, CTA and CTG)

• In humans its frequently coded by CTG and is only rarely by TTA or CTA

• Similarly human genome uses GTG foru times more frequently than GTA for valine.

• This biase can be written in the program to add in the search for exons in the genome

Exon-Intron Boundaries

• To solve the problem of intron exon search the distinctive features in their boundaries can be searched.

• Although these features are not very distinctive– The sequence of the upstream exon-

intron boundary is usually described as 5’-AG↓GTAAGT-3’

– Its only the consensus sequence so many variant can be found

– The downstream exon-intron boundary is even less defined: 5’-PyPyPyPyPyPyNACG↓-3’

– Where Py mean any pyrimidine (T or C) and N mean any nucleotide (A,T,C or G)

• Presently its very difficult to search the most of intron-exon in this way


Upstream regulatory sequences

• The regulatory sequences also have distinctive features that enable them to bind with certain transcriptional factors.

• These features can also be searched to find the genes downstream them.

• The problem with this type of method is again that most of the genes contains variable regulatory regions making it difficult to relay on this technique

CpG islands

• It is found now that most of the human genes contains CpG islands upstream of many genes

• They are upstream and present in 1kb portion• So in that region the GC content is greater than the average

• Some 40-50% of human genes contains such type of CpG islands

• So if a ORF is found down stream of such region there is great chances that it might be expressed in humans.

Locating Genes for Functional RNA

• ORF search is appropriate to find protein-coding genes.

• But functional RNAs are not composed of codons i.e. genes of rRNA and tRNA

• Functional RNA do have their own distinctive features

• They can fold into complex secondary structures

• i.e. tRNA into cloverleaf• This is due to intramolecular base pairing• These folding pattern provide wealth of

information which can be utilized to search for their genes.

• This type of searching has proven quite effective to search for functional RNA genes

• There are some RNA which don't form complex secondary structures like siRNA and miRNA

• Most of them contains simpler stem-loop sequences (or hairpins)

• That is due to Watson-Crick base pairing rules

• The programs can calculate thermodynamic rules for checking the:– stability of such secondary

structures– Size of the loop– And size of stem

• The search can also be made to find regulatory regions of these functional RNAs

• The success can increase if the regions which are left from coding regions are searched rigorously for the presence of functional RNA genes

Figure 5.6b Genomes 3 (© Garland Science 2007)

Homology searches and comparative genomics give an extra dimension to

sequence inspection

• Most of the various software designed so far can find 95% of ORFs in eukaryotes

• But still they make frequent mistakes in their positioning of exon-intron boundaries

• Also find spurious ORFs which is a major problem• These problems can be over come to a certain degree

by the use of homology search• A search is made to the already known genes

database to find any match with DNA under study• If a match is found it indicates the evolutionary

relatedness of the gene under study to the already known gene

• So homology search can help to assign the function to an entirely new sequence

Comparative Genomics• A more precise method of homology search is possible when genome sequences are available for two or more related species

• The related species have genomes that share similarities inherited from their common ancestor

• Which then get different independently to each other

• The selection pressure on coding sequence make them more conserved than intergenic regions in their genomes

• Therefore homologous genes can be easily identified by comparing these genomes

• So any ORF which does not have clear homology can be discounted as almost certainly being a chance sequence not a real one

• Comparative genomics approach is very successful for Saccharomyces cerevisiae genomes

• As complete or partial sequences are available for about 16 related species

• The comparative analysis has authenticated S. serevisiae ORFs

• About 500 putative ORFs have been removed from S. serevisiae genomes by this analysis

• The analysis can be made much powerful by the phenomenon of Synteny

• Synteny is the conserved gene order displayed by genomes of related species.

Automatic annotation of genome sequences

• One great advantage of bioinformatics techniques for gene identification is the combining of analytical programs into one integrated system.

• So different approaches like:– ORF finder– Codong bias– Regulatory analysis– Intron-exon boundaries search– Homology to genes– Functional RNAs analysis– comparative genomics analysis

• Can be integrated and can help in automatic annotation of genomic sequences

Experimental Techniques for gene location

• The gene finding by bioinformatics tool is a good way

• But in cases where it is not helpful genes can be find by checking for their expression into mRNA

• The hybridization techniques allow to find– Transcription start and stop sites– Intron-exon boundries– Termination sites

• Northern hybridization is based on transfer of an RNA prep agarose gel electrophoresis onto nitrocellulose membranes which are subsequently probed with the DNA under study

• The northern hybridization allow the identification of number of genes present in the fragment of DNA

• There are some limitation of this technique:– Multiple bands can be detected due to

alternative splicing– RNA may not be representative of whole

animal so genes may not be expressing at that time

Zoo-blotting: Locating Gene

• Similar to northern blotting

• A gene is searched against number of closely related species RNA

• So if any fragment binds with other species, it also tells that the gene in DNA fragment is expressing in related species.

• Same as homology search

cDNA sequencing enables genes to be mapped within DNA

fragments• The mRNA expressed at certain stage and

condition of a cell can be converted into DNA by the process of reverse transcription, known as cDNA– These cDNA made are usually short as some time

enzymes leave the template

• These cDNA can be made in complete length by the process of Rapid Amplification of cDNA Ends (RACE)

• The RACE helps in identification of start and end point of a gene to be elucidated with precision.

• The full-length sequence of cDNA also gives information about intron-exon boundaries to be delineated

Heteroduplex Analysis: Position of Exon-Introns

• The fragment of DNA if cloned into M13 vector it make it possible to produce single stranded DNA

• This DNA can be hybridized with mRNA prep

• The regions where DNA-RNA hybrid will form will be of exonic regions

• While introns present in DNA will not be hybridized with RNA as mRNA do not contains introns.

• The single stranded regions can be digested by S1 nucleases which will degrade any single stranded DNA or RNA portion

• Leaving only heterodupliex which can be resolved on agarose gels to find their position

Determining the Function of Individual Genes

• After determining the location of coding regions/genes in a genome the next target is to assign function to the genes

• Function determination can be done in similar way as finding gene location i.e.– Bioinformatics tools (homology search)– Experimental techniques (Biochemical analysis)

• Homology Reflects Evolutionary Relationships• Homologous genes might be of two types:

– Orthologousn: homologous genes in different organism– Paralogous: homologous genes in same organism

Homology analysis can provide information on the function of an

entire gene or of segment s within it

• To find the function of a gene a homology search conducted with the DNA is less informative then protein sequence

• DNA has four nucleotides while amino acids are twenty• Therefore proteins which are not homologous appear more

distant when their amino acid sequences are compared.• In homology search a score for better alignment is

given, there are two ways by which score can be made:– By counting the number of positions at which the same amino acid is present in both the sequences which is then converted into percentage score: This is called as identity

– Other based on the relatedness between the non-identical amino acid to assign score using certain matrixes like substitution matrix: This determines the degree of similarity between two sequences

Nucleotide Identity 76% (Homologous)Amino acid Identity 28% (Not Homologous)

• The search is usually made by standard BLAST (Basic Local Alignment Search Tool)

• BLAST can find sequences which are 30%-40% similar

• PSI-BLAST (Position-Specific iterated BLAST) can find more distant sequences which are not found by standard BLAST search

• The homology finding has immense importance to understand the function of a gene

• There are some limitations in the analysis which should be kept in mind:– Some proteins are assigned incorrect functions

– Some unrelated sequences may have similarities at least in some part like presence of domains

– Homologous genes performing very different biological functions

Using Homology Searching to Assign the Function to Human

Disease Genes

Assigning Gene Function by

Experimental Analysis• The homology search is not a panacea that can identify

the function of all new genes.• Therefore, experimental methods are needed to

complement and extend the results of homology studies. • Reverse genetics is an approach to discovering the

function of a gene by analyzing the phenotypic effects of specific gene sequences obtained by DNA sequencing. This investigative process proceeds in the opposite direction of so-called forward genetic screens of classical genetics.

• Forward genetics seeks to find the genetic basis of a phenotype or trait while, reverse genetics seeks to find what phenotypes arise as a result of particular genes

Functional Analysis by Gene Inactivation

• Functional Analysis of a Gene can be performed by:– Inactivation of a gene– Over expression of a gene– Mutation analysis of a gene

• Gene inactivation is a powerful tool to elucidate the function of a gene, falls under reverse genetics approach

Individual Genes can be Inactivated by Homologous

Recombination• A gene can be inactivated by sequence

specific manner homologous recombination

• For S. cerevisiae this strategy revealed the function of many genes.

• A vector is generated with some suitable and expressible antibiotic genes like kanamycin (kan’).

• The gene under study is replaced by kanamycin gene by homologous recombination process.

• The resultant cells can be selected on kanamycin plants and their function can be studied.

Homologous Recombination in mammalian systems

• In S. cerevisiae its easy to study the effect of gene replacement by homologous recombination as its unicellular organism

• But for multicellular organisms like humans and mouse its very difficult as gene understudy should be replaced in every cell of the organism so that its function in any cell type can be elucidated.

• A mouse which is a model organism for humans because of its genetic similarity with human beings, can be generated so that its all cells may contains inactive gene.

• Embryonic stem cells can be engineered in a similar way as for s. cerevisiae and then can be mixed with early embryo, so that a chimera can be generated

• The chimera then allowed to mate so that any two gametes which have inactive genes can combine giving rise to a homozygous organism, who’s both genes are inactive.

• These type of mouse are called knockout mice.

Gene Inactivation without Homologous

Recombination• Transposon tagging is another way to

inactivate a gene without homologous recominbation

• The genetically engineered transposon can be generated which can change their position in response to certain external stimuli.

• This strategy helped a lot in understanding the function of many genes of Drosophila melanogaster

• The weakness of this approach that transposition is not predictable so need to analyze lot of recombinants to find target gene inactivation

RNA interference (RNAi)• RNAi is very powerful technique which

can be utilized to silence many genes without changing the genetic makeup of the organism.

• This targets mRNA in a sequence specific manner.

• All 19000 predicted genes of the Caenorhabditis elegans have been analyzed for their function.

• Similarly its have the potential to be used for higher animals and been used even for human cells line and mouse.

• To work at organism level RNAi should be very potent and expressive in all cells at higher amount to achieve gene silencing in all the cells.

Gene Over Expression can also be used to Access

Function• Not only the gene knockout or silencing

can be used to elucidate the function of a gene but

• Gene over expression can also tells about the function of a gene.

• The dose of genes increased in the cell, which help in understanding the function.

• The vectors can be generated which can over express a target gene under strong promoter

• These vectors are also multicopy vector so that gene dose can be increased.

• This approach also helped in elucidation of the function of many genes.

More Detailed Studies of the Activity of a Protein Coded by

an unknown gene• Gene inactivation, Gene over expression can

determined the general function of a gene• But the detailed information about the function

of a gene can not be elucidated.• i.e. which part of a protein is involved in

which activity, regulation?• Where that protein is expressed• When that proteins is needed by the cell?• To gain insight about these aspect a detailed

analysis is needed.

Directed Mutagenesis can be used to Probe Gene Function

in Detail• The site directed mutagenesis is technique where proteins are mutated at desired position

• Like some active domain to make a protein more fast, more thermal tolerant etc.

• This technique is actively utilized in protein engineering

• Where aim is to develop novel proteins with properties that are better suited for use in industrial or clinical settings.

Site Directed Mutagenesis

• There are three common ways to make site specific mutated genes– Oligonucleotide-Directed mutagenesis

– Artificial gene synthesis

– PCR

Reporter Gene and Immunocytochemistry can be used to locate where and when

genes are expressed• We can experimentally determine that

where protein is being expressed in an organism

• And at what stage it is expressed by visually examination of its expression

• It can be achieve by reporter gene under the same regulation of target gene

• And location can be find by immunocytochemistry


Case Study: Annotation of the Saccharomyces cerevisiae Genome

Sequence• We have studied different techniques which allow to allocate the position and expression of a gene of an organism.

• Now we will study how these various techniques was applied on S. cerevisiae genomic sequence to elucidate the function and position of genes.

• The use of different techniques is dependent on many considerations– Type of genome– Availability of related sequences– Ease of experimentation for elucidating the function of genes


Annotation of the yeast genome sequence

• The S. cerevisiae sequences project was completed in 1996.

• The initial analysis with 100 codons cut-off value for potential genes, identified 6274 ORFs.

• Out of these about 30% were known to be genuine genes because they had previously been identified by conventional genetic approaches before even sequencing the genome.

• The remaining 70% were studied for homology analysis after genome been sequenced completely.

• The results shows:• Almost 30% of the genes could be assign function

after homology searching of the sequence database.– About half of them were those whom function has already been known.

– About half were with less similarities and many of them were of those where similarity was confined to some domains so of limited usefulness

Table 5.2 part 1 of 2 Genomes 3 (© Garland Science 2007)

• For some genes the homology search enabled to find exactly the function of the gene. i.e. different sub units of DNA polymerase

• For some genes it was puzzling to assign function on the bases of homology search. i.e. a bacterial homolog of nitrogen fixation, which tuned out to be gene involve in the synthesis of metal containing proteins in which falls the nitrogen fixing gene of bacteria.

• About 10% of all the gene of S. cerevisiae had homolog in database but with unknown function so finding the function of those genes were not easy. These types of genes were called as orphan families

• The remaining yest genes, about 30% of the total, had no homologous in the database.

• The 7% of the ORFs were questionable ORFs which might be not real genes.

• There reminder look like genes but were unique so are called as single orphans

• After initial annotation of the S. cerevisiae genome there were questions about”– How many single orphans are genuine genes?– Second are there genuine genes less than 100 codons in length?

• Although there were just 6274 ORFs more than 100 codons but there were 100.000 ORFs of 15 codons or less and many of them with codon biasness of S. cerevisiae codon usage.

• Therefore potential of finding new genes were quite high.• So experimental work was conducted to elucidate the

function of genes

•

Experiments to find the function of S. cerevisiae

genome• Comparative Genomics

– By comparing genes in closely related genomes• Sequencing cDNA libraries

– By sequencing cDNA libraries which show which genes get transcribed

• Transposon TaggingGene inactivation so that function of genes could be find.

The strategy which was used was robust by using tronsposon tagging with molecular bar codes

These experiments are continuing but their results so far has reduced the yeast gene catalogue to about 6120 genes

– by removing about many long ORFs which previously thought as genes

– By adding some more ORFs which were shorter than 100 codons.

Documents

Genomes Third Edition Chapter 5: Understanding a Genome Sequence Copyright © Garland Science 2007 Terry Brown