5
Proc. Natt Acad. Sci. USA Vol. 78, No. 12, pp. 7665-7669, December 1981 Genetics Enhanced graphic matrix analysis of nucleic acid and protein sequences (sequence homology/transposition/secondary structure/immunoglobulin/microcomputer) JACOB V. MAIZEL, JR., AND ROBERT P. LENK Section on Molecular Structure, Laboratory of Molecular Genetics, National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland 20205 Communicated by Philip Leder, September 23, 1981 ABSTRACT The enhanced graphic matrix procedure ana- lyzes nucleic acid and amino acid sequences for features of possible biological interest and reveals the spatial patterns of such features. When a sequence is compared to itself the technique shows regions of self-complementarity, direct repeats, and palindromic subse- quences. Comparison of two different sequences, exemplified by immunoglobulin K light chain genes, by using colored graphic matrices showed domains of similarity, regions of divergence, and features explainable by transpositions. Analysis of mouse constant domain immunoglobulin sequences revealed self-complementary regions that can be used to fold the molecule into a structure con- sistent with electron microscopic observations. Computer trans- lation of nucleic acid sequences into all possible amino acid se- quences followed by graphic matrix analysis provides a way to detect the imost likely protein encoding regions and can predict the correct reading frames in sequences in which splicing patterns are not defined. Application of this technique to regions of simian virus 40 and polyoma virus demonstrates the frames of translation and shows the agreement of sequences determined in separate labo- ratories with different virus isolates. The graphic matrix tech- nique can also be used to assemble fragmentary sequences during determination, to display local variations in base composition, to detect distant evolutionary relationships, and to display intragenic variation in rates of evolution. distances (9), these programs are limited in the body of data they can handle, because the computations increase as a power of the sequence length. Available resources are exhausted by se- quence lengths on the order of 103 to 104 nucleotides. An even more severe limitation exists for programs that attempt to pre- dict the folding of single-stranded nucleic acids by evaluating intra-strand self-complementarity. This paper describes a simple sequence comparison based on a graphic matrix display. Graphic techniques were used by Tinoco et al. (10) to find regions of self-complementarity in RNA and by Gibbs and McIntyre (11), Fitch (12), and McLachlan (13) to determine regions of homology in amino acid and nucleic acid sequences. The refinements described here have already been applied to globin (14) and immunoglobulin gene se- quences (15, 16). These programs enable the approach to be used with sequences of more than 10,000 bases and are en- hanced by various subroutines that detect partial homology be- tween sequences. A large graphic matrix can be compressed into a single page for a broader overview. Additional programs permute sequences, either by complementing or translating them into amino acids, and use color to increase the information content of the graphic matrix. The recent development of techniques for rapid determination of nucleic acid sequences (1, 2) has allowed complete specifi- cation of a number of cellular genes and the entire chromosomes of several viruses. The data, consisting simply of finite arrays of nucleotides, embodies the intrinsic blueprint of an organism. Transmission and correlation of this large and growing collection of data is ideally handled by computer. Several widely distributed computer programs and their de- scendents are in use (3-7). A review of some programs for nu- cleic acid analysis has appeared (8). Most frequently used are programs that search for occurrences of short subsequences that are used by enzymes as signals to recognize, modify, and express nucleic acids, that determine the frequency and locations of short strings of nucleotides, and that translate nucleic acid se- quences into amino acid sequences or complementary polynu- cleotide strands. Another category of analysis, originally applied to amino acid sequences, uses programs that collate large stretches of se- quence data to determine imperfect homologies and thereby show evolutionary relationships. Often the approach is to gen- erate a matrix with a numerical calculation for each nucleotide comparison. Aside from sophisticated caveats, such as the rel- ative efficacy of various algorithms for measuring evolutionary MATERIALS AND METHODS The graphic matrix programs are written in Hewlett-Packard extended BASIC specifically for a Hewlett-Packard 9845 desk- top computer with 64 kilobytes of memory and integral dot matrix printer. They could be adapted to other microcompu- ters. Ancillary equipment includes a model 9872A four-color plotter, and a model 9885 flexible disc drive. Linkage with the data storage facilities at the National Institutes of Health, the Stanford Molgen project, and the Dayhoff Nucleic Acid Se- quence project is through a communications-terminal emulat- ing program, acoustic coupling device, and standard telephone. The basic program, called REVEAL, functions by projecting two sequences to be compared along the horizontal and vertical axes of a two-dimensional array. The matrix is constructed by marking the position corresponding to every element with a dot where the bases corresponding to the vertical and horizontal coordinates are identical. A nucleic acid sequence comparison of one base at a time consists of only four kinds of horizontal rows, one for each of the four types of bases (Fig. 1). this feature is utilized to simplify the actual computation by storing each type of row and recalling it when the corresponding base is en- countered along the vertical column. A comparison of two 1000- base sequences, one base at a time, is completed and printed in 1-2 min. For analysis of three bases at a time, using stored rows for 64 triplets, about 5 min is required. The utility of the graphic matrix procedure is enhanced by a filtration procedure in which n bases (n is an odd number) are The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "advertise- ment" in accordance with 18 U. S. C. §1734 solely to indicate this fact. 7665

Enhanced matrix analysis sequences - Proceedings of the ... · PDF filelation ofnucleic acid sequences into all possible amino acid se-quences followed by graphic matrix analysis provides

Embed Size (px)

Citation preview

Page 1: Enhanced matrix analysis sequences - Proceedings of the ... · PDF filelation ofnucleic acid sequences into all possible amino acid se-quences followed by graphic matrix analysis provides

Proc. Natt Acad. Sci. USAVol. 78, No. 12, pp. 7665-7669, December 1981Genetics

Enhanced graphic matrix analysis of nucleic acid andprotein sequences

(sequence homology/transposition/secondary structure/immunoglobulin/microcomputer)

JACOB V. MAIZEL, JR., AND ROBERT P. LENKSection on Molecular Structure, Laboratory of Molecular Genetics, National Institute of Child Health and Human Development, National Institutes of Health,Bethesda, Maryland 20205

Communicated by Philip Leder, September 23, 1981

ABSTRACT The enhanced graphic matrix procedure ana-lyzes nucleic acid and amino acid sequences for features ofpossiblebiological interest and reveals the spatial patterns ofsuch features.When a sequence is compared to itselfthe technique shows regionsof self-complementarity, direct repeats, and palindromic subse-quences. Comparison of two different sequences, exemplified byimmunoglobulin K light chain genes, by using colored graphicmatrices showed domains of similarity, regions of divergence, andfeatures explainable by transpositions. Analysis ofmouse constantdomain immunoglobulin sequences revealed self-complementaryregions that can be used to fold the molecule into a structure con-sistent with electron microscopic observations. Computer trans-lation of nucleic acid sequences into all possible amino acid se-quences followed by graphic matrix analysis provides a way todetect theimost likely protein encoding regions and can predict thecorrect reading frames in sequences in which splicing patterns arenot defined. Application ofthis technique to regions ofsimian virus40 and polyoma virus demonstrates the frames of translation andshows the agreement of sequences determined in separate labo-ratories with different virus isolates. The graphic matrix tech-nique can also be used to assemble fragmentary sequences duringdetermination, to display local variations in base composition, todetect distant evolutionary relationships, and to display intragenicvariation in rates of evolution.

distances (9), these programs are limited in the body ofdata theycan handle, because the computations increase as a power ofthe sequence length. Available resources are exhausted by se-quence lengths on the order of 103 to 104 nucleotides. An evenmore severe limitation exists for programs that attempt to pre-dict the folding of single-stranded nucleic acids by evaluatingintra-strand self-complementarity.

This paper describes a simple sequence comparison basedon a graphic matrix display. Graphic techniques were used byTinoco et al. (10) to find regions ofself-complementarity in RNAand by Gibbs and McIntyre (11), Fitch (12), and McLachlan(13) to determine regions ofhomology in amino acid and nucleicacid sequences. The refinements described here have alreadybeen applied to globin (14) and immunoglobulin gene se-quences (15, 16). These programs enable the approach to beused with sequences of more than 10,000 bases and are en-hanced by various subroutines that detect partial homology be-tween sequences. A large graphic matrix can be compressedinto a single page for a broader overview. Additional programspermute sequences, either by complementing or translatingthem into amino acids, and use color to increase the informationcontent of the graphic matrix.

The recent development of techniques for rapid determinationof nucleic acid sequences (1, 2) has allowed complete specifi-cation ofa number ofcellular genes and the entire chromosomesof several viruses. The data, consisting simply of finite arraysofnucleotides, embodies the intrinsic blueprint ofan organism.Transmission and correlation ofthis large and growing collectionof data is ideally handled by computer.

Several widely distributed computer programs and their de-scendents are in use (3-7). A review of some programs for nu-cleic acid analysis has appeared (8). Most frequently used areprograms that search for occurrences ofshort subsequences thatare used by enzymes as signals to recognize, modify, and expressnucleic acids, that determine the frequency and locations ofshort strings of nucleotides, and that translate nucleic acid se-quences into amino acid sequences or complementary polynu-cleotide strands.

Another category ofanalysis, originally applied to amino acidsequences, uses programs that collate large stretches of se-quence data to determine imperfect homologies and therebyshow evolutionary relationships. Often the approach is to gen-erate a matrix with a numerical calculation for each nucleotidecomparison. Aside from sophisticated caveats, such as the rel-ative efficacy of various algorithms for measuring evolutionary

MATERIALS AND METHODSThe graphic matrix programs are written in Hewlett-Packardextended BASIC specifically for a Hewlett-Packard 9845 desk-top computer with 64 kilobytes of memory and integral dotmatrix printer. They could be adapted to other microcompu-ters. Ancillary equipment includes a model 9872A four-colorplotter, and a model 9885 flexible disc drive. Linkage with thedata storage facilities at the National Institutes of Health, theStanford Molgen project, and the Dayhoff Nucleic Acid Se-quence project is through a communications-terminal emulat-ing program, acoustic coupling device, and standard telephone.The basic program, called REVEAL, functions by projecting

two sequences to be compared along the horizontal and verticalaxes of a two-dimensional array. The matrix is constructed bymarking the position corresponding to every element with a dotwhere the bases corresponding to the vertical and horizontalcoordinates are identical. A nucleic acid sequence comparisonof one base at a time consists of only four kinds of horizontalrows, one for each ofthe four types ofbases (Fig. 1). this featureis utilized to simplify the actual computation by storing eachtype ofrow and recalling it when the corresponding base is en-countered along the vertical column. A comparison oftwo 1000-base sequences, one base at a time, is completed and printedin 1-2 min. For analysis of three bases at a time, using storedrows for 64 triplets, about 5 min is required.The utility of the graphic matrix procedure is enhanced by

a filtration procedure in which n bases (n is an odd number) are

The publication costs ofthis article were defrayed in part by page chargepayment. This article must therefore be hereby marked "advertise-ment" in accordance with 18 U. S. C. §1734 solely to indicate this fact.

7665

Page 2: Enhanced matrix analysis sequences - Proceedings of the ... · PDF filelation ofnucleic acid sequences into all possible amino acid se-quences followed by graphic matrix analysis provides

7666 Genetics: Maizel and Lenk

GGGRTC

RC

GTRTGC

RTTRGC

RTRC

RTC

RC

GC

GG

B

C

TTA AC*GG*CTA

G GGATCAC GCATGACGCGG

CGGGRTCRCGTRTGCRTTRCARTRCRTCRCG GGC

C __1 C C 1CG 7GGG G Gm GG

EFc1CC R1 C-G GGGG G,

iTT FTT

T -i3 1T-GA[G [GtGG m' Lm

9_T_IR~~~~~~~~~~FE RIER1_R: RR§_

T mTlG G, FGTE' mYR [EE R FR E

T

G FGTGGI I' mC C Ct IC a

T AMTCR r !/y[1[

T [GT,mGR RSRBg I[ EG

T L FTTTVC ,' mC, [m~ OCOCr[

FIG. 1. Demonstration of the graphic matrix technique using ar-tificial sequences. (A) Self-comparison of a sequence reveals direct re-peats (solid line). (C) Comparison of a sequence and its reversed andcomplemented sequence reveals self-complementary regions (solidline). The diagonal of 2-fold symmetry is shown as a dashed line.(B) Folding of a single-stranded polynucleotide based on self-complementarity.

collated at each step. The program allows for a minimal numberof homologous matches, m, out of n bases to be scored by en-

tering a dot in the position of the central coordinate. In the

ensuing step the horizontal n-base subsequence is stepped over

one base and the homologies are assessed again. At the end ofa line the vertical subsequence is dropped by one base and theprocess is repeated. Considerably greater time is needed forthis type ofanalysis in comparison with the high-speed base-for-base comparison above.

Additional subroutines permute sequences in one of severalways including translation into amino acids in three readingframes, generating a sequence's complement, or allowing theweak interaction of GOT or G&U base pairs to contribute to re-

verse complement matching.

RESULTSDemonstration of Features in a Hypothetical Polynucleo-

tide by Using the Graphic Matrix. Fig. 1 shows the analysis ofa hypothetical nucleic acid to demonstrate the concepts and themain features revealed by the graphic matrix technique. If a

sequence is compared with itself, as in Fig. 1A, a perfect di-agonal arises from the match of each base with itself. Further-more, repeated regions within the sequence appear as parallellines-for example, the subsequence A-T-C-A-C-G starting atbases 4 and 25. A self-comparison matrix has identical halvesabove and below the main diagonal, but comparison ofdifferentsequences does not. Alphabetic palindromes are recognized asdiagonal lines in the lower left to upper right direction. An ex-ample is A-C-G-T-A and A-T-G-C-A starting at positions 7 and11. No biological relevance has been attached to this type offeature; therefore, it provides a measure of the "chance" oc-currence of matches of various lengths.

Comparison of the sequence with its reversed complementreveals regions of self-complementarity that can be involved inthe formation of secondary structure in single-stranded mole-cules as shown in Fig. 1C. This function has increasing impor-tance as secondary, and higher order, structures are implicatedin the expression of single-stranded nucleic acids. Self-comple-mentarity is revealed by diagonal lines in the upper left to lowerright direction-for example, G-T-A-T-G-C beginning at po-sition 9 is self-complementary to G-C-A-T-A-C beginning atposition 19. There is a 2-fold symmetry on either side of thenonmatching diagonal (dashed line) that runs from the lower leftto the upper right corners that is useful in recognizing the re-

gions ofself-complementarity. The two subsequences that couldpair are identified by projecting vertical lines from the ends ofthe two diagonal segments to the horizontal (forward) sequence.Then the complementary regions can be joined and the mole-cule folded as shown in Fig. 1B. From even a small sequence,a large number of possible combinations can occur. The eval-uation of all their stabilities in order to choose the most stableone is a large task. Programs to search and evaluate these pos-sibilities in a rigorous manner are an active subject of investi-gation (9, 17, 18).

Comparison of Two Related Sequences: Mouse K ChainVariable Regions. Fig. 2 Left is a heterologous comparison ofportions of K-2 (K2) and K-41 (K41) variable regions of twosubgroups from the mouse immunoglobulin system. These sub-sequences are of the germ-line genome and contain the leader,intervening, and coding sequences (19-21). The sequence isdisplayed with each of the four bases assigned a different color,allowing direct reading of the homologous sequences from thegraphic matrix. Local variations in base composition are readilyapparent. Runs of repeated bases appear as lines or rectanglesof a single color.

Lines ofpartial homology are seen in Fig. 2 Left between theregions ofbases 540-590 ofK2 and 300-350 ofK41 and to vary-ing degrees between 620-850 of K2 and 380-600 of K41. The

A

Proc. Natl. Acad. Sci. USA 78 (1981)

Page 3: Enhanced matrix analysis sequences - Proceedings of the ... · PDF filelation ofnucleic acid sequences into all possible amino acid se-quences followed by graphic matrix analysis provides

Genetics:MaizelandLenk~~~~~Proc.Natl. Acad. Sci. USA 78 (1981) 7667

K2-Leader InterveningVrm~400 500 600 700 $00

A

r N~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

30--*0f

400<

501 615 620

1530 \

6001 35

K2-Iiade intervening VarnablvI,

100

I- A

r-4v

I 1, .0)r-

-200 5--

2

I

in.to

300 ---q

.XAhl-A

-400

K.

-500

B~~~~~~~~-0

FIG. 2. Graphic matrix Of two K chain variable regions from the mouse. K chain variable region 2 is on the horizontal axis, and region 41 is onthe vertical (Left). Each dot represents one base: black, T-T; red, A-A; green, C-C; blue, G-G. (Right) The same sequences filtered by collating ninenucleotides at a step. Black, all nine bases are homologous; red, eight of the nine; blue, seven of nine; green, six of nine. Absence of a dot indicatesfive or fewer matches.

overall homology is 68% and typical of partially divergent se-quences. Fig. 2 Right illustrates a matrix in which the extentofhomology in nine bases is indicated by the color ofa dot placedat the center of the group. This method effectively removesmany of the chance background homologies. In addition to themain diagonal, short regions of enhanced homology are seenbetween 633-646 of K2 and 191-204 of K41 and between611-622 of K2 and 528-539 of K41. They coincide with placeswhere the main diagonal has vanished. It is possible to read thesequences directly from the colored matrix, especially Fig. 2Left, Insets A and B. They are:

A B

K2-633 AAGTGGGAATATTC 611 CTGTCACCATCA

K41-191 AATTGGGAA1TTTC 528 CTCTCAGCATCA

On the basis of a numerical program (5, 6), these homologiesare significantly improbable (4.8 x 10'6 and 1.8 X 10-6); there

are no other homologies with probabilities less than 5 x 10-5.

Transpositions, or duplications followed by mutation, of por-

tions of the K41 sequence bases 390-191 and 378-538 would

account for this result. The former event would involve a move

from a coding region to an intervening sequence; the latter

would involve a move of a coding region to another coding re-

gion. It is not known whether the events occurred by reciprocal

exchange of sequence at these sites or by segmental gene con-

version (22). A role for such transpositions in the function of the

immunoglobulin gene remains to be found, but because the

germ-line and a differentiated gene have similar sequences, it

is not necessary to imagine a role as a somatic generator of

diversity.

Secondary Structure of RNA Inferred from the GraphicMatrix. The filtration subroutine can reveal regions of reverse

complementarity within a sequence and facilitate the selection

ofregions for combination into a folded structure. Fig. 3A dem-

onstrates the reverse complements ofthe mouse Kchain in RNA

(16, 23). Two filtrations are superimposed. The small dots showperfect complementary segments of three nucleotides; Xs in-dicate regions where at least 9 of 11 bases are complementary.

Fig. 3B illustrates how these regions of reverse complemen-tarity might pair in a folded molecule. Each pair ofcomplementsis projected onto the horizontal axis; a hypothesized structureis shown below it. The electron micrograph in Fig. 3C, similarto that in ref. 19, shows a typical structure of K chain mRNAhybridized to a DNA fragment containing the Kvariable regiongene. Because the cloned fragments do not contain J or constantregion sequences, the unhybridized constant region projectsfrom the resulting R loop. The resolution of the electron mi-croscopic procedure used here does not allow confirmation ofthe specific structure illustrated but both the size and locationof the hairpin are consistent with the prediction.

Protein Coding Regions Inferred from the Graphic Matrix.An accessory filtration and compressed plot procedure gives ausefuil overview of sequences too long for examination in a 1:1matrix reduced to page size. Fig. 4A is a comparison of the se-quences of simian virus 40 (24) and polyoma virus (25), two dis-tandly related papovaviruses of more than 5000 base pairs each.Here the DNA sequences are compressed by a combination ofcollating five bases at a time for perfect match and by scalingthe plotted display to page size. Regions of interspecies ho-mology are distributed selectively along the genomes. Highestinterspecies sequence conservation is shown by the gene forcoat protein VP1. Note that an insertion or deletion betweenthe early and late regions of either one or both genomes has ledto a shift of the homology from the diagonal.The imperfect homology of the early region genes is also re-

flected in the computer-translated amino acid sequences, asdemonstrated in Fig. 4B. Nine graphic matrices comparing allframes of translation for the early regions of simian virus 40 andpolyoma virus are shown. One comparison (frame 3 of simianvirus 40 versus frame 3 of polyoma virus) shows strikinglygreater homology than do the eight other translation compari-sons or the direct nucleic acid (Fig. 4A). This region encodes

Genetics: Maizel and Lenk

Page 4: Enhanced matrix analysis sequences - Proceedings of the ... · PDF filelation ofnucleic acid sequences into all possible amino acid se-quences followed by graphic matrix analysis provides

7668 Genetics: Maizel and Lenk

1 2 3 4 5A 0

A III L

U)

Polyoma10 20 30 40 50i-L LL11LIii1iI1I1iiiIIIlIlII1ili.1 .-

10-2,,.,.., .'<4''C

201

30~.-iaAl- t: A- 8 K *t t.,0_ _

50 ".;.'--'C9y.j«

J T T 1 1 1 1. 1 III t

mV VP2

earl, RNA T

T T -

B BFrame 1

, , I

a.-4 -. .. . ;, .

! .. d...>.,,

A,..2 ag o A -

C

FIG. 3. Secondary structure of the transcript of the mouse K chainconstant region. The sequence is that of the appropriate J region ap-pended to the constant region sequence. (A) Reverse complement plotof this sequence. x, Nine of a block of 11 nucleotides are complemen-tary; -, three of three are complements. (B) Folded structure on thebasis of the self-complementarity regions noted in A. (C) Sketch andelectron micrograph of mouse K chain mRNA hybridized to a plasmidcontaining the variable region gene only.

the portion of the large tumor antigen acquired by splicingframe 2 into frame 3 to circumvent the small tumor antigen ter-mination codons. Another region of increased homology can befound in the first 100-200 amino acids of frame 2 versus frame2 corresponding to the small tumor antigen in an unfilteredcomparison (data not shown). Increased homology at the aminoacid level as opposed to nucleic acid level was noted previously(26).

Used in this manner, the graphic matrix alone reveals po-tential coding regions in two related sequences even if no otherdata are available. This relies on the simple hypothesis that an

enhanced homology in one set of frames implies evolutionaryconservation ofamino acid sequence. This is most likely to occurif real proteins exist for those frames of translation. The tech-nique could also aid in predicting the correct choices amongmany possible alternatives for coding regions in genes that haveintervening sequences.

DISCUSSION

The graphic matrix technique takes advantage of the sensitivediscrimination of the eye and mind in detecting patterns in nu-

cleic acid and protein sequences. Most of the deductions madefrom the sequences presented here can be discovered, but onlywith difficulty, either by aligning the sequence data manuallyor by performing a numerical analysis. Both approaches are

expedited by the graphic matrix. For example, several programs

c

0

0) C1

>0)Ec

0

(n

cog0)

E

Polyoma early regionFrame 2

.1i ,,i ,. n .; , ' s - tis me. .. ; i. . . ............... ........ ^ . . ...t , . . . . .. ... , 7,,, . . __ . A - All A.),-- ....... -I.,-i - - - ' i, .'' a. 'S . ,. _'; 'z, , ',"- ...... i.,. , .. '' ,. - . -' , -:: _'-', ; 2 ' , ; .. 0 , ,-,<., Ad ;. ';'' -' ' 0 ': ' .- . ;''. _A. . . + @, i,, tsa . , t X -.. t. *Zb if + *IS; @,:9' .... .."8'i'>' ;' ';- _. ' ;J .- t; '_',,'; ' ;- j- ............................................... '' I'-;- tt ,,z, If ', ^ ' ,_';;' X'.-h.',, s ,; ;L- w . ' . ._ -*;; . '. is -: _t>,5-1^ _ As; 4 ', ,; ' * C ! :* ' .................................................... * ^ a *t-*-*ii;, Or, I;_ .'z'-,'s's; .;,-s.s-.;,';ows-,,'J'x, s*e'*Fi,', i; -I'm ;b'' An.-',.'-a ......... 4 A-\,",: -. a . qW'v, s: . . ; .> . ; :,: . - { . . r ¢. . ^ . t '.> .,; ,,.:..<; 9;, ...................... iz ';t: ...... ,.: . -.' .4\+* . ......... \ . i 0 { _. 4 i . t ti .... .. , i. . . - . - ;- k , .. . - . - .. . . . .- ....... -.; . ,...E * 5 i .;-.i . S - .

_ *'".' 'a z'' 0' J' , if;' '' '-.,4, < Ma .v - ;, a.. .. . : > _

_; i,, ,,ij" :'t ,' t:2(- ,, Be, 9, :NX_n N> s . , A; * l j., in ;0, anro Frame 3

FIG. 4. Compressed graphic matrix of comparison of polyoma vi-rus with simian virus 40 (SV40) sequences. (A) Entire nucleic acid se-

quences comprising 5226 x 5160 = 26,966,160 individual elements.A dot indicates that five bases in a row are homologous. The early andlate regions of the simian virus 40 sequence were rearranged from thepublished form to bring both sequences into the same sense and ori-entation. Features of the genome indicated along the margin: t, smalltumor antigen; mT1 and mT2, the two portions of the genome which,when spliced, encode the middle-size tumor antigen of polyoma virus;T1 and T2, portions of large tumor antigen RNA, the primary tran-script of the early region of the genomes; VP1, VP2, and VP3, the threevirus coat proteins; 0, origin of replication. (B) The early regions ofthe two sequences compared after translating each into its three read-ing frames and concatenating them (frame 1 starts with the first baseof each sequence, frame 2 with the second, and frame 3 with the third).This results in nine possible pairwise comparisons.

exist for predicting the secondary structure of RNA but theyeither require large amounts ofcomputer time or take uncertainshortcuts (18, 26, 27). The graphic matrix can be used to guidenumerical programs by eliminating unfavorable regions andthereby shortening calculation times. Existing numerical meth-ods to produce "best" alignments of homologous sequences do

0

A

1 -

2-

3-

4-

5-

I .,sx*.<n_ _ 41': ''' '. ': ::: : s ''

0 4- j- -- I... ''. 5'. 4.::.F7-F7.7 112 .,. m. '. -::, -- 11 !:. A-:. 4. -

it

t

L.I-

.1'I

A

~tI

.'

Proc. Natl. Acad. Sci. USA 78 (1981)

I, :.: - ... .w. M.-..a,-.l .,I!:, .4 ..:t ..'" .1,'- !PI P.: -. -. :"I .,.. . -."-

WT1J mTI1

Page 5: Enhanced matrix analysis sequences - Proceedings of the ... · PDF filelation ofnucleic acid sequences into all possible amino acid se-quences followed by graphic matrix analysis provides

Proc. Natl. Acad. Sci. USA 78 (1981) 7669

not deal with transposed regions within sequences, as exem-plified in the K gene comparisons (Fig. 2), but the graphic matrixtechnique reveals them readily.

Another application, not shown here, utilizes pairwise anal-yses on transparent sheets overlaid to reveal similarities in pat-terns of repeats or reverse complements. Related patterns ofrepeats or reverse complements can occur in sequences that areotherwise dissimilar, suggesting that general features ratherthan specific sequences may be recognized by certain processes(28).The graphic matrix method has proved to be ofgreat practical

value in ordering fragments of DNA in the sequence deter-mination process (15). Fragments are compared in forward andreverse complement forms and overlapping fragmentary se-quences are arranged in order by generating the comparisonmatrices and observing the overlaps in the form ofhomologousregions between ends ofthe fragments. Direct visual inspectiondeals successfully with imperfectly repeated sequences whichmay confuse programs designed to order fragments automati-cally (5, 8). Comparison with a previously known nucleic acidor amino acid sequence, even if from a different species, canaid in ordering the fragments.Due to its simplicity and versatility, the graphic matrix can

be adapted with ease to reveal various properties and patternsin biological sequences.

We gratefully acknowledge the expert assistance of Terri Broderickin preparation of this manuscript. We thank Jon Seidman for providingsequences and electron micrographs and encouraging the developmentof these programs, M. Waterman (Los Alamos Scientific Laboratories)and B. Bush and C. Leventhal (Columbia University) for introductionto the literature on matrix methods, Margery Sullivan and BarbaraNorman for technical assistance, and the Stanford Molgen and the Na-tional Biomedical Research Foundation data base projects.

1. Maxam, A. M. & Gilbert, W. (1977) Proc. Natl Acad. Sci. USA74, 560-564.

2. Sanger, F., Nicklen, S. & Coulson, A. R. (1977) Proc. NatL Acad.Sci. USA 74, 5463-5467.

3. Staden, R. (1977) Nucleic Acids Res. 4, 4037-4051.4. Staden, R. (1978) Nucleic Acids Res. 5, 1013-1015.5. Staden, R. (1979) Nucleic Acids Res. 6, 2601-2610.6. Korn, L. J., Queen, C. L. & Wegman, M. N. (1977) Proc. Natl

Acad. Sci. USA 74, 4401-4405.7. Queen, C. L. & Korn, L. J. (1980) Methods Enzymol 65,

595-609.8. Gingeras, T. R. & Roberts, R. J. (1980) Science 209, 1322-1328.9. Waterman, M. S., Smith, T. F. & Beyer, W. A. (1976) Adv.

Math. 20, 367-387.10. Tinoco, I., Jr., Uhlenbeck, 0. C. & Levine, M. D. (1971) Nature

(London) 230, 362-367.11. Gibbs, A. J. & McIntyre, G. A. (1970) Eur. J. Biochem. 16, 1-11.12. Fitch, W. M. (1969) Biochem. Genet. 3, 99-108.13. McLachlan, A. D. (1971)]. MoL Biowl 61, 409-424.14. Konkel, D. A., Maizel, J. V., Jr., & Leder, P. (1979) Cell 18,

865-873.15. Hieter, P. A., Max, E. E., Seidman, J. G., Maizel, J. V., Jr., &

Leder, P. (1980) Cell 22, 197-202.16. Max, E. E., Maizel, J. V., Jr., & Leder, P. (1981)J. Biol Chem.

256, 5116-5120.17. Nussinov, R. & Jacobson, A. B. (1980) Proc. Natl Acad. Sci. USA

77, 6309-6313.18. Zuker, M. & Stiegler, P. (1981) Nucleic Acids Res. 9, 133-148.19. Seidman, J. G., Leder, A., Edgell, M. H., Polsky, F., Tilghman,

S. M., Tiemeier, D. C. & Leder, P. (1978) Proc. Natl Acad. Sci.USA 75, 3881-3885.

20. Nishioka, Y. & Leder, P. (1980) J. Biol Chem. 255, 3691-3694.21. Seidman, J. G., Max, E. E. & Leder, P. (1979) Nature (London)

280, 370-375.22. Seidman, J. G., Leder, A., Nau, M., Norman, B. & Leder, P.

(1978) Science 202, 11-15.23. Hamlyn, P. H., Brownlee, G. G., Cheng, C., Gait, M. J. & Mil-

stein, C. (1978) Cell 15, 1067-1075.24. Reddy, V. B., Thimmappaya, B., Dhar, R., Subramanian, K. N.,

Zain, B. S., Pan, J., Ghosh, P. K., Celma, M. L. & Weissman,S. M. (1978) Science 200, 494-502.

25. Soeda, E., Arrand, J., Smolar, N., Walsh, J. & Griffin, B. (1980)Nature (London) 283, 445-453.

26. Friedman, T., Doolittle, R. F. & Walter, G. (1978) Nature (Lon-don) 274, 291-293.

27. Sellers, P. H. (1974) SIAM J. Appl Math. 26, 787-793.28. Studnicka, G. M., Eisenling, F. A. & Lake, J. A. (1981) Nucleic

Acids Res. 9, 1885-1904.

Genetics: Maizel and Lenk