The Increasing Dot Plot and Arc Diagrams

Arc Diagrams and the Increasing Dot Plot: Visualizing Repetitions in Genomic Sequences

MXMLLN

Abstract

The Increasing Dot Plot is introduced, an alternative implementation method to Martin Wattenbergs Arc

Diagrams: Visualizing Structure in Strings [ARC02]. The technique is able to handle significantly larger sequences than

the suffix tree approach, while sacrificing only an uncommon use case more relevant for music visualizations. The two

techniques are compared and the Increasing Dot Plot is used to visualize a diverse set of genomes, chromosomes, genes,

and proteins.

Keywords: visualization, text visualization, dot plot, bioinformatics, computational biology

Introduction and Previous Work

Wattenberg [ARC02] introduces an approach that has standardized how to visualize repetitions in strings.

Although the paper does apply the technique to DNA sequences, the authors propose that point mutations make the

approach ill-suited for this application. Nevertheless, a number of publications explore repetitions in genomic

sequences, including [SZC08] and Micropeats (1995). This conflicting evidence suggests that further exploration of this

space is needed.

Wattenberg implements Arc Diagrams using a suffix tree. However, this approach may not be well suited to

DNA sequences, whose individual chromosomes contain hundreds of millions of base pairs (Mb < n < Gb). Nevertheless,

its robustness has made it the standard algorithm for identifying repetitions in Bioinformatics. As a result, much research

has gone into finding more efficient ways of storing the strings at each suffix tree node to be able to handle full genomes

[HUO07].

The Increasing Dot Plot

Despite the success of suffix trees, the implementation is far more complex than the main technique its

replacing: the Dot Plot. The Dot Plot can be used to visualize point similarities in a sequence by creating an n x n matrix,

where n is the length of the input string, comparing each character to every other character in the string. If two

characters match, a 1 is stored in the cell; otherwise, a 0 is stored. The Dot Plot is usually visualized in black and white,

for 1 and 0 respectively, and gives some impression of the similarity between different parts of the sequence. Although

the technique is still used in practice, its actual utility is very small. Nevertheless, the gains in insight from the suffix tree

are not comparable to its increased complexity over the Dot Plot. Thus, a new technique was created.

Using inspiration from Needleman-Wunschs dynamic programming, sequence alignment algorithm, a variation

of the Dot Plot was created. Instead of simply entering a binary number at each matrix position, a simple function is

used that utilizes previous cell values (comparison(a) is the character comparison, where a is the binary result):

matrix[x,y]= comparison(0): 0

comparison(1): 1 + matrix[x-1, y-1]

Basically, if two characters are equal, the comparison result is added to the previous entry, otherwise a 0 is stored. Thus,

in contrast to a binary matrix, the matrix has a range of positive integer values, where longer sets of repetitions result in

larger consecutive numbers along a single diagonal. This Increasing Dot Plot technique can be visualized with a heat

map. Figure 2 compares the Dot Plot with the Increasing Dot Plot variation using protein d1btea_ 7.7.1.4.1 Extracellular

domain of the type II activin receptor {Mouse (Mus musculus)}. Notice the heat map on the right with the default

parameters reveals 4 different values including the two visualized in the dot plot on the left. The distracting main

diagonal has also been removed in the Increasing Dot Plot implementation.

Implementation: Dot Plot Space Limitations and Efficiencies

Dot Plots require O(n) for both time and space. Thus, without efficient memory management, Dot Plots are limited to

several thousand characters [Matlab was not able to handle anything more than sequences of 20KB bases]. Thankfully,

Increasing Dot Plots can drastically reduce their memory footprint to allow for sequences over 60MB, a 3,000 fold

increase.

Storing only the top half of the matrix, a common trick in symmetric matrix operations, allows for a small gain. Since the

top half of the matrix exactly mirrors the bottom half, only one needs to be computed. However, this method really only

improves time efficiency, reducing by a constant factor of . The key to space efficiency is that for repetitions, the final

number is an increasing chain has all the information needed to create an arc in the final visualization. For a repetition of

length 20, the diagonal increases from 0 to 20 by single digits and subtracting the final number from the end index will

point to the starting index. As a result, an arc diagram can be created from an array storing start and end index pairs. As

long as the arc arrays size is insignificant compared to the time complexity, its additional memory requirements are

negligible. This constraint can be encoded by setting a minimum repetition length, which also serves to remove noise

and clean the data.

In order to effectively utilize only the repetition information, all other data is discarded. The comparison function only

requires information in the previous row, namely the value along the diagonal. When a match is found, the previous cell

is added. When a match is not found and the previous cell is at least the minimum repetition size, its data is added to

the arc array. In addition, the final cell in the row must also be checked for a repetition, since it can no longer continue.

At this point, all the information from the previous row has been used, is no longer needed, and can be swapped with

the current row to save space. Consequently, the Increasing Dot Plot really only uses 3 n-sized elements: two one-

dimensional matrices and the original string. This results in O(n) [linear] space complexity.

Cetera Implementation Details

The Increasing Dot Plot was implemented in Processing, with memory increased to 1GB, for its rapid prototyping

prowess. The program was tested on a low-end laptop (2.13 GHz processor with 4 GB of RAM). Arc diagrams were

drawn with the specifications outlined in [ARC02], though Red was used as arc filler to differentiate the results. Unlike

Wattenberg, sequence characters are shown if n

Results: Biological Arc Diagrams

The original plan was to survey genomes of viruses and the biological kingdoms. Although the program was able

to completely store over a sequence over 60MB (chromosome 11 of Equus Caballus: NC_009154.2), actually creating a

diagram would take too long. For example, the complete virus genome for Murid Herpesvirus 1 (230,278 base pairs (bp)

) takes approximately 8 minutes to run (Figure gi|21716071 above). Sequences of even 1 or 2 million characters already

take many hours to complete, due to the O(n) time complexity. Thus, the original plan of sampling a large set of

genomes was too ambitious. Instead, a small set of genome, gene, and proteins are shown, where genomes and genes

are mostly nucleotide bases and proteins are exclusively amino acid base pair inputs.

As with many bioinformatics programs, choosing an appropriate minimum repetition length parameter is

extremely important. The length largely determines how many arcs will the diagram will include. Another parameter

was introduced to make sure a sufficient number of repetitions are found. If the minimum number of repetitions is not

reached, then the program decrements the minimum length by one and restarts. Smaller protein sequences (n < 1,000)

are run with minimum:3, repetitions:0 (Sequence dlush_2). These amino acids sequences many times cannot find

repetitions on the first pass and are automatically reduced to length 2, for which there are too many results.

Consequently, proteins might not be a good target for this application. All larger sequences (n > 1,000) were run with

minimum:20, repetitions:10. These parameters usually work, except in the case below of the fungus Encephalitozoon

Intestinalis ATCC 50506 chromosome I (NC_014415.1, 160332 bp) where the program had to reduce the minimum down

to length 15, after which 27 repetitions were found. Creating a function to compute the default parameters for various

sequence lengths would have been useful, but was outside the scope of the project.

Discussion: The Increasing Dot Plot vs. Arc Diagrams

As described in [ARC02], traditional Arc Diagrams do not visualize every pair of repetitions. Their

implementation and definitions were a bit confusing, but would probably amount to not showing the middle two arcs

connecting the first and third set of AABB, as well the as the second and fourth set (Sequence 01010101 below).

Wattenberg describes making two passes through the suffix tree to get the final visualization. Making another pass

through the arc array may eliminate this difference between the techniques.

The Shape of Song visualizations produced by the Arc Diagrams with music scores as input are clearly different

from those of biological sequences. Firstly, natural strings do not have the layering seen in music, where the song has

short consecutive repetitions, as well as much larger repetitions. Biological text seems to have far fewer repetitions and

very few repetitions between multiple pairs. What results is simply a less attractive, purely practical image. In addition,

the significantly longer length of genomes may hide elements of the visualization: Murid Herpesvirus 1(Figure

gi|21716071, previously shown) has 51 repetitions, but only two larger arcs are visible. Additionally, variation in

repetition length is mostly invisible for long sequences and large repetition lengths. Regardless, both pictures do present

a starting point for analysis.

Conclusion

Arc Diagrams are very relevant to genomic sequences, contrary to what Wattenberg suggested. With respect to

the new technique introduced, Increasing Dot Plots are significantly more informative than standard Dot Plots. In

practice, hopefully Increasing Dot Plots will replace their predecessors altogether, except for a few small applications.

Nevertheless, Increasing Dot Plots are just one additional tool in the much larger Bioinformatics toolbox.

References:

[ARC02]: Wattenberg, M. (2002) Arc Diagrams: Visualizing Structure in Strings. Proceedings of the IEEE

Symposium on Information Visualization. IEEE Computer Society. (http://www.turbulence.org/works/Song)

[SZC08]: Szczesny, P., and A. Lupas. 2008. Domain annotation of trimeric autotransporter adhesinsdaTAA.

Bioinformatics 24:1251-1256.

[HUO07]: Hongwei Huo and Vojislav Stojkovic, A Suffix Tree Construction Algorithm for DNA Sequences, IEEE

7th International Symposium on BioInformatics & BioEngineering. Harvard School of Medicine, Boston, MA,

October 14-17, Vol. II, pp. 1178-1182, 2007.

Appendix: Example arc array reference file (minimum repetitions: 3) with the triple repetition lls

d1ush_2 4.145.1.2.1 (26-362) 5'-nucleotidase (syn. UDP-sugar hydrolase), N-terminal domain {Escherichia coli}

Bases: 337

Repetitions: 12

tvl(3): 9, 98

eyg(3): 24, 27

aae(3): 45, 234

lls(3): 53, 111

lls(3): 53, 322

ign(3): 88, 152

lls(3): 111, 322

lfk(3): 125, 131

efr(3): 162, 275

kpd(3): 182, 250

nge(3): 197, 278

aen(3): 235, 314

Documents

The Increasing Dot Plot and Arc Diagrams