The Increasing Dot Plot and Arc Diagrams

Embed Size (px)

DESCRIPTION

The Increasing Dot Plot and Arc Diagrams

Citation preview

  • Arc Diagrams and the Increasing Dot Plot: Visualizing Repetitions in Genomic Sequences

    MXMLLN

    Abstract

    The Increasing Dot Plot is introduced, an alternative implementation method to Martin Wattenbergs Arc

    Diagrams: Visualizing Structure in Strings [ARC02]. The technique is able to handle significantly larger sequences than

    the suffix tree approach, while sacrificing only an uncommon use case more relevant for music visualizations. The two

    techniques are compared and the Increasing Dot Plot is used to visualize a diverse set of genomes, chromosomes, genes,

    and proteins.

    Keywords: visualization, text visualization, dot plot, bioinformatics, computational biology

    Introduction and Previous Work

    Wattenberg [ARC02] introduces an approach that has standardized how to visualize repetitions in strings.

    Although the paper does apply the technique to DNA sequences, the authors propose that point mutations make the

    approach ill-suited for this application. Nevertheless, a number of publications explore repetitions in genomic

    sequences, including [SZC08] and Micropeats (1995). This conflicting evidence suggests that further exploration of this

    space is needed.

    Wattenberg implements Arc Diagrams using a suffix tree. However, this approach may not be well suited to

    DNA sequences, whose individual chromosomes contain hundreds of millions of base pairs (Mb < n < Gb). Nevertheless,

    its robustness has made it the standard algorithm for identifying repetitions in Bioinformatics. As a result, much research

    has gone into finding more efficient ways of storing the strings at each suffix tree node to be able to handle full genomes

    [HUO07].

  • The Increasing Dot Plot

    Despite the success of suffix trees, the implementation is far more complex than the main technique its

    replacing: the Dot Plot. The Dot Plot can be used to visualize point similarities in a sequence by creating an n x n matrix,

    where n is the length of the input string, comparing each character to every other character in the string. If two

    characters match, a 1 is stored in the cell; otherwise, a 0 is stored. The Dot Plot is usually visualized in black and white,

    for 1 and 0 respectively, and gives some impression of the similarity between different parts of the sequence. Although

    the technique is still used in practice, its actual utility is very small. Nevertheless, the gains in insight from the suffix tree

    are not comparable to its increased complexity over the Dot Plot. Thus, a new technique was created.

    Using inspiration from Needleman-Wunschs dynamic programming, sequence alignment algorithm, a variation

    of the Dot Plot was created. Instead of simply entering a binary number at each matrix position, a simple function is

    used that utilizes previous cell values (comparison(a) is the character comparison, where a is the binary result):

    matrix[x,y]= comparison(0): 0

    comparison(1): 1 + matrix[x-1, y-1]

    Basically, if two characters are equal, the comparison result is added to the previous entry, otherwise a 0 is stored. Thus,

    in contrast to a binary matrix, the matrix has a range of positive integer values, where longer sets of repetitions result in

    larger consecutive numbers along a single diagonal. This Increasing Dot Plot technique can be visualized with a heat

    map. Figure 2 compares the Dot Plot with the Increasing Dot Plot variation using protein d1btea_ 7.7.1.4.1 Extracellular

    domain of the type II activin receptor {Mouse (Mus musculus)}. Notice the heat map on the right with the default

    parameters reveals 4 different values including the two visualized in the dot plot on the left. The distracting main

    diagonal has also been removed in the Increasing Dot Plot implementation.

  • Implementation: Dot Plot Space Limitations and Efficiencies

    Dot Plots require O(n) for both time and space. Thus, without efficient memory management, Dot Plots are limited to

    several thousand characters [Matlab was not able to handle anything more than sequences of 20KB bases]. Thankfully,

    Increasing Dot Plots can drastically reduce their memory footprint to allow for sequences over 60MB, a 3,000 fold

    increase.

    Storing only the top half of the matrix, a common trick in symmetric matrix operations, allows for a small gain. Since the

    top half of the matrix exactly mirrors the bottom half, only one needs to be computed. However, this method really only

    improves time efficiency, reducing by a constant factor of . The key to space efficiency is that for repetitions, the final

    number is an increasing chain has all the information needed to create an arc in the final visualization. For a repetition of

    length 20, the diagonal increases from 0 to 20 by single digits and subtracting the final number from the end index will

    point to the starting index. As a result, an arc diagram can be created from an array storing start and end index pairs. As

    long as the arc arrays size is insignificant compared to the time complexity, its additional memory requirements are

    negligible. This constraint can be encoded by setting a minimum repetition length, which also serves to remove noise

    and clean the data.

    In order to effectively utilize only the repetition information, all other data is discarded. The comparison function only

    requires information in the previous row, namely the value along the diagonal. When a match is found, the previous cell

    is added. When a match is not found and the previous cell is at least the minimum repetition size, its data is added to

    the arc array. In addition, the final cell in the row must also be checked for a repetition, since it can no longer continue.

    At this point, all the information from the previous row has been used, is no longer needed, and can be swapped with

    the current row to save space. Consequently, the Increasing Dot Plot really only uses 3 n-sized elements: two one-

    dimensional matrices and the original string. This results in O(n) [linear] space complexity.

    Cetera Implementation Details

    The Increasing Dot Plot was implemented in Processing, with memory increased to 1GB, for its rapid prototyping

    prowess. The program was tested on a low-end laptop (2.13 GHz processor with 4 GB of RAM). Arc diagrams were

    drawn with the specifications outlined in [ARC02], though Red was used as arc filler to differentiate the results. Unlike

    Wattenberg, sequence characters are shown if n

  • Results: Biological Arc Diagrams

    The original plan was to survey genomes of viruses and the biological kingdoms. Although the program was able

    to completely store over a sequence over 60MB (chromosome 11 of Equus Caballus: NC_009154.2), actually creating a

    diagram would take too long. For example, the complete virus genome for Murid Herpesvirus 1 (230,278 base pairs (bp)

    ) takes approximately 8 minutes to run (Figure gi|21716071 above). Sequences of even 1 or 2 million characters already

    take many hours to complete, due to the O(n) time complexity. Thus, the original plan of sampling a large set of

    genomes was too ambitious. Instead, a small set of genome, gene, and proteins are shown, where genomes and genes

    are mostly nucleotide bases and proteins are exclusively amino acid base pair inputs.

  • As with many bioinformatics programs, choosing an appropriate minimum repetition length parameter is

    extremely important. The length largely determines how many arcs will the diagram will include. Another parameter

    was introduced to make sure a sufficient number of repetitions are found. If the minimum number of repetitions is not

    reached, then the program decrements the minimum length by one and restarts. Smaller protein sequences (n < 1,000)

    are run with minimum:3, repetitions:0 (Sequence dlush_2). These amino acids sequences many times cannot find

    repetitions on the first pass and are automatically reduced to length 2, for which there are too many results.

    Consequently, proteins might not be a good target for this application. All larger sequences (n > 1,000) were run with

    minimum:20, repetitions:10. These parameters usually work, except in the case below of the fungus Encephalitozoon

    Intestinalis ATCC 50506 chromosome I (NC_014415.1, 160332 bp) where the program had to reduce the minimum down

    to length 15, after which 27 repetitions were found. Creating a function to compute the default parameters for various

    sequence lengths would have been useful, but was outside the scope of the project.

    Discussion: The Increasing Dot Plot vs. Arc Diagrams

    As described in [ARC02], traditional Arc Diagrams do not visualize every pair of repetitions. Their

    implementation and definitions were a bit confusing, but would probably amount to not showing the middle two arcs

    connecting the first and third set of AABB, as well the as the second and fourth set (Sequence 01010101 below).

    Wattenberg describes making two passes through the suffix tree to get the final visualization. Making another pass

    through the arc array may eliminate this difference between the techniques.

  • The Shape of Song visualizations produced by the Arc Diagrams with music scores as input are clearly different

    from those of biological sequences. Firstly, natural strings do not have the layering seen in music, where the song has

    short consecutive repetitions, as well as much larger repetitions. Biological text seems to have far fewer repetitions and

    very few repetitions between multiple pairs. What results is simply a less attractive, purely practical image. In addition,

    the significantly longer length of genomes may hide elements of the visualization: Murid Herpesvirus 1(Figure

    gi|21716071, previously shown) has 51 repetitions, but only two larger arcs are visible. Additionally, variation in

    repetition length is mostly invisible for long sequences and large repetition lengths. Regardless, both pictures do present

    a starting point for analysis.

    Conclusion

    Arc Diagrams are very relevant to genomic sequences, contrary to what Wattenberg suggested. With respect to

    the new technique introduced, Increasing Dot Plots are significantly more informative than standard Dot Plots. In

    practice, hopefully Increasing Dot Plots will replace their predecessors altogether, except for a few small applications.

    Nevertheless, Increasing Dot Plots are just one additional tool in the much larger Bioinformatics toolbox.

    References:

    [ARC02]: Wattenberg, M. (2002) Arc Diagrams: Visualizing Structure in Strings. Proceedings of the IEEE

    Symposium on Information Visualization. IEEE Computer Society. (http://www.turbulence.org/works/Song)

    [SZC08]: Szczesny, P., and A. Lupas. 2008. Domain annotation of trimeric autotransporter adhesinsdaTAA.

    Bioinformatics 24:1251-1256.

    [HUO07]: Hongwei Huo and Vojislav Stojkovic, A Suffix Tree Construction Algorithm for DNA Sequences, IEEE

    7th International Symposium on BioInformatics & BioEngineering. Harvard School of Medicine, Boston, MA,

    October 14-17, Vol. II, pp. 1178-1182, 2007.

  • Appendix: Example arc array reference file (minimum repetitions: 3) with the triple repetition lls

    d1ush_2 4.145.1.2.1 (26-362) 5'-nucleotidase (syn. UDP-sugar hydrolase), N-terminal domain {Escherichia coli}

    Bases: 337

    Repetitions: 12

    tvl(3): 9, 98

    eyg(3): 24, 27

    aae(3): 45, 234

    lls(3): 53, 111

    lls(3): 53, 322

    ign(3): 88, 152

    lls(3): 111, 322

    lfk(3): 125, 131

    efr(3): 162, 275

    kpd(3): 182, 250

    nge(3): 197, 278

    aen(3): 235, 314