Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Alignment-free sequence comparison
Analysis of Biological Sequences 140.638
Why not just align the sequences?
• Alignment scoring can be arbitrary• current alignment algorithms are not scalable: tedious and slow to do
sequence alignment on a large scale (especially short read sequencing)• sequences may not align to each other well enough to give recognizable
distances (gaps etc)• alignment algorithms assume generally collinear sequences
Why not just align the sequences?
• below “twilight zone” of 60-65% identity (nucleotide) or 20-35% identity (protein), alignments are not accurate
• memory and time consuming (prohibitive for multiple genomes)• algorithms make implicit assumptions about evolutionary trajectories of
sequences being compared
resolution-free sequence comparison methods
• word counting/composition comparison• Universal sequence maps (CGR)• Kolmogorov complexity• Complete composition vectors
word-based distance
word-based distances
• word size 2-6 works well for protein comparisons• 8-10nt words useful for DNA or RNA• long words (~25nt) can distinguish very closely related bacterial species in
metagenomics applications
word-based approach
word-based distances
• determine relative frequency of each word:
Oij = # times word Oj appears in sequence i
fij = frequency of word Oj in sequence i
word-based distances
comparing two sequences x and y:
used this method to compare mitochondrial DNA from primate species
improved word-based methods
a single change in a word creates a new word -- biologically realistic?
instead use word neighborhoods e.g.CATTATT, CATTATA, CATTAAT...
N2 similarity score
N2 similarity score
• defines a (potentially weighted) set of words that are the “neighborhood” of any word
• compute word neighborhood counts• correct for inter-variable frequency (e.g. observations of CAAAA and AAAAA
are strongly correlated)• correct for word covariance• normalize so that all word frequencies sum to one
N2 similarity score: distinguishing enhancers
D2 score
if XW is the count of word W in the sequence X,
D2 is Poisson distributed
improvements to D2 score
D2S is normally distributed, D2* is the sum of independent normally distributed variables
metagenomics with 5-tuples
k-tuple scores and metagenomics
metagenomics with D2*
clustering of gut bacteria from foregut fermenters, hindgut fermenters, and carnivores
more metagenomics
speeding up kmer distances
CAFE workflow
source sequences can be whole genomes, contigs, or short reads
CAFE results
resolution-free sequence comparison methods
word counting/composition comparisonComplete composition vectorsUniversal sequence maps (CGR)Kolmogorov complexity
Universal sequence maps
Chaos theory / Chaos game representationIterative functions to represent biological sequencesCan be generalized to any order alphabet (thus “universal”)
Chaos game representation
Plot sequence in a square with vertices labeled A,C,T,G1st nt is plotted halfway between the center of the square and the vertex labeled with that ntSubsequent nts are plotted halfway between previous dot and the vertex labeled with the new nt-> 2D plot representing 1° DNA sequence for ANY lengthpatterns are usually fractal
Chaos game representation
ACG C
A T
G
Chaos game representation
Chaos game representation
Sierpiński Sieve
Chaos game representation
Genomic signature: dinucleotide & trinucleotide relative abundance profiles distinguish between organisms and sequence segments, and can be used in phylogenetic analysis
We see less variation of CGR along genomes than between genomes -> related to genomic signature?
Chaos game representation
What determines the pattern in a CGR?Short nucleotide frequencies don’t solely determine the patternFor a DNA sequence, one can construct a simulated sequence with the same length and nucleotide compositionIF CGRs are the same, then nucleotide and dinucleotide frequencies are all that’s important
making CGR computable
• Hatje & Kollmar divide the CGR grid to outline short oligos & then get frequencies of those oligos
• Almeida et al proved that the length of the common prefix between two CGR is the dissimilarity distance
3D Chaos game representation (HPV)
computing feature vectors from 3D CGR (HPV)
Kolmogorov complexity
K(x) is the shortest binary program that can compute the string x on a binary computerK(x|y) is the shortest binary program that can compute the string x, given the information from y
NID(x,y) = max[K(x|y), K(y|x)]/max[K(y), K(x)] is the normalized information distance (0 ≤ NID ≤ 1)Can be shown that NID(x,y) can express all other distances between x and y!. . . Not computable though :(
Kolmogorov complexity
NID(x,y) = max[K(x|y), K(y|x)]/max[K(y), K(x)]
NCD = normalized compression distance, approximation to NID
NCD(x,y) = (min[C(xy),C(yx)] - min[C(x),C(y)])/max[C(x),C(y)]
Where C(x) is the compressibility of the string x, C(xy) is the compressibility of the string x concatenated to y etc. Can just use gzip!
x: AAAAAAAy: ACGAATAxy: AAAAAAAACGAATAyx: ACGAATAAAAAAAA
>seq1TAGAAATAAATGGAAAGTCAGTAAATGTGTGGCCTGTTAAAATTCTTGGAGAATATACATCACCACTTTCCTCCAAAAATGGGAATAGAATTAGTTCGAATAATTTAGAGAAAAGCACCAACAAACAAATCCACTCAGAATTCTCCATTTCTAGATTGCCCAGAACTAGGCCACGGCAGCTGGGTTCTGAGCAAGACAGTGAGGTTTTCCCTTCCGACCAGGGTGTCAAGAAGAATTGTAAGCAGATTGAATCTGCTAAATTATTACCTGATACACCCGTTCAATTCATACCTCCAAATACATTGAACCTTCGTAGCTTTACCAAGATCATAAAGAGACTGGCTGAACTGCATCCAGAAGTCAGCAGAGACCATATTATAAATGCACTTCAGGAAGTGAGAATAAGACATAAAGGTTTTCTGAATGGCTTATCTATTACTACTATTGTGGAGATGACTTCATCTCTTCTGAAAAACTCTGCTTCCAGTTAGGAATTCAAAAAACAATAAAGAGAACTTCCTTGGAAAGTGTGTTTCCTCCTTCAGAGAATGTTCTACAGCACTTAGGAAAAAGTAGTAATAACAAGATGATGTAATTAAATAGGCTCTATAAATGGGCTAAGCTGTTAAAATATTCTACTTTATATCCCTCCTTTAAAATCTAGCAACAGTTGTCTATACAATATTAAGATCTTCTCTATATATTTAAAGTTAAAATATAATTTTTAATAAGTTTTTAAATTTTTTTATTTCAATTTTGTTACTTAGAACATTAAGATGCATATTTGTGATCTAAAGAAATTGTCTTGTCCATTTTAAAAACCTTTATTAAGTCACTTTTAAAATGTATTGACCAAGAAGGAGGTTTGTTGTTACATCAATGTTTGTGAAATGATTTCCATACATAAAAAATGTAATTTACCTGAACTTTGTCTTAAGACTCTTACATTGGATTATAGGATAACAGATAAATAAACTGTATAGATACATTCAGTATCATACAACATTTTGGAATGTGTATGCTTTCAGGCTTCCAAGATAATTAAATTACTAGAAATAAATGGAAAGTCAGTAAATGTGTGGCCTGTTAAAATTCTTGGAGAATATACATCACCACTTTCCTCCAAAAATGGGAATAGAATTAGTTCGAATAATTTAGAGAAAAGCACCAACAAACAAATCCACTCAGAATTCTCCATTTCTAGATTGCCCAGAACTAGGCCACGGCAGCTGGGTTCTGAGCAAGACAGTGAGGTTTTCCCTTCCGACCAGGGTGTCAAGAAGAATTGTAAGCAGATTGAATCTGCTAAATTATTACCTGATACACCCGTTCAATTCATACCTCCAAATACATTGAACCTTCGTAGCTTTACCAAGATCATAAAGAGACTGGCTGAACTGCATCCAGAAGTCAGCAGAGACCATATTATAAATGCACTTCAGGAAGTGAGAATAAGACATAAAGGTTTTCTGAATGGCTTATCTATTACTACTATTGTGGAGATGACTTCATCTCTTCTGAAAAACTCTGCTTCCAGTTAGGAATTCAAAAAACAATAAAGAGAACTTCCTTGGAAAGTGTGTTTCCTCCTTCAGAGAATGTTCTACAGCACTTAGGAAAAAGTAGTAATAACAAGATGATGTAATTAAATAGGCTCTATAAATGGGCTAAGCTGTTAAAATATTCTACTTTATATCCCTCCTTTAAAATCTAGCAACAGTTGTCTATACAATATTAAGATCTTCTCTATATATTTAAAGTTAAAATATAATTTTTAATAAGTTTTTAAATTTTTTTATTTCAATTTTGTTACTTAGAACATTAAGATGCATATTTGTGATCTAAAGAAATTGTCTTGTCCATTTTAAAAACCTTTATTAAGTCACTTTTAAAATGTATTGACCAAGAAGGAGGTTTGTTGTTACATCAATGTTTGTGAAATGATTTCCATACATAAAAAATGTAATTTACCTGAACTTTGTCTTAAGACTCTTACATTGGATTATAGGATAACAGATAAATAAACTGTATAGATACATTCAGTATCATACAACATTTTGGAATGTGTATGCTTTCAGGCTTCCAAGATAATTAAATTAC
>seq2ATTTATAGAGAAGCCAGTGTTAAGCCGTACTTAAGGTTCACATTTGTAATGAAATAGGTAACTGGGCCTCCACAAGTTCCATGGGAATCGCAGACTAACCATTTGGTTTTCCTCTGCCTCATTTTCTCCTCCTCCTCCTGCTCCTCCTCTTCCTCCTCCCCTCTCTTTAGCATCCTCCTCCTCCTTCTTCTTCTACATCCTCCTTTTCCTCTTCCTCCTCCATCTTCTCCTCTCCTTCTCCTCTTCCTCCCCTTCTTCATCTATTCATTCTTCCTTGAGCCTCCTGGCCCACTAGGGCCCTTCTATCTTGCATCACCTCTGCCCTCTCAAGGCATGCAATATCCTGTATCTCATTCTTCCTTTAGTTCAGCTGCCTTCTCTTCACATGGTGGTCTATCTTGGGCTGTCTGCTCAGACCACATCTCACCCAATTTCCTTGCTACATTCCCAGTGGACAAGCCCGGTGATTCACTCTTGATCTTTGGACAATATTCAGAATGAAGCAGGAAGAAAGCAAGCGGTAGTCTTTTGTGAGTACCTAAGTCTTCATTTTTCTTCAGGTCCTTTCTTATTGCCTTTAAGAGGAACATAATTCTTCATCAGCTATCATAGCCTCAGAGCAAGCCTTGTCACTTGGAGCTGTATCTTCAGGTTTCACCTTTTCCTTTGTAGGCATGAAGGTCCTCTCCAAGAACTCAGCAAAGCTGACTGGACCCAGGCATTTCTTTCTGTTCTCCTGGAAGTCTGCAGGAAGACAGCTCCTGGGCCTTTTCTTCCTCCAGCCAACCCAGTCTCCTTCACCCAAGGTGACCCATGGCGTGCGGGGAGAAGGGGGGCTCTATCTGAGTGGGCTTTTTCCTGAGTCCAAACCAGATGCTTCCTTCTCCATACGATTGTCAGCTGGCTTCACTTTTCATATTATTTTAAGCTTTAATTATTTTTCTCTCCTTGCAGAGCAACAATTGTGGTAATAAAACCAGATACCAACTCTTATCTCAGGTTAGTAATAAAGTTGTTGCCTACTATCTAGAAATGTACCTGCCTTTTCTTTTTTCTTTCCTTTTCTCTTTCCTTTCCTTTCCTTTCCTTTCCTTTTCGTTTCTTTTCTTTTCTGTAAAATGTGGCAATTTACAGGTTGGGATGTATCACCGTTGGTGGAGTGTTTACCTAGCTAGTATATACAAAGCCCTTGTTTAAATTCCTAGCACTGGGTAGGTATGGTGACTCGTGTCTGTAATCTCAGAACTCTACAGGTAGACATGTGGGAAGCAGAAATTCATCCTCAGCATACAGTGAGTTTCAAGTTGGCCTGACCCAGAAGAACTCAGGGGAAAAAAGCTGATGTCTTTTCTCTCTCTCTCTCTCTCTTTCTCTCTCTCTCTCTCTCTCTCTCTCTTTCTCTCTTTCATAATTCTTTTGGTAGAGAGAAGGAAAGAGATGAACATGTATTAAGTTCCCTGGTATCTACCAAATTTGTGTATTACTTGTCGGTTAATATTATAACAAACATTAAATTGTATTCAGAACCATATTTTGATTATTATCTTTGTGTGCTTTGGATCTCACGACAGTAATAGTTACCTGAGGTGCTTAACTACCGTTTCTGTGACAGTAAATTATTTAAGTTTACTCTCTCCCTCTACAGCCCAACAGTGTGTAGTTTGTATGGTTCATTTGTTGTTGGCTTGTTGTTATTGATGTTGTTTGTGTTGCTGATGCTAGAGTCTGGGGCCTTGGACATATTCGGCAGGCAAATGCTCCACCACTGAGCCTCCAGCCACTTTGCTGGAGGTTTTTGTAGCTGTAGATTGTAATGAAGAAGTTTTTCATCTTTTATATTTGAAAAAGATACCACGGCACGATACACAGCTACAACCAATGCACTAAGATAAATAACCAACCCAACAGAGTGACATTATGATGCAGTAGTTGTAAGAATCAATTTAAAAGATATATCACTTCATCCTTGGGTTTGCCTATGTTCTCATCTGTGAGATTTAAAATCTTTTGAAACATTGAATGAAGCCTCTCATCTATCATCAACTGCCATTAAATATCACATATTCACAGCTGGAGAAATGGACCAGCCGACATCCGGAA
sequence comparison by gzip
bytes filename 2130 seq1 2130 seq2 4260 seq1.seq1 4260 seq2.seq2 4260 seq1.seq2 4260 seq2.seq1
sequence comparison by gzip
bytes filename 2130 seq1 2130 seq2 4260 seq1.seq1 4260 seq2.seq2 4260 seq1.seq2 4260 seq2.seq1NCD(x,y) = (min[C(xy),C(yx)] - min[C(x),C(y)])/max[C(x),C(y)]= (min(1104/4260, 1088/4260) - min(766/4260, 443/4260))/max(766/4260, 443/4260)= (1088/4260 - 443/4260) / (766/4260) = 0.84
bytes filename422 seq1.gz735 seq2.gz443 seq1.seq1.gz766 seq2.seq2.gz1104 seq1.seq2.gz1088 seq2.seq1.gz