48
Alignment-free sequence comparison Analysis of Biological Sequences 140.638

Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

Alignment-free sequence comparison

Analysis of Biological Sequences 140.638

Page 2: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

Why not just align the sequences?

• Alignment scoring can be arbitrary• current alignment algorithms are not scalable: tedious and slow to do

sequence alignment on a large scale (especially short read sequencing)• sequences may not align to each other well enough to give recognizable

distances (gaps etc)• alignment algorithms assume generally collinear sequences

Page 3: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

Why not just align the sequences?

• below “twilight zone” of 60-65% identity (nucleotide) or 20-35% identity (protein), alignments are not accurate

• memory and time consuming (prohibitive for multiple genomes)• algorithms make implicit assumptions about evolutionary trajectories of

sequences being compared

Page 4: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

resolution-free sequence comparison methods

• word counting/composition comparison• Universal sequence maps (CGR)• Kolmogorov complexity• Complete composition vectors

Page 5: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

word-based distance

Page 6: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

word-based distances

• word size 2-6 works well for protein comparisons• 8-10nt words useful for DNA or RNA• long words (~25nt) can distinguish very closely related bacterial species in

metagenomics applications

Page 7: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

word-based approach

Page 8: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

word-based distances

• determine relative frequency of each word:

Oij = # times word Oj appears in sequence i

fij = frequency of word Oj in sequence i

Page 9: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

word-based distances

comparing two sequences x and y:

used this method to compare mitochondrial DNA from primate species

Page 10: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple
Page 11: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

improved word-based methods

a single change in a word creates a new word -- biologically realistic?

instead use word neighborhoods e.g.CATTATT, CATTATA, CATTAAT...

Page 12: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

N2 similarity score

Page 13: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

N2 similarity score

• defines a (potentially weighted) set of words that are the “neighborhood” of any word

• compute word neighborhood counts• correct for inter-variable frequency (e.g. observations of CAAAA and AAAAA

are strongly correlated)• correct for word covariance• normalize so that all word frequencies sum to one

Page 14: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

N2 similarity score: distinguishing enhancers

Page 15: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

D2 score

if XW is the count of word W in the sequence X,

D2 is Poisson distributed

Page 16: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

improvements to D2 score

D2S is normally distributed, D2* is the sum of independent normally distributed variables

Page 17: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

metagenomics with 5-tuples

Page 18: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

k-tuple scores and metagenomics

Page 19: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

metagenomics with D2*

clustering of gut bacteria from foregut fermenters, hindgut fermenters, and carnivores

Page 20: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

more metagenomics

Page 21: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

speeding up kmer distances

Page 22: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

CAFE workflow

source sequences can be whole genomes, contigs, or short reads

Page 23: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

CAFE results

Page 24: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple
Page 25: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple
Page 26: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple
Page 27: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple
Page 28: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple
Page 29: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

resolution-free sequence comparison methods

word counting/composition comparisonComplete composition vectorsUniversal sequence maps (CGR)Kolmogorov complexity

Page 30: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

Universal sequence maps

Chaos theory / Chaos game representationIterative functions to represent biological sequencesCan be generalized to any order alphabet (thus “universal”)

Page 31: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

Chaos game representation

Plot sequence in a square with vertices labeled A,C,T,G1st nt is plotted halfway between the center of the square and the vertex labeled with that ntSubsequent nts are plotted halfway between previous dot and the vertex labeled with the new nt-> 2D plot representing 1° DNA sequence for ANY lengthpatterns are usually fractal

Page 32: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

Chaos game representation

ACG C

A T

G

Page 33: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

Chaos game representation

Page 34: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple
Page 35: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

Chaos game representation

Sierpiński Sieve

Page 36: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

Chaos game representation

Genomic signature: dinucleotide & trinucleotide relative abundance profiles distinguish between organisms and sequence segments, and can be used in phylogenetic analysis

We see less variation of CGR along genomes than between genomes -> related to genomic signature?

Page 37: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

Chaos game representation

What determines the pattern in a CGR?Short nucleotide frequencies don’t solely determine the patternFor a DNA sequence, one can construct a simulated sequence with the same length and nucleotide compositionIF CGRs are the same, then nucleotide and dinucleotide frequencies are all that’s important

Page 38: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

making CGR computable

• Hatje & Kollmar divide the CGR grid to outline short oligos & then get frequencies of those oligos

• Almeida et al proved that the length of the common prefix between two CGR is the dissimilarity distance

Page 39: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple
Page 40: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple
Page 41: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

3D Chaos game representation (HPV)

Page 42: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

computing feature vectors from 3D CGR (HPV)

Page 43: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

Kolmogorov complexity

K(x) is the shortest binary program that can compute the string x on a binary computerK(x|y) is the shortest binary program that can compute the string x, given the information from y

NID(x,y) = max[K(x|y), K(y|x)]/max[K(y), K(x)] is the normalized information distance (0 ≤ NID ≤ 1)Can be shown that NID(x,y) can express all other distances between x and y!. . . Not computable though :(

Page 44: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

Kolmogorov complexity

NID(x,y) = max[K(x|y), K(y|x)]/max[K(y), K(x)]

NCD = normalized compression distance, approximation to NID

NCD(x,y) = (min[C(xy),C(yx)] - min[C(x),C(y)])/max[C(x),C(y)]

Where C(x) is the compressibility of the string x, C(xy) is the compressibility of the string x concatenated to y etc. Can just use gzip!

x: AAAAAAAy: ACGAATAxy: AAAAAAAACGAATAyx: ACGAATAAAAAAAA

Page 45: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

>seq1TAGAAATAAATGGAAAGTCAGTAAATGTGTGGCCTGTTAAAATTCTTGGAGAATATACATCACCACTTTCCTCCAAAAATGGGAATAGAATTAGTTCGAATAATTTAGAGAAAAGCACCAACAAACAAATCCACTCAGAATTCTCCATTTCTAGATTGCCCAGAACTAGGCCACGGCAGCTGGGTTCTGAGCAAGACAGTGAGGTTTTCCCTTCCGACCAGGGTGTCAAGAAGAATTGTAAGCAGATTGAATCTGCTAAATTATTACCTGATACACCCGTTCAATTCATACCTCCAAATACATTGAACCTTCGTAGCTTTACCAAGATCATAAAGAGACTGGCTGAACTGCATCCAGAAGTCAGCAGAGACCATATTATAAATGCACTTCAGGAAGTGAGAATAAGACATAAAGGTTTTCTGAATGGCTTATCTATTACTACTATTGTGGAGATGACTTCATCTCTTCTGAAAAACTCTGCTTCCAGTTAGGAATTCAAAAAACAATAAAGAGAACTTCCTTGGAAAGTGTGTTTCCTCCTTCAGAGAATGTTCTACAGCACTTAGGAAAAAGTAGTAATAACAAGATGATGTAATTAAATAGGCTCTATAAATGGGCTAAGCTGTTAAAATATTCTACTTTATATCCCTCCTTTAAAATCTAGCAACAGTTGTCTATACAATATTAAGATCTTCTCTATATATTTAAAGTTAAAATATAATTTTTAATAAGTTTTTAAATTTTTTTATTTCAATTTTGTTACTTAGAACATTAAGATGCATATTTGTGATCTAAAGAAATTGTCTTGTCCATTTTAAAAACCTTTATTAAGTCACTTTTAAAATGTATTGACCAAGAAGGAGGTTTGTTGTTACATCAATGTTTGTGAAATGATTTCCATACATAAAAAATGTAATTTACCTGAACTTTGTCTTAAGACTCTTACATTGGATTATAGGATAACAGATAAATAAACTGTATAGATACATTCAGTATCATACAACATTTTGGAATGTGTATGCTTTCAGGCTTCCAAGATAATTAAATTACTAGAAATAAATGGAAAGTCAGTAAATGTGTGGCCTGTTAAAATTCTTGGAGAATATACATCACCACTTTCCTCCAAAAATGGGAATAGAATTAGTTCGAATAATTTAGAGAAAAGCACCAACAAACAAATCCACTCAGAATTCTCCATTTCTAGATTGCCCAGAACTAGGCCACGGCAGCTGGGTTCTGAGCAAGACAGTGAGGTTTTCCCTTCCGACCAGGGTGTCAAGAAGAATTGTAAGCAGATTGAATCTGCTAAATTATTACCTGATACACCCGTTCAATTCATACCTCCAAATACATTGAACCTTCGTAGCTTTACCAAGATCATAAAGAGACTGGCTGAACTGCATCCAGAAGTCAGCAGAGACCATATTATAAATGCACTTCAGGAAGTGAGAATAAGACATAAAGGTTTTCTGAATGGCTTATCTATTACTACTATTGTGGAGATGACTTCATCTCTTCTGAAAAACTCTGCTTCCAGTTAGGAATTCAAAAAACAATAAAGAGAACTTCCTTGGAAAGTGTGTTTCCTCCTTCAGAGAATGTTCTACAGCACTTAGGAAAAAGTAGTAATAACAAGATGATGTAATTAAATAGGCTCTATAAATGGGCTAAGCTGTTAAAATATTCTACTTTATATCCCTCCTTTAAAATCTAGCAACAGTTGTCTATACAATATTAAGATCTTCTCTATATATTTAAAGTTAAAATATAATTTTTAATAAGTTTTTAAATTTTTTTATTTCAATTTTGTTACTTAGAACATTAAGATGCATATTTGTGATCTAAAGAAATTGTCTTGTCCATTTTAAAAACCTTTATTAAGTCACTTTTAAAATGTATTGACCAAGAAGGAGGTTTGTTGTTACATCAATGTTTGTGAAATGATTTCCATACATAAAAAATGTAATTTACCTGAACTTTGTCTTAAGACTCTTACATTGGATTATAGGATAACAGATAAATAAACTGTATAGATACATTCAGTATCATACAACATTTTGGAATGTGTATGCTTTCAGGCTTCCAAGATAATTAAATTAC

Page 46: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

>seq2ATTTATAGAGAAGCCAGTGTTAAGCCGTACTTAAGGTTCACATTTGTAATGAAATAGGTAACTGGGCCTCCACAAGTTCCATGGGAATCGCAGACTAACCATTTGGTTTTCCTCTGCCTCATTTTCTCCTCCTCCTCCTGCTCCTCCTCTTCCTCCTCCCCTCTCTTTAGCATCCTCCTCCTCCTTCTTCTTCTACATCCTCCTTTTCCTCTTCCTCCTCCATCTTCTCCTCTCCTTCTCCTCTTCCTCCCCTTCTTCATCTATTCATTCTTCCTTGAGCCTCCTGGCCCACTAGGGCCCTTCTATCTTGCATCACCTCTGCCCTCTCAAGGCATGCAATATCCTGTATCTCATTCTTCCTTTAGTTCAGCTGCCTTCTCTTCACATGGTGGTCTATCTTGGGCTGTCTGCTCAGACCACATCTCACCCAATTTCCTTGCTACATTCCCAGTGGACAAGCCCGGTGATTCACTCTTGATCTTTGGACAATATTCAGAATGAAGCAGGAAGAAAGCAAGCGGTAGTCTTTTGTGAGTACCTAAGTCTTCATTTTTCTTCAGGTCCTTTCTTATTGCCTTTAAGAGGAACATAATTCTTCATCAGCTATCATAGCCTCAGAGCAAGCCTTGTCACTTGGAGCTGTATCTTCAGGTTTCACCTTTTCCTTTGTAGGCATGAAGGTCCTCTCCAAGAACTCAGCAAAGCTGACTGGACCCAGGCATTTCTTTCTGTTCTCCTGGAAGTCTGCAGGAAGACAGCTCCTGGGCCTTTTCTTCCTCCAGCCAACCCAGTCTCCTTCACCCAAGGTGACCCATGGCGTGCGGGGAGAAGGGGGGCTCTATCTGAGTGGGCTTTTTCCTGAGTCCAAACCAGATGCTTCCTTCTCCATACGATTGTCAGCTGGCTTCACTTTTCATATTATTTTAAGCTTTAATTATTTTTCTCTCCTTGCAGAGCAACAATTGTGGTAATAAAACCAGATACCAACTCTTATCTCAGGTTAGTAATAAAGTTGTTGCCTACTATCTAGAAATGTACCTGCCTTTTCTTTTTTCTTTCCTTTTCTCTTTCCTTTCCTTTCCTTTCCTTTCCTTTTCGTTTCTTTTCTTTTCTGTAAAATGTGGCAATTTACAGGTTGGGATGTATCACCGTTGGTGGAGTGTTTACCTAGCTAGTATATACAAAGCCCTTGTTTAAATTCCTAGCACTGGGTAGGTATGGTGACTCGTGTCTGTAATCTCAGAACTCTACAGGTAGACATGTGGGAAGCAGAAATTCATCCTCAGCATACAGTGAGTTTCAAGTTGGCCTGACCCAGAAGAACTCAGGGGAAAAAAGCTGATGTCTTTTCTCTCTCTCTCTCTCTCTTTCTCTCTCTCTCTCTCTCTCTCTCTCTTTCTCTCTTTCATAATTCTTTTGGTAGAGAGAAGGAAAGAGATGAACATGTATTAAGTTCCCTGGTATCTACCAAATTTGTGTATTACTTGTCGGTTAATATTATAACAAACATTAAATTGTATTCAGAACCATATTTTGATTATTATCTTTGTGTGCTTTGGATCTCACGACAGTAATAGTTACCTGAGGTGCTTAACTACCGTTTCTGTGACAGTAAATTATTTAAGTTTACTCTCTCCCTCTACAGCCCAACAGTGTGTAGTTTGTATGGTTCATTTGTTGTTGGCTTGTTGTTATTGATGTTGTTTGTGTTGCTGATGCTAGAGTCTGGGGCCTTGGACATATTCGGCAGGCAAATGCTCCACCACTGAGCCTCCAGCCACTTTGCTGGAGGTTTTTGTAGCTGTAGATTGTAATGAAGAAGTTTTTCATCTTTTATATTTGAAAAAGATACCACGGCACGATACACAGCTACAACCAATGCACTAAGATAAATAACCAACCCAACAGAGTGACATTATGATGCAGTAGTTGTAAGAATCAATTTAAAAGATATATCACTTCATCCTTGGGTTTGCCTATGTTCTCATCTGTGAGATTTAAAATCTTTTGAAACATTGAATGAAGCCTCTCATCTATCATCAACTGCCATTAAATATCACATATTCACAGCTGGAGAAATGGACCAGCCGACATCCGGAA

Page 47: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

sequence comparison by gzip

bytes filename 2130 seq1 2130 seq2 4260 seq1.seq1 4260 seq2.seq2 4260 seq1.seq2 4260 seq2.seq1

Page 48: Alignment-free sequence comparisonbiostat.jhsph.edu/bstcourse/bio638/notes/alignment_free.pdf(protein), alignments are not accurate • memory and time consuming (prohibitive for multiple

sequence comparison by gzip

bytes filename 2130 seq1 2130 seq2 4260 seq1.seq1 4260 seq2.seq2 4260 seq1.seq2 4260 seq2.seq1NCD(x,y) = (min[C(xy),C(yx)] - min[C(x),C(y)])/max[C(x),C(y)]= (min(1104/4260, 1088/4260) - min(766/4260, 443/4260))/max(766/4260, 443/4260)= (1088/4260 - 443/4260) / (766/4260) = 0.84

bytes filename422 seq1.gz735 seq2.gz443 seq1.seq1.gz766 seq2.seq2.gz1104 seq1.seq2.gz1088 seq2.seq1.gz