Advancing Science with DNA Sequence
Sequence Clustering
MGM WorkshopSeptember 26, 2011
Reducing Search Space in Protein
and
DNA/RNA Sequence Analysis
Denis Kaznadzey, GBP
Advancing Science with DNA Sequence
Sequence clustering
- Classify into groups of essentially similar objects
- When new data arrives, assign objects to existing groups
- Classify ‘leftovers’
- Occasionally review entire classification
Problem: What is essentially similar’?• Finding properties that are important
(Ontological relevancy)
• Does classification reflect reality in any way?
To deal with a huge variety of individual ‘objects’:
Advancing Science with DNA Sequence
Sequence clustering
Taxonomical Classification
vs.
Continuity of Great Chain of Being
Even if reductionist, classification is a tool to study the world – the biology in particular.
When data is incomplete, any classification is a convention. At the same time, it is an approximation of a “reality”.
Carl Linnaeus Georges Buffon
Advancing Science with DNA Sequence
Sequence clustering
In Modern Biology: Most abundant type of data is sequence:
• Genomic DNA
• RNA (through RNASeq)
• Derived Proteins
Primary feature is Primary Structure, but
- Classification criteria depends on application.
Advancing Science with DNA Sequence
Sequence Clustering
Genome Assembly: Binning, Scaffolding
Transcriptomics: EST (read) clustering
Protein Function and Evolution studies:Protein families
Phylogenetic profiling: OTUs
Select Applications in Genomic Sciences:
Advancing Science with DNA Sequence
Sequence Clustering
In Metagenomics:
Primary tasks:
• Assess diversity
• Find genes
• Predict functions
• Predict pathways
• Estimate capabilities
Based on sequence comparison.
Advancing Science with DNA Sequence
Sequence Clustering
- Any Clustering is based on the Distance in some Metric.
- Initial clustering is based on pair-wise distances.
- Subsequent classification is based on distances from object to clusters- Representative- Set of representatives (all at
extreme)- Other measure, may be
unrelated to initial.
Advancing Science with DNA Sequence
Sequence Clustering
When distance measure is chosen, and distances are obtained / computed:
• There is a HUGE variety of clustering methods (clustering / classification is a very elaborate methodology)• K-mean, average linkage, complete linkage, single linkage,
iterative, SOM, etc.
• However options for large volume clustering are limited due to performance of algorithms.
• Single-linkage can be computed very efficiently
• (Method for pledging new sequences to clusters may be computationally more intense)
Advancing Science with DNA Sequence
Sequence clustering
Most efficient clustering: transitive-closure based.
• Requires ‘boolean’ distances (two sequences can be linked or not linked
• Requires number of nodes to be known
• Space ~ NodesNo
• Run-time (worst) ~ EdgesNo* AveClustSize
• Run-time (average) ~ EdgesNo * log2 (AveClustSize))
Advancing Science with DNA Sequence
Sequence clustering
Practical Transitive Closure algorithm:Allocate array of sequence numbers A [0..N]
Phase I: connect linked vertices through vertex of smallest index
For each edge (m, n):
While A [n] != n:
n = A [n]
While A [m] != m:
m = A [m]
A [max (m, n)] = min (m, n)
Phase II: propagate smallest indices as cluster identifiers
For each n from 0 to N:
If A [n] ! = A [ A [n]]:
A [n] = A [A [n]]
Phase III: collect clusters. (Implementation dependent)
Count number of distinct cluster “id”s => M (1 pass)
Allocate array of sizes; Count size of each cluster (1 pass)
Allocate array of clusters; fill it in (1 pass)
0 1 2 3 4 5 6
0 1 2 1 4 5 6
0 1 2 1 4 5 5
0 1 2 1 4 1 5
0 1 2 1 4 1 1
+(1,3)
+(5,6)
+(6, 1)
(0); (1,3,5,6); (2); (4)
Advancing Science with DNA Sequence
OK
Sequence clustering
Computing ‘boolean’ distances:• Threshold – based
• Additional rules (match arrangement)
Example: read/EST clustering% identity + length + arrangement:
Advancing Science with DNA Sequence
Computing similarity measure:- Edit distance or (ungapped) statistics P-value: BLAST,
Fasta, needle, water, etc.
- Adjusted edit distance through progressive alignment: Clustal, MUSCLE, T-coffee
- K-mere statistics: CD-HIT, USEARCH, MUSCLE
- Suffix trees (and probabilistic suffix trees): MUMmer, Reputer, CLUSEQ
- Suffix Arrays: Bowtie, BWT
- Position-Specific scoring matrix: PSI-Blast, Impala
- Hidden Markoff Models: HMMer, HHSearch/HHPred, SAM
Advancing Science with DNA Sequence
Sequence clustering
Distance computing is harder then clustering. (Bundled solutions: BLASTCLUST, CD-HIT, UCLUST, CLUSEQ)
- For large data sets only k-mere and suffix array measures are practical.
- However: incremental/ greedy approaches can be used to avoid entire distance matrix computing. This makes use of sensitive similarity measures possible.
- For boolean distance, iterative similarity detection is possible. Fast binning->slow comparing. (no off-the-shelve implementations(?))
Advancing Science with DNA Sequence
Sequence clustering
Boolean distance clustering killer:
CLUSTER AGGREGATION.
In large clusters, even a small number of random links lead to huge conglomerates.
Advancing Science with DNA Sequence
Common causes:
1) Contamination with standard constructs
2) Repeats
3) Chimeras
4) Spurious similarities (low complexity zones etc.
Advancing Science with DNA Sequence
Sequence clustering
Fighting aggregation
- Vector / adapter trimming:- Lucy, Figaro, etc. Integrated in many assembly suites
(newbler, velvet, AMOS, CLCbio, etc.)
- Low complexity detection / masking:- SEG, DUST, FastQC, WindowMasker etc. – often integrated
in search tools.
Advancing Science with DNA Sequence
Sequence clustering
- Repeat detection / masking:- Regular (tandem) repeats:
- Pre-search masking: Based on structure (IMEx, SRF); or on database (TRDB)
- Post-search detection based on similarity properties (multiple parallel threads)
- Irregular (long) repeats:- Database based: RepeatMasker- De-novo: RepeatScout, orrb, PILER, etc.
Require genome as input, construct database.
Advancing Science with DNA Sequence
Sequence clustering
Detecting chimeric sequences:
• Abundance-based: Perseus, UCHIME• Chimeras undergo less amplification
cycles. So chimera segments in native arrangement are more frequent.
• Specific to 16S: ChimeraSlayer, Bellerophon• Chimera ‘arms’ are closer to originating
phyla then entire chimera
Advancing Science with DNA Sequence
Sequence clustering
Detecting chimeric sequences• Similarity coverage based: Mira assembler
Advancing Science with DNA Sequence
Sequence clustering
Detecting chimeric sequences• Similarity graph topology based: dchim
Alignment view Connectivity view
Advancing Science with DNA Sequence
Protein Clusters: various criteria- Primary structure similarity
- Close evolutionary relationship
- Similarity in physical properties
- 3-D structure similarity
- Similar fold arrangement
- Domain structure similarity
- Common or similar functions
- etc.
Advancing Science with DNA Sequence
Sequence clustering
Functional and structural classifications in IMG
Advancing Science with DNA Sequence
Sequence clustering
Direct similarity measure by edit distance is not sensitive enough for evolutionary distant species
Position-specific scoring matrices and profile-HMMs provide better sensitivity, but MUCH SLOWER.
For individual genomes (103 -5x104 proteins) could be used with massively parallel computations (while number of genomes is within thousands)
For metagenomes can not be used with foreseeable computing resources.
Advancing Science with DNA Sequence
Sequence clustering
Functional annotation of metagenome genes through protein clusters (under development):
- Build set of functionally homogenous clusters of similar proteins – for annotated genomes
- Build HMMs for each cluster, compose model database
- Pledge metagenome proteins to clusters by matching to models
- Cluster unpledged proteins, build models, update model database.
- Balance model database by creating model tree: aggregating small relative clusters and dissecting large ones.
- Perform hierarchical searches through profiles tree.
Advancing Science with DNA Sequence
Sequence clustering
Clustering reduces search space, but adds another level of indirection, which is a source of errors, and complexity, which consumes effort.
Improves only searches within parameters space used for clustering (structure-based clusters not useful for searching for certain codon usage, etc.)
Advancing Science with DNA Sequence
However, for proteins, which form dense relationship networks, clustering is a great tool.
Advancing Science with DNA Sequence
Thank you!