Upload
valeriya-simeonova
View
71
Download
2
Embed Size (px)
Citation preview
An approach for detection and
correction of errors in 16S RNA
fragments from parallel
metagenomic sequencing
Milko Krachunov2, Valeria Simeonova2, Ivan Popov1, Irena
Avdjieva1, Paweł Szczęsny3, Urszula Zelenkiewicz3, Piotr
Zelenkievicz3, Dimitar Vassilev1
1Bioinforomatics group; AgroBioInstitute, Bulgaria
2Faculty of mathematics and informatics; Sofia University St. Kliment Ohridski, Bulgaria
3Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Warsaw, Poland
Introduction
Problems in parallel sequencing
Errors from mathematical and biological origin are introduced
in the assembled reads
It's impossible to resequence the same sample again
There's no easy way to differentiate between a “sequencing
error” and a “rare variant”
All existing solutions target very specific cases
Even low level of errors is significant and introduce high level
of errors in prediction of biodiversity
Parallel sequencing produces large amounts of data that
needs to be analysed.
Input data
Shared characteristics:
Metagenomic source
From highly conservative 16S RNA
Including high variable ranges, which are species specific, so the same hypervariable region is sequenced for many of the
microorganisms
Usually tens of thousands short reads (length 300-500 bp)
Specific Test data characteristics:
of 22325 and 34717 fragments
Aims and common Tasks1. Aims:
① Biodiversity investigating (mostly for microbes)
② Identifying already known organisms (as bacteria etc.)
③ Generating Evolutionary tree about current data set
2. Tasks:
① Develop an approach for finding and correcting the errors that
have occured in the reads by studying the correspondance
between similar sequences (Algorithm 1)
② Develop an approach for fine tuning the accuracy of the
algorithm for sequence error estimation level, SNPs and species
detecting in NGS data (Algorithm 2)
③ Develop free/open-source software packages for both
approaches, which could be performing independently of the data and
algorithm used for defining errors, SNPs, species.
Problem solving
INTE
RA
CTIO
N B
ETW
EEN
BO
TH A
PR
OA
CH
ES
Decision whether to apply the Algorithm 1 over the data set to get the error levels etc. OR to Tune up again
Estimation the accuracy level of Algorithm 1
Applying Algorithm 1 with the new set of predicted parameters, following the assumption in the beginning
Tune up the Algorithm 1, using Algorithm 2
Applying the Algorithm 1 with different parameters for identifying generated “mutations”
The aim is to obtain the accuracy of the algorithm
Generating artificial sequence errors, SNPs and different species by using different levels of errors
Assuming that the data set doesn‟t include any errors, SNPs, and all sequences are from one specie
① DEVELOP AN APPROACH FOR
FINDING AND CORRECTING THE
ERRORS THAT HAVE OCCURED IN
THE READS BY STUDYING THE
CORRESPONDANCE BETWEEN
SIMILAR SEQUENCES
PROBLEM SOLVING
Overview of the approach
The data is filtered, removing any ambigous reads (e.g.
marked by the sequencer)
The data is aligned so that any corresponding regions occupy
the same position in all reads
Error evaluation for each base in each read is performed by
finding its occurance rate weighted with the local similarity in
the region
Optionally, sequences that are globally dissimilar are
weighted lower
Approaches for alignmentProblem: free/open-source tools for multiple sequence
allignment (e.g. muscle) don't scale for datasets over a few
thousand reads
Construct a hierarchical tree and perform pairwise alignment
at each branching – not promising preliminary
results, slow, O(n³)
Cluster the dataset and align clusters independently
Hieararchical clustering – slow in tests, O(n³)
Suffix tree-based clustering – much faster
Proprietary tools presumed to use suffix-trees such as CAP3 give
near-perfect alignment of the test dataset (CAP3 is free of
charge only for non-profit projects)
Preliminary error correction
approach
Create a position window with radius n around each base that is evaluated (e.g. 5 characters before and after the base)
Find all reads that contain exactly the same region in the given window
Calculate the occurence rate of this base and assume it is an error if it is below a given threshold
Use the most occuring base in the same subset to replace it to fix the error
Result: With sub-optimal alignment, it leads to some improvements in the number of recognized organisms using external software (ClaMS)
The proposed approach
Create a large window around each base evaluated base (for performance reasons)
Calculate a similarity score between each other sequence and the evaluated sequence
For the score, take bases away from the middle with lower weight
Use the scores to calculate a weighted occurence rate using all sequences
Use the same weighting approach when finding the base that's most common in the similar reads and use it to replace the incorrect one
Sequences that don't match or have an accidental match would be discarded by their negligent score
② DEVELOP AN APPROACH FOR
FINE TUNING THE ACCURACY OF
THE ALGORITHM FOR SEQUENCE
ERROR ESTIMATION LEVEL, SNPs
AND SPECIES DETECTING IN NGS
DATA.
PROBLEM SOLVING
According to Accuracy and
quality assessment of 454 GS-FLX
Titanium pyrosequencing [1]:
• As control (reference) sequences we will use
the main data set to estimate the accuracy of
Algorithm 1 by previously artificial generated
digression with three types of ranges defining:
• Whether it is SNP: ~0.01 % (it may vary)
• Whether it is a different specie (DS): ~40-60 %
• Whether it is sequence error: in the limits between
SNP and DS, and greater than DS
According to Accuracy and
quality assessment of 454 GS-FLX
Titanium pyrosequencing [1]:
• Estimating the accuracy of Algorithm 1 is connected also to the computed variables, defining the sequence error:
• the presence of homopolymers
• position in the sequence
• size of the sequence
• spatial localization in PT plates for insertion and deletion errors – this is not computable element, it is provided by sequencer
• reference sequence type
• the distance of beads to the region center
• the distance of beads to the plate center
According to Accuracy and
quality assessment of 454 GS-FLX
Titanium pyrosequencing [1]:
• As their results are showing that the error rate
was heterogeneous along the length of the
sequence, we propose to analyze it by several
regions, according to the analysis of
distributions.
• Another good idea is to group the sequence
error by whether it is occurred in
insertion, deletion, mismatches or in ambiguous
base calls.
Defining the Training set• As we have all this information we will organized as
an ANN training set for prediction the parameters of Algorithm 1, using Artificial Neural Networks. This manner we will have several groups of input data:
• sequence error parameters
• Kind of produced “error”: sError, SNP, different specie
• Where is obtained the “error”: in/del, missmatches, ambiguous base call
• Parameters of Algorithm 1
• Another option is to include also an average row of data for each each group of Algorithm 1 „s parameters. This way we will provide a guide line for the rest.
How to choose the best
architecture for ANN ?
Choosing the best architecture is a complex task.
The best way to figure it out is to use Genetic Algorithm. The role of GA is to estimate the internal error in ANN for different topologies. We will be interested only of the best one, which means with the smallest maximum internal error. There is an option to use also the smallest average error. There is no clear view of which one is better.
The back propagation algorithm will be the core of the training, because it is the most appropriate for such tasks of prediction.
How to choose which results
are the best ?
• After training and prediction we can apply over
the results 3D PCA analysis as visualizing method
of clustering the results. It is very good for taking
a quick snapshot of different groups. Of course
we are open for other methods of visualization
as methods for graphical analysis.
What we can expect from
Algorithm 2?
• ANN and GA are part of Artificial Intelligence
Algorithms. They are widely and successful used
for prediction, classification and cauterization
problems.
• We expect that this will be:
• A tool for fine tuning up, because we will work
only with digital data, big training set, using the last IT technologies.
• Less CPU using tool
• A tool, giving quick results
Current
Realization
General approach in metagenomic
biodiversity studies
454 Sequencing
Filtering/Denoising
Multiple alignment
Distance matrix
ОTU clusters with abundance count
Our approach:
Classification,evaluation
and comparison
B. Correction with SHREC
A. Raw data
C. Correction with our method
A. Raw data characteristic
and processing
• Two separate runs of metagenomic 16S RNA
fragments, sequenced with 454 platform and converted in FASTA
format:
• run 02 – 46429 short reads
• run 04 – 41386 short reads
• Our task – extract, denoise and correct only the quality reads.
A1• 454 output data
A2
• Filtering by quality and length (300-500 nucleotides)• run 02 – 22325 short reads• run 04 – 34717 short reads
Raw data length histogram
Run 02 Run 04
B. Correction with SHREC
A2• Filtered data
B1
• Restrict quality parts of fragments only• Cut to length of 300 nucleotides• Remove fragments that have Ns in their sequence
B2
• Correction with SHREC:• run 02 – 11211 short reads• run 04 – 30149 short reads
A2• Filtered data => restrict quality parts of fragments only
C1
• Calculate distance matrix using Lewenstein distances
• Cluster data pool to smaller groups with distance matrix
• Alignment with MUSCLE:• Multiple alignment of sequences in each group
• Alignment of each group with the others
C2
• Calculate occurrence rate [%] for each base (A/T/G/C) for each position of the short reads• Determine a value [%] under which a base is regarded as "error"• Scan reads for errors• Replace errors with base that has highest occurrence rate for current position among the neighbors
C. Correction with our method:
Classification and performance evaluation
1
• At every step sequences are classified with ClaMS:• classification against ClaMS own database
• classification of the two runs against each other
2• Comparison of classification results made for the last
step for each variant (A, B, C)
ClaMS parameters:
Distance cut-off: 0,05Signature type: DBC
k-mer length: 3Existing taxonomy: 4th Level
Result comparison
• We have comparison results about the two runs ofsequenator. which are similar in most cases.
• The next three slides will show the difference in findingsome specific classes of organisms and organisms,using four different error correction methods, groupedinto: Original (i.e. without error correction), SHREC,and ours – AF2 and ANS.
• The Levels of classification are derived from ClaMS,where: 0-th Level is the Root, 1-st Level givesclassification classes Bacteria, Archaea andEucaryota, 2-nd and 3-d Levels are more specific andthey present more detailed classification of contigs.
Improvements
Fuzzy estimation of coincident in the window
Increasing the size of the window and
appropriation higher weights of the positions
close to the examined
Creating preliminary evolutionary tree and
removing the organisms that are far along
making the estimation.
Realization Problems
• The result is strongly dependent of:
• The data preprocessing
• The alignment approach
• Every approach for error correction also
recover some of the truth short reads.
Conclusions &
Future worksUsed software and papers
Conclusion
• Work with large amounts of data is easier when they are
clustered into groups before alignment;
• Modifying existing or creating new algorithms allows more
data to be preserved according to custom requirements;
• The error correction method is important for better
classification of the data.
• The organism class presence depends on the used error
correction method.
Future work
• Improvement and validation of our approach;
• Attainment of more precise error detection, correction and
classification of metagenomic data
• Comparison of the result to the other problem solvers
• Studying and Investigating different approaches for
comparison of the effect over the final results
• Estimating the significance preserving the rare mutations vs
removing the “wrong” short reads.
• Creating an approach for quality measure of the generated
evolutionary tree
• Using the estimation of the tree into the error evaluating
• Comparison with public data bases
• Working on the Algorithm 2
Sphere of Application
• Estimating the biodiversity
• Creating an evolutionary tree
Software used in this project:
• Python: http://www.python.org/
• Cython: http://cython.org/
• MEGA (Molecular Evolutionary Genetics Analysis):
http://www.megasoftware.net/
• Muscle: http://www.drive5.com/muscle/
• SHREC (SHort Read Error Correction method):
http://ww2.cs.mu.oz.au/~schroder/shrec_www/
• ClaMS (Classifier for Metagenomic Sequences): http://clams.jgi-
psf.org/
• NINJA (modified):
http://nimbletwist.com/software/ninja/index.html
• R-package: http://www.r-project.org/
Papers
1. Gilles, A., Meglecz, E., Pech, N., Ferreira, S., Mal
ausa, T., and Martin, J. F. (2011). Accuracy and
quality assessment of 454 GS-FLX titanium
pyrosequencing. BMC Genomics, 12(1):245+.