презентация за варшава

An approach for detection and

correction of errors in 16S RNA

fragments from parallel

metagenomic sequencing

Milko Krachunov2, Valeria Simeonova2, Ivan Popov1, Irena

Avdjieva1, Paweł Szczęsny3, Urszula Zelenkiewicz3, Piotr

Zelenkievicz3, Dimitar Vassilev1

1Bioinforomatics group; AgroBioInstitute, Bulgaria

2Faculty of mathematics and informatics; Sofia University St. Kliment Ohridski, Bulgaria

3Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Warsaw, Poland

Introduction

Problems in parallel sequencing

Errors from mathematical and biological origin are introduced

in the assembled reads

It's impossible to resequence the same sample again

There's no easy way to differentiate between a “sequencing

error” and a “rare variant”

All existing solutions target very specific cases

Even low level of errors is significant and introduce high level

of errors in prediction of biodiversity

Parallel sequencing produces large amounts of data that

needs to be analysed.

Input data

Shared characteristics:

Metagenomic source

From highly conservative 16S RNA

Including high variable ranges, which are species specific, so the same hypervariable region is sequenced for many of the

microorganisms

Usually tens of thousands short reads (length 300-500 bp)

Specific Test data characteristics:

of 22325 and 34717 fragments

Aims and common Tasks1. Aims:

① Biodiversity investigating (mostly for microbes)

② Identifying already known organisms (as bacteria etc.)

③ Generating Evolutionary tree about current data set

2. Tasks:

① Develop an approach for finding and correcting the errors that

have occured in the reads by studying the correspondance

between similar sequences (Algorithm 1)

② Develop an approach for fine tuning the accuracy of the

algorithm for sequence error estimation level, SNPs and species

detecting in NGS data (Algorithm 2)

③ Develop free/open-source software packages for both

approaches, which could be performing independently of the data and

algorithm used for defining errors, SNPs, species.

Problem solving

INTE

RA

CTIO

N B

ETW

EEN

BO

TH A

PR

OA

CH

ES

Decision whether to apply the Algorithm 1 over the data set to get the error levels etc. OR to Tune up again

Estimation the accuracy level of Algorithm 1

Applying Algorithm 1 with the new set of predicted parameters, following the assumption in the beginning

Tune up the Algorithm 1, using Algorithm 2

Applying the Algorithm 1 with different parameters for identifying generated “mutations”

The aim is to obtain the accuracy of the algorithm

Generating artificial sequence errors, SNPs and different species by using different levels of errors

Assuming that the data set doesn‟t include any errors, SNPs, and all sequences are from one specie

① DEVELOP AN APPROACH FOR

FINDING AND CORRECTING THE

ERRORS THAT HAVE OCCURED IN

THE READS BY STUDYING THE

CORRESPONDANCE BETWEEN

SIMILAR SEQUENCES

PROBLEM SOLVING

Overview of the approach

The data is filtered, removing any ambigous reads (e.g.

marked by the sequencer)

The data is aligned so that any corresponding regions occupy

the same position in all reads

Error evaluation for each base in each read is performed by

finding its occurance rate weighted with the local similarity in

the region

Optionally, sequences that are globally dissimilar are

weighted lower

Approaches for alignmentProblem: free/open-source tools for multiple sequence

allignment (e.g. muscle) don't scale for datasets over a few

thousand reads

Construct a hierarchical tree and perform pairwise alignment

at each branching – not promising preliminary

results, slow, O(n³)

Cluster the dataset and align clusters independently

Hieararchical clustering – slow in tests, O(n³)

Suffix tree-based clustering – much faster

Proprietary tools presumed to use suffix-trees such as CAP3 give

near-perfect alignment of the test dataset (CAP3 is free of

charge only for non-profit projects)

Preliminary error correction

approach

Create a position window with radius n around each base that is evaluated (e.g. 5 characters before and after the base)

Find all reads that contain exactly the same region in the given window

Calculate the occurence rate of this base and assume it is an error if it is below a given threshold

Use the most occuring base in the same subset to replace it to fix the error

Result: With sub-optimal alignment, it leads to some improvements in the number of recognized organisms using external software (ClaMS)

The proposed approach

Create a large window around each base evaluated base (for performance reasons)

Calculate a similarity score between each other sequence and the evaluated sequence

For the score, take bases away from the middle with lower weight

Use the scores to calculate a weighted occurence rate using all sequences

Use the same weighting approach when finding the base that's most common in the similar reads and use it to replace the incorrect one

Sequences that don't match or have an accidental match would be discarded by their negligent score

② DEVELOP AN APPROACH FOR

FINE TUNING THE ACCURACY OF

THE ALGORITHM FOR SEQUENCE

ERROR ESTIMATION LEVEL, SNPs

AND SPECIES DETECTING IN NGS

DATA.

PROBLEM SOLVING

According to Accuracy and

quality assessment of 454 GS-FLX

Titanium pyrosequencing [1]:

• As control (reference) sequences we will use

the main data set to estimate the accuracy of

Algorithm 1 by previously artificial generated

digression with three types of ranges defining:

• Whether it is SNP: ~0.01 % (it may vary)

• Whether it is a different specie (DS): ~40-60 %

• Whether it is sequence error: in the limits between

SNP and DS, and greater than DS




• Estimating the accuracy of Algorithm 1 is connected also to the computed variables, defining the sequence error:

• the presence of homopolymers

• position in the sequence

• size of the sequence

• spatial localization in PT plates for insertion and deletion errors – this is not computable element, it is provided by sequencer

• reference sequence type

• the distance of beads to the region center

• the distance of beads to the plate center




• As their results are showing that the error rate

was heterogeneous along the length of the

sequence, we propose to analyze it by several

regions, according to the analysis of

distributions.

• Another good idea is to group the sequence

error by whether it is occurred in

insertion, deletion, mismatches or in ambiguous

base calls.

Defining the Training set• As we have all this information we will organized as

an ANN training set for prediction the parameters of Algorithm 1, using Artificial Neural Networks. This manner we will have several groups of input data:

• sequence error parameters

• Kind of produced “error”: sError, SNP, different specie

• Where is obtained the “error”: in/del, missmatches, ambiguous base call

• Parameters of Algorithm 1

• Another option is to include also an average row of data for each each group of Algorithm 1 „s parameters. This way we will provide a guide line for the rest.

How to choose the best

architecture for ANN ?

Choosing the best architecture is a complex task.

The best way to figure it out is to use Genetic Algorithm. The role of GA is to estimate the internal error in ANN for different topologies. We will be interested only of the best one, which means with the smallest maximum internal error. There is an option to use also the smallest average error. There is no clear view of which one is better.

The back propagation algorithm will be the core of the training, because it is the most appropriate for such tasks of prediction.

How to choose which results

are the best ?

• After training and prediction we can apply over

the results 3D PCA analysis as visualizing method

of clustering the results. It is very good for taking

a quick snapshot of different groups. Of course

we are open for other methods of visualization

as methods for graphical analysis.

What we can expect from

Algorithm 2?

• ANN and GA are part of Artificial Intelligence

Algorithms. They are widely and successful used

for prediction, classification and cauterization

problems.

• We expect that this will be:

• A tool for fine tuning up, because we will work

only with digital data, big training set, using the last IT technologies.

• Less CPU using tool

• A tool, giving quick results

Current

Realization

General approach in metagenomic

biodiversity studies

454 Sequencing

Filtering/Denoising

Multiple alignment

Distance matrix

ОTU clusters with abundance count

Our approach:

Classification,evaluation

and comparison

B. Correction with SHREC

A. Raw data

C. Correction with our method

A. Raw data characteristic

and processing

• Two separate runs of metagenomic 16S RNA

fragments, sequenced with 454 platform and converted in FASTA

format:

• run 02 – 46429 short reads

• run 04 – 41386 short reads

• Our task – extract, denoise and correct only the quality reads.

A1• 454 output data

A2

• Filtering by quality and length (300-500 nucleotides)• run 02 – 22325 short reads• run 04 – 34717 short reads

Raw data length histogram

Run 02 Run 04

B. Correction with SHREC

A2• Filtered data

B1

• Restrict quality parts of fragments only• Cut to length of 300 nucleotides• Remove fragments that have Ns in their sequence

B2

• Correction with SHREC:• run 02 – 11211 short reads• run 04 – 30149 short reads

A2• Filtered data => restrict quality parts of fragments only

C1

• Calculate distance matrix using Lewenstein distances

• Cluster data pool to smaller groups with distance matrix

• Alignment with MUSCLE:• Multiple alignment of sequences in each group

• Alignment of each group with the others

C2

• Calculate occurrence rate [%] for each base (A/T/G/C) for each position of the short reads• Determine a value [%] under which a base is regarded as "error"• Scan reads for errors• Replace errors with base that has highest occurrence rate for current position among the neighbors

C. Correction with our method:

Classification and performance evaluation

1

• At every step sequences are classified with ClaMS:• classification against ClaMS own database

• classification of the two runs against each other

2• Comparison of classification results made for the last

step for each variant (A, B, C)

ClaMS parameters:

Distance cut-off: 0,05Signature type: DBC

k-mer length: 3Existing taxonomy: 4th Level

Result comparison

• We have comparison results about the two runs ofsequenator. which are similar in most cases.

• The next three slides will show the difference in findingsome specific classes of organisms and organisms,using four different error correction methods, groupedinto: Original (i.e. without error correction), SHREC,and ours – AF2 and ANS.

• The Levels of classification are derived from ClaMS,where: 0-th Level is the Root, 1-st Level givesclassification classes Bacteria, Archaea andEucaryota, 2-nd and 3-d Levels are more specific andthey present more detailed classification of contigs.

Improvements

Fuzzy estimation of coincident in the window

Increasing the size of the window and

appropriation higher weights of the positions

close to the examined

Creating preliminary evolutionary tree and

removing the organisms that are far along

making the estimation.

Realization Problems

• The result is strongly dependent of:

• The data preprocessing

• The alignment approach

• Every approach for error correction also

recover some of the truth short reads.

Conclusions &

Future worksUsed software and papers

Conclusion

• Work with large amounts of data is easier when they are

clustered into groups before alignment;

• Modifying existing or creating new algorithms allows more

data to be preserved according to custom requirements;

• The error correction method is important for better

classification of the data.

• The organism class presence depends on the used error

correction method.

Future work

• Improvement and validation of our approach;

• Attainment of more precise error detection, correction and

classification of metagenomic data

• Comparison of the result to the other problem solvers

• Studying and Investigating different approaches for

comparison of the effect over the final results

• Estimating the significance preserving the rare mutations vs

removing the “wrong” short reads.

• Creating an approach for quality measure of the generated

evolutionary tree

• Using the estimation of the tree into the error evaluating

• Comparison with public data bases

• Working on the Algorithm 2

Sphere of Application

• Estimating the biodiversity

• Creating an evolutionary tree

Software used in this project:

• Python: http://www.python.org/

• Cython: http://cython.org/

• MEGA (Molecular Evolutionary Genetics Analysis):

http://www.megasoftware.net/

• Muscle: http://www.drive5.com/muscle/

• SHREC (SHort Read Error Correction method):

http://ww2.cs.mu.oz.au/~schroder/shrec_www/

• ClaMS (Classifier for Metagenomic Sequences): http://clams.jgi-

psf.org/

• NINJA (modified):

http://nimbletwist.com/software/ninja/index.html

• R-package: http://www.r-project.org/

Papers

1. Gilles, A., Meglecz, E., Pech, N., Ferreira, S., Mal

ausa, T., and Martin, J. F. (2011). Accuracy and

quality assessment of 454 GS-FLX titanium

pyrosequencing. BMC Genomics, 12(1):245+.

Technology

презентация за варшава