1
Harrison Leong, Edgar Schreiber, Stephan Berosik, Shiaw-Min Chen, Wallace George, Jeffrey Marks, Stephanie Schneider ThermoFisher Scientific, Genetic Sciences Division, 200 Oyster Point Blvd., South San Francisco, CA, 94080 RESULTS Table 1 shows the sensitivity and specificity for allele frequencies 5% and 10%. Although we have been able to detect variants at allele frequencies 0.6125%, 1%, 1.25%, 2%, and 2.5%, the algorithm did not meet the LOD criteria of 95% sensitivity and 99% specificity for these extremely low levels. INTRODUCTION Detecting minor genetic variants has become essential to cancer and infectious disease management. Many have turned to next generation sequencing to fill this need given the common perception that the limit of detection (LOD) for Sanger sequencing is somewhere between 15% to 25% 1,2,3 . We have discovered a software algorithmic solution to reduce this detection limit to 5% and have demonstrated detection at even lower allele frequencies. Standard Sanger sequencing protocols can be used and the method can generate the familiar electropherogram data display with noise substantially reduced. This opens up an alternative for detecting low level somatic variants. The key observation that enabled this development is that the noise underlying Sanger sequencing fluorescence data (traces) appears to be highly correlated to the primary sequence in the data. Figure 1 shows the electropherograms from two different samples: the control sample has the same primary sequence as the test sample which contains a few minor variants. CONCLUSIONS It should now be possible to achieve a reference-based limit of detection of 5% allelic proportion with standard Sanger sequencing protocols. Existing protocols for visually reviewing the results can also be used and are enhanced because the algorithm generates results in the form of familiar electropherograms for which the noise has been substantially diminished. These two features of the algorithm may give Sanger sequencing performance and/or economic advantages in some molecular diagnostic applications that require finding minor genetic variants. NOTE: Results on clinical samples can be found at www.thermofisher.com / sangeroncology . The algorithm has been embedded within ThermoFisher Scientific’s Minor Variant Finder software (www.thermofisher.com/mvf ) . REFERENCES 1. Lin, M.T. et al. (2014), American Journal of Clinical Pathology, June 2014; 141:856-866. 2. Jancik, S. et al. (2012), Journal of Experimental & Clinical Cancer Research 2012; 31:79:1-13. 3. Tsiatis, A.C. et al. (2010), Journal of Molecular Diagnostics, July 2010; 12:4:425-432. 4. Wang G. and Guo L. (2013) Journal of Applied Mathematics, 2013; article 696491. High Sensitivity Sanger Sequencing for Minor Variant Detection Thermo Fisher Scientific • 5781 Van Allen Way • Carlsbad, CA 92008 • thermofisher.com TT27 These are the key steps in the noise minimization algorithm: a) for each of control and test traces, find the range of base positions where the sequence data quality is consistently high; b) find the intersection of the high-sequence-quality ranges between the control and test sample traces; do the following within that intersection: c) remove the trace components associated with the primary bases leaving the non-primary traces; d) locally expand or contract and/or strengthen or weaken the non- primary traces of the control sample to maximize correspondence between the non-primary traces of the control and test samples; e) subtract the manipulated non-primary traces from the test sample traces; f) suppress non-primary peaks that are obviously not variant peaks (set them to zero) based on several peak characteristics such as amplitude, width, alignment with the primary trace peak, etc. This process is applied to traces from both forward and reverse sequencing reactions. The outcome is noise minimized traces for forward and reverse traces of the test sample that can be displayed for review in the familiar electropherogram format. These traces are passed into the second stage of the algorithm to automatically detect variants. AUTOMATED VARIANT DETECTION For variant detection, the forward and reverse noise-minimized traces of the test sample are examined for any remaining peaks. These peaks are scrutinized by a set of five interconnected multi-variate classification functions to decide whether or not there is a bona fide variant at a given base position and its base identity. The final thresholds of four of these functions are optimized for classification accuracy using an algorithm based on swarm theory 4 . DATA FOR DEVELOPING AND TESTING THE METHOD Samples came from 22 amplicons associated with eight different genes: TP53, KRAS, BRAF, EGFR, FLT3, RB1, CDH1, and ERBB2. Many of these were extracted from formalin-fixed, paraffin-embedded samples. Some were commercially available reference standards (Acrometrix), others were quantified using the RNase-P quantitative polymerase chain reaction assay and serially diluted. Allelic proportions spanned 0.6125% to 50%. These samples were amplified, sequenced, and pre-processed using standard protocols and tools for fluorescent dye terminator Sanger sequencing from Applied Biosystems TM . A third of these data were used for developing the algorithms. Two thirds were used for evaluating the performance characteristics of the method. Figure 1. Noise underlying two different samples looks very similar when their primary sequences are the same. Control Sample Test Sample with Variants Figure 1: Electropherograms from two different physical samples showing the underlying noise; note the close similarity between the two. The bottom 200 relative fluorescence units (RFUs) is shown. The primary peaks are up at around 1000 RFUs. A two-part algorithm has been developed to exploit this observation. The first part minimizes the noise that underlies the traces. The second part detects variants, if any, in the noise- minimized traces. This communication describes the algorithmic details and shows test results. f) Use a global optimization algorithm (one based on swarm theory was used) to find optimum final threshold values for the four discriminant functions of step (c) and (e). The classifiers of step (e) deliver the final judgment on whether a peak is associated with a variant or non-variant. Figure 2 summarizes the complete classification engine. MATERIALS AND METHODS NOISE MINIMIZATION For noise minimization, a model of the noise in the traces of the test sample is made from traces of the control sample and this model is subtracted from the traces of the test sample. The key steps of the variant detection algorithm, including construction of the classification engine, are as follows: a) Compute metrics on trace peaks such as the location of a peak relative to that of the nearest primary base, symmetry of the peak, sharpness of a peak relative to that of its nearest primary base, etc.; b) Classify the largest non-primary peaks based on each peak metric alone to the degree that a peak can be unambiguously classified in this manner; c) Of those that cannot be classified in step (b), construct two discriminant functions, one based on peak metrics that combine forward and reverse information (x-strand), one based on peak metrics that do not combine the two (s-strand). Within each of these two categories, generate discriminant functions for all possible combinations of metrics belonging to the category and choose the function with the highest performance; d) Use the s-strand classifier to generate additional peak metrics; e.g., the probability ratio between variant and non-variant peaks based on s-strand pre-thresholded output; e) Make two additional discriminant functions: one for peaks categorized as variants by the x-strand classifier and the second for peaks categorized as non-variants by the x-strand classifier. Metrics of steps (c) and (d) are used to create these discriminant functions using the feature selection process of (c). INPUT DATA: Forward control Forward test Reverse control Reverse test Fwd and Rvs test, noise minimized Single-strand metrics: peak height, width, sharpness, symmetry, signal to noise, etc. Cross-strand metrics (combined fwd rvs information): base complementarity, relative peak amplitude, relative width, etc. Signal to noise outliers are variant candidates Classifier for clear cut cases Classifier based on cross-strand metrics Classifier to override cross-strand variant calls (all metrics) Classifier to override cross- strand non-variant calls (all metrics) OUTPUT RESULTS: Variant locations and base identities Meta-metrics from classifier based on single- strand metrics: var/non-var probability ratio, pre- thresholded score, etc. Figure 2. The classification engine for variant detection. Figure 2: Trace data enters at the upper left and detected variants, if any, are reported out at the bottom. The figure illustrates that the decision making process is layered so that easy decisions are made first and only the trace peaks that cannot be clearly classified are funneled down into the deeper levels of analysis. This allows the classifier at each level to concentrate on a smaller set of the data which may have a simpler statistical structure. Figure 3 shows results of applying noise minimization to the forward sequencing orientation of a sample with three variants at an allele frequency of 1.25%. The central panel shows the traces before minimization. The process clearly reveals the three variant peaks. The red marks in the bottom panel indicate where the automated variant detection algorithm called out variants. 1.25% Variant Test Sample Control Sample algorithm finds the variants KB Basecaller misses the variants 1.25% Test Sample, Noise Minimized Figure 3. Noise minimized trace example (bottom panel). Figure 3: Noise minimization reveals 1.25% minor variants deeply embedded in the noise underlying Sanger trace data. The high similarity in the noise between the control (top panel) and test (middle panel) traces allows much of the noise to be removed (bottom panel). TABLE 1: Algorithm performance for allele frequencies meeting LOD criteria Variant Level Sensitivity Specificity Datasets Total True Variants Total True Non-variants 5% 95.9% 99.8% 704 785 229623 10% 98.8% 99.8% 454 503 163037

High Sensitivity Sanger Sequencing for Minor Variant Detection

Embed Size (px)

Citation preview

Page 1: High Sensitivity Sanger Sequencing for Minor Variant Detection

Harrison Leong, Edgar Schreiber, Stephan Berosik, Shiaw-Min Chen, Wallace George, Jeffrey Marks, Stephanie Schneider ThermoFisher Scientific, Genetic Sciences Division, 200 Oyster Point Blvd., South San Francisco, CA, 94080

RESULTS

Table 1 shows the sensitivity and specificity for allele frequencies 5% and 10%. Although we have been able to detect variants at allele frequencies 0.6125%, 1%, 1.25%, 2%, and 2.5%, the algorithm did not meet the LOD criteria of 95% sensitivity and 99% specificity for these extremely low levels.

INTRODUCTION Detecting minor genetic variants has become essential to cancer and infectious disease management. Many have turned to next generation sequencing to fill this need given the common perception that the limit of detection (LOD) for Sanger sequencing is somewhere between 15% to 25%1,2,3. We have discovered a software algorithmic solution to reduce this detection limit to 5% and have demonstrated detection at even lower allele frequencies. Standard Sanger sequencing protocols can be used and the method can generate the familiar electropherogram data display with noise substantially reduced. This opens up an alternative for detecting low level somatic variants. The key observation that enabled this development is that the noise underlying Sanger sequencing fluorescence data (traces) appears to be highly correlated to the primary sequence in the data. Figure 1 shows the electropherograms from two different samples: the control sample has the same primary sequence as the test sample which contains a few minor variants.

CONCLUSIONS It should now be possible to achieve a reference-based limit of detection of 5% allelic proportion with standard Sanger sequencing protocols. Existing protocols for visually reviewing the results can also be used and are enhanced because the algorithm generates results in the form of familiar electropherograms for which the noise has been substantially diminished. These two features of the algorithm may give Sanger sequencing performance and/or economic advantages in some molecular diagnostic applications that require finding minor genetic variants. NOTE: Results on clinical samples can be found at www.thermofisher.com/sangeroncology. The algorithm has been embedded within ThermoFisher Scientific’s Minor Variant Finder software (www.thermofisher.com/mvf ). REFERENCES 1.  Lin, M.T. et al. (2014), American Journal of Clinical Pathology, June 2014; 141:856-866. 2.  Jancik, S. et al. (2012), Journal of Experimental & Clinical Cancer Research 2012;

31:79:1-13. 3.  Tsiatis, A.C. et al. (2010), Journal of Molecular Diagnostics, July 2010; 12:4:425-432. 4.  Wang G. and Guo L. (2013) Journal of Applied Mathematics, 2013; article 696491.

High Sensitivity Sanger Sequencing for Minor Variant Detection

Thermo Fisher Scientific • 5781 Van Allen Way • Carlsbad, CA 92008 • thermofisher.com

TT27 These are the key steps in the noise minimization algorithm: a)  for each of control and test traces, find the range of base

positions where the sequence data quality is consistently high; b)  find the intersection of the high-sequence-quality ranges between

the control and test sample traces; do the following within that intersection:

c)  remove the trace components associated with the primary bases leaving the non-primary traces;

d)  locally expand or contract and/or strengthen or weaken the non-primary traces of the control sample to maximize correspondence between the non-primary traces of the control and test samples;

e)  subtract the manipulated non-primary traces from the test sample traces;

f)  suppress non-primary peaks that are obviously not variant peaks (set them to zero) based on several peak characteristics such as amplitude, width, alignment with the primary trace peak, etc.

This process is applied to traces from both forward and reverse sequencing reactions. The outcome is noise minimized traces for forward and reverse traces of the test sample that can be displayed for review in the familiar electropherogram format. These traces are passed into the second stage of the algorithm to automatically detect variants.

AUTOMATED VARIANT DETECTION For variant detection, the forward and reverse noise-minimized traces of the test sample are examined for any remaining peaks. These peaks are scrutinized by a set of five interconnected multi-variate classification functions to decide whether or not there is a bona fide variant at a given base position and its base identity. The final thresholds of four of these functions are optimized for classification accuracy using an algorithm based on swarm theory 4.

DATA FOR DEVELOPING AND TESTING THE METHOD Samples came from 22 amplicons associated with eight different genes: TP53, KRAS, BRAF, EGFR, FLT3, RB1, CDH1, and ERBB2. Many of these were extracted from formalin-fixed, paraffin-embedded samples. Some were commercially available reference standards (Acrometrix), others were quantified using the RNase-P quantitative polymerase chain reaction assay and serially diluted. Allelic proportions spanned 0.6125% to 50%. These samples were amplified, sequenced, and pre-processed using standard protocols and tools for fluorescent dye terminator Sanger sequencing from Applied BiosystemsTM. A third of these data were used for developing the algorithms. Two thirds were used for evaluating the performance characteristics of the method.

Figure 1. Noise underlying two different samples looks very similar when their primary sequences are the same.

Control Sample

Test Sample with Variants

Figure 1: Electropherograms from two different physical samples showing the underlying noise; note the close similarity between the two. The bottom 200 relative fluorescence units (RFUs) is shown. The primary peaks are up at around 1000 RFUs.

A two-part algorithm has been developed to exploit this observation. The first part minimizes the noise that underlies the traces. The second part detects variants, if any, in the noise-minimized traces. This communication describes the algorithmic details and shows test results.

f)  Use a global optimization algorithm (one based on swarm theory was used) to find optimum final threshold values for the four discriminant functions of step (c) and (e).

The classifiers of step (e) deliver the final judgment on whether a peak is associated with a variant or non-variant. Figure 2 summarizes the complete classification engine.

MATERIALS AND METHODS NOISE MINIMIZATION For noise minimization, a model of the noise in the traces of the test sample is made from traces of the control sample and this model is subtracted from the traces of the test sample.

The key steps of the variant detection algorithm, including construction of the classification engine, are as follows:

a)  Compute metrics on trace peaks such as the location of a peak relative to that of the nearest primary base, symmetry of the peak, sharpness of a peak relative to that of its nearest primary base, etc.;

b)  Classify the largest non-primary peaks based on each peak metric alone to the degree that a peak can be unambiguously classified in this manner;

c)  Of those that cannot be classified in step (b), construct two discriminant functions, one based on peak metrics that combine forward and reverse information (x-strand), one based on peak metrics that do not combine the two (s-strand). Within each of these two categories, generate discriminant functions for all possible combinations of metrics belonging to the category and choose the function with the highest performance;

d)  Use the s-strand classifier to generate additional peak metrics; e.g., the probability ratio between variant and non-variant peaks based on s-strand pre-thresholded output;

e)  Make two additional discriminant functions: one for peaks categorized as variants by the x-strand classifier and the second for peaks categorized as non-variants by the x-strand classifier. Metrics of steps (c) and (d) are used to create these discriminant functions using the feature selection process of (c).

INPUT DATA: Forward control Forward test Reverse control Reverse test

Fwd and Rvs test, noise minimized

Single-strand metrics: peak height, width, sharpness, symmetry, signal to noise, etc.

Cross-strand metrics (combined fwd rvs information): base complementarity, relative peak amplitude, relative width, etc.

Signal to noise outliers are variant candidates

Classifier for clear cut cases

Classifier based on cross-strand metrics

Classifier to override cross-strand variant calls (all metrics)

Classifier to override cross-strand non-variant calls (all metrics)

OUTPUT RESULTS: Variant locations and base identities

Meta-metrics from classifier based on single-strand metrics: var/non-var probability ratio, pre-thresholded score, etc.

Figure 2. The classification engine for variant detection.

Figure 2: Trace data enters at the upper left and detected variants, if any, are reported out at the bottom. The figure illustrates that the decision making process is layered so that easy decisions are made first and only the trace peaks that cannot be clearly classified are funneled down into the deeper levels of analysis. This allows the classifier at each level to concentrate on a smaller set of the data which may have a simpler statistical structure.

Figure 3 shows results of applying noise minimization to the forward sequencing orientation of a sample with three variants at an allele frequency of 1.25%. The central panel shows the traces before minimization. The process clearly reveals the three variant peaks. The red marks in the bottom panel indicate where the automated variant detection algorithm called out variants.

1.25% Variant Test Sample

Control Sample

algorithm finds the variants

KB Basecaller misses the variants

1.25% Test Sample, Noise Minimized

Figure 3. Noise minimized trace example (bottom panel).

Figure 3: Noise minimization reveals 1.25% minor variants deeply embedded in the noise underlying Sanger trace data. The high similarity in the noise between the control (top panel) and test (middle panel) traces allows much of the noise to be removed (bottom panel).

TABLE 1: Algorithm performance for allele frequencies meeting LOD criteria

Variant Level

Sensitivity Specificity Datasets Total True Variants

Total True Non-variants

5% 95.9% 99.8% 704

785 229623

10% 98.8% 99.8% 454 503 163037