6
Instant spectral assignment for advanced decision tree-driven mass spectrometry Derek J. Bailey a,b , Christopher M. Rose a,b , Graeme C. McAlister a , Justin Brumbaugh b,c , Pengzhi Yu c , Craig D. Wenger a , Michael S. Westphall a,b , James A. Thomson c,d,e,1 , and Joshua J. Coon a,b,f a Departments of Chemistry, f Biomolecular Chemistry, d Cell and Regenerative Biology, b Genome Center of Wisconsin, University of Wisconsin, Madison, WI 53706; e Department of Molecular, Cellular, and Developmental Biology, University of California, Santa Barbara, CA 93106; and c Morgridge Institute for Research, Madison, WI 53715 Contributed by James A. Thomson, April 3, 2012 (sent for review January 23, 2012) We have developed and implemented a sequence identification algorithm (inSeq) that processes tandem mass spectra in real-time using the mass spectrometers (MS) onboard processors. The inSeq algorithm relies on accurate mass tandem MS data for swift spectral matching with high accuracy. The instant spectral processing tech- nology takes 16 ms to execute and provides information to enable autonomous, real-time decision making by the MS system. Using inSeq and its advanced decision tree logic, we demonstrate (i) real- time prediction of peptide elution windows en masse (3 min width, 3,000 targets), (ii) significant improvement of quantitative precision and accuracy (~3x boost in detected protein differences), and (iii) boosted rates of posttranslation modification site localiza- tion (90% agreement in real-time vs. offline localization rate and an approximate 25% gain in localized sites). The decision tree logic en- abled by inSeq promises to circumvent problems with the conven- tional data-dependent acquisition paradigm and provides a direct route to streamlined and expedient targeted protein analysis. T he shotgun sequencing method has rapidly evolved over the past two decades (1, 2). In this strategy eluting peptide cations have their mass-to-charge (mz) values measured in the MS 1 scan. Then precursor mz values are selected for a series of se- quential tandem MS events (MS 2 ). This succession is cycled for the duration of the analysis. The process, called data-dependent acquisition (DDA), is at the very core of shotgun analysis and has not changed for over 15 y, however, MS hardware has. Major im- provements in MS sensitivity, scan rate, mass accuracy, and reso- lution have been achieved. Orbitrap hybrid systems, for example, routinely achieve low ppm mass accuracy with MS/MS repetition rates of 510 Hz (3, 4). Constant operation of such systems gen- erates hundreds of thousands of spectra in hours. These MS 2 spectra are then mapped to sequence using database search algo- rithms (57). The DDA sampling strategy offers an elegant simplicity and has proven highly useful for discovery-driven proteomics. Of re- cent years, however, emphasis has shifted from identification to quantification-often with certain targets in mind. In this context, faults in the DDA approach have become increasingly evident. There are two primary limitations of the DDA approach: First, is poor run-to-run reproducibility and, second, is the inability to effectively target peptides of interest (8). Hundreds of peptides often coelute so that low-level signals often are selected in one run and not the next, and selecting mz peaks to sequence by abundance certainly does not offer the opportunity to inform the system of preselected targets. Several DDA add-ons and alternatives have been examined. Sampling depth, for example, can be increased by preventing se- lection of an mz value identified in a prior technical replicate (PAnDA) (9). Irreproducibility can be somewhat countered by in- forming the DDA algorithm of the precursor mz values of desired targets (inclusion list)if observed, this can ensure their selection for MS 2 . Frequently, however, low abundance peptides may not have precursor signals above noise so that a MS 2 scan, which is requisite for identification, is never triggered. This conundrum is avoided altogether in the data-independent acquisition approach (DIA) (10). Here, no attention is paid to precursor abundance or even presence; instead, consecutive mz isolation windows are dis- sociated and mass analyzed. A main drawback of DIA is that it requires significantly more instrument analysis time as MS 2 scans from every mz window must be collected (11). As such, DDA analysis remains the preeminent method for MS data acquisition. Besides improvements in MS analyzer performance, numerous alternative dissociation methods and scan types have recently ad- vanced. These include collision, electron and photon-based frag- mentation [i.e., High-Energy Collisional Dissociation (HCD), Electron-Transfer Dissociation (ETD), Infrared Multiphoton Dis- sociation, etc.], specialized quantification scans (i.e., QuantMode), or simply analysis using varied precursor ion targets, mz accuracy, etc. (1217). Each of these techniques shows applicability and superlative performance for a subset of peptide precursors. The result is a dizzying alphabet soup of techniques, scan types, and parameter space that is not easily integrated into the current data acquisition paradigm. Recently, we introduced a decision tree (DT) algorithm that used precursor m, z, and mz to automatically de- termine, in real time, whether to employ CAD or ETD during MS 2 (18). The approach significantly improved sequencing success rates and was an important step in a movement toward development of informed acquisition. Here we describe the next advance in DT acquisition technol- ogyinstant sequence confirmation (inSeq). The inSeq algorithm processes MS 2 spectra at the moment of collection using the MS systems onboard processing power. With sequence in hand, the MS acquisition system can process this knowledge to make auton- omous, real-time decisions about what type of scan to trigger next. Here, with the inSeq instant identification algorithm, we ex- tend our simple DT method by adding several different decision nodes. These nodes enable automated functionalities including real-time elution prediction, advanced quantification, posttran- slation modification (PTM) localization, large-scale targeted pro- teomics, and increased proteome coverage, among others. This technology provides a direct pathway to transform the current passive data collection paradigm. Specifically, knowing the iden- tity of a peptide that is presently eluting into the MS system per- mits an ensemble of advanced automated decision-making logic. Results Instant Sequence Confirmation (inSeq). To develop an advanced DT acquisition scheme that can seamlessly incorporate the myriad of specialized procedures and scans available on modern day MS systems, we must expedite the spectral analysis processi.e., Author contributions: D.J.B., C.M.R., G.C.M., M.S.W., J.A.T., and J.J.C. designed research; D.J.B., C.M.R., G.C.M., J.B., P.Y., and C.D.W. performed research; D.J.B., C.M.R., G.C.M., and J.J.C. analyzed data; and D.J.B., C.M.R., and J.J.C. wrote the paper. The authors declare no conflict of interest. 1 To whom correspondence should be addressed. E-mail: [email protected]. This article contains supporting information online at www.pnas.org/lookup/suppl/ doi:10.1073/pnas.1205292109/-/DCSupplemental. www.pnas.org/cgi/doi/10.1073/pnas.1205292109 PNAS May 29, 2012 vol. 109 no. 22 84118416 CHEMISTRY

Instant spectral assignment for advanced decision tree ... · These nodes enable automated functionalities including: real-time elution prediction, advanced quantification, PTM localiza-tion,

  • Upload
    trinhtu

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Instant spectral assignment for advanced decision tree ... · These nodes enable automated functionalities including: real-time elution prediction, advanced quantification, PTM localiza-tion,

Instant spectral assignment for advanced decisiontree-driven mass spectrometryDerek J. Baileya,b, Christopher M. Rosea,b, Graeme C. McAlistera, Justin Brumbaughb,c, Pengzhi Yuc, Craig D. Wengera,Michael S. Westphalla,b, James A. Thomsonc,d,e,1, and Joshua J. Coona,b,f

aDepartments of Chemistry, fBiomolecular Chemistry, dCell and Regenerative Biology, bGenome Center of Wisconsin, University of Wisconsin, Madison, WI53706; eDepartment of Molecular, Cellular, and Developmental Biology, University of California, Santa Barbara, CA 93106; and cMorgridge Institute forResearch, Madison, WI 53715

Contributed by James A. Thomson, April 3, 2012 (sent for review January 23, 2012)

We have developed and implemented a sequence identificationalgorithm (inSeq) that processes tandem mass spectra in real-timeusing the mass spectrometer’s (MS) onboard processors. The inSeqalgorithm relies on accurate mass tandemMS data for swift spectralmatching with high accuracy. The instant spectral processing tech-nology takes ∼16 ms to execute and provides information to enableautonomous, real-time decision making by the MS system. UsinginSeq and its advanced decision tree logic, we demonstrate (i) real-time prediction of peptide elution windows en masse (∼3 minwidth, 3,000 targets), (ii) significant improvement of quantitativeprecision and accuracy (~3x boost in detected protein differences),and (iii) boosted rates of posttranslation modification site localiza-tion (90% agreement in real-time vs. offline localization rate and anapproximate 25% gain in localized sites). The decision tree logic en-abled by inSeq promises to circumvent problems with the conven-tional data-dependent acquisition paradigm and provides a directroute to streamlined and expedient targeted protein analysis.

The shotgun sequencing method has rapidly evolved over thepast two decades (1, 2). In this strategy eluting peptide cations

have their mass-to-charge (m∕z) values measured in the MS1

scan. Then precursor m∕z values are selected for a series of se-quential tandem MS events (MS2). This succession is cycled forthe duration of the analysis. The process, called data-dependentacquisition (DDA), is at the very core of shotgun analysis and hasnot changed for over 15 y, however, MS hardware has. Major im-provements in MS sensitivity, scan rate, mass accuracy, and reso-lution have been achieved. Orbitrap hybrid systems, for example,routinely achieve low ppm mass accuracy with MS/MS repetitionrates of 5–10 Hz (3, 4). Constant operation of such systems gen-erates hundreds of thousands of spectra in hours. These MS2

spectra are then mapped to sequence using database search algo-rithms (5–7).

The DDA sampling strategy offers an elegant simplicity andhas proven highly useful for discovery-driven proteomics. Of re-cent years, however, emphasis has shifted from identification toquantification-often with certain targets in mind. In this context,faults in the DDA approach have become increasingly evident.There are two primary limitations of the DDA approach: First,is poor run-to-run reproducibility and, second, is the inability toeffectively target peptides of interest (8). Hundreds of peptidesoften coelute so that low-level signals often are selected in onerun and not the next, and selecting m∕z peaks to sequence byabundance certainly does not offer the opportunity to informthe system of preselected targets.

Several DDA add-ons and alternatives have been examined.Sampling depth, for example, can be increased by preventing se-lection of an m∕z value identified in a prior technical replicate(PAnDA) (9). Irreproducibility can be somewhat countered by in-forming theDDA algorithm of the precursorm∕z values of desiredtargets (inclusion list)—if observed, this can ensure their selectionfor MS2. Frequently, however, low abundance peptides may nothave precursor signals above noise so that a MS2 scan, which isrequisite for identification, is never triggered. This conundrum

is avoided altogether in the data-independent acquisition approach(DIA) (10). Here, no attention is paid to precursor abundance oreven presence; instead, consecutivem∕z isolation windows are dis-sociated and mass analyzed. A main drawback of DIA is that itrequires significantly more instrument analysis time as MS2 scansfrom every m∕z window must be collected (11). As such, DDAanalysis remains the preeminent method for MS data acquisition.

Besides improvements in MS analyzer performance, numerousalternative dissociation methods and scan types have recently ad-vanced. These include collision, electron and photon-based frag-mentation [i.e., High-Energy Collisional Dissociation (HCD),Electron-Transfer Dissociation (ETD), Infrared Multiphoton Dis-sociation, etc.], specialized quantification scans (i.e., QuantMode),or simply analysis using varied precursor ion targets,m∕z accuracy,etc. (12–17). Each of these techniques shows applicability andsuperlative performance for a subset of peptide precursors. Theresult is a dizzying alphabet soup of techniques, scan types, andparameter space that is not easily integrated into the current dataacquisition paradigm. Recently, we introduced a decision tree (DT)algorithm that used precursor m, z, and m∕z to automatically de-termine, in real time, whether to employ CAD or ETD duringMS2

(18). The approach significantly improved sequencing success ratesand was an important step in a movement toward development ofinformed acquisition.

Here we describe the next advance in DTacquisition technol-ogy—instant sequence confirmation (inSeq). The inSeq algorithmprocesses MS2 spectra at the moment of collection using the MSsystem’s onboard processing power. With sequence in hand, theMS acquisition system can process this knowledge to make auton-omous, real-time decisions about what type of scan to triggernext. Here, with the inSeq instant identification algorithm, we ex-tend our simple DT method by adding several different decisionnodes. These nodes enable automated functionalities includingreal-time elution prediction, advanced quantification, posttran-slation modification (PTM) localization, large-scale targeted pro-teomics, and increased proteome coverage, among others. Thistechnology provides a direct pathway to transform the currentpassive data collection paradigm. Specifically, knowing the iden-tity of a peptide that is presently eluting into the MS system per-mits an ensemble of advanced automated decision-making logic.

ResultsInstant Sequence Confirmation (inSeq). To develop an advanced DTacquisition scheme that can seamlessly incorporate the myriad ofspecialized procedures and scans available on modern day MSsystems, we must expedite the spectral analysis process—i.e.,

Author contributions: D.J.B., C.M.R., G.C.M., M.S.W., J.A.T., and J.J.C. designed research;D.J.B., C.M.R., G.C.M., J.B., P.Y., and C.D.W. performed research; D.J.B., C.M.R., G.C.M.,and J.J.C. analyzed data; and D.J.B., C.M.R., and J.J.C. wrote the paper.

The authors declare no conflict of interest.1To whom correspondence should be addressed. E-mail: [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1205292109/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1205292109 PNAS ∣ May 29, 2012 ∣ vol. 109 ∣ no. 22 ∣ 8411–8416

CHEM

ISTR

Y

Page 2: Instant spectral assignment for advanced decision tree ... · These nodes enable automated functionalities including: real-time elution prediction, advanced quantification, PTM localiza-tion,

from offline to real-time. There are two obvious pathways to in-corporate real-time spectral analysis within an MS system. Thefirst approach exports spectra for processing with an external com-puting system followed by import of the search outcome (19). Asecond, more elegant strategy, is to perform all computation with-in the MS onboard computing system (20). The former approachcircumvents complications in accessing instrument firmware andallows for the use of more sophisticated processing power; how-ever, a serious constraint is the time required for import/export ofthe information (i.e., ∼40 ms). We have pursued technologies andcomputational algorithms that integrate real-time spectral analysisinto the MS system’s onboard processors and firmware. We callthis method inSeq. Experimental details (e.g., peptide candidates,scan sequences, etc.) are transferred on demand along with theinstrument’s method file to the instrument before the experimentcommences allowing for flexibility in experimental design withminimum configuration. To establish robustness across platforms,we implemented inSeq on two distinct MS systems (operating withdifferent code bases)—a dual-cell, quadrupole, linear ion trap-Orbitrap hybrid (LTQ-Velos Orbitrap) and a quadrupole mass fil-ter-Orbitrap hybrid (Q-Exactive). In both cases, we modified andextended the instrument firmware to quickly (∼<20 ms in thecase of themore modern Q-Exactive system) and accurately [<2%false discovery rate (FDR)] mapMS/MS spectra to sequence. Theembedded peptide database-matching algorithm processes MS/MS scans immediately (Fig. 1 A–B) by comparing product ionspresent in the MS/MS scan to those from peptide candidatespreloaded onto the instrument’s firmware. Note the candidatesequences are first filtered so that only sequences whose massis within a small window (e.g., 5–50 ppm) of the sampled precursorneutral mass are considered (Fig. 1C). For each candidate se-quence the number of þ1 product ions (þ2 ions are included forprecursors > þ2) that matched the spectrum at a mass tolerance<10 ppm is recorded (Fig. 1C). Next, it uses straightforward scor-ing metrics providing sufficient evidence for the confirmation of aputative sequence without burdening the system with nonessentialcalculations. On both MS platforms, the real-time confirmationalgorithm was expediently executed and required no hardwaremodification taking an average of 16 ms to perform (Q-Exactive,Fig. S1). To confirm that this small overhead does not affect theoverall duty cycle, we compared the number of MS/MS scans per-

formed when inSeq was and was not operating (9,076 DDA vs.8,908 DDA with inSeq, ∼1.6%). The number of MS1 scans forthe peptide IVGIVSGELNNAAAK within its elution profilefurther demonstrates the negligible impact on duty cycle as 20MS1 scans were taken with inSeq inactive as compared to 19 scanswith inSeq active (Fig. S2).

To characterize the inSeq algorithm, we performed a nHPLC-MS/MS experiment on tryptic peptides derived from human em-bryonic stem cells. A database consisting of all theoretical trypticpeptides (up to three missed cleavages, 6–50 in length) containedwithin the human proteome was uploaded to the instrument’s (Q-Exactive) onboard computer. A DDA method was employed andanalysis proceeded as usual except following each MS/MS scanthe inSeq algorithm was executed and the results logged. Thismanifest of instant identifications was then compared to thosemade postacquisition via traditional database searching at a1% FDR (reverse-decoy method). We assumed the conventionalpostacquisition approach to represent the true answer and com-pared the number of correct instant spectral identifications as afunction of matched product ions (Fig. 1D). From these data weconclude the detection of >6 product ions at high mass accuracy(<10 ppm) by the inSeq algorithm produces the correct sequenceidentification >98% of the time. To determine the impact of inSeqon depth of protein coverage, we compared OMSSA identifica-tions with inSeq identifications (species with >6 matching peaks)(Fig. S3). Traditional postacquisition searching identified morepeptides than inSeq (11,095 vs. 7,910, respectively) indicatingstrong initial performance but also room for further developmentof a more sophisticated real-time scoring algorithm.

The inSeq method represents a straightforward approach tocorrelate sequence to spectrum and is positioned to becomean essential technology in transforming the current passive datacollection paradigm. Specifically, learning the identity of a pep-tide that is presently eluting into the MS system permits an en-semble of advanced automated decision-making logic. Theseconcepts build upon our previous development of the data-de-pendent DT method. There we embedded an onboard algorithmto make real-time decisions of which fragmentation method toengage based on precursor charge (z) and m∕z. Here, with theinSeq instant identification algorithm, we extend our simple DTmethod by adding several different decision nodes (Fig. S4).

A

F E

B C D

Fig. 1. Progression of the inSeq logic. (A) A nHPLC-MS/MS chromatogram at 48.35 min along with an MS2 scan (B) that was acquired at that time followingdissociation of a þ2 feature of m∕z 737.86. Upon collection of the MS2 scan, inSeq groups all peptide candidates (in silico, n ¼ 94) whose theoretical mass arewithin 30 ppm of the experimentally determined precursor neutral mass 1473.756 (C). Then inSeq performs in silico fragmentation to produce a theoreticalproduct ion series for each of the 94 candidates and proceeds to compare each to the experimental spectrum (<10 ppm mass accuracy). (D) Plot of inSeqidentifications compared to conventional postacquisition searching. When >6 fragment ions match, inSeq agrees (True Positive) >98% of the time.

8412 ∣ www.pnas.org/cgi/doi/10.1073/pnas.1205292109 Bailey et al.

Page 3: Instant spectral assignment for advanced decision tree ... · These nodes enable automated functionalities including: real-time elution prediction, advanced quantification, PTM localiza-tion,

These nodes enable automated functionalities including: real-time elution prediction, advanced quantification, PTM localiza-tion, large-scale targeted proteomics, and increased proteomecoverage, among others (Fig. 1F).

Predicting Peptide Elution. Liquid chromatography is the conven-tional approach to fractionate highly complex peptide mixturesprior to measurement by MS. The highest MS sensitivity isachieved when one tunes the MS system to detect a given target(i.e., execute MS/MS) regardless of its presence in the precedingMS1 event (i.e., selected reaction monitoring). SRM measure-ments deliver sensitivity and reproducibility at the cost of band-width. Specifically, if one does not know the elution time of atarget, the duration of the nHPLC-MS/MS analysis must be dedi-cated to conditions for that specific entity. If elution times areknown, then multiple SRM scan events can be programmed allow-ing for detection of multiple targets; however, chromatographicconditions must remain identical or the scheduled SRM elutionwindows will no longer align. Still, the bandwidth of that approachis low, ∼100 peptide targets per nHPLC-MS/MS analysis, andcompiling such an experiment is highly laborious (21).

We surmised that inSeq could inform the MS system withouthuman intervention, of which peptide targets are most likely toelute subsequently. Such capability could enable robust, large-scale targeting (>500 per analysis) in an automated manner.Our approach relies upon relative peptide elution order and, con-sequently, bypasses the use of absolute retention times that shiftdepending on chromatographic conditions and are not directlyportable frommultiple disparate experiments. Peptide elution or-der can be obtained in two ways. First, discovery experiments canbe employed to determine retention order by normalizing themeasured retention time for each detected peptide sequence.Second, the relative hydrophobicity for any sequence can be the-oretically determined using existing software (e.g., SSRCalc) (22–24). In our experience, experimentally determined retention or-der offers better precision, though it requires prior knowledgethat may not be available. However, retention order is deter-mined, the real-time confirmation algorithm maintains a rollingaverage of the calculated elution order (CEO—a number de-scribing the relative elution order of a target peptide) so that tar-get peptides having nearby CEOs are specifically pursued(Fig. 2B). Fig. 2 presents an overview of this approach. This ex-ample, 60.39 min into the chromatograph, highlights the last fiveinSeq identified peptides and their average CEO (26.926 a.u.).The onboard algorithm then computes an asymmetric CEO win-dow (5 a.u., 24.926–29.926) that presents a short list of desiredtargets having CEOs within that range (Fig. 2C). With this infor-mation the MS system can trigger specialized MS2 scans specificto this refined target subset. Note that as targets are identified,the CEOwindow is dynamically adjusted so targets come into andout of the range precisely when they are eluting.

To test this technology, we performed a DDA nHPLC-MS/MSexperiment where tryptic peptides from a human ES cell samplewere separated over a 60-min gradient. Following data collection,the resulting MS/MS spectra were mapped to sequence using da-tabase searching (1% FDR). The unique peptide identifications(4,237) were sorted by observed retention time—this orderingthen served as the CEO. Three thousand of these peptides wererandomly selected as “targets” and loaded onto the instrumentfirmware (Velos-Orbitrap), along with their respective CEO, asa database for inSeq. The sample was then reanalyzed with inSeqactivated but with a doubled gradient length (120 min). Fig. 2Ddisplays the CEO window as calculated in real-time by the MSsystem (inSeq) plotted beside the actual elution time of identifiedpeptides. Greater than 95% of the peptides (2,889) fell within therolling CEO window and were identified by inSeq and postacqui-sition searching. At our present capability we can achieve windowwidths similar to those used in absolute scheduling type experi-ments (∼3–6 min) on a scale that is 30x larger (e.g., 3,000 targetsvs. 100) with minimal effort (21, 25). Furthermore, we demon-strate that our approach adapts to different chromatographicconditions with no negative effects (Fig. 2D). The key to the highportability and simplicity of our algorithm is the use of inSeq forcontinual, real-time realignment.

Improvement of Quantitative Accuracy. The method of stable iso-tope labeling has greatly propelled large-scale, quantitative ana-lysis (26–31). While generally robust, these techniques can yieldspotty data for certain peptide and protein groups—mainly thosepresent at low abundances. For SILAC, low signal-to-noise (S/N)precursor peaks in the MS1 scan often result in omission of thatparticular feature or quantitative imprecision, if included (32).For isobaric tagging, low intensity reporter ion signals (MS2) in-duce similar shortcomings (33). We surmised that inSeq could beemployed to counter these limitations.

First, we developed an inSeq module to improve the quality ofisobaric label-based measurements. The module analyzes MS2

spectra using inSeq and, when a peptide of interest is detected,the quality of quantitative data is assessed. Should the reporterion signal fall below a specified threshold, inSeq triggers follow-up scans to generate increased signal at the very instant the targetpeptide is eluting. In one implementation, we instructed inSeq toautomatically trigger three quantitative scans using the recentlydeveloped QuantMode (QM) method to generate superior qual-ity quantitative data on targets of high value (17). The trio of QMscans are then summed offline.

To assess this decision node, we analyzed a sample comprisingthree biological replicates of human embryonic stem cells pre-and 2 d post bone morphogenetic protein 4 (BMP4) treatment(i.e., TMT 6-plex, three pretreatment and three post BMP4 treat-ment cell populations). BMP4, a growth factor that inducescontext-dependent differentiation in pluripotent stem cells, is

A B C

D

Fig. 2. Elution order prediction using inSeq. (A) Experimental chromatogram, 60.39 min into a 120-min gradient obtained while inSeq was recording instantidentifications. (B) The inSeq CEO obtained by averaging the CEOs of the preceding five instantly identified peptides. (C) The asymmetric window (5 a.u.)surrounding the instant CEO average (μ ¼ 26.926) and their corresponding sequences. (D) This analysis is repeated following each new peptide confirmationto realign the CEO window constantly based on current chromatographic conditions.

Bailey et al. PNAS ∣ May 29, 2012 ∣ vol. 109 ∣ no. 22 ∣ 8413

CHEM

ISTR

Y

Page 4: Instant spectral assignment for advanced decision tree ... · These nodes enable automated functionalities including: real-time elution prediction, advanced quantification, PTM localiza-tion,

widely used to study differentiation to biologically relevant celllineages such as mesoderm and endoderm (34–36). Whenevera target peptide was identified by inSeq, three QM scans wereimmediately executed. This ensured that all identified peptideshad the same number of quantitation scans enabling a direct com-parison for analyzing multiple QM scans within this experiment.Fig. 3A demonstrates the benefit of summing isobaric tag inten-sities from one, two, or three consecutive quantitation scansfor an inSeq identified target peptide having the sequenceFCADHPFLFFIR from the protein SERPINB8. Here the ratioof change between control and treatment cell lines measured inone QM scan is large (5.86) but not significant (P ¼ 0.067, Stu-dent’s t test with Storey correction) (37). Note significance testingwas accomplished by assessing variation within the three biologi-cal replicates of treatment and control cell lines. The measuredratio remains relatively unchanged (5.22 and 5.43) as reporter tagsignal from additional quantitation scans are added; however, thecorresponding P values decrease to 0.014 and 0.012 when two orthree quantitation scans are summed. By plotting the log2 ratio ofquantified proteins from the three biological replicates againstthe average intensity of isobaric labels (Fig. 3B), we demonstratethis improved significance results from boosted reporter S/N. Ide-ally, this log2 ratio would be zero indicating perfect biological re-plication; however, when only one quantitation scan is employed,this ratio severely deviates from zero with decreasing tag inten-sity. To improve overall data quality and to omit potentially er-roneous measurements, we and others employ arbitrary reportersignal cutoffs (dashed vertical line in Fig. 3B). Summation of ad-ditional quantitation scans increases the average reporter tag in-tensity raising nearly all of the protein measurements above theintensity cutoff value (74, 9, and four proteins omitted using one,two, and three quantification scans, respectively). This quantifi-cation decision node also increased the number of proteins within25% of perfect biological replication (horizontal dashed line).

To determine if the method could improve the number of sta-tistically significant differences between the cell populations, wecalculated the log2 ratio of treated vs. control (i.e., 2 d∕0 d) foreach of the 596 quantified proteins (P < 0.05, Student’s t testwith Storey correction, Fig. 3C). Only 28 proteins display signifi-cant change when one QM scan is used. By simply adding thereporter tag signal from additional scans the number of signifi-cantly changing proteins increases nearly threefold from 28 to91 when all three QM scans are analyzed together.

Many stable isotope incorporation techniques measure heavyand light peptide pairs in MS1 (e.g., SILAC). This approach, ofcourse, requires the detection of both partners. Note, low abun-dance peptides are often identified with low, or no, precursor sig-nal in the MS1. We supposed that addition of another inSeqdecision node could circumvent this problem. We cultured human

embryonic stem cells in light and heavy media. Protein extract fromthese cultures was mixed 5∶1 (light:heavy) before digestion over-night with LysC. The SILAC node was developed to select precur-sors from an MS1 scan only if the monoisotopic mass was within30 ppm of any target on a list that contained 4,000 heavy and lightpeptides from a previous discovery run. Targets were selected onlyif the SILAC ratio deviated from the expected ratio of 5 by 25%,i.e., the subset containing the most error. Following MS/MS, theresulting spectra were analyzed using inSeq. When a target of in-terest was identified, inSeq instructed the system to record imme-diately a SIM scan surrounding the light/heavy pair with a small,charge-dependent isolation window (∼8–10 Th).

The average ratio of the light and heavy peptides subtly,but significantly, shifted from 4.47 under normal analysis to 5.34for the inSeq triggered SIM scans (Student’s t test, p value <6 × 10−20). More importantly, the number of useable measure-ments, i.e., when both partners of the pair are observed, increasedby ∼20% (2,887 under normal analysis to 3,548 with inSeq,Fig. S5A). Fig. S5B displays an example of the inSeq-triggeredSIM scan and the increase in S/N and accuracy it affords. Here,the MS/MS scan of the light partner was mapped, in real-time tothe sequence IEELDQENEAALENGIK. This event triggered ahigh-resolution SIM scan (8 Th window) that led to the ratioof 4.99∶1 (correct ratio 5∶1). Here, gas phase enrichment wasessential to quantify the relative abundance as the isotopic envel-ope of the heavy partner was not observed even with extensivespectral averaging of successive MS1 scans (∼30 s, Fig. S5B).Whether for MS1 or MS2 centric methods, we conclude that in-Seq technology will significantly improve the quality of quantita-tive data with only a minimal impact on duty cycle.

Posttranslational Modification Site Localization. The presence ofPTMs on proteins plays a major role in cellular function and sig-naling. Unambiguous localization of PTMs to residues demandsobservation of product ions resulting from cleavage of the residuesadjacent to the site of modification, i.e., site-determining frag-ments (SDFs). In a typical analysis, only about half of the identi-fied phosphorylation sites can be mapped with single amino acidresolution stymying systems-level data analysis. We reasoned thatinSeq could be leveraged to boost PTM localization rates by dy-namically modifying MS2 acquisition conditions when necessary.As such, we developed an online PTM localization decision nodeto determine, within milliseconds, whether a MS/MS spectrumcontains SDFs to localize unambiguously the PTM. Should SDFsbe lacking, inSeq instantly orchestrates further interrogation.

The PTM localization node is engaged when inSeq confirmsthe detection of a PTM-bearing peptide. After the sequence isconfirmed, inSeq assesses the confidence with which the PTM(s) can be localized to a particular amino acid residue. This

A B C

Fig. 3. Improved quantitative outcomes using in-Seq in isobaric tagging. An inSeq decision nodewas written to so that a real-time identificationof a target sequence prompted automatic acquisi-tion of three consecutive QuantMode (QM) scans.(A) Summing the reporter ion tag intensity fromone, two, or three QM scans greatly improves thestatistical significance of the measurement. (B) Sum-mation of QM technical replicates reduces the var-iation in biological replicate measurement byincreasing reporter ion S/N. (C) QM-triggered scansby inSeq increase the number of significantly chan-ging proteins from 28 to 91.

8414 ∣ www.pnas.org/cgi/doi/10.1073/pnas.1205292109 Bailey et al.

Page 5: Instant spectral assignment for advanced decision tree ... · These nodes enable automated functionalities including: real-time elution prediction, advanced quantification, PTM localiza-tion,

procedure is accomplished by computing an online probabilityscore similar to postacquisition PTM localization software—i.e., AScore (38). Briefly, inSeq compares all possible peptide iso-forms against the MS/MS spectrum. For each SDF, the number ofmatches at <10 ppm tolerance is counted, and an AScore is cal-culated (inSeq uses similar math). If the AScore of the best fittingisoform is above 13 (p < 0.05), the PTM is declared localized.When the AScore is below 13, however, inSeq triggers furthercharacterization of the eluting precursor until the site has beendeemed localized or all decision nodes have been exhausted.Additional characterization can include many procedures suchas acquisition of MS/MS spectra using different fragmentationmethods (e.g., CAD, HCD, ETD, PD, etc.), varied fragmentationconditions (e.g., collision energy, reaction time, laser fluence,etc.), increased spectral averaging, MSn, pseudo MSn, modifieddynamic exclusion, and altered automatic gain control target va-lues among others (39).

To obtain proof-of-concept results, we wrote a simple inSeqnode that triggered an ETDMS/MS scan of phosphopeptides thatwere not localized following HCD MS2. In one example (Fig. 4),the sequence, RNSSEASSGDFLDLK was confirmed to containa phosphoryl group; however, the inSeq algorithm could notconfidently localize the PTM to any of the four Ser residues(AScore ¼ 0). Next, inSeq triggered an ETD MS2 scan of thesame precursor (Fig. 4B). The resulting spectrum was then ana-lyzed for the presence of the SDFs, c3∕z•12. Both of these frag-ments were present and the site was localized to Ser 3 with anAScore of 31.0129 (p < 0.00079). Postacquisition analysis con-firmed the results of our online inSeq approach—both spectra(HCD and ETD) were confidently identified, and their calculatedAScores were 0 and 45.58, respectively. When compared on a glo-bal scale, 993 of the 1,134 inSeq-identified phosphopeptides hadlocalization judgments that matched postacquisition AScore ana-lysis (Fig. 4). This slight difference is the result of using differentlocalization algorithms for online and postacquisition analysis. Pri-marily, the postacquisition method considers fragment ions oneither side of the site-determining fragments separately, whereasthe inSeq method does not perform this extra step for simplicity(30, 38). These data demonstrate that our localization node ishighly effective at instantaneously determining whether a PTMsite can be localized. Unfortunately, only marginal gains wereachieved in this basic implementation as most precursors weredoubly charged and, therefore, not effectively sequenced by ETD.Next, we modified the inSeq decision node to incorporate a dis-sociation method DT. Here a follow-up ETD or combination iontrap CAD/HCD scan was triggered depending upon precursorcharge (z) andm∕z. With the slightly evolved algorithm, the inSeqmethod detected 998 phosphopeptides in a single shotgun experi-ment. It determined that 324 of these identifications lacked theinformation to localize the PTM site and, in those cases, triggeredthe new dissociation decision node. Seventy-eight of these unlo-calizable sites were confidently mapped with this technique—salvaging nearly 25% of the unlocalized sites (Fig. 4D). Theseencouraging results demonstrate that inSeq has great promiseto curtail the problem of PTM localization in a highly automatedfashion. We note there are dozens of parameters to explore in thecontinued advancement of this PTM localization decision node.

DiscussionHere we describe an instant sequencing algorithm (inSeq) thatoperates using the preexisting processors of the MS. Rapidreal-time sequencing affords several data acquisition opportu-nities. To orchestrate these opportunities, we constructed anadvanced decision tree logic that extends our earlier use of themethod to intelligently select dissociation type. The approach cancircumvent long-standing problems with the conventional DDAparadigm. We provide three such examples herein. First, we de-monstrate that knowledge of which peptide sequences are eluting

can facilitate the prediction of soon-to-elute targets. This methodshows strong promise to revolutionize the way in which targetedproteomics is conducted. Second, we used quantitative decisionnodes that fired when inSeq detected a peptide sequence of in-terest. With SILAC or isobaric tagging, significant gains in quan-titative outcomes were documented. Third, we endowed inSeqwith an instant PTM site-localization algorithm to determinewhether or not to initiate more rigorous follow-up at the very in-stant the peptide of interest was eluting. We show that the inSeqsite localizer is highly effective (90% agreement with postacquisi-tion analysis) and that triggering a simple dissociation method

Fig. 4. PTM localization rates increase with inSeq. Following MS2 (HCD) ofthe singly phosphorylated precursor RNsSEASSGDFLDLK, inSeq could not findsufficient information to confidently localize the modification to either Ser 3or 4 (A, AScore ¼ 0). An ETD MS2 scan event is immediately triggered by in-Seq on the same precursor (B). This spectrum was assigned an AScore of31.0129 (phosphorylation on Ser3) and was considered confidently loca-lized—note the SDFs c3 and z•

12 ions. (C) Globally, the inSeq localization cal-culation agreed with offline analysis using the actual AScore algorithm. (D)Using a simple dissociation method DT, inSeq produced a confidently loca-lized phosphorylation site for 78 of 324 unlocalizable sites.

Bailey et al. PNAS ∣ May 29, 2012 ∣ vol. 109 ∣ no. 22 ∣ 8415

CHEM

ISTR

Y

Page 6: Instant spectral assignment for advanced decision tree ... · These nodes enable automated functionalities including: real-time elution prediction, advanced quantification, PTM localiza-tion,

DTcan improve site localization by ∼25%. Further developmentwill doubtless deliver additional gains.

Targeted proteomics is an area of increasing importance. Fol-lowing discovery analysis it is natural to cull the list of several thou-sand detected proteins to several hundred key players. In an idealworld, these key proteins are then monitored in dozens or evenhundreds of samples with high sensitivity and reproducibility with-out rigorous method development and can be expediently per-formed. We envision that advanced DT analysis with inSeq couldoffer such a platform. Using the retention time prediction algo-rithm we introduced here, one can foresee the inSeq algorithmquickly and precisely monitoring hundreds of peptides withoutthe extensive labor and preplanning required by the selected reac-tion monitoring (SRM) technique, current state-of-the-art (40).Other possibilities include automated pathway analysis whereuser-defined proteins within a collection of pathways are simplyuploaded to the MS system. Then, inSeq automatically determinesthe best peptides to track their retention times and constructs themethod. Two key advantages over current SRM technology makethis operation possible. First, knowledge of specific fragmentationtransitions is not necessary as all products are monitored with highmass accuracy. Second, precise elution time scheduling is not ne-cessary as inSeq can use CEO, experimental or theoretical, to dy-namically adjust the predicted elution of targets. In this fashion themost tedious components of the SRM workflow can be avoided.

Materials and MethodsCell Culture, Cell Lysis, Digestion, Isobaric Labeling, and Phosphopeptide Enrich-ment. For all experiments, cells were lysed by sonication, protein extracted,and digested by LysC or LysC/Trypsin before desalting. For isobaric-labeledexperiments, peptides were labeled, mixed, and desalted before nHPLC-MS/MS analysis. For phosphopeptide experiments, phosphopeptides wereenriched via immobilized metal affinity chromatography. Further detailsare provided in SI Text.

inSeq Configuration, nHPLC, MS, Database Searching, Phosphosite Localization,and Peptide and Protein Quantification. All experiments were performed onLTQ Orbitrap Velos and Q-Exactive mass spectrometers (Thermo Fisher Scien-tific). All MS/MS spectra were searched using OMSSA (Open Mass Spectrome-try Search Algorithm). Peptide FDR, protein FDR, and peptide and proteinquantification using isobaric labels was performed using Coon OMSSA Proteo-mics Analysis Software Suite. Phosphosite localization and SILAC quantitationwas performed using in-house software. Additional details are provided in theSI Text.

ACKNOWLEDGMENTS. We thank Steve Gygi for assistance in analyzing theSILAC datasets and A.J. Bureta for figure illustrations. We also thank JaeSchwartz, John Syka, Mike Senko, and Jens Griep-Raming for helpful discus-sions. This work was funded by the grants from the National Institutes ofHealth (NIH) (GM081629 and HG004952), the National Science Foundation(NSF) (DBI-0701846), and Thermo Fisher Scientific. J.B. was funded by anNSF Graduate Research Fellowship and NIH traineeship (5T32GM08349).C.M.R. was also funded by an NSF Graduate Research Fellowship and NIHTraineeship (T32GM008505).

1. WashburnMP,Wolters D, Yates JR (2001) Large-scale analysis of the yeast proteome bymultidimensional protein identification technology. Nat Biotechnol 19:242–247.

2. Nilsson T, et al. (2010)Mass spectrometry in high-throughput proteomics: ready for thebig time. Nat Methods 7:681–685.

3. Wenger CD, McAlister GC, Xia QW, Coon JJ (2010) Sub-part-per-million Precursor andProduct Mass Accuracy for High-throughput Proteomics on an Electron Transfer Dis-sociation-enabled Orbitrap Mass Spectrometer. Mol Cell Proteomics 9:754–763.

4. Michalski A, et al. (2011) Mass spectrometry-based proteomics using Q Exactive, ahigh-performance benchtop quadrupole orbitrap mass spectrometer.Mol Cell Proteo-mics 10:M111 011015.

5. Eng JK, Mccormack AL, Yates JR (1994) An approach to correlate tandemmass-spectraldata of peptides with amino-acid-sequences in a protein database. J Am Soc MassSpectr 5:976–989.

6. Perkins DN, Pappin DJC, Creasy DM, Cottrell JS (1999) Probability-based protein iden-tification by searching sequence databases usingmass spectrometry data. Electrophor-esis 20:3551–3567.

7. Geer LY, et al. (2004) Open mass spectrometry search algorithm. J Proteome Res3:958–964.

8. Liu HB, Sadygov RG, Yates JR (2004) A model for random sampling and estimation ofrelative protein abundance in shotgun proteomics. Anal Chem 76:4193–4201.

9. Hoopmann MR, Merrihew GE, von Haller PD, MacCoss MJ (2009) Post analysis dataacquisition for the iterative MS/MS sampling of proteomics mixtures. J ProteomeRes 8:1870–1875.

10. Venable JD, Dong MQ, Wohlschlegel J, Dillin A, Yates JR (2004) Automated approachfor quantitative analysis of complex peptide mixtures from tandem mass spectra. NatMethods 1:39–45.

11. Panchaud A, et al. (2009) Precursor acquisition independent from ion count: how todive deeper into the proteomics ocean. Anal Chem 81:6481–6488.

12. McAlister GC, Phanstiel DH, Brumbaugh J,Westphall MS, Coon JJ (2011) Higher-energycollision-activated dissociation without a dedicated collision cell. Mol Cell Proteomics10:o111 009456.

13. Olsen JV, et al. (2007) Higher-energy C-trap dissociation for peptide modification ana-lysis. Nat Methods 4:709–712.

14. Syka JEP, Coon JJ, Schroeder MJ, Shabanowitz J, Hunt DF (2004) Peptide and proteinsequence analysis by electron transfer dissociation mass spectrometry. Proc Natl AcadSci USA 101:9528–9533.

15. Little DP, Speir JP, Senko MW, Oconnor PB, Mclafferty FW (1994) Infrared multiphotondissociation of large multiply-charged ions for biomolecule sequencing. Anal Chem66:2809–2815.

16. Olsen JV, Mann M (2004) Improved peptide identification in proteomics by two con-secutive stages of mass spectrometric fragmentation. Proc Natl Acad Sci USA101:13417–13422.

17. Wenger CD, et al. (2011) Gas-phase purification enables accurate, multiplexed pro-teome quantification with isobaric tagging. Nat Methods 8:933–935.

18. Swaney DL, McAlister GC, Coon JJ (2008) Decision tree-driven tandem mass spectro-metry for shotgun proteomics. Nat Methods 5:959–964.

19. Graumann J, Scheltema RA, Zhang Y, Cox J, Mann M (2011) A framework for intelli-gent data acquisition and real-time database searching for shotgun proteomics. MolCell Proteomics 11:M111 013185.

20. Bailey DJ, et al. (2011) How High Mass Accuracy Measurements Will TransformTargeted Proteomics. Proceedings of the 59th ASMS Conference on Mass Spectrome-try and Allied Topics (ASMS, Denver, CO) ThOE:278.

21. Yan W, et al. (2011) Index-ion triggered MS2 ion quantification: a novel proteomicsapproach for reproducible detection and quantification of targeted proteins in com-plex mixtures. Mol Cell Proteomics 10:M110 005611.

22. Krokhin OV, et al. (2004) An improved model for prediction of retention times of tryp-tic peptides in ion pair reversed-phase HPLC—Its application to protein peptide map-ping by off-line HPLC-MALDI MS. Mol Cell Proteomics 3:908–919.

23. Krokhin OV (2006) Sequence-specific retention calculator Algorithm for peptide reten-tion prediction in ion-pair RP-HPLC: Application to 300-and 100-angstrom pore sizeC18 sorbents. Anal Chem 78:7785–7795.

24. Kiyonami R, et al. (2011) Increased selectivity, analytical precision, and throughput intargeted proteomics. Mol Cell Proteomics 10:M110 002931.

25. Schmidt A, et al. (2008) An Integrated, Directed Mass Spectrometric Approach forIn-depth Characterization of Complex Peptide Mixtures. Mol Cell Proteomics

7:2138–2150.26. Gruhler A, et al. (2005) Quantitative phosphoproteomics applied to the yeast phero-

mone signaling pathway. Mol Cell Proteomics 4:310–327.27. Choe L, et al. (2007) 8-Plex quantitation of changes in cerebrospinal fluid protein

expression in subjects undergoing intravenous immunoglobulin treatment for Alzhei-mer’s disease. Proteomics 7:3651–3660.

28. de Godoy LMF, et al. (2008) Comprehensive mass-spectrometry-based proteome quan-tification of haploid versus diploid yeast. Nature 455:1251–U1260.

29. Xiao KH, et al. (2010) Global phosphorylation analysis of beta-arrestin-mediatedsignaling downstream of a seven transmembrane receptor (7TMR). Proc Natl AcadSci USA 107:15299–15304.

30. Phanstiel DH, et al. (2011) Proteomic and phosphoproteomic comparison of human ESand iPS cells. Nat Methods 8:821–U884.

31. Lee MV, et al. (2011) A dynamic model of proteome changes reveals new roles fortranscript alteration in yeast. Mol Syst Biol 7:514.

32. Bakalarski CE, et al. (2008) The impact of peptide abundance and dynamic range onstable-isotope-based quantitative proteomic analyses. J Proteome Res 7:4756–4765.

33. Zhang Y, et al. (2010) A robust error model for iTRAQ quantification reveals divergentsignaling between oncogenic FLT3 mutants in acute myeloid leukemia. Mol Cell Pro-teomics 9:780–790.

34. Lengerke C, et al. (2008) BMP andWnt specify hematopoietic fate by activation of theCdx-Hox pathway. Cell Stem Cell 2:72–82.

35. Yang L, et al. (2008) Human cardiovascular progenitor cells develop from a KDR plusembryonic-stem-cell-derived population. Nature 453:524–U526.

36. Yu PZ, Pan GJ, Yu JY, Thomson JA (2011) FGF2 Sustains NANOG and switches the out-come of BMP4-induced human embryonic stem cell differentiation. Cell Stem Cell8:326–334.

37. Storey JD, Tibshirani R (2003) Statistical significance for genomewide studies. Proc NatlAcad Sci USA 100:9440–9445.

38. Beausoleil SA, Villen J, Gerber SA, Rush J, Gygi SP (2006) A probability-based approachfor high-throughput protein phosphorylation analysis and site localization. Nat Bio-technol 24:1285–1292.

39. Schroeder MJ, Shabanowitz J, Schwartz JC, Hunt DF, Coon JJ (2004) A neutral loss ac-tivation method for improved phosphopeptide sequence analysis by quadrupole iontrap mass spectrometry. Anal Chem 76:3590–3598.

40. Picotti P, et al. (2009) High-throughput generation of selected reaction-monitoringassays for proteins and proteomes. Nat Methods 7:43–U45.

8416 ∣ www.pnas.org/cgi/doi/10.1073/pnas.1205292109 Bailey et al.