14
HAL Id: hal-02492494 https://hal.univ-lorraine.fr/hal-02492494 Submitted on 28 Oct 2020 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Machine learning of reverse transcription signatures of variegated polymerases allows mapping and discrimination of methylated purines in limited transcriptomes Stephan Werner, Lukas Schmidt, Virginie Marchand, Thomas Kemmer, Christoph Falschlunger, Maksim Sednev, Guillaume Bec, Eric Ennifar, Claudia Höbartner, Ronald Micura, et al. To cite this version: Stephan Werner, Lukas Schmidt, Virginie Marchand, Thomas Kemmer, Christoph Falschlunger, et al.. Machine learning of reverse transcription signatures of variegated polymerases allows mapping and discrimination of methylated purines in limited transcriptomes. Nucleic Acids Research, Oxford University Press, 2020, 48 (7), pp.3734 - 3746. 10.1093/nar/gkaa113. hal-02492494

Machine learning of reverse transcription signatures of

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Machine learning of reverse transcription signatures of

HAL Id: hal-02492494https://hal.univ-lorraine.fr/hal-02492494

Submitted on 28 Oct 2020

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Machine learning of reverse transcription signatures ofvariegated polymerases allows mapping and

discrimination of methylated purines in limitedtranscriptomes

Stephan Werner, Lukas Schmidt, Virginie Marchand, Thomas Kemmer,Christoph Falschlunger, Maksim Sednev, Guillaume Bec, Eric Ennifar,

Claudia Höbartner, Ronald Micura, et al.

To cite this version:Stephan Werner, Lukas Schmidt, Virginie Marchand, Thomas Kemmer, Christoph Falschlunger, etal.. Machine learning of reverse transcription signatures of variegated polymerases allows mappingand discrimination of methylated purines in limited transcriptomes. Nucleic Acids Research, OxfordUniversity Press, 2020, 48 (7), pp.3734 - 3746. �10.1093/nar/gkaa113�. �hal-02492494�

Page 2: Machine learning of reverse transcription signatures of

3734–3746 Nucleic Acids Research, 2020, Vol. 48, No. 7 Published online 25 February 2020doi: 10.1093/nar/gkaa113

Machine learning of reverse transcription signaturesof variegated polymerases allows mapping anddiscrimination of methylated purines in limitedtranscriptomesStephan Werner1,†, Lukas Schmidt1,†, Virginie Marchand2, Thomas Kemmer3,Christoph Falschlunger4, Maksim V. Sednev5, Guillaume Bec6, Eric Ennifar6,Claudia Hobartner 5, Ronald Micura4, Yuri Motorin 2,7, Andreas Hildebrandt3,* andMark Helm 1,*

1Institute of Pharmaceutical and Biomedical Sciences, Johannes Gutenberg-University Mainz, Staudingerweg 5,55128 Mainz, Germany, 2Epitranscriptomics and Sequencing (EpiRNA-Seq) Core Facility, UMS2008 IBSLorCNRS-Universite de Lorraine-INSERM, Biopole, 9 Avenue de la Foret de Haye, 54505 Vandœuvre-les-Nancy,France, 3Institute of Computer Science, Johannes Gutenberg-University Mainz, Staudingerweg 9, 55128 Mainz,Germany, 4Department of Organic Chemistry, Leopold Franzens University Innsbruck, Innrain 80/82, 6020Innsbruck, Austria, 5Institute of Organic Chemistry, Julius Maximilian University Wurzburg, Am Hubland, 97074Wurzburg, Germany, 6IBMC, UPR9002 CNRS-Universite de Strasbourg, 2 Allee Konrad Roentgen, 67084Strasbourg, France and 7IMoPA, UMR7365 CNRS-Universite de Lorraine, Biopole, 9 Avenue de la Foret de Haye,54505 Vandœuvre-les-Nancy, France

Received October 10, 2019; Revised February 10, 2020; Editorial Decision February 11, 2020; Accepted February 19, 2020

ABSTRACT

Reverse transcription (RT) of RNA templates contain-ing RNA modifications leads to synthesis of cDNAcontaining information on the modification in theform of misincorporation, arrest, or nucleotide skip-ping events. A compilation of such events from multi-ple cDNAs represents an RT-signature that is typicalfor a given modification, but, as we show here, de-pends also on the reverse transcriptase enzyme. Acomparison of 13 different enzymes revealed a rangeof RT-signatures, with individual enzymes exhibitingaverage arrest rates between 20 and 75%, as well asaverage misincorporation rates between 30 and 75%in the read-through cDNA. Using RT-signatures fromindividual enzymes to train a random forest modelas a machine learning regimen for prediction ofmodifications, we found strongly variegated successrates for the prediction of methylated purines, asexemplified with N1-methyladenosine (m1A). Amongthe 13 enzymes, a correlation was found betweenread length, misincorporation, and prediction suc-

cess. Inversely, low average read length was corre-lated to high arrest rate and lower prediction suc-cess. The three most successful polymerases werethen applied to the characterization of RT-signaturesof other methylated purines. Guanosines featuringmethyl groups on the Watson-Crick face were iden-tified with high confidence, but discrimination be-tween m1G and m2

2G was only partially successful.In summary, the results suggest that, given sufficientcoverage and a set of specifically optimized reactionconditions for reverse transcription, all RNA mod-ifications that impede Watson-Crick bonds can bedistinguished by their RT-signature.

INTRODUCTION

With the discovery of retroviral reverse transcriptases (RT)in 1970 by Howard Temin (1) and David Baltimore (2),the possibility of synthesizing cDNA copies of RNA sub-strates revolutionized the field of molecular biology andfound application in various analytical and biotechnologi-cal methodologies, including transcriptome profiling, RNA

*To whom correspondence should be addressed. Tel: +49 6131 39 25731; Fax: +49 6131 39 20373; Email: [email protected] may also be addressed to Andreas Hildebrandt. Tel: +49 6131 39 23334; Fax: +49 6131 39 23534; Email: [email protected].†The authors wish it to be known that, in their opinion, the first two authors should be regarded as Joint First Authors.

C© The Author(s) 2020. Published by Oxford University Press on behalf of Nucleic Acids Research.This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), whichpermits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Dow

nloaded from https://academ

ic.oup.com/nar/article/48/7/3734/5755881 by U

niversite de Strasbourg France user on 13 October 2020

Page 3: Machine learning of reverse transcription signatures of

Nucleic Acids Research, 2020, Vol. 48, No. 7 3735

structure mapping, RT-PCR of RNA fragments and RNA-sequencing (RNA-Seq) (3,4).

The most widely applied reverse transcriptases derivefrom the Avian Myeloblastosis Virus (AMV) and theMoloney Murine Leukemia Virus (MMLV), (5). In general,known reverse transcriptases exhibit two catalytic activi-ties: a DNA polymerase activity and an associated RNaseactivity. Thereby, the DNA polymerase activity is used tocopy both, RNA and DNA templates and the RNase ac-tivity, termed RNase H, degrades RNA in RNA–DNA-hybrid duplexes. The mature enzymes are formed by pro-teolytic processing and differ in their subunit structures.AMV RT is a heterodimer in which both subunits haveDNA polymerase and RNase H activity, while MMLV RTis a monomer, that also exhibits both activities (6).

Today’s commercially available reverse transcriptases aregenerally recombinant variants of the native AMV andMMLV enzymes with improved fidelity, elevated temper-ature optima, reduced RNase H activity and higher pro-cessivity, thus addressing limitations such as impairmentthrough higher-order RNA structures, primer mispairingor relatively high error rates (7). Another well-known andresearched reverse transcriptase is the heterodimeric HIV-RT, comprised of two related subunits. The larger subunitp66, carrying the polymerase and RNase H domains, playsthe catalytic role, the smaller subunit p51 plays a structuralrole (8). HIV-RT serves as major target for antiretroviraldrug combination therapies and therefore is mostly used interms of drug development studies. Recently, HIV-RT wasused in an evolution platform to find a reverse transcriptasevariant with robust read-through and high mutation ratesat modified sites during reverse transcription, allowing site-specific identification (9). There are also reports of increasedreverse transcriptase activity from some DNA polymeraseslike Thermus thermophilus (Tth) (10), and a recent studyabout a N-terminally shortened form of the DNA poly-merase I from Thermus aquaticus (Klentaq) (11) that showsthe possibility of targeted polymerase engineering to evolvereverse transcriptase activity (12,13). In follow-up studies,Klentaq variants for specific modification analysis could bedeveloped (14,15).

Until the development of nanopore-based direct RNA-Sequencing methods (16–19), reverse transcription of RNAtemplates was a mandatory step in library preparation forRNA-Sequencing (RNA-Seq), a widely used applicationfor RNA sequence elucidation, including the detection ofRNA modifications, the so-called modification calling (20).Thereby, the information about modification type and posi-tion may get erased, partially or completely, since the newlysynthesized cDNA is composed only of the four canonicaldeoxynucleotides.

Of the >150 different types of post-transcriptional mod-ifications (21) only a very limited number can alter cDNAsynthesis due to chemical alterations on their Watson–Crick face and therefore leave traces in RNA-Seq data.The emerging characteristic patterns, which we previouslytermed RT-signatures, derive from RT termination productsand misincorporated nucleotides, so-called mismatches, atmodified sites and can be utilized to accurately detect RNAmodifications, successfully applied on m1A (22). The lat-

ter, a modification featuring a methyl group on the Watson-Crick face of adenosine, which interferes with proper base-pairing, in RNA structure (23–25) as well as during cDNAsynthesis by reverse transcription (26).

Two studies for a transcriptome-wide m1A detectionbased on the enrichment of m1A-containing RNA frag-ments with specific m1A antibodies and following map-ping at single-nucleotide resolution were published in 2017(27,28). By relying on the property of m1A to lead totypical misincorporation and truncation profiles by usingdifferent reverse transcriptases, this led to quite differentoutcomes in terms of the number of detected and pro-posed m1A sites. A recently published study on the cross-reactivity of the most commonly used m1A-binding an-tibody in transcriptome-wide mapping studies provides apossible explanation for the strongly varying numbers ofm1A sites (29,30). Besides this, the two methods differ intheir experimental design and the bioinformatic process-ing pipeline, but both use the thermostable group II in-tron reverse transcriptase (TGIRT) for cDNA synthesis.This enzyme is encoded by Group II introns, mobile ri-bozymes that use the RT to stabilize the RNA structurefor splicing, and afterwards convert the integrated intronRNA back into DNA (31). The same reverse transcrip-tase was used for a study about high-throughput tRNA se-quencing, where the authors focused on N3-methylcytidine(m3C) and N1-methylguanosine (m1G), two other modifi-cations with altered Watson–Crick edge, among m1A (32).In addition, the occurrence of RT-signatures at two N6,N6-dimethyladenosine (m6

2A) residues in yeast 18S rRNA isalso known and reported to be inconsistent and partiallyindistinguishable from the signature of m1A (22,33). Thisraises the question whether purine modifications like m1G,m2G, m2

2G, m1A, m6A or m62A with altered Watson-

Crick-Face show characteristic and distinguishable RT-signatures in RNA-Seq data and furthermore whether theycan be clearly detected using computer-based analyticalmethods.

Central and common problems in modification call-ing from RNA-Seq data are rooted in statistics and datatreatment. As outlined above, interpretation of RNA-Seqdatasets is frequently controversial, as different groups ap-ply variegated statistical methods, but also because, in theabsence of commonly accepted standards, many thresholdsset for modification calling are set arbitrarily, or accordingto idiosyncratic reasoning. Consequently, there is a need toestablish neutral and unbiased procedures for threshold de-termination. Indeed, a number of bioinformatics tools canbe applied to assess the robustness and accuracy of the mod-ification calling based on RT-signatures. This concerns inparticular the distinctness of actual modification sites fromnon-modified sites in RNA-Seq data. For this purpose, pre-diction models from machine learning are used, which canbe implemented for classification and regression analysis.The use of machine learning models, trained to distinguisha certain class by different characteristics, allows a predic-tion without thresholds or cutoffs set by the user. The de-cision whether a modification is present or not is entirelyleft to the model, making the analyses comparable and lessbias susceptible, as the model compromises between sensi-

Dow

nloaded from https://academ

ic.oup.com/nar/article/48/7/3734/5755881 by U

niversite de Strasbourg France user on 13 October 2020

Page 4: Machine learning of reverse transcription signatures of

3736 Nucleic Acids Research, 2020, Vol. 48, No. 7

tivity (recall) and precision, while own thresholds and cut-offs tend to shift this balance to one side. Two examples ofsuch supervised learning models for pattern recognition aresupport-vector machines (SVM) (34) and random forests(RF) (35), commonly used for binary classification. Ran-dom forest models were chosen as prediction tool in thestudy by Hauenschild et al. (22) operating by constructingdecision trees, the building blocks of random forests, withtraining datasets and then being applied on test data form1A recognition, according to the occurring RT-signatures.Random forests consist of a large number of individual de-cision trees, trained with different subsets of the trainingdata, and each node of each decision tree being split us-ing the feature with the best splitting properties (out of arandom feature sampling) from the data. In operating asan ensemble, all prediction trees contribute to the overallclassification, forming the model’s prediction by majorityvoting (bagging) (35). The identification of the predictiveperformance and statistical classification is of particular im-portance in order to classify the received results. Therefore,a confusion matrix is created to allow visualization of theperformance of a classification due to the algorithm (36).A confusion matrix consists of two rows and two columnsreporting the number of false positives (FP), false negatives(FN), true positives (TP) and true negatives (TN), compar-ing the instances in a predicted class with the instances inan actual class. For more detailed analysis, the calculationof the sensitivity, also called true positive rate (TPR), thespecificity, true negative rate (TNR), or precision, positivepredictive value (PPV), is possible. By plotting the true pos-itive rate against the false positive rate (FPR), calculated as(1 − specificity), a receiver operating characteristic (ROC)curve as two-dimensional graphical plot can be generatedto illustrate the classifier performance (37). For compar-ison purposes, the area under the ROC curve (AUC) asscalar value is a good measure for the classifier performance(38,39). Another quality parameter is the Matthews Corre-lation Coefficient (MCC), which serves to study the correla-tion between observed and predicted binary classificationsand can be used even if the classes are of very different sizes(40).

Here, we present an RT-signature analysis of vari-ous purine modifications, featuring methyl groups on theWatson–Crick face known to impede reverse transcrip-tion. The analyses uncovered a strong dependence of RT-signatures on the reverse transcriptase enzyme, and, mostinterestingly, a correlation between different RT-signaturecharacteristics and success rates of random forest mod-els for prediction of modified sites. Different reverse tran-scriptase enzymes produce substantial differences in mis-match, arrest and read-through feature content and the var-iegated RT-signatures lead to highly variable machine learn-ing performances. As exemplified with N1-methyladenosine(m1A), the reverse transcriptase selection can stronglychange and increase the detection rates, by prediction,based on different importance of the characteristics. Thesefindings help to enable a more precise detection of modifiedsites, by selecting the best enzymes and using a set of specif-ically optimized reaction conditions, and can therefore beused for future modification studies.

MATERIALS AND METHODS

The synthetic nucleic acids for purine modification anal-ysis (m6A, m6

2A, m1G and m22G) were synthesized ac-

cording to previously published protocols (41,42). The m1Aoligonucleotides and all other synthetic nucleic acids werefrom IBA (Gottingen, Germany). Details including se-quence information are given in Supplement Table S1. Forthe studies, total tRNA from Saccharomyces cerevisiae fromRoche was used (Sigma-Aldrich, Germany). The amountand quality of the RNA was verified by UV absorption(NanoDrop 2000 (Thermo Fisher Scientific, Germany)),polyacrylamide gel electrophoresis (PAGE) and AgilentRNA ScreenTape analysis (for Agilent TapeStation 4200).

Library preparation & sequencing

Library preparation was prepared according to a previ-ously published version (22,43) with necessary adaptationsfor the reverse transcription and 3′-tailing step (details in-cluding sequence information of the used synthetic nucleicacids are given in Supplement Table S1). Briefly, total tRNAfrom Saccharomyces cerevisiae and the synthetic revolveroligos (the corresponding m1A, m1G and m2

2G revolveroligos as synthesized, and the m6A and m6

2A revolver oli-gos mixed in equal amounts respectively) were dephospho-rylated with FastAP Thermosensitive Alkaline Phosphatase(Thermo Fisher Scientific, USA) as previously described.Then, a pre-adenylated 3′-adapter was ligated to the 3′-end of the dephosphorylated RNA using T4 RNA ligase 2truncated (New England Biolabs, Germany) and T4 RNAligase (Thermo Fisher Scientific, USA). After purificationwith 5′-Deadenylase (New England Biolabs, Germany) andLambda exonuclease (Thermo Fisher Scientific, USA) toremove non-ligated pre-adenylated adapter, an ethanol pre-cipitation was performed. The RNA pellet was re-dissolvedand used for reverse transcription. The reverse transcriptionstep was performed with slight adaptation of the previouslypublished workflow as follows:

Reverse transcription. For the reverse transcription step, weused 13 different reverse transcriptases (see Table 1) andthe reaction was performed according to the respectivemanufacturer’s protocol (see Supplement Table S2). Ingeneral, the reverse transcription mixture (20 �l) com-prised the RT primer (IBA, Germany; see SupplementTable S1 for sequence) in a final concentration of 5 �Mand the respective RT reaction buffer (1×). The mixturewas heated to 75◦C for 5 min for heat denaturation, thenchilled on ice. Then, dNTP mix (0.5 mM final concentra-tion) and according to the protocol BSA, DTT and/orMgCl2 (see Supplement Table S2) were added to the mix-ture. After addition of the RT enzyme, the reaction wasperformed at 45◦C for 1 h with one exception, for the DNApolymerase variant Volcano (RT #13) the reaction wasdone at 60◦C for 1 h.After reverse transcription, excess of the RT primer was

digested using Lambda exonuclease (Thermo Fisher Sci-entific, USA) and Exonuclease I (Thermo Fisher Scien-tific, USA). Residual dNTPs were dephosphorylated withFastAP Thermosensitive Alkaline Phosphatase (Thermo

Dow

nloaded from https://academ

ic.oup.com/nar/article/48/7/3734/5755881 by U

niversite de Strasbourg France user on 13 October 2020

Page 5: Machine learning of reverse transcription signatures of

Nucleic Acids Research, 2020, Vol. 48, No. 7 3737

Table 1. Reverse transcriptases (RTs)

Reverse transcriptase Origin Cat. No. Supplier

#1 M-MuLV MMLV M0253S New England Biolabs#2 AMV AMV M0277S New England Biolabs#3 ProtoScript® II MMLV M0368S New England Biolabs#4 GoScript™ MMLV A5003 Promega#5 SuperScript™ III MMLV 18080044 Thermo Fisher Scientific#6 RevertAid™ MMLV EP0441 Thermo Fisher Scientific#7 AccuScript™ MMLV 600089 Agilent#8 AffinityScript™ MMLV 600107 Agilent#9 M-MLV MMLV M1701 Promega#10 MonsterScript™ MMLV MSTA5110 Epicentre®

#11 EpiScript™ MMLV ERT12910K Epicentre®

#12 SuperScript™ IV MMLV 18090050 Thermo Fisher Scientific#13 Volcano Klentaq #8100S myPOLS Biotec

Fisher Scientific, USA). Next, RNA was degraded by addi-tion of NaOH, the sample was neutralized with acetic acidand the cDNA was ethanol precipitated. The 3′-tailing andligation step of the cDNA were performed with slight adap-tations to the previously published workflow as follows:

3′-Tailing and ligation of cDNA. For the purine modifi-cation studies, adenosine triphosphate (ATP) was usedin the tailing step and a 5′-adapter with TT sequenceoverhang (5′-adapter strand 2) was used accordingly tocreate the double-stranded DNA adapter for the secondadapter ligation. After re-dissolving the cDNA pellet, the3′-cDNA tailing was performed using the enzyme terminaldeoxynucleotidyl transferase (TdT) (Thermo Fisher Sci-entific, USA) in 10 �l reaction volume. The reaction mix-ture included TdT Buffer (1×), rATP (final concentration:1.25 mM) and TdT (final concentration: 1 U/�l). Thedouble-stranded DNA adapter was prepared by anneal-ing equimolar amounts of the synthetic single-strandedoligonucleotides 5′-adapter strand 2 & 5′-adapter strand 3(see Supplement Table S1 for sequences) at 90◦C for 5 minand subsequent cooling to 22◦C and further incubation for10 min (Cytidine triphosphate (CTP), 5′-adapter strand 1with GG sequence overhang & 5′-adapter strand 3 wereused for the side-by-side m1A RT-signature study accord-ing to the previously published protocol (43)). The ligationof the double-stranded DNA adapter (final concentration1.25 �M) was then performed overnight at 4◦C in a totalvolume of 40 �l with T4 DNA ligase HC 30 U/�l (ThermoFisher Scientific, USA) and ATP (Thermo Fisher Scien-tific, USA) in a final concentration of 1.5 Weiss U/�lDNA ligase and 10 �M ATP in 50 mM Tris–HCl (pH 7.4)and 20 mM MgCl2 reaction buffer. Ligated DNA was thenethanol precipitated, and the ligation product was loadedon a 10% denaturing polyacrylamide gel and size selec-tion between 40 and 150 nt was performed to purify lig-ated DNA from non-ligated DNA adapters.

After elution and ethanol precipitation the samples wereused for a final PCR amplification with Taq-Polymerase(Rapidozym, Germany) and P7 and P5 PCR primers. PCRproducts were ethanol precipitated and loaded on a 10%denaturing polyacrylamide gel for size selection between150 and 300 nt. After elution, ethanol precipitation and re-dissolving, the libraries were loaded on an Agilent HighSensitivity DNA chip (for Agilent Bioanalyzer 2100) to

check for the presence of adapter dimers. Each library wasthen quantified by fluorometry using the Qubit dsDNAHS assay kit (Thermo Fischer Scientific, USA) and finallysequenced after pooling on MiSeq in 2 × 75 bp paired-end mode. Demultiplexing was performed and the resultingFastQ files were inspected for a quality check using FastQC.

Data analysis

The sequencing data was processed using multiple bioinfor-matic analysis workflows. These workflows entail function-alities for standard procedures such as adapter trimmingand reference mapping, but also provide algorithms for theextraction of relevant information from the alignment data,automated machine learning and prediction for modifica-tion calling as well as for visualization, based on our previ-ously published software package (44). Described below arethe workflows used in context of this study. As of note, theseworkflows were integrated into the graphical user-interfaceGalaxy which provides easy accessibility, adjustability andreproducibility (45). A version of said Galaxy-platform in-cluding many of the workflows used in this study is availablefor download and is described in Schmidt et al. (46).

Standard workflow

The standard workflow performs automated trimming,mapping and feature extraction and was used to analyze alldatasets. In general, the workflow is derived and adaptedfrom Hauenschild et al. (44) and implemented in Galaxy.

Trimming. In the first trimming step, auxiliary sequencessuch as adapters, unique molecule identifiers (UMIs) andtailing-bases were removed from the reads using the Cu-tadapt (version 1.16) software (47). Due to the specifica-tions of the library preparation, this step was performedseparately for forward and reverse reads as they require dif-ferent settings. The trimming step consisted of two phases:Removal of the Illumina adapters and the trimming of theUMI sequences, 10 bases (nine of them random) located atleftmost (forward read) or rightmost position (reverse read).In addition, the remaining sequences were filtered by lengthwith the required minimum length being set to 10.

Alignment. Mapping of the reads to the reference se-quence was performed using the Bowtie2 alignment soft-

Dow

nloaded from https://academ

ic.oup.com/nar/article/48/7/3734/5755881 by U

niversite de Strasbourg France user on 13 October 2020

Page 6: Machine learning of reverse transcription signatures of

3738 Nucleic Acids Research, 2020, Vol. 48, No. 7

ware (48). As in the trimming step, mapping was appliedseparately for forward and reverse reads. The settings forBowtie2 allowed for one mismatch (‘-N 1’-option) in a se-quence of six bases (‘-L 6’) to account for the high num-ber of expected mismatches from remaining tailing bases(added in an enzymatic step with TdT during library prepa-ration at the 3′-end of the cDNA as DNA adapter anchorfor the adapter sequence overhang; excess possible, there-fore additional overhang trimming step in post-processingafter alignment) and high amounts of modified ribonu-cleotides contained in the underlying tRNA samples. Thetwo alignment files from forward and reverse reads werestored in BAM-format and then merged into a single file.The BAM file were then sorted and indexed using the SAM-tools package (49).

Post-processing and feature extraction. The alignmentdata was converted into pileup format as this format al-lows to extract relevant information for each position inthe reference. Remaining tailing bases were removed fromthe alignment using a custom python script (overhang trim-ming). The format was then converted to a tab-separatedtext-file termed ‘Profile’, containing information on multi-ple defined features such as the coverage, arrest rate, mis-match rate, the mismatch composition and the jump rate foreach individual position within the reference. The featuresare defined as follows:

Coverage: Describes the number of reads/nucleobases thatwere mapped to the position in the reference.

Arrest rate: The arrest rate of a given position x is defined bythe relative number of reads that start at the neighboringposition x + 1. Since the arrest is given as a percentage independence of the coverage, it can also be described as thedrop in coverage in relation to the preceding position.

Mismatch rate and mismatch composition (individual misin-corporation rates): The mismatch rate is the relative num-ber of nucleobases mapped to a given position that do notmatch to the nucleobase shown in the reference. The Pro-file does not only contain the mismatch rate, but also givesdetailed information on the alignment numbers for eachtype of base (A, C, G, T) and unknown read bases (N).For each individual nucleobase, the amount of mismatchrelative to the overall mismatch is also calculated, thus re-sulting in values for A-, C-, G- and T-mismatch.

Jump rate: A jump is represented by deletions occurring atthe position in question or shortly thereafter. Jumps arepresented as percentage of deletions relative to the overallcoverage.

Machine learning and prediction

In order to determine whether a given position in the align-ment is modified or not, a machine learning algorithmbased on a random forest (RF) model was used (35). Oursupervised prediction workflow is thereby modeled accord-ing to our earlier study on m1A RT-signatures from Hauen-schild et al. (22,43), to allow comparison. Here, we usedthe random forest model integrated into the Python ‘scikit-learn’ package (version 0.20.2) (50). In general, random for-est models generate a multitude of decision trees based on

the available training data by separating instances from twopredefined and known classes (positive and negative class)according to a number of features describing each instance(mismatch rate, arrest rate, jumps, etc.). For the predictionprocess (e.g. classifying an instance of which the class is notknown), an instance is put through every individual deci-sion tree where it is eventually assigned a class. The instanceis then classified according to the ‘majority vote’ of all indi-vidual decision trees.

For the special properties of this study, such as the num-ber of modified instances being generally quite low com-pared to non-modified instances, the model had to be ad-justed to accommodate this problem. As a first measure,for the training process, the machine learning model wasgiven positive and negative classes (e.g. modified and un-modified instances) in a ratio of 1:1. Secondly, the modelwas enhanced by k-fold stratified cross-validation. This ap-proach splits the available training dataset into k subsets.The random forest is then trained on k − 1 subsets andperforms a prediction on the remaining subset. The wholeprocess is performed k times with a different subset beingleft out in each iteration. In our studies, we set k to 5. Withthis approach, the random forest model makes use of allavailable data as opposed to simply splitting the data intotraining and test samples as the test sample would not beused for training. Accordingly, cross-validation is especiallyuseful for datasets of low sample sizes. To counteract anypossible biases resulting from the 1:1-ratio of modified andunmodified instances and the thereof resulting reduction ofthe number of unmodified instances considered in the train-ing process, the whole cross-validation process was repeated10 times. In each iteration, the unmodified instances wererandomly selected.

This approach not only generated a trained random for-est as outcome, but also allowed the evaluation of themodel’s predictive accuracy. Assessment scores for the ran-dom forest performance such as the AUC, MCC, sensitivityand specificity were calculated and averaged over the 10 it-erations of the training process.

For the prediction process in this study, we trained a ran-dom forest model based on the instances of a specific mod-ification from two independent tRNA datasets and subse-quently performed classification on a third tRNA dataset.

RESULTS

Side-by-side comparison of 13 RT-signatures at known m1Asites

To assess the influence of the enzyme on RT-signatures, weused the previously established workflow to detect varia-tions in m1A-related signatures of a total of 13 enzymes(see Table 1). These 13 reverse transcriptases represent aselection of commercially available RTs at the beginningof the study (without claiming completeness). Briefly, yeasttotal tRNA was submitted to an RNA-Seq based on a li-brary preparation protocol designed to capture misincor-poration events as well as abortive cDNA (43). The readswere mapped onto a reference set consisting of yeast tRNAsequences, compiled from the Modomics (21) and tRNAdatabase (51), containing sequences of known modificationstatus.

Dow

nloaded from https://academ

ic.oup.com/nar/article/48/7/3734/5755881 by U

niversite de Strasbourg France user on 13 October 2020

Page 7: Machine learning of reverse transcription signatures of

Nucleic Acids Research, 2020, Vol. 48, No. 7 3739

From all m1A sites listed in Modomics and the tRNAdatabase, those were excluded from further analysis, forwhich the coverage was insufficient to extract a statisticallyrelevant RT-signature. At m1A sites, corresponding signa-tures were extracted which are composed of the follow-ing features: (i) the overall mismatch incorporation of nu-cleotides into the cDNA, (ii) the individual mismatch com-positions (plotted in Supplement Figure S1) and (iii) cov-erage drops between m1A and the position to its 3′ due toRT-arrests. In addition to these previously (22) establishedfeatures, (iv) a new feature termed ‘jump rate’ was estab-lished. This refers to a certain amount of deletions in thealigned data, specifically detected at m1A sites. Such jumpsare known from DNA polymerases, proposed mechanismsinclude strand slippage (52,53) and strand misalignmentsduring DNA synthesis (54–56). A comprehensive illustra-tion and the data of the distribution of RT-signature fea-tures (i, iii and iv) in the different enzymes at m1A sites isshown in Figure 1A and Supplement Figure S2. Of note,all used datasets contained at least 300 000 reads per RT-enzyme (per replicate), and escalating comparative analy-ses with smaller datasets indicated no significant changeseven at read numbers as low as 50 000 reads (see Supple-ment Figure S10). From Figure 1A, it is apparent that the13 reverse transcriptases display a wide distribution of fea-tures i, iii and iv, which is bracketed between RT #12, withhighest mismatch, highest jump rate and lowest arrest rateon one hand, and RTs #4 and #10, with lowest mismatchrates, lowest jump rate and highest arrest rate, on the otherhand. A distribution along the diagonal, which is immedi-ately obvious upon visual inspection, strongly suggests aninverse correlation of arrest and mismatch rates, and indeed,the corresponding Pearson correlation coefficient (PCC) is−0.77. Further correlations, also including the overall aver-age read lengths (see Supplement Figure S3), are shown inthe correlation matrix in Figure 1B.

Interestingly, the jump rate shows a positive PCC withthe mismatch rate (0.85), and consequently a negative cor-relation (−0.76) with the arrest rate. On the molecular level,this suggests a mechanistic relation between a generally highelongation activity of a given enzyme on one hand, and suc-cessful bypassing of m1A sites, which is reflected in low ar-rest rates and correspondingly higher jump and misincor-poration rates, as well as in the corresponding correlationswith the average read length.

Random forest performance and feature importance of indi-vidual RTs

Based on the known m1A sites in yeast tRNA, the signa-tures with their characteristic features were used as inputfor machine learning analysis. As machine learning model,a random forest implementation was trained on m1A andnon-m1A instances to perform binary classification, andthus conduct modification calling without external inputof threshold values. The quality of the classification wascompared among all 13 RTs by determining the area-under-the-curve value (AUC) of a receiver operating characteristic(ROC) curve. The calculated AUC results from a machinelearning approach applying 10 repetitions of a 5-fold crossvalidation.

AUC values are frequently used for such quality assess-ments, as they provide a measure of the balance betweensensitivity and specificity instances in modification calling.To safeguard against inadverted optimization by the ran-dom forest model of parameters not reflected by the AUC,we also calculated the MCC (Matthews correlation coef-ficient) (40) as an additional quality parameter, which ac-counts for true and false positives and negatives and servesas correlation coefficient between observed and predictedbinary classifications. In the remainder of the manuscript,we will report and discuss AUC values only, since the corre-lation between AUC and MCC results reported throughoutthis manuscript was 0.9016 (see Supplement Table S3).

Overall, AUCs for each individual RT varied substan-tially, in a range of 0.9764–0.9964, with the lowest AUCvalue registered for RT #9 and the highest recorded for RT#12 (see Figure 2 and Supplement Table S3). It is impor-tant to stress, that even variations at the third and fourthsignificant digit are extremely relevant for modification call-ing; in an application of a trained random forest model toan epitranscriptome of 107 adenosines, the difference be-tween 0.998 and 0.999 is projected to cause an additional104 errors. We therefore inspected the features of the ran-dom forest in some detail. Figure 2 and Supplement Ta-ble S4 show the feature importance, i.e. the relevance ofthe individual feature for the decision making in the train-ing process. These include the previously discussed features(i–iv), with the individual misincorporation rates (ii) sepa-rated according to the three nucleotides (iiC, iiT, iiG). Thevariegated levels of feature importance, represented as blueshades, provides interesting insight into the ‘strategy’, bywhich different RTs (in combination with the correspondingrandom forest) arrive at optimum prediction performance.This analysis of the feature importance (indicated by a colorshade code) provides a basis to support the intuitive (andhence anticipated) claim that certain polymerases allow de-tection of m1A (and probably other modifications) via RT-arrest, while others provide significant information in theform of misincorporation. More specific is the finding thatan increasing feature importance of the jump rate (iv), cor-relates with a decline of importance of the arrest rate (iii).This was most notable for RT #12 which was the one RTstanding out because of its very low importance of the arrestrate, contrasted by high importance of jump and mismatchfeatures. Indeed, higher arrest rates were found to be associ-ated with lower AUC values in general, indicated by a neg-ative PCC (−0.52) between the two values (see Figure 1B).A plausible explanation might be the low coverage gener-ally associated with RT-enzymes of high arrest rate, which,in turn, renders the sites toward the 5′-end, less amenable tostatistically significant analysis.

In contrast to the arrest rate, the correlation for mismatchand jump rates with the AUC was positive (0.85 and 0.78).Feature importance of the individual mismatches (C, T andG mismatch) were very low to negligible for C and G mis-match. Solely T mismatch levels played a role in the machinelearning process as is already implied by the prominent lev-els of T mismatch at m1A positions (≈50%) (plotted in Sup-plement Figure S1).

While the above considerations result from perusal of theensemble of polymerases, a closer look at the features of the

Dow

nloaded from https://academ

ic.oup.com/nar/article/48/7/3734/5755881 by U

niversite de Strasbourg France user on 13 October 2020

Page 8: Machine learning of reverse transcription signatures of

3740 Nucleic Acids Research, 2020, Vol. 48, No. 7

Figure 1. (A) Scatter plot showing the average m1A signatures of 13 RTs at 26 m1A sites in yeast cytosolic tRNA. Error bars show standard deviationsof arrest and mismatch rates across 3 sequencing runs, i.e. triplicates. The colour-code represents the jump rate. Arrest rate percentages refer to the readscovering the 3′ adjacent position of m1A (+1 position). Mismatch and jump rate percentages refer to the reads covering the m1A position. (B) Pearsoncorrelation coefficient (PCC) matrix heat map. Evaluation of interrelations and mutual influences by showing the positive and negative correlations betweenthe RT-signature features arrest, mismatch and jump rate as well as the random forest performance measure AUC (Area Under Curve) and the readlength (see also Supplement Figure S9 for additional information on TGIRT and HIV-RT).

Figure 2. Bar plot with random forest performance and feature importance by RT. Classification performance is represented as Area Under Curve (AUC)of Receiver Operating Characteristic (ROC). Colour-code represents the feature importance for the classification. Data was averaged from triplicates.Jump = jump rate. C, T, G = mismatch components, which add up to 100%. Mismatch = mismatch rate. Arrest = arrest rate. Percentages represent featureimportance in random forest analysis = mean loss in classification accuracy, if values of respective feature are permutated. (See also Supplement FigureS9 for additional information on TGIRT and HIV-RT)

top three performing enzymes revealed interesting detailswith potential impact for prediction quality. RT #11 for ex-ample, despite being rather processive, uses arrest events toalmost equal degree (15.952%; dark blue colour in Figure 2)than it does mismatches (17.112%), whereas RT #3 also usesboth features about equally (arrest 18.253% and mismatch16.933%) but with slight preference for arrest. RT #12 fi-nally is the most processive among all RTs, which presum-ably leads to a high percentage of jumps as already pointedout (see Figure 1A). The latter feature is exploited by therandom forest, which heavily relies on jumps (16.020%) formodification calling (see Figure 2 and Supplement Table

S4). In summary, and against the general trend that asso-ciates success with processivity, the three best enzymes ac-cording to their individual random forest performance usethree different strategies for optimal modification callingresults. We therefore selected these polymerases with theirdifferent characteristics, and compared their behaviour onother purine modifications as well (vide infra). To guardagainst a training bias potentially originating from the useof m1A instances in the structural context of a tRNA, weperformed experiments using synthetic m1A-containing oli-gos (Supplement Table S9) designed according to the ‘re-volver’ concept, i.e. with a degenerated nucleotide at the

Dow

nloaded from https://academ

ic.oup.com/nar/article/48/7/3734/5755881 by U

niversite de Strasbourg France user on 13 October 2020

Page 9: Machine learning of reverse transcription signatures of

Nucleic Acids Research, 2020, Vol. 48, No. 7 3741

+1 position to ablate the previously identified influence ofthis position on mismatch composition (22). Performanceparameters obtained from random forests trained with 26native m1A sites were indistinguishable from those trainedwith an additional 56 synthetic sites (Supplement TableS10), suggesting that the results were not impacted by anypotential overfitting.

To account for other m1A detection data, meanwhilepublished during or after our study, mainly based on twoother reverse transcriptases, TGIRT (27,28) and HIV-RT(9), we retrospectively collected data on m1A RT-signaturesand the m1A random forest performance for these RTs to al-low evaluation and comparison (see Supplement Figure S9for experimental and analytical details). The reverse tran-scriptases were subjected to the same library preparationand data analysis workflow, whereby for the TGIRT themanufacturer’s (RT #14a) (as applied in (28)) as well as anadapted buffer composition (as applied in (27)) were tested.The adapted TGIRT (RT #14b), as well as the wildtypeHIV-RT (RT #15) led to mismatch and arrest patterns atm1A sites which were in the range of RT #12, lacking thedistinct jump characteristics. In comparison, RT #14a pro-vides RT-signatures with comparable mismatch rates, alsolacking increased jump rates, but with much higher arrestrates (see Supplement Figure S9A and B). The average readlength provided by RT #14b exceeds the previous values ofall other RTs, indicating a high read-through capability. RT#15 and RT #14a are in line with the values of the otherRTs. In terms of machine learning performance, RT #14bachieves an AUC in the range of the best performing RT#12, while RT #14a and RT #15 do not reach the AUCvalues of the best RTs (see Supplement Figure S9D). Thebehaviour of these RTs with respect to feature importancein machine learning corresponds to the findings of our pre-vious RT analysis, so that more distinct features lead tostronger feature importance in random forest models, whileweak or missing features play only a minor role.

Machine learning for modified guanosines

Similar to m1A, the methylated purines m1G and m22G pos-

sess one methyl-group pointed towards the Watson–Crickface. The fact that m1G and m2

2G are less abundant intRNA than is m1A, presents a problem for valid train-ing of a random forest model, which requires an adequatenumber of training instances for unbiased predictions. Toincrease the number of usable instances, we pooled m1Gand m2

2G sites in a first training run, which was designedto separate the modified guanosines from non-modifiedguanosines (see Figure 3 and Supplement Figure S4A). Thisleft the option to distinguish between m1G and m2

2G atidentified sites later on, using divergence of particular fea-tures in their respective signatures. Analysis was conductedon RNA-Seq data from the three reverse transcriptases (#3,#11 and #12, as identified above) as well as from RT #5 forcomparison with previous data (22). In full analogy to theprevious evaluation of m1A sites, the corresponding readswere mapped onto a reference set consisting of tRNA se-quences from Saccharomyces cerevisiae, compiled from theModomics (21) and tRNA database (51). The data was in-spected for signatures at known m1G and m2

2G sites. We

Figure 3. Prediction scheme for guanosine modifications. A first randomforest model was trained (10-repetitions 5-fold cross validation) separatelyby RT on m1G & m2

2G sites from two replicates of total tRNA sam-ples from Saccharomyces cerevisiae (RNA data) to distinguish modified(positive class) from unmodified (negative class) guanosines. The trainedrandom forest then was used to make a prediction on guanosines from acorresponding third yeast data set. In this case, the prediction is a binaryclassification, the two classes being ‘m1G, m2

2G’ or ‘non-modified Gs’.Guanosines which were classified as modified guanosines (m1G and m2

2G)in the first prediction were then written into an output file (RNA data mod-ified Gs). Then, two random forest models were trained (10-repetitions 5-fold cross validation) separately by RT on either m1G or m2

2G sites fromtwo replicates of total tRNA samples from S. cerevisiae (RNA data) aspositive class to separate these guanosine modifications from each other(the respective other modification together with non-modified Gs (1:1 ra-tio) served as negative class). The trained models were then used to make aprediction on the output file from the first prediction (RNA data ModifiedGs).

found, that many m1G and m22G sites, which are typically

present near the 5′-end at positions 9 and 26, displayed lowcoverage as a result of their location downstream of otherRT-blocking modifications such as m1A or bulky modifica-tions at position 34 and 37.

We prepared three independent RNA-Seq datasets fromthree replicates of total tRNA yeast for each of the four re-verse transcriptases, each total tRNA dataset (each repli-cate) containing 988 guanosines, including 16 m1G and 22

Dow

nloaded from https://academ

ic.oup.com/nar/article/48/7/3734/5755881 by U

niversite de Strasbourg France user on 13 October 2020

Page 10: Machine learning of reverse transcription signatures of

3742 Nucleic Acids Research, 2020, Vol. 48, No. 7

m22G sites. In general, two replicates were used for random

forest training and the remaining third replicate for test-ing (RF #1). In order to collect a sufficient number of sitesfor random forest training, we set a relatively low coveragethreshold of 20 reads. After removal of the low-coverage in-stances from the two training datasets, between 22 (68.8%)and 32 (100%) instances for m1G (dependent on the RTused; see Supplement Table S5) and between 20 (45.5%) and38 (86.4%) instances for m2

2G passed this coverage thresh-old (e.g. 32 m1G and 38 m2

2G for RT #12, see SupplementFigure S4 (B)). Along with a matching number of unmodi-fied guanosines as negatives, these were used for training arandom forest model in full analogy to those used for m1A.In a first random forest model we trained and tested for ajoint m1G and m2

2G separation from other guanosines, som1G and m2

2G sites were together used for random foresttraining. Overall, the model prediction scored AUC valuesbetween 0.9439 and 0.9801 for the different RTs (e.g. 0.9801for RT #12), which favorably compares to our first runs onm1A (22). After filtering to remove low coverage instances(coverage >20) we could correctly classify 80% of the m1Ginstances and 90% of the m2

2G instances (average percent-ages for all RTs used) in the respective test datasets (see Sup-plement Table S5).

Besides m1G and m22G other instances of Watson-Crick

blocking modifications were classified as modified, filteringout non-modified guanosines in the process. An interest-ing instance of a ‘false’ positive concerned position 37 inyeast tRNAPhe, namely the wybutosine modification (seeSupplement Figure S5). This hypermodification is gener-ated in a sophisticated multistep biosynthesis starting withm1G at G37 (57). Clearly, the ‘Y-base’ features a blockedWatson–Crick face as well, and some similarities of its RT-signature to m1G and m2

2G were to be expected. We con-clude that purines with modifications blocking the Watson-Crick face can be successfully identified in a sequence spaceof 104–105.

In keeping with Figure 3 (see also Supplement FigureS4A) a second step (RF #2) was conducted with two in-dependent random forests, which were separately trainedon m1G and m2

2G respectively. The respective other mod-ification together with non-modified guanosines served asnegative class (1:1 composition). The model predictions form1G scored AUC values between 0.9226 and 0.9647 andAUC values between 0.9433 and 0.9820 for m2

2G accord-ing to the different RTs (e.g. 0.9647 m1G and 0.9801 m2

2Gfor RT #12). The trained models were then used in an at-tempt to distinguish the m1G vs. m2

2G instances identifiedin the first step prediction and to reduce the number of falsepositives, so the trained random forest models were testedon all instances, classified as modified in the first step RF #1(e.g. 67 instances for RT #12). In average, the four differentRTs could identify and correctly classify 93.8% instancesfor m1G (see Supplement Table S6) and 88.7% instancesfor m2

2G (see Supplement Table S7) from the previouslyclassified modified sites (in RF #1) in the random forestmodels (see Supplement Figure S4B). With the second ran-dom forest model the number of false positives for all RTscould be reduced. Despite this, the resulting classificationdid not completely separate the two modified guanosinesfrom each other and still includes a certain number of false

positives for the respective modification. Even though in thecase of RT #5 and #12 all m1G and m2

2G sites from the firststep were successfully retrieved and separated, the numberof false positives still did not allow an accurate detection.Given that we had to set a relatively low coverage thresh-old, we wondered if low coverage might blur the charac-teristics in the RT-signatures to a degree that the randomforest model suffers in prediction quality. Indeed, this is inline with an analysis of pseudouridine modification call-ing after CMCT treatment (58) and we thus hypothesizedthat the low coverage might be responsible for the lack ofdistinction between the RT-signatures of m2

2G and m1G,respectively.

Synthetic RNAs yield clear RT-signatures

To verify this hypothesis, we synthesized and tested shortoligoribonucleotides of a type previously published for m1Aanalyses (22). By analogy, these oligos are referred to as ‘re-volver’ oligos according to the so-called revolver assay fromour previous study (see Figure 3 in Hauenschild et al. (22)).They contained a central position occupied by the modi-fied nucleotide under investigation, whereas the remainingsequence was identical. Because of the known influence ofthe neighbouring bases (+1 position, 3′ to the modifica-tion), four derivatives were designed and synthesized fea-turing all four different possible neighboring bases down-stream of the respective modification. Figure 4 shows sig-nals of the four enzymes, arranged for side-by-side compar-ison of m1A, m1G and m2

2G. Data on m62A, the fourth

methylated purine with a blocked Watson−Crick edge, wereomitted here, but are shown in the supplement (see Sup-plement Figure S6). In brief, our data for synthetic m6

2Arevolver oligos could confirm our previous observations attwo residues in yeast 18S rRNA (22), showing inconsistentRT signatures. Therefore, modification calling based on ma-chine learning models is not applicable and no further inves-tigations of this modification were performed. In passing,we also verified that m2G and m6A did not produce any ap-preciable RT-signature (see Supplement Figure S7), whichis in keeping with the methyl group pointing away from theWatson–Crick face during reverse transcription.

The two guanosine modifications m1G and m22G pro-

duced consistent characteristic signatures for the differentselected reverse transcriptases in the revolver oligo anal-yses. In general, the signatures for m1G and m2

2G showhigh arrest rates, which were comparable to that of m1A.RT #3 and RT #5 featured the highest values, while thelowest values originated from the highly processive RT #12(see Supplement Figure S8A). Interestingly, for each RT-enzyme, the mismatch rate at m2

2G sites was higher than atm1G sites, such that one might anticipate a successful ap-plication of this feature in an random forest model (com-pare feature importance in Figure 2). A potential reasonfor the failure of our random forest approach became ap-parent upon comparison of the individual modified sites innative tRNA and in the synthetic revolver oligos. Figure 5shows plots of arrest rate versus mismatch rate for individ-ual instances (technical triplicates for revolver oligos) of thefour RT enzymes. The results are colour-coded to allow vi-sual distinction of single instances of m1G versus m2

2G. A

Dow

nloaded from https://academ

ic.oup.com/nar/article/48/7/3734/5755881 by U

niversite de Strasbourg France user on 13 October 2020

Page 11: Machine learning of reverse transcription signatures of

Nucleic Acids Research, 2020, Vol. 48, No. 7 3743

Figure 4. Examples for RT-signatures of m1A, m1G and m22G by RT from revolver oligo analyses and the expected structural impairment of the Watson–

Crick base-pairing. Graphs from the revolver oligo with a neighboring A, 3′ adjacent (+1 position) to the modified site at position 9, are shown. Sites witherror rates of more than 10% are highlighted with yellow arrows. Colored bars indicate the nature of the reads. The mismatch rate is depicted as blackcross and the arrest rate as red line. The modified site is shown at position 9 in the middle of the considered sequence. In general, arrest rate percentagesrefer to the reads covering the 3′ adjacent position of m1A/m1G/m2

2G (+1 position). Mismatch rate percentages refer to the reads covering the modifiedposition; Note that statements on average values stated in the text may differ from these individual signatures.

visual inspection clearly shows, that the oligo-derived sig-nals allow clear distinction between m1G versus m2

2G inthe identical sequence context of the revolver assay. In con-trast, signals from native tRNAs only show a general trendof m2

2G to higher mismatch and arrest rates (see also Sup-plement Figure S8B), but lack clear separation of the in-stances from m1G instances, thereby providing a plausiblehypothesis, why the machine learning approach might beimpractical using only native tRNA.

DISCUSSION

The central objective of this work concerns the performanceof machine learning approaches for automated modifica-tion calling. We advocate the use of e.g. a random forest

model as an unbiased entity, which determines thresholdvalues without human input, yet leaves the possibility opento read out this parameter a posteriori. Here, we compared13 commercially available reverse transcriptases of differ-ent origin, based on their individual behaviour at m1A sitesin tRNA of S. cerevisiae. The presented data show largevariance of RT-signatures, illustrated in the differently ex-pressed features, arrest and mismatch. In addition, some ofthe reverse transcriptases show an increased susceptibilityto deletions at modified sites, termed as jumps, which weused to extend the RT-signature feature set. We taught thesecharacteristics to a random forest model for supervised pre-diction and identification. In addition, to compare our se-lected RTs with TGIRT and HIV-RT, used in other m1Adetection studies, we retrospectively collected data about

Dow

nloaded from https://academ

ic.oup.com/nar/article/48/7/3734/5755881 by U

niversite de Strasbourg France user on 13 October 2020

Page 12: Machine learning of reverse transcription signatures of

3744 Nucleic Acids Research, 2020, Vol. 48, No. 7

Figure 5. m1G and m22G RT-signature comparison for the analysed reverse transcriptases RT #3, #5, #11 and #12. (A) Revolver oligo. Dot plots of

mismatch and arrest RT-signatures at m22G (dots) and m1G (triangles) sites in revolver oligos for base configurations guanosine (orange), cytidine (blue),

uridine (red, T in mapping profile) and adenosine (green) at position +1. Mismatch and arrest rates are given in percentage. Data was averaged fromtriplicates; error bars show standard deviations of arrest and mismatch rates. (B) Total tRNA from Saccharomyces cerevisiae. Dot plots of mismatch andarrest RT-signatures at m2

2G (black dots) and m1G (gray triangles) sites in total tRNA. Data was averaged from triplicates; error bars show standarddeviations of arrest and mismatch rates. m1G and m2

2G sites which are present in all three total tRNA replicates and show a coverage of at least 20 readsin at least two replicates are shown (see Supplement Table S8 for more details). In general, arrest rate percentages refer to the reads covering the 3′ adjacentposition of m1G/m2

2G (+1 position). Mismatch rate percentages refer to the reads covering the modified position.

their m1A dependent RT-signatures and their applicationin machine learning models. Interestingly we found strongcorrelations between the different RT-signature characteris-tics and the random forest performance. Especially, higherread-through, reached by increased mismatch and jumprates led to higher performance. Vice versa, increased ar-rest rates led to impeded distinguishability and thus de-tectability. Inadvertent arrest, occurring due to other modi-fications, RNA degradation or secondary and tertiary RNAstructures, may introduce a strong background noise, thuscomplicating and possibly even hinder a prediction. Indeed,especially in order to distinguish modifications from eachother, a test setup should be chosen aiming utmost diverseRT signatures and therefore providing signatures with thelowest possible arrest rates. Despite highly significant im-provement in the AUC values, from previously 0.94 (22) tonow 0.9964, the sobering truth is, that in the present state,the signal quality does not allow meaningful mapping of eu-karyotic transcriptomes, which contain in the order of 107

adenosines. The problem is compounded by the fact, thata significant number of m1A sites could plausible be mod-ified sub-stoichiometrically, which would diminish signalstrength. The only dataset featuring a potentially sufficientstatistical basis would be that used by Grozhik et al. (29,30),

who, significantly, conclude a very low number of m1A oc-currences in human mRNAs. In further experiments, wethen applied the three best performing RTs on other purinemodifications with chemical alterations on their Watson–Crick face, very likely impeding cDNA synthesis, to evalu-ate the hypothesis, that, in general, every modification canbe distinguished and detected. Indeed, our data on m1G andm2

2G instances suggests that the apparent limitations aremost likely due to coverage at analysed modified sites andthat with an optimized experimental setup, e.g. selection ofthe reverse transcriptase and sufficient sequencing output,the modifications can be predicted and differentiated. Ourstudy also indicates that synthetic oligonucleotides provideclearer RT-signatures compared to the more variegated pat-terns from native RNA, which has to be considered. Tak-ing all these findings into account we could show that aprospective large-scale modification prediction depends ona carefully selected workflow, including an optimized exper-imental setup and unbiased, comparable data evaluation.By considering that the distinctness of RT-signatures is themajor key for a strong and successful prediction, thoroughRT screening and reaction setup for specific pattern gener-ation at modified sites will help for further development ofmodification calling pipelines.

Dow

nloaded from https://academ

ic.oup.com/nar/article/48/7/3734/5755881 by U

niversite de Strasbourg France user on 13 October 2020

Page 13: Machine learning of reverse transcription signatures of

Nucleic Acids Research, 2020, Vol. 48, No. 7 3745

ABBREVIATIONS

A selected list of abbreviations is available in the Supple-mentary Data.

DATA AVAILABILITY

The bioinformatics modules, used for this study andimplemented in Galaxy (46), are available in theGitHub repository (https://github.com/HelmGroup/Galaxy modification calling).

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

ACKNOWLEDGEMENTS

We are grateful to Lilia Ayadi and Valerie Igel-Bourguignon(EpiRNA-Seq Core Facility, UMS2008 IBSLor, CNRS-Universite de Lorraine-INSERM, Nancy, France) for ex-cellent technical assistance in MiSeq sequencing. We wantto further thank Dr Lyudmil Tserovski and Dr Ralf Hauen-schild for their contributions during early stages of thisproject.

FUNDING

DFG [SPP1784, HE3397/13-2 and HE3397/14-2 toM.H., HO4436/2-2 to C.H.]; Austrian Science Fund(FWF) [P27947 and P31691 to R.M.]; FRCR fundingEpiARN from Grand Est Region, France (to Y.M.);EPITRAN COST Initiative [CA16120]. The open accesspublication charge for this paper has been waived byOxford University Press – NAR Editorial Board membersare entitled to one free paper per year in recognition oftheir work on behalf of the journal.Conflict of interest statement. None declared.

REFERENCES1. Temin,H.M. and Mizutani,S. (1970 ) RNA-dependent DNA

polymerase in virions of Rous sarcoma virus . Nature , 226 , 1211–1213 .

2. Baltimore,D. (1970 ) RNA-dependent DNA polymerase in virionsof RNA tumour viruses . Nature , 226 , 1209 –1211 .

3. Mayer,G., Muller,J. and Lunse,C.E. (2011 ) RNA diagnostics:real-time RT-PCR strategies and promising novel target RNAs .Wiley Interdiscip. Rev. RNA , 2 , 32 –41 .

4. Wang,Z., Gerstein,M. and Snyder,M. (2009 ) RNA-Seq: arevolutionary tool for transcriptomics . Nat. Rev. Genet. , 10 , 57–63 .

5. Konishi,A., Yasukawa,K. and Inouye,K. (2012 ) Improving thethermal stability of avian myeloblastosis virus reverse transcriptasealpha-subunit by site-directed mutagenesis . Biotechnol. Lett. , 34 ,1209 –1215 .

6. Tanese,N. and Goff,S.P. (1988 ) Domain structure of the Moloneymurine leukemia virus reverse transcriptase: mutational analysis andseparate expression of the DNA polymerase and RNase H activities .PNAS , 85 , 1777 –1781 .

7. Mohr,S., Ghanem,E., Smith,W., Sheeter,D., Qin,Y.,King,O., Polioudakis,D., Iyer,V.R., Hunicke-Smith,S., Swamy,S.et al. (2013 ) Thermostable group II intron reverse transcriptasefusion proteins and their use in cDNA synthesis and next-generationRNA sequencing . RNA , 19 , 958 –970 .

8. Hu,W.S. and Hughes,S.H. (2012 ) HIV-1 reverse transcription .Cold Spring Harb. Perspect. Med. , 2 , a006882 .

9. Zhou,H., Rauch,S., Dai,Q., Cui,X., Zhang,Z.,Nachtergaele,S., Sepich,C., He,C. and Dickinson,B.C. (2019 )Evolution of a reverse transcriptase to map N(1)-methyladenosine inhuman messenger RNA . Nat. Methods , 16 , 1281 –1288 .

10. Myers,T.W. and Gelfand,D.H. (1991 ) Reverse transcription andDNA amplification by a Thermus thermophilus DNA polymerase .Biochemistry , 30 , 7661 –7666 .

11. Barnes,W.M. (1992 ) The fidelity of Taq polymerase catalyzingPCR is improved by an N-terminal deletion . Gene , 112 , 29 –35 .

12. Blatter,N., Bergen,K., Nolte,O., Welte,W.,Diederichs,K., Mayer,J., Wieland,M. and Marx,A. (2013 ) Structureand function of an RNA-reading thermostable DNA polymerase .Angew. Chem. Int. Ed. Engl. , 52 , 11935 –11939 .

13. Sauter,K.B. and Marx,A. (2006 ) Evolving thermostable reversetranscriptase activity in a DNA polymerase scaffold . Angew. Chem.Int. Ed. Engl. , 45 , 7633 –7635 .

14. Aschenbrenner,J. and Marx,A. (2016 ) Direct and site-specificquantification of RNA 2′-O-methylation by PCR with an engineeredDNA polymerase . Nucleic Acids Res. , 44 , 3495 –3502 .

15. Aschenbrenner,J., Werner,S., Marchand,V., Adam,M.,Motorin,Y., Helm,M. and Marx,A. (2018 ) Engineering of a DNApolymerase for direct m6 A sequencing . Angew. Chem. Int. Ed. , 57 ,417 –421 .

16. Byrne,A., Beaudin,A.E., Olsen,H.E., Jain,M.,Cole,C., Palmer,T., DuBois,R.M., Forsberg,E.C., Akeson,M. andVollmers,C. (2017 ) Nanopore long-read RNAseq revealswidespread transcriptional variation among the surface receptors ofindividual B cells . Nat. Commun. , 8 , 16027 .

17. Garalde,D.R., Snell,E.A., Jachimowicz,D., Sipos,B.,Lloyd,J.H., Bruce,M., Pantic,N., Admassu,T., James,P., Warland,A.et al. (2018 ) Highly parallel direct RNA sequencing on an array ofnanopores . Nat. Methods , 15 , 201 –206 .

18. Liu,H., Begik,O., Lucas,M.C., Ramirez,J.M.,Mason,C.E., Wiener,D., Schwartz,S., Mattick,J.S., Smith,M.A. andNovoa,E.M. (2019 ) Accurate detection of m(6)A RNAmodifications in native RNA sequences . Nat. Commun. , 10 , 4079 .

19. Smith,A.M., Jain,M., Mulroney,L., Garalde,D.R. andAkeson,M. (2019 ) Reading canonical and modified nucleobases in16S ribosomal RNA using nanopore native RNA sequencing . PLoSOne , 14 , e0216709 .

20. Helm,M. and Motorin,Y. (2017 ) Detecting RNA modificationsin the epitranscriptome: predict and validate . Nat. Rev. Genet. , 18 ,275 –291 .

21. Boccaletto,P., Machnicka,M.A., Purta,E.,Piatkowski,P., Baginski,B., Wirecki,T.K., de Crecy-Lagard,V.,Ross,R., Limbach,P.A., Kotter,A. et al. (2018 ) MODOMICS: adatabase of RNA modification pathways. 2017 update . Nucleic AcidsRes. , 46 , D303 –D307 .

22. Hauenschild,R., Tserovski,L., Schmid,K.,Thuring,K., Winz,M.L., Sharma,S., Entian,K.D., Wacheul,L.,Lafontaine,D.L., Anderson,J. et al. (2015 ) The reverse transcriptionsignature of N-1-methyladenosine in RNA-Seq is sequence dependent. Nucleic Acids Res. , 43 , 9950 –9964 .

23. Helm,M., Brule,H., Degoul,F., Cepanec,C., Leroux,J.P.,Giege,R. and Florentz,C. (1998 ) The presence of modifiednucleotides is required for cloverleaf folding of a humanmitochondrial tRNA . Nucleic Acids Res. , 26 , 1636 –1643 .

24. Helm,M., Giege,R. and Florentz,C. (1999 ) A Watson-CrickBase-Pair-Disrupting Methyl Group(m‘A9’) is sufficient for cloverleaffolding of human mitochondrial tRNA(lys) . Biochemistry , 38 ,13338 –133346 .

25. Lempereur,L., Nicoloso,M., Riehl,N., Ehresmann,C.,Ehresmann,B. and Bachellerie,J.P. (1985 ) Conformation of yeast18S rRNA. Direct chemical probing of the 5′ domain in ribosomalsubunits and in deproteinized RNA by reverse transcriptase mappingof dimethyl sulfate-accessible . Nucleic Acids Res. , 13 , 8339 –8357 .

26. Motorin,Y., Muller,S., Behm-Ansmant,I. and Branlant,C.(2007 ) Identification of modified residues in RNAs by reversetranscription-based methods . Methods Enzymol. , 425 , 21 –53 .

27. Li,X., Xiong,X., Zhang,M., Wang,K., Chen,Y.,Zhou,J., Mao,Y., Lv,J., Yi,D., Chen,X.W. et al. (2017 )Base-Resolution mapping reveals distinct m(1)A Methylome innuclear- and Mitochondrial-Encoded transcripts . Mol. Cell , 68 ,993 –1005 .

Dow

nloaded from https://academ

ic.oup.com/nar/article/48/7/3734/5755881 by U

niversite de Strasbourg France user on 13 October 2020

Page 14: Machine learning of reverse transcription signatures of

3746 Nucleic Acids Research, 2020, Vol. 48, No. 7

28. Safra,M., Sas-Chen,A., Nir,R., Winkler,R.,Nachshon,A., Bar-Yaacov,D., Erlacher,M., Rossmanith,W.,Stern-Ginossar,N. and Schwartz,S. (2017 ) The m1A landscape oncytosolic and mitochondrial mRNA at single-base resolution .Nature , 551 , 251 –255 .

29. Grozhik,A.V., Olarerin-George,A.O., Sindelar,M., Li,X.,Gross,S.S. and Jaffrey,S.R. (2019 ) Antibody cross-reactivityaccounts for widespread appearance of m(1)A in 5′UTRs . Nat.Commun. , 10 , 5126 .

30. Helm,M., Lyko,F. and Motorin,Y. (2019 ) Limited antibodyspecificity compromises epitranscriptomic analyses . Nat. Commun. ,10 , 5669 .

31. Lambowitz,A.M. and Zimmerly,S. (2011 ) Group II introns:mobile ribozymes that invade DNA . Cold Spring Harb. Perspect.Biol. , 3 , a003616 .

32. Zheng,G., Qin,Y., Clark,W.C., Dai,Q., Yi,C., He,C.,Lambowitz,A.M. and Pan,T. (2015 ) Efficient and quantitativehigh-throughput tRNA sequencing . Nat. Methods , 12 , 835 –837 .

33. Ryvkin,P., Leung,Y.Y., Silverman,I.M., Childress,M.,Valladares,O., Dragomir,I., Gregory,B.D. and Wang,L.S. (2013 )HAMR: high-throughput annotation of modified ribonucleotides .RNA , 19 , 1684 –1692 .

34. Cortes,C. and Vapnik,V. (1995 ) Support-Vector networks .Mach. Learn. , 20 , 273 –297 .

35. Breiman,L. (2001 ) Random forests . Mach. Learn. , 45 , 5 –32 .36. Stehman,S.V. (1997 ) Selecting and interpreting measures of

thematic classification accuracy . Remote Sens. Environ. , 62 , 77 –89.

37. Fawcett,T. (2006 ) An introduction to ROC analysis . PatternRecogn. Lett. , 27 , 861 –874 .

38. Hanley,J.A. and McNeil,B.J. (1982 ) The meaning and use of thearea under a receiver operating characteristic (ROC) curve .Radiology , 143 , 29 –36 .

39. Bradley,A.P. (1997 ) The use of the area under the roc curve in theevaluation of machine learning algorithms . Pattern Recogn. , 30 ,1145 –1159 .

40. Matthews,B.W. (1975 ) Comparison of the predicted and observedsecondary structure of T4 phage lysozyme . Biochim. Biophys. Acta ,405 , 442 –451 .

41. Hobartner,C., Kreutz,C., Flecker,E., Ottenschlager,E.,Pils,W., Grubmayr,K. and Micura,R. (2003 ) The synthesis of2′-O-[(triisopropylsilyl)oxy] methyl (TOM) phosphoramidites ofmethylated ribonucleosides (m(1)G, m(2)G, m(2)(2)G, m(1)I, m(3)U,m(4)C, m(6)A, m(2)(6)A) for use in automated RNA solid-phasesynthesis . Monatsh. Chem. , 134 , 851 –873 .

42. Micura,R., Pils,W., Hobartner,C., Grubmayr,K.,Ebert,M.O. and Jaun,B. (2001 ) Methylation of the nucleobases inRNA oligonucleotides mediates duplex-hairpin conversion . NucleicAcids Res. , 29 , 3997 –4005 .

43. Tserovski,L., Marchand,V., Hauenschild,R.,Blanloeil-Oillo,F., Helm,M. and Motorin,Y. (2016 )High-throughput sequencing for 1-methyladenosine (m(1)A)mapping in RNA . Methods , 107 , 110 –121 .

44. Hauenschild,R., Werner,S., Tserovski,L., Hildebrandt,A.,Motorin,Y. and Helm,M. (2016 ) CoverageAnalyzer (CAn): A tool

for inspection of modification signatures in RNA sequencing profiles. Biomolecules , 6 , 42 .

45. Afgan,E., Baker,D., Batut,B., van den Beek,M.,Bouvier,D., Cech,M., Chilton,J., Clements,D., Coraor,N.,Gruning,B.A. et al. (2018 ) The Galaxy platform for accessible,reproducible and collaborative biomedical analyses: 2018 update .Nucleic Acids Res. , 46 , W537 –W544 .

46. Schmidt,L., Werner,S., Kemmer,T., Niebler,S.,Kristen,M., Ayadi,L., Johe,P., Marchand,V., Schirmeister,T.,Motorin,Y. et al. (2019 ) Graphical workflow system formodification calling by machine learning of reverse transcriptionsignatures . Front. Genet. , 10 , 876 .

47. Martin,M. (2011 ) Cutadapt removes adapter sequences fromhigh-throughput sequencing reads . EMBnet.journal , 17 , 10 .

48. Langmead,B. and Salzberg,S.L. (2012 ) Fast gapped-readalignment with Bowtie 2 . Nat. Methods , 9 , 357 –359 .

49. Li,H., Handsaker,B., Wysoker,A., Fennell,T., Ruan,J.,Homer,N., Marth,G., Abecasis,G. and Durbin,R. (2009 ) TheSequence Alignment/Map format and SAMtools . Bioinformatics ,25 , 2078 –2079 .

50. Pedregosa,F., Varoquaux,G., Gramfort,A.,Michel,V., Thirion,B., Grisel,O., Blondel,M., Prettenhofer,P.,Weiss,R., Dubourg,V. et al. (2011 ) Scikit-learn: Machine learning inPython . J. Mach Learn Res. , 12 , 2825 –2830 .

51. Juhling,F., Morl,M., Hartmann,R.K., Sprinzl,M.,Stadler,P.F. and Putz,J. (2009 ) tRNAdb 2009: compilation of tRNAsequences and tRNA genes . Nucleic Acids Res. , 37 , D159 –D162 .

52. Streisinger,G., Okada,Y., Emrich,J., Newton,J.,Tsugita,A., Terzaghi,E. and Inouye,M. (1966 ) Frameshift mutationsand the genetic code. This paper is dedicated to Professor TheodosiusDobzhansky on the occasion of his 66th birthday . Cold Spring Harb.Symp. Quant. Biol. , 31 , 77 –84 .

53. Banavali,N.K. (2013 ) Partial base flipping is sufficient for strandslippage near DNA duplex termini . J. Am. Chem. Soc. , 135 , 8274–8282 .

54. Kunkel,T.A. and Soni,A. (1988 ) Mutagenesis by transientmisalignment . J. Biol. Chem. , 263 , 14784 –14789 .

55. Fujii,S., Akiyama,M., Aoki,K., Sugaya,Y.,Higuchi,K., Hiraoka,M., Miki,Y., Saitoh,N., Yoshiyama,K., Ihara,K.et al. (1999 ) DNA replication errors produced by the replicativeapparatus of Escherichia coli . J. Mol. Biol. , 289 , 835 –850 .

56. Efrati,E., Tocco,G., Eritja,R., Wilson,S.H. andGoodman,M.F. (1997 ) Abasic translesion synthesis by DNApolymerase beta violates the “A-rule”. Novel types of nucleotideincorporation by human DNA polymerase beta at an abasic lesion indifferent sequence contexts . J. Biol. Chem. , 272 , 2559 –2569 .

57. Perche-Letuvee,P., Molle,T., Forouhar,F., Mulliez,E. andAtta,M. (2014 ) Wybutosine biosynthesis: structural andmechanistic overview . RNA Biol. , 11 , 1508 –1518 .

58. Safra,M., Nir,R., Farouq,D., Vainberg Slutskin,I. andSchwartz,S. (2017 ) TRUB1 is the predominant pseudouridinesynthase acting on mammalian mRNA via a predictable andconserved code . Genome Res. , 27 , 393 –406 .

Dow

nloaded from https://academ

ic.oup.com/nar/article/48/7/3734/5755881 by U

niversite de Strasbourg France user on 13 October 2020