32
1 Transposable Element-Gene Splicing Modulates the Transcriptional Landscape of Human 1 Pluripotent Stem Cells 2 Isaac A. Babarinde 1 , Gang Ma 1 , Yuhao Li 1 , Boping Deng 1,2 , Zhiwei Luo 3 , Hao Liu 4 , Mazid Md. 3 Abdul 4 , Carl Ward 4 , Minchun Chen 1 , Xiuling Fu 1 , Martha Duttlinger 1 , Jiangping He 5,6,7 , Li Sun 1 , 4 Wenjuan Li 4 , Qiang Zhuang 1 , Jon Frampton 2 , Jean-Baptiste Cazier 2,8 , Jiekai Chen 5,6,7 , Ralf Jauch 9 , 5 Miguel A. Esteban 4,5 , Andrew P. Hutchins 1# 6 7 1 Department of Biology, Southern University of Science and Technology, Shenzhen, Guangdong 8 518055, China. 9 2 Institute of Cancer and Genomic Sciences, University of Birmingham, B15 2TT, Birmingham, UK. 10 3 Guangzhou Medical University, Guangzhou 511436, China. 11 4 Laboratory of Integrative Biology, Guangzhou Institutes of Biomedicine and Health, Chinese 12 Academy of Sciences, Guangzhou 510530, China. 13 5 Guangzhou Regenerative Medicine and Health-Guangdong Laboratory (GRMH-GDL), 510530 14 Guangzhou, China. 15 6 Key Laboratory of Regenerative Biology of the Chinese Academy of Sciences and Guangdong 16 Provincial Key Laboratory of Stem Cell and Regenerative Medicine, Guangzhou Institutes of 17 Biomedicine and Health, Chinese Academy of Sciences, 510530 Guangzhou, China. 18 7 Joint School of Life Sciences, Guangzhou Medical University and Guangzhou Institutes of 19 Biomedicine and Health, Chinese Academy of Sciences, 511436 Guangzhou, China. 20 8 Centre for Computational Biology, University of Birmingham, Birmingham, UK. 21 9 School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, 22 Hong Kong SAR, China. 23 #Correspondance: [email protected] 24 25 . CC-BY-ND 4.0 International license (which was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint this version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608 doi: bioRxiv preprint

Transposable Element-Gene Splicing Modulates the ... · 7/26/2020  · 411 Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 412 Interpretation

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Transposable Element-Gene Splicing Modulates the ... · 7/26/2020  · 411 Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 412 Interpretation

1

Transposable Element-Gene Splicing Modulates the Transcriptional Landscape of Human 1

Pluripotent Stem Cells 2

Isaac A. Babarinde1, Gang Ma1, Yuhao Li1, Boping Deng1,2, Zhiwei Luo3, Hao Liu4, Mazid Md. 3

Abdul4, Carl Ward4, Minchun Chen1, Xiuling Fu1, Martha Duttlinger1, Jiangping He5,6,7, Li Sun1, 4

Wenjuan Li4, Qiang Zhuang1, Jon Frampton2, Jean-Baptiste Cazier2,8, Jiekai Chen5,6,7, Ralf Jauch9, 5

Miguel A. Esteban4,5, Andrew P. Hutchins1# 6

7

1Department of Biology, Southern University of Science and Technology, Shenzhen, Guangdong 8

518055, China. 9

2Institute of Cancer and Genomic Sciences, University of Birmingham, B15 2TT, Birmingham, UK. 10

3Guangzhou Medical University, Guangzhou 511436, China. 11

4Laboratory of Integrative Biology, Guangzhou Institutes of Biomedicine and Health, Chinese 12

Academy of Sciences, Guangzhou 510530, China. 13

5Guangzhou Regenerative Medicine and Health-Guangdong Laboratory (GRMH-GDL), 510530 14

Guangzhou, China. 15

6Key Laboratory of Regenerative Biology of the Chinese Academy of Sciences and Guangdong 16

Provincial Key Laboratory of Stem Cell and Regenerative Medicine, Guangzhou Institutes of 17

Biomedicine and Health, Chinese Academy of Sciences, 510530 Guangzhou, China. 18

7Joint School of Life Sciences, Guangzhou Medical University and Guangzhou Institutes of 19

Biomedicine and Health, Chinese Academy of Sciences, 511436 Guangzhou, China. 20

8Centre for Computational Biology, University of Birmingham, Birmingham, UK. 21

9School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, 22

Hong Kong SAR, China. 23

#Correspondance: [email protected] 24

25

.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint

Page 2: Transposable Element-Gene Splicing Modulates the ... · 7/26/2020  · 411 Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 412 Interpretation

2

Abstract 26

Transposable elements (TEs) occupy nearly 50% of mammalian genomes and are both potential dangers 27

to genome stability and functional genetic elements. TEs can be expressed and exonised as part of a 28

transcript, however, their full contribution to the transcript splicing remains unresolved. Here, guided by 29

long and short read sequencing of RNAs, we show that 26% of coding and 65% of non-coding 30

transcripts of human pluripotent stem cells (hPSCs) contain TEs. Different TE families have unique 31

integration patterns with diverse consequences on RNA expression and function. We identify hPSC-32

specific splicing of endogenous retroviruses (ERVs) as well as LINE L1 elements into protein coding 33

genes that generate TE-derived peptides. Finally, single cell RNA-seq reveals that proliferating hPSCs 34

are dominated by ERV-containing transcripts, and subpopulations express SINE or LINE-containing 35

transcripts. Overall, we demonstrate that TE splicing modulates the pluripotency transcriptome by 36

enhancing and impairing transcript expression and generating novel transcripts and peptides. 37

38

Keywords: transposable elements, pluripotent stem cells, long non-coding RNA 39

40

41

.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint

Page 3: Transposable Element-Gene Splicing Modulates the ... · 7/26/2020  · 411 Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 412 Interpretation

3

Introduction 42

TEs are repetitive sequences of DNA that make up the majority of mammalian DNA (Jurka et al. 2007; 43

Hutchins and Pei 2015). Whilst the majority of TEs are incapable of retrotransposition, there is growing 44

evidence that inactive incorporated TEs are epigenetically regulated and can alter splicing patterns and 45

generate gene-TE chimeric transcripts that can drive disease and oncogene activity (Jang et al. 2019; 46

Clayton et al. 2020). However, paradoxically, TE expression is also associated with developmental 47

competency and evolutionary innovation (Kunarso et al. 2010; Macfarlan et al. 2012; Fort et al. 2014; 48

Wang et al. 2014; Theunissen et al. 2016; Bourque et al. 2018). TEs are mainly thought to be suppressed 49

by DNA methylation and histone H3K9me3 in somatic cells (Feng et al. 2010; Jonsson et al. 2019), but 50

in the mammalian embryo DNA is demethylated, and human pluripotent stem cells (hPSCs) have 51

reduced epigenetic repression, looser chromatin and a permissive environment for transcript and TE 52

expression (Bulut-Karslioglu et al. 2018). Consequently, TEs are more active during embryogenesis and 53

in hPSCs, compared to somatic cells (Fort et al. 2014), and are expressed in a stage-specifically manner 54

during embryogenesis (Goke et al. 2015; Grow et al. 2015). 55

In addition to their epigenetic activity, TEs are active genetically and can alter transcript splicing 56

patterns by becoming exonised into transcripts (Macfarlan et al. 2012; Wang et al. 2014; Goke et al. 57

2015; Grow et al. 2015). TEs make a major contribution to the sequences of long-non-coding RNAs 58

(lncRNAs) (Lev-Maor et al. 2008; Kelley and Rinn 2012; Kapusta et al. 2013; Naville et al. 2016), and 59

whilst lncRNAs have roles in normal biological processes (Goff and Rinn 2015; Bourque et al. 2018; 60

Lu et al. 2020), and disease (Wapinski and Chang 2011; Carlevaro-Fita et al. 2020), the roles of TEs 61

inside the lncRNAs has received less attention. One model suggests that TEs act as functional domains 62

within lncRNAs, something akin to globular domains of proteins (Johnson and Guigo 2014; Chishima 63

et al. 2018). However, the full contribution of TEs to transcript splicing in both the normal and disease 64

state remains unexplored. Interestingly, the splicing of TEs into normal pluripotency genes has been 65

observed in cancer (Jang et al. 2019), suggesting TE-splicing can contribute to human disease. 66

Consequently, it is important to understand how TEs contribute to the coding and non-coding 67

transcriptome (You et al. 2017; Ma et al. 2019; Morillon and Gautheret 2019), if hPSCs are to be used 68

.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint

Page 4: Transposable Element-Gene Splicing Modulates the ... · 7/26/2020  · 411 Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 412 Interpretation

4

in cell replacement therapy (Fort et al. 2014; Schumann et al. 2019). However, due to limitations in 69

short-read based sequencing, accurate transcriptome maps have been elusive. Much of this is due to the 70

relatively low expression of the TEs and lncRNAs (Lagarde et al. 2017), the repetitive nature of TEs, 71

and problems in using short-reads to assemble longer transcripts (Steijger et al. 2013; Babarinde et al. 72

2019). 73

To explore the splicing patterns of TEs in the normal, non-diseased, state, we took advantage 74

of the large number of hPSC short-read-RNA-seq samples, which we supplemented with long-read-75

RNA-seq to build a map of the gene-TE transcriptome of hPSCs. Our analysis shows that TEs are 76

expressed and spliced into both coding and non-coding RNAs, and had strong effects on gene 77

expression. TEs make up a substantial fraction of the non-coding transcript complement, and they also 78

alter the peptide output of the cell, producing fragments of endogenous viral proteins and small peptides. 79

Finally, using single-cell RNA-seq we show that the hPSC-state is dominated by HERVH and LTR7-80

containing transcripts, which are associated with proliferating cells and are rapidly lost on differentiation 81

as SINE and LINE-containing transcripts become dominant. 82

83

Results 84

Transcript assembly from short and long reads 85

To build a hPSC transcriptome, we started with 197 publicly available short read (SR) paired-end-only 86

hPSC RNA-seq data samples. To quality control the data, we mapped the hPSC samples to the hg38 87

genome assembly using HISAT2 (Kim et al. 2019). Only samples with a minimum mapping rate of 70% 88

were retained, leaving 171 RNA-seq samples (Supplemental Fig. S1A). There are widespread errors in 89

metadata annotation of publically available transcriptome data (Toker et al. 2016), consequently, to 90

ensure that the samples were undifferentiated hPSCs, we measured gene-level expression (Hutchins et 91

al. 2017). hPSC samples were removed if they were outliers in a co-correlation analysis, or did not 92

express hPSC-marker genes, or expressed differentiation-specific genes (Supplemental Fig. S1B and 93

C). This left 150 samples that passed our quality control metrics (Supplemental Table S1, and 94

Supplemental Fig. S1D). We used the 150 confirmed hPSC samples to assemble transcripts using 95

.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint

Page 5: Transposable Element-Gene Splicing Modulates the ... · 7/26/2020  · 411 Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 412 Interpretation

5

HISAT2/StringTie (Pertea et al. 2015; Kim et al. 2019) (Supplemental Fig. S1E). In total ~13 billion 96

reads, with a median mapped fragment size of 225 bp of which 92% could be aligned to the hg38 genome 97

assembly. StringTie (Pertea et al. 2015) assembled an initial set of 279,051 transcripts, which was 98

reduced to 272,268 after removing transcripts shorter than 200 bp (Supplemental Fig. S1F). This 99

dataset comprised the transcripts assembled from the SR data. 100

Transcript assembly from SR can be problematic (Wang et al. 2019), especially if the transcripts 101

contain transposable elements (Babarinde et al. 2019). This is reflected in the large number of single-102

exon fragments from the SR assembly (72,902; 27%). Although some of these fragments might be 103

genuine single-exon transcripts, many are likely to be fragmentary assemblies from SR. To address this, 104

we augmented our assembly with SMRT long-read (LR) sequencing using the PacBio platform. We 105

sequenced the hESC cell line H1 and the iPSC cell line S0730 in duplicate (Zhou et al. 2011). Transcripts 106

were identified from long reads using the isoseq pipeline, and were aligned to the genome using GMAP 107

(Wu and Watanabe 2005) (Supplemental Fig. S1E). Long read assembly produced 53,168 unique 108

transcripts that were at least 200bp long (Supplemental Fig. S1F). 109

Both SR-based and LR-based transcriptomes have disadvantages: SR-based assemblies tend to 110

be unreliable (Steijger et al. 2013), and LR-based assemblies can detect rare transcripts and are poor at 111

quantifying expression. Consequently, to arrive at a transcriptome representation, we set two expression 112

level conditions that a transcript must pass: (1) Expression >=0.1 TPM (transcripts per million) in at 113

least 50 of the samples. (2) The average per-base coverage reported by StringTie must be greater than 1 114

for all exons. Additionally, we also deleted single-exon transcripts that appear to be derived from 115

incomplete splicing, if they shared edges near perfectly with an annotated intron (Supplemental Fig. 116

S1G). The final assembly contained 101,492 transcripts, of which 13,177 (13%) were single-exons. 117

Using these criteria, 71% of the LR transcripts were retained, but only 17% of the SR transcripts were 118

kept in the final assembly (Supplemental Fig. S1F, H). To validate our assembly pipeline, we designed 119

RT-PCR primers spanning an intron (Supplemental Table S2) and could detect 31 out of 40 transcripts 120

including 8 out of 9 long-read transcripts (Supplemental Fig. S2A). We confirmed the accuracy of the 121

5’ ends of our transcript assembly using iPSC and hESC deepCAGE data (Consortium et al. 2014) (Fig. 122

.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint

Page 6: Transposable Element-Gene Splicing Modulates the ... · 7/26/2020  · 411 Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 412 Interpretation

6

1A, S2B). There was a linear relationship between deepCAGE expression and RNA-seq TPM for the 123

transcripts we could unambiguously assign a deepCAGE cluster (Supplemental Fig. S2C). Our final 124

transcript assembly contained 58,201 transcripts with SR support, 6,881 with LR support and 36,410 125

with both SR and LR support (Fig. 1B and Supplemental Table S3). 126

127

Identification of novel transcripts and isoforms 128

We next classified the hPSC transcripts based on the GENCODE transcriptome. We defined three 129

categories: Matching (all internal splices match a GENCODE transcript and at least 75% total exon 130

length overlap); Variant (with an exon overlapping any exon from a GENCODE transcript but not 131

completely matching), and Novel (with no exon overlapping any GENCODE exon) (Fig. 1C). Whilst 132

most transcripts shared a splicing pattern with a GENCODE transcript, 26,846 transcripts were variant 133

or novel (Fig. 1C). For each assembled transcript, we defined completeness as the percent of exons or 134

splices of the closest GENCODE transcript that were correctly retrieved (Supplemental Fig. S2D). 135

Transcripts supported by both LR and SR tend to have higher GENCODE exon and splice completeness 136

(Supplemental Fig. S2E), although there remains a sizeable number of transcripts supported only by 137

SR, suggesting our LR data set has not fully saturated the transcriptome (Supplemental Fig. S2F). The 138

known transcripts tended to be expressed more strongly than novel transcripts, as measured by both 139

RNA-seq and deepCAGE (Fig. 1D). 140

141

A census of coding and non-coding transcripts enriched in pluripotent stem cells 142

FEELnc was used to discriminate coding and non-coding transcripts (Wucher et al. 2017), of which 143

28,699 (28%) were non-coding, and the majority of novel transcripts were non-coding (Fig. 1E, and 144

Supplemental Fig. S2G-H). Conversely, the majority (~81%) of the GENCODE-matching transcripts 145

were protein-coding. As expected, the expression of lncRNAs was lower than protein coding transcripts 146

(Fig. 1F, and Supplemental Fig. S2I). Genes can be expressed in both a cell type-independent and cell 147

type-specific manner (Hutchins et al. 2017). To identify the transcripts important for hPSCs function, 148

we measured the expression of our transcript assembly using the BodyMap dataset (Derrien et al. 2012), 149

.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint

Page 7: Transposable Element-Gene Splicing Modulates the ... · 7/26/2020  · 411 Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 412 Interpretation

7

and divided the transcripts into hPSC-enriched, -nonspecific or -depleted, based on the Z-score against 150

the BodyMap expression dataset (Fig. 2A, and Supplemental Fig. S2J). This approach could recover 151

known hPSC-enriched transcripts such as NANOG, POU5F1 and SALL4 (Fig. 2B). There was no 152

specific bias in coding and non-coding transcripts between the hPSC expression categories (Fig. 2C). 153

However, novel transcripts were more likely to be enriched in hPSCs (Supplemental Fig. S2K). Finally, 154

non-coding transcripts had a lower level of expression compared to coding transcripts across all 155

categories (Fig. 2D). Using LR and SR we have generated an accurate transcriptome for hPSCs, that 156

describes known and novel transcripts, coding and non-coding, and hPSC-enriched and -depleted 157

transcripts. We will use this transcriptome to explore the effect of TEs. 158

159

TEs are widely spliced into protein coding genes and modulate expression levels 160

TEs are found in the untranslated regions (UTRs) of coding genes (Bourque et al. 2018), and contribute 161

to lncRNAs (Kelley and Rinn 2012; Kapusta et al. 2013). Additionally, deepCAGE data has revealed 162

pluripotent-specific TSSs (transcription start sites) originating inside TEs (Faulkner et al. 2009; Fort et 163

al. 2014). We searched our assembled transcriptome using nhmmer (Wheeler and Eddy 2013), using the 164

DFAM models of TEs (Hubley et al. 2016), and identified 37,493 transcripts (37%) that contained at 165

least one TE (Supplemental Table S4). About 22% of the matching transcripts contained a TE, which 166

was similar to GENCODE, however the variant coding transcripts had a higher percentage of TE-167

containing transcripts in hPSCs (Fig. 3A). There was a very small number of novel coding transcripts 168

(411), of which 139 (34%) were predicted to contain a TE. As this number is close to the false positive 169

rate to detect coding transcripts, we do not explore these further (Fig. 1E, Supplemental Fig. S2G, and 170

Supplemental Table S3). Surprisingly, hPSC-enriched transcripts were less likely than hPSC-171

nonspecific or hPSC-depleted transcripts to contain a TE (Fig. 3B). This effect was not unique to the 172

subset of novel transcripts assembled by us, as the same pattern is observed for transcripts matching the 173

GENCODE annotations (Fig. 3C). This is unexpected as TEs are thought to be more actively expressed 174

in the epigenetically relaxed hPSCs. 175

.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint

Page 8: Transposable Element-Gene Splicing Modulates the ... · 7/26/2020  · 411 Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 412 Interpretation

8

We next analyzed TE insertion frequency into the UTRs or CDS of coding transcripts (Fig. 3D). 176

We annotated the CDS from either the GENCODE reference or using the longest open reading frame 177

(ORF). There was an enrichment of TEs in the UTRs, particularly inside the 3’UTRs, and a depletion 178

in the CDS (Fig. 3D). We next asked whether integration sites vary across TE families. LINE:L1, L2 179

and SINE:Alu elements were specifically enriched in both UTRs, but particularly in the 3’UTRs (Fig. 180

3E, and Supplemental Fig. S3A, S3B). LINEs and SINEs were robustly depleted in the CDS, compared 181

to the UTRs (Fig. 3E), and most LINE:L1 TEs were biased to the UTRs (Fig. 3E). For example, 182

LINE:L1:L1M2_orf2 was strongly biased to the 3’UTR, and LINE:L1:L1HS_5end was specifically 183

enriched in the 5’UTR (Fig. 3F, and Supplemental Fig. S3C). Intriguingly, the L1HS family of LINEs 184

are active and capable of retrotransposition in humans (Beck et al. 2010), and their presence here in the 185

5’UTRs of genes suggests activity in hPSCs. 186

In a pattern similar to LINEs, several SINE family members were spliced into the 3’ ends of 187

transcripts (Fig. 3E, and Supplemental Fig. S3B), whilst HERVH/LTR7 TEs were specifically 188

enriched in the 5’UTR (Fig. 3F). Previous analysis of deepCAGE data showed that HERVH/LTR7 acts 189

as a hPSC-specific transcription start site (Fort et al. 2014). Our assembly included previously reported 190

LTR7/HERVH initiating transcripts (Wang et al. 2014) (Supplemental Fig. S3D). Additionally, we 191

could assemble transcripts with HERVH spliced throughout the transcripts, including the CDS and 192

3’UTR (Fig. 3F, and Supplemental Fig. S3E), indicating that HERVH can function as an exonised 193

unit. LTR7, the HERVH-accompanying LTR, was restricted to only the 5’UTR (Fig. 3F). 194

We next looked at the effect of TE insertions on transcript expression. Non-coding transcripts 195

containing TEs had significantly lower expression levels, however coding transcripts were overall 196

unaffected by the presence of a TE (Fig. 3G). It was previously reported that retrotransposons in the 197

3’UTR lead to decreased isoform expression (Faulkner et al. 2009), hence we explored if TE location 198

within a coding transcript influences expression. Transcripts with a TE in the 5’UTR or CDS had 199

significantly lower expression, however those with a TE inside the 3’UTR were significantly (though 200

modestly) up-regulated (Fig. 3H). We next looked at the individual subtypes of TEs, and found that 201

most subtypes of TE led to a lower level of expression (Fig. 3I). Transcripts with HERVHs, or L1 LINEs 202

.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint

Page 9: Transposable Element-Gene Splicing Modulates the ... · 7/26/2020  · 411 Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 412 Interpretation

9

had significantly lower expression no matter the location of their insertion (Fig. 3I, J). However, some 203

TEs could be inserted into the 3’UTR, and expression would be higher. For example, SINEs and the 204

DNA:hAT-Charlie family (Fig. 3I, K). Overall, the effect of TEs in transcripts is both transcript context 205

and TE-family dependent, and whilst LTRs tend to be uniformly deleterious for expression, some SINEs 206

and DNA transposons in the 3’UTR are associated with enhanced expression. 207

208

TE splicing affects the proteome of pluripotent cells 209

We next analyzed whether TE insertions alter the amino acid sequences of coding transcripts. We 210

divided TE insertions into classes based upon the TE insertions effect on the CDS, and counted the 211

number of TE insertions (Fig. 4A). We did not detect any in-frame TE insertions and found only 63 212

frame-shift insertions (Fig. 4A). A sizeable number of TE insertions led to a premature STOP or to the 213

introduction of a new ATG that would disrupt the pre-existing CDS. However, the most common type 214

of TE insertion caused a disruption or frame-shift that led to a longer CDS becoming the most likely 215

predicted CDS (Fig. 4A). These transcripts are predicted as coding, however they may not yield a 216

peptide, as they may lack productive translation or peptide products may be rapidly degraded. 217

Additionally, some TE types, particularly the SINE/Alu and SVA TEs, often have coding-like 218

signatures, but do not encode proteins (Jungreis et al. 2018). To estimate how many of these TE modified 219

CDSs are translated, we extracted the peptide sequences of the putative CDS, deleted any regions that 220

had a >95% BLAST hit against any GENCODE protein, removed peptides shorter than 20 amino acids 221

and retained only those that had at least one Trypsin/Lys-C cleavage site. These strict criteria mean that 222

we would only detect peptides derived from the impact of the TE, or from TE-derived sequences. We 223

then used the HipSci proteomics LC-MS/MS data (Kim and Pevzner 2014; Kilpinen et al. 2017) to 224

search for peptides. Overall, we could detect at least one peptide match for ~17-20% of transcripts (Fig. 225

4B and Supplemental Table S5), indicating that at least some TE-modified transcripts are translated 226

(Supplemental Fig. S4A). The majority of peptides were not derived directly from the TE, but 227

correspond to novel ORFs outside the TE sequence (Fig. 4C). Yet, a small number of peptides are 228

directly encoded by TEs (Fig. 4D). Many of the peptides corresponded to gag, pol and env but not pro, 229

.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint

Page 10: Transposable Element-Gene Splicing Modulates the ... · 7/26/2020  · 411 Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 412 Interpretation

10

of a putative progenitor HERVK (Dewannieux et al. 2006) (23 unique peptides in total) (Supplemental 230

Fig. S4B). These results are consistent with reports that some human cells synthesize viral-like 231

structures, particularly in the placenta (Kammerer et al. 2011), and HERVK RNAs and proteins are 232

expressed in hPSCs (Fuchs et al. 2013). Our data suggests that these viral proteins are derived from up 233

to nine transcripts (Supplemental Table S5). Interestingly, we find evidence for coding potential for a 234

variant hPSC-enriched transcript PCAT14 (prostate cancer-associated transcript 14) that has hitherto 235

been considered non-coding. We observed 13 unique HERVK peptides derived from PCAT14 that 236

encode a near-complete product of gag and fragments of pol (Fig. 4E, Supplemental Fig. S5A-C and 237

Supplemental Table S5). We did not observe any peptides from HERVH, in agreement with a previous 238

report that the ORFs are not functional (Santoni et al. 2012). In addition to HERVs, we also detected 239

peptides derived from SINEs, which was unexpected as SINEs do not encode proteins and rely on LINEs 240

for retrotransposition. Using BLAST to search for matching peptides against the human non-redundant 241

protein set produced no significant hits. In summary, these data indicate that TE insertions modulate the 242

coding transcriptome of pluripotent cells, and can give rise to novel peptides and the ectopic expression 243

of viral proteins. 244

245

TEs wire the lncRNA complement of hPSCs 246

Previous reports show that TEs constitute a major part of lncRNAs (Kelley and Rinn 2012; Kapusta et 247

al. 2013). Consistently, we observed a large number of non-coding transcripts containing at least 1 TE 248

(65%), slightly less that the 83% reported in a smaller set of lncRNAs (Kelley and Rinn 2012). Of the 249

transcripts matching GENCODE, 45% had a TE, whilst the variant and novel transcripts were more 250

likely to contain a TE (Fig. 5A). Interestingly, and in contrast to coding transcripts, novel and variant 251

non-coding transcripts containing a TE were more likely to be enriched in hPSCs (Fig. 5B). 252

We next looked at the position of TEs inside lncRNA sequences. TEs are spliced into lncRNAs, 253

and are found anywhere from the TSS to the TTS (transcription termination site) with a slight bias 254

towards the 3’ end that was more pronounced in hPSC-enriched transcripts (Fig. 5C). Interestingly, 255

compared to GENCODE, hPSC transcripts were more enriched with TEs, and hPSC-enriched transcripts 256

.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint

Page 11: Transposable Element-Gene Splicing Modulates the ... · 7/26/2020  · 411 Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 412 Interpretation

11

contained even more TEs (Fig. 5C, 5D, and Supplemental Fig. S6A). This indicates that the 257

contribution of TEs to non-coding transcripts is increased in hPSCs. At the TE family level, LINEs were 258

enriched in lncRNAs, such as LINE:L1:L1PA3_3end and LINE:L1:L1Hs_5end (Fig. 5E, and 259

Supplemental Fig. S6B). The L1HS family are active and capable of retrotransposition in humans, and 260

although we did not observe intact L1HS sequences or peptides, that L1HS TE fragments are contained 261

inside lncRNAs suggests they are under active epigenetic regulation (Fig. 3F and 5E). In addition to 262

LINEs, the SINEs, DNA, SVA and Satellite type repeats/TEs were also enriched (Supplemental Fig. 263

S6A, S6C-E). However, overall, the LTRs had the most complex patterns of splicing (Fig. 5E), and 264

especially the ERV1 and ERVK families of TEs (Supplemental Fig. S6F, S6G). LTR7, the LTR for 265

HERVH, can function as a hPSC-specific TSS (Fort et al. 2014). However, unlike coding transcripts, 266

LTR7 not was biased to the TSS and there was splicing of LTR7 and HERVH along the entirety of the 267

lncRNAs (Fig. 5E). There were similarly complex patterns of other LTRs, for example HERVFH21 268

and HERVH48 were found in the middle of transcripts, and not at the TSS or TTS (Supplemental Fig. 269

S6G). Overall, these results indicate that TEs are more likely to be spliced into hPSC-enriched non-270

coding transcripts, which is in contrast to hPSC-enriched coding transcripts which are depleted of TEs 271

(Fig. 3A, 3B and 5B). 272

Finally, we looked at the effect of TE insertions on lncRNA expression. In contrast to coding 273

transcripts, the presence of a TE inside the lncRNA was more likely to be deleterious for expression 274

(Fig. 3G). Many of the hPSC-specific transcripts with TEs had significantly reduced expression 275

compared to transcripts containing no TE (Fig. 5F). Additionally, multiple TE insertions resulted in 276

decreased expression, an effect not seen in coding transcripts (Fig. 5G). These results indicate that in 277

direct contrast to coding transcripts TE insertions are deleterious for non-coding transcript expression. 278

279

Single cell RNA-seq expression of lincRNAs, heterogeneity in TE splicing 280

We took advantage of our hPSC-specific high-quality transcript assembly to look at single cell RNA-281

seq (sc-RNA-seq) data. Sc-RNA-seq technology can measure specific expression of genes in individual 282

cells. However, transcript identity can be ambiguous as the reads are heavily biased to the 3’ ends. 283

.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint

Page 12: Transposable Element-Gene Splicing Modulates the ... · 7/26/2020  · 411 Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 412 Interpretation

12

Consequently, we reduced the transcripts to those with unique non-overlapping strand-specific 3’ ends, 284

and collapsed overlapping transcript 3’ends to a single transcript. This resulted in a total set of 88,520 285

transcripts (87% of the total superset of transcripts). We generated two sc-RNA-seq datasets, one from 286

WIBR3 hESCs and another from S0730/c11 iPSCs (Zhou et al. 2011), supplemented with 5 samples 287

from WTC iPSCs from E-MTAB-6687 (Nguyen et al. 2018), and two UCLA1 line hESC samples from 288

GSE140021 (Chen et al. 2019). We aligned the reads to the hg38 genome assembly and annotated the 289

reads to our 3’end transcript database. On average 50-70% of the reads could be aligned to this 3’ end 290

transcriptome (in comparison 60-70% align to GENCODE for the same samples). The majority of the 291

3’ ends show strand-specific read pileups (Supplemental Fig. S7A) indicating that our transcript 292

assembly has accurate 3’ ends and can be used for scRNA-seq. After filtering and normalization, we 293

retrieved 30,001 cells, and 35,400 transcript ends. Projection of the cells into a UMAP (Uniform 294

Maniform and Approximation Projection) plot showed the hESCs and iPSCs mixed well, and there was 295

no bias in the replicate samples (Supplemental Fig. S7B, S7C). We could detect five major clusters of 296

cells (Fig. 6A). GO (gene ontology) analysis for differentially regulated transcripts specific to each 297

cluster and marker genes expression indicated that all clusters except clusters 3 and 4 were pluripotent 298

(Fig. 6B, 6C, and Supplemental Fig. S7D). Analysis of the expression of G1, S and G2/M-specific 299

genes indicated that clusters 0 and 1 had higher expression of G2/M and proliferative genes (Fig. 6D, 300

E), suggesting these are highly proliferative cells identified previously (Nguyen et al. 2018; Lau et al. 301

2020), positive for EPCAM, CD9 and PODXL (GCTM2) (Fig. 6E, and Supplemental Fig. S7E, S7F). 302

Conversely, clusters 2 and 4 showed reduced levels of G2/M (Fig. 6D). Intriguingly, our data contained 303

two novel clusters of cells (cluster 3 and 4), that was predominantly made up of hESCs, rather than 304

iPSCs (Supplemental Fig. S7B). GO analysis suggested that these cells were undergoing differentiation 305

(Fig. 6B). Marker genes for a specific lineage were not evident, but there was a strong shift in epithelial 306

and mesenchymal genes, as CDH1 and EPCAM were downregulated and CDH2 and VIM were up-307

regulated (Supplemental Fig. S7E, S7G). This is reminiscent of the epithelial-mesenchymal transition 308

we saw previously in the early stage of hepatocyte differentiation (Li et al. 2017). The main difference 309

.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint

Page 13: Transposable Element-Gene Splicing Modulates the ... · 7/26/2020  · 411 Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 412 Interpretation

13

between the two clusters was in cell cycle activity, as many cells were in G2/M in cluster 3, but cluster 310

4 showed reduced cell cycle activity (Fig. 6D). 311

Finally, we looked at the composition of TEs and transcripts specific to each of the clusters. 312

Clusters 0, 1 and 2 represent the bulk hPSCs, and exist on a spectrum, from highly proliferating (cluster 313

1) to medium proliferation (cluster 0) and relatively low proliferation (cluster 2). All three clusters were 314

rich in hPSC-enriched transcripts and TE-containing transcripts (Fig. 6F). We measured the frequency 315

of TEs inside the transcripts, and observed a large number of HERVH TEs, and some LINE:L1 and 316

SINE:Alu-containing transcripts specific to cluster 0, 1 and 2 (Fig. 6G, S8A, B). Proliferative cluster 1 317

showed an increase in SINE:Alu containing transcripts (Fig. 6G). The two differentiating clusters, 3 318

and 4, were almost entirely depleted of LTR-containing transcripts (Fig. 6G), and were only enriched 319

for SINE:Alu, SINE:MIR and LINE:L1-containing transcripts (Fig. 6G, H, S8C, D). Overall, we show 320

that distinct subsets of cells within a hPSC culture contain transcripts with specific patters of TE 321

expression. HERVH TEs are restricted to clusters 0, 1 and 2, and are inversely correlated with 322

proliferation. Proliferating cells tend to express transcripts containing SINEs and LINEs, and cells 323

undergoing an EMT, and starting to differentiate, lose expression of HERVH-containing transcripts 324

(Fig. 6I). These results demonstrate the heterogenous patterns of TE expression in single cells in cultures 325

of hPSCs. 326

327

Discussion 328

TEs constitute the majority of the DNA sequence of the human genome (Hutchins and Pei 2015). They 329

have long been considered as a part, involved in regulatory functions, and providing an evolutionary 330

substrate (Bourque et al. 2018). However, TEs ultimately need to be suppressed and controlled due to 331

their implications in human disease and effects on genome stability (Beck et al. 2010; Hutchins and Pei 332

2015; Tam et al. 2019). Here, using a combination of short and long read RNA-seq data, we show that 333

TEs are an integral part of the hPSC-transcriptome, contributing to 26% of coding transcripts and 65% 334

of non-coding lncRNA transcripts. The insertion of TEs into coding transcripts is poorly tolerated in the 335

5’UTR and CDS, but are tolerated or even beneficial for expression inside the 3’UTR. For non-coding 336

.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint

Page 14: Transposable Element-Gene Splicing Modulates the ... · 7/26/2020  · 411 Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 412 Interpretation

14

transcripts the effect of TEs is uniformly deleterious, and the higher the number of TEs, the lower the 337

level of gene expression. 338

We found potential coding sequences in our catalog of transcripts; particularly peptides derived 339

from TE insertions. Our estimates of coding transcripts containing a TE suggests that around 20% of 340

these transcripts are capable of generating peptides, at least transiently, although most peptides are not 341

derived directly from the TE sequence. As lincRNAs have been observed to contain potential CDSs 342

(Cabili et al. 2011), it is likely that the non-peptide producing transcripts retain coding signatures but 343

not coding potential. Ultimately, the distinction between coding and non-coding transcripts remains 344

fuzzy, with a large number of genes with unclear designation (Abascal et al. 2018). It is also possible 345

that non-coding transcripts can become coding in a specific cell type. Considering the increasing interest 346

in the biological functions of alternate proteins and small peptides (sCDSs) in other biological contexts 347

(Makarewich and Olson 2017), it will be important to explore how both sCDSs and TE-derived peptides 348

are contributing to hPSCs. One class of TE-containing coding transcript that we could detect peptides 349

for were HERVH-derived viral proteins. Viral proteins have been observed in some cell types, for 350

example HERVW env proteins have been found in placental cells, and are thought to help mediate 351

trophoblast invasion into uterine tissues (Frendo et al. 2003; Chuong et al. 2013). hPSCs express 352

HERVH and HERVEs containing RNAs, but we did not observe any peptides. Only HERVK ERVs 353

showed evidence of gag, pol and env translation. 354

Our catalog of transcripts from hPSCs demonstrates how the complement of TEs is integrated 355

into coding and non-coding transcriptome. Interestingly, the splicing of TEs into pluripotency transcripts 356

has been seen in cancer cells. For example, an AluJb is spliced into the pluripotency gene LIN28B and 357

MLT1J into SALL4 (Yu et al. 2007; Yang et al. 2010; Jang et al. 2019). The splicing of these TEs is 358

thought to convert the pluripotent gene into an oncogene (Jang et al. 2019). Intriguingly, in our hPSC-359

specific dataset we did not observe these specific TE-transcript splicing patterns (Supplemental Fig. 360

S9A-C), suggesting that these are cancer-specific effects, and implying that there is a normal set of TEs 361

that are spliced into transcripts (Fort et al. 2014), and a set that is associated with disease. Indeed, hPSC-362

specific HERVs were not associated with the expression of pluripotency-related transcripts in human 363

.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint

Page 15: Transposable Element-Gene Splicing Modulates the ... · 7/26/2020  · 411 Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 412 Interpretation

15

cancers (Zapatka et al. 2020), again suggesting that TE splicing has a normal and diseased pattern. 364

Related to this, the observation of proteins derived from HERVK is noteworthy considering their link 365

with various cancers (Subramanian et al. 2011). The application of long-read sequencing to human 366

diseases, and specifically cancers will be important to shed light on the influence of TEs on transcript 367

splicing in human disease. Overall, our study reveals how the transcribed TE complement of the cell 368

contributes to both coding and non-coding transcriptome of hPSCs, and provides a model of the normal 369

TE transcriptome for comparison to diseased TE expression. 370

371

Competing Interests 372

The authors declare no competing interests 373

374

Authors’ contributions 375

A.P.H conceived and funded the project, I.A.B., Y.H.L. and A.P.H. performed the bioinformatic 376

analysis. C.M.C., X.L.F., M.G., S.L. and M.D. performed the long-read RNA-seq and validation. 377

B.P.D., Z.L., M.M.A., C.W. performed the scRNA-seq. A.P.H. and I.A.B. wrote the manuscript with 378

assistance from R.J, M.A.E, J.F, J.B.C and C.J. All authors read and approved the manuscript. 379

380

Acknowledgements 381

This work was supported by the National Natural Science Foundation of China (31970589, 382

31850410463 to A.P.H., 31801217 to Z.Q., 31850410486 to I.A.B.), the Science and Technology 383

Planning Project of Guangdong Province (2019A050510004 to A.P.H. and R.J.), the National Key 384

Research and Development Program of China (2016YFA0100102 and 2018YFA0106903 to M.A.E.), 385

and the Shenzhen Peacock plan (201701090668B to A.P.H.). Supported by Center for Computational 386

Science and Engineering of Southern University of Science and Technology. 387

388

Accessions 389

.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint

Page 16: Transposable Element-Gene Splicing Modulates the ... · 7/26/2020  · 411 Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 412 Interpretation

16

The long-read RNA-seq was deposited in the Sequence Read Archive (SRA) with the accession number: 390

PRJNA631047, and the scRNA-seq data with the accession number: PRJNA631808. 391

392

References 393

Abascal F, Juan D, Jungreis I, Kellis M, Martinez L, Rigau M, Rodriguez JM, Vazquez J, Tress ML. 394 2018. Loose ends: almost one in five human genes still have unresolved coding status. Nucleic 395 Acids Res 46: 7070-7084. 396

Babarinde IA, Li Y, Hutchins AP. 2019. Computational Methods for Mapping, Assembly and 397 Quantification for Coding and Non-coding Transcripts. Comput Struct Biotechnol J 17: 628-637. 398

Beck CR, Collier P, Macfarlane C, Malig M, Kidd JM, Eichler EE, Badge RM, Moran JV. 2010. LINE-1 399 retrotransposition activity in human genomes. Cell 141: 1159-1170. 400

Bourque G, Burns KH, Gehring M, Gorbunova V, Seluanov A, Hammell M, Imbeault M, Izsvak Z, Levin 401 HL, Macfarlan TS et al. 2018. Ten things you should know about transposable elements. 402 Genome Biol 19: 199. 403

Bulut-Karslioglu A, Macrae TA, Oses-Prieto JA, Covarrubias S, Percharde M, Ku G, Diaz A, McManus 404 MT, Burlingame AL, Ramalho-Santos M. 2018. The Transcriptionally Permissive Chromatin 405 State of Embryonic Stem Cells Is Acutely Tuned to Translational Output. Cell Stem Cell 22: 406 369-383 e368. 407

Cabili MN, Trapnell C, Goff L, Koziol M, Tazon-Vega B, Regev A, Rinn JL. 2011. Integrative annotation 408 of human large intergenic noncoding RNAs reveals global properties and specific subclasses. 409 Genes Dev 25: 1915-1927. 410

Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 411 Interpretation G, Johnson R, Consortium P. 2020. Cancer LncRNA Census reveals evidence 412 for deep functional conservation of long noncoding RNAs in tumorigenesis. Commun Biol 3: 56. 413

Chen D, Sun N, Hou L, Kim R, Faith J, Aslanyan M, Tao Y, Zheng Y, Fu J, Liu W et al. 2019. Human 414 Primordial Germ Cells Are Specified from Lineage-Primed Progenitors. Cell Rep 29: 4568-4582 415 e4565. 416

Chishima T, Iwakiri J, Hamada M. 2018. Identification of Transposable Elements Contributing to Tissue-417 Specific Expression of Long Non-Coding RNAs. Genes (Basel) 9. 418

Chuong EB, Rumi MA, Soares MJ, Baker JC. 2013. Endogenous retroviruses function as species-419 specific enhancer elements in the placenta. Nat Genet 45: 325-329. 420

Clayton EA, Rishishwar L, Huang TC, Gulati S, Ban D, McDonald JF, Jordan IK. 2020. An atlas of 421 transposable element-derived alternative splicing in cancer. Philos Trans R Soc Lond B Biol Sci 422 375: 20190342. 423

Consortium F the RP Clst Forrest AR Kawaji H Rehli M Baillie JK de Hoon MJ Haberle V Lassmann T 424 et al. 2014. A promoter-level mammalian expression atlas. Nature 507: 462-470. 425

Derrien T, Johnson R, Bussotti G, Tanzer A, Djebali S, Tilgner H, Guernec G, Martin D, Merkel A, 426 Knowles DG et al. 2012. The GENCODE v7 catalog of human long noncoding RNAs: analysis 427 of their gene structure, evolution, and expression. Genome Res 22: 1775-1789. 428

.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint

Page 17: Transposable Element-Gene Splicing Modulates the ... · 7/26/2020  · 411 Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 412 Interpretation

17

Dewannieux M, Harper F, Richaud A, Letzelter C, Ribet D, Pierron G, Heidmann T. 2006. Identification 429 of an infectious progenitor for the multiple-copy HERV-K human endogenous retroelements. 430 Genome Res 16: 1548-1556. 431

Faulkner GJ, Kimura Y, Daub CO, Wani S, Plessy C, Irvine KM, Schroder K, Cloonan N, Steptoe AL, 432 Lassmann T et al. 2009. The regulated retrotransposon transcriptome of mammalian cells. Nat 433 Genet 41: 563-571. 434

Feng S, Jacobsen SE, Reik W. 2010. Epigenetic reprogramming in plant and animal development. 435 Science 330: 622-627. 436

Fort A, Hashimoto K, Yamada D, Salimullah M, Keya CA, Saxena A, Bonetti A, Voineagu I, Bertin N, 437 Kratz A et al. 2014. Deep transcriptome profiling of mammalian stem cells supports a regulatory 438 role for retrotransposons in pluripotency maintenance. Nat Genet 46: 558-566. 439

Frendo JL, Olivier D, Cheynet V, Blond JL, Bouton O, Vidaud M, Rabreau M, Evain-Brion D, Mallet F. 440 2003. Direct involvement of HERV-W Env glycoprotein in human trophoblast cell fusion and 441 differentiation. Mol Cell Biol 23: 3566-3574. 442

Fuchs NV, Loewer S, Daley GQ, Izsvak Z, Lower J, Lower R. 2013. Human endogenous retrovirus K 443 (HML-2) RNA and protein expression is a marker for human embryonic and induced pluripotent 444 stem cells. Retrovirology 10: 115. 445

Goff LA, Rinn JL. 2015. Linking RNA biology to lncRNAs. Genome Res 25: 1456-1465. 446

Goke J, Lu X, Chan YS, Ng HH, Ly LH, Sachs F, Szczerbinska I. 2015. Dynamic transcription of distinct 447 classes of endogenous retroviral elements marks specific populations of early human 448 embryonic cells. Cell Stem Cell 16: 135-141. 449

Grow EJ, Flynn RA, Chavez SL, Bayless NL, Wossidlo M, Wesche DJ, Martin L, Ware CB, Blish CA, 450 Chang HY et al. 2015. Intrinsic retroviral reactivation in human preimplantation embryos and 451 pluripotent cells. Nature 522: 221-225. 452

Hubley R, Finn RD, Clements J, Eddy SR, Jones TA, Bao W, Smit AF, Wheeler TJ. 2016. The Dfam 453 database of repetitive DNA families. Nucleic Acids Res 44: D81-89. 454

Hutchins AP, Pei D. 2015. Transposable elements at the center of the crossroads between 455 embryogenesis, embryonic stem cells, reprogramming, and long non-coding RNAs. Sci Bull 60: 456 1722–1733. 457

Hutchins AP, Yang Z, Li Y, He F, Fu X, Wang X, Li D, Liu K, He J, Wang Y et al. 2017. Models of global 458 gene expression define major domains of cell type and tissue identity. Nucleic Acids Res 45: 459 2354-2367. 460

Jang HS, Shah NM, Du AY, Dailey ZZ, Pehrsson EC, Godoy PM, Zhang D, Li D, Xing X, Kim S et al. 461 2019. Transposable elements drive widespread expression of oncogenes in human cancers. 462 Nat Genet 51: 611-617. 463

Johnson R, Guigo R. 2014. The RIDL hypothesis: transposable elements as functional domains of long 464 noncoding RNAs. RNA 20: 959-976. 465

Jonsson ME, Ludvik Brattas P, Gustafsson C, Petri R, Yudovich D, Pircs K, Verschuere S, Madsen S, 466 Hansson J, Larsson J et al. 2019. Activation of neuronal genes via LINE-1 elements upon global 467 DNA demethylation in human neural progenitors. Nat Commun 10: 3182. 468

.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint

Page 18: Transposable Element-Gene Splicing Modulates the ... · 7/26/2020  · 411 Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 412 Interpretation

18

Jungreis I, Tress ML, Mudge J, Sisu C, Hunt T, Johnson R, Uszczynska-Ratajczak B, Lagarde J, Wright 469 J, Muir P et al. 2018. Nearly all new protein-coding predictions in the CHESS database are not 470 protein-coding. bioRxiv doi:10.1101/360602: 360602. 471

Jurka J, Kapitonov VV, Kohany O, Jurka MV. 2007. Repetitive sequences in complex genomes: 472 structure and evolution. Annu Rev Genomics Hum Genet 8: 241-259. 473

Kammerer U, Germeyer A, Stengel S, Kapp M, Denner J. 2011. Human endogenous retrovirus K 474 (HERV-K) is expressed in villous and extravillous cytotrophoblast cells of the human placenta. 475 J Reprod Immunol 91: 1-8. 476

Kapusta A, Kronenberg Z, Lynch VJ, Zhuo X, Ramsay L, Bourque G, Yandell M, Feschotte C. 2013. 477 Transposable elements are major contributors to the origin, diversification, and regulation of 478 vertebrate long noncoding RNAs. PLoS Genet 9: e1003470. 479

Kelley D, Rinn J. 2012. Transposable elements reveal a stem cell-specific class of long noncoding 480 RNAs. Genome Biol 13: R107. 481

Kilpinen H, Goncalves A, Leha A, Afzal V, Alasoo K, Ashford S, Bala S, Bensaddek D, Casale FP, Culley 482 OJ et al. 2017. Common genetic variation drives molecular heterogeneity in human iPSCs. 483 Nature 546: 370-375. 484

Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. 2019. Graph-based genome alignment and 485 genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol 37: 907-915. 486

Kim S, Pevzner PA. 2014. MS-GF+ makes progress towards a universal database search tool for 487 proteomics. Nat Commun 5: 5277. 488

Kunarso G, Chia NY, Jeyakani J, Hwang C, Lu X, Chan YS, Ng HH, Bourque G. 2010. Transposable 489 elements have rewired the core regulatory network of human embryonic stem cells. Nat Genet 490 42: 631-634. 491

Lagarde J, Uszczynska-Ratajczak B, Carbonell S, Perez-Lluch S, Abad A, Davis C, Gingeras TR, 492 Frankish A, Harrow J, Guigo R et al. 2017. High-throughput annotation of full-length long 493 noncoding RNAs with capture long-read sequencing. Nat Genet 49: 1731-1740. 494

Lau KX, Mason EA, Kie J, De Souza DP, Kloehn J, Tull D, McConville MJ, Keniry A, Beck T, Blewitt ME 495 et al. 2020. Unique properties of a subset of human pluripotent stem cells with high capacity for 496 self-renewal. Nat Commun 11: 2420. 497

Lev-Maor G, Ram O, Kim E, Sela N, Goren A, Levanon EY, Ast G. 2008. Intronic Alus influence 498 alternative splicing. PLoS Genet 4: e1000204. 499

Li Q, Hutchins AP, Chen Y, Li S, Shan Y, Liao B, Zheng D, Shi X, Li Y, Chan WY et al. 2017. A sequential 500 EMT-MET mechanism drives the differentiation of human embryonic stem cells towards 501 hepatocytes. Nat Commun 8: 15166. 502

Lu JY, Shao W, Chang L, Yin Y, Li T, Zhang H, Hong Y, Percharde M, Guo L, Wu Z et al. 2020. Genomic 503 Repeats Categorize Genes with Distinct Functions for Orchestrated Regulation. Cell Rep 30: 504 3296-3311 e3295. 505

Ma L, Cao J, Liu L, Du Q, Li Z, Zou D, Bajic VB, Zhang Z. 2019. LncBook: a curated knowledgebase of 506 human long non-coding RNAs. Nucleic Acids Res 47: D128-D134. 507

Macfarlan TS, Gifford WD, Driscoll S, Lettieri K, Rowe HM, Bonanomi D, Firth A, Singer O, Trono D, 508 Pfaff SL. 2012. Embryonic stem cell potency fluctuates with endogenous retrovirus activity. 509 Nature 487: 57-63. 510

.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint

Page 19: Transposable Element-Gene Splicing Modulates the ... · 7/26/2020  · 411 Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 412 Interpretation

19

Makarewich CA, Olson EN. 2017. Mining for Micropeptides. Trends Cell Biol 27: 685-696. 511

Morillon A, Gautheret D. 2019. Bridging the gap between reference and real transcriptomes. Genome 512 Biol 20: 112. 513

Naville M, Warren IA, Haftek-Terreau Z, Chalopin D, Brunet F, Levin P, Galiana D, Volff JN. 2016. Not 514 so bad after all: retroviruses and long terminal repeat retrotransposons as a source of new 515 genes in vertebrates. Clin Microbiol Infect 22: 312-323. 516

Nguyen QH, Lukowski SW, Chiu HS, Senabouth A, Bruxner TJC, Christ AN, Palpant NJ, Powell JE. 517 2018. Single-cell RNA-seq of human induced pluripotent stem cells reveals cellular 518 heterogeneity and cell state transitions between subpopulations. Genome Res 28: 1053-1066. 519

Pertea M, Pertea GM, Antonescu CM, Chang TC, Mendell JT, Salzberg SL. 2015. StringTie enables 520 improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol 33: 290-295. 521

Santoni FA, Guerra J, Luban J. 2012. HERV-H RNA is abundant in human embryonic stem cells and a 522 precise marker for pluripotency. Retrovirology 9: 111. 523

Schumann GG, Fuchs NV, Tristan-Ramos P, Sebe A, Ivics Z, Heras SR. 2019. The impact of 524 transposable element activity on therapeutically relevant human stem cells. Mob DNA 10: 9. 525

Steijger T, Abril JF, Engstrom PG, Kokocinski F, Consortium R, Hubbard TJ, Guigo R, Harrow J, Bertone 526 P. 2013. Assessment of transcript reconstruction methods for RNA-seq. Nat Methods 10: 1177-527 1184. 528

Subramanian RP, Wildschutte JH, Russo C, Coffin JM. 2011. Identification, characterization, and 529 comparative genomic distribution of the HERV-K (HML-2) group of human endogenous 530 retroviruses. Retrovirology 8: 90. 531

Tam OH, Ostrow LW, Gale Hammell M. 2019. Diseases of the nERVous system: retrotransposon 532 activity in neurodegenerative disease. Mob DNA 10: 32. 533

Theunissen TW, Friedli M, He Y, Planet E, O'Neil RC, Markoulaki S, Pontis J, Wang H, Iouranova A, 534 Imbeault M et al. 2016. Molecular Criteria for Defining the Naive Human Pluripotent State. Cell 535 Stem Cell 19: 502-515. 536

Toker L, Feng M, Pavlidis P. 2016. Whose sample is it anyway? Widespread misannotation of samples 537 in transcriptomics studies. F1000Res 5: 2103. 538

Wang J, Xie G, Singh M, Ghanbarian AT, Rasko T, Szvetnik A, Cai H, Besser D, Prigione A, Fuchs NV 539 et al. 2014. Primate-specific endogenous retrovirus-driven transcription defines naive-like stem 540 cells. Nature 516: 405-409. 541

Wang X, You X, Langer JD, Hou J, Rupprecht F, Vlatkovic I, Quedenau C, Tushev G, Epstein I, 542 Schaefke B et al. 2019. Full-length transcriptome reconstruction reveals a large diversity of RNA 543 and protein isoforms in rat hippocampus. Nat Commun 10: 5009. 544

Wapinski O, Chang HY. 2011. Long noncoding RNAs and human disease. Trends Cell Biol 21: 354-545 361. 546

Wheeler TJ, Eddy SR. 2013. nhmmer: DNA homology search with profile HMMs. Bioinformatics 29: 547 2487-2489. 548

Wu TD, Watanabe CK. 2005. GMAP: a genomic mapping and alignment program for mRNA and EST 549 sequences. Bioinformatics 21: 1859-1875. 550

.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint

Page 20: Transposable Element-Gene Splicing Modulates the ... · 7/26/2020  · 411 Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 412 Interpretation

20

Wucher V, Legeai F, Hedan B, Rizk G, Lagoutte L, Leeb T, Jagannathan V, Cadieu E, David A, Lohi H 551 et al. 2017. FEELnc: a tool for long non-coding RNA annotation and its application to the dog 552 transcriptome. Nucleic Acids Res 45: e57. 553

Yang J, Gao C, Chai L, Ma Y. 2010. A novel SALL4/OCT4 transcriptional feedback network for 554 pluripotency of embryonic stem cells. PLoS One 5: e10766. 555

You BH, Yoon SH, Nam JW. 2017. High-confidence coding and noncoding transcriptome maps. 556 Genome Res 27: 1050-1062. 557

Yu J, Vodyanik MA, Smuga-Otto K, Antosiewicz-Bourget J, Frane JL, Tian S, Nie J, Jonsdottir GA, 558 Ruotti V, Stewart R et al. 2007. Induced pluripotent stem cell lines derived from human somatic 559 cells. Science 318: 1917-1920. 560

Zapatka M, Borozan I, Brewer DS, Iskar M, Grundhoff A, Alawi M, Desai N, Sultmann H, Moch H, 561 Pathogens P et al. 2020. The landscape of viral associations in human cancers. Nat Genet 52: 562 320-330. 563

Zhou T, Benda C, Duzinger S, Huang Y, Li X, Li Y, Guo X, Cao G, Chen S, Hao L et al. 2011. Generation 564 of induced pluripotent stem cells from urine. J Am Soc Nephrol 22: 1221-1228. 565

566

567

568

.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint

Page 21: Transposable Element-Gene Splicing Modulates the ... · 7/26/2020  · 411 Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 412 Interpretation

60

A CB

0

20

40

60

Num

ber (

×000

)

Mat

chin

g

Varia

nt

Nov

el

log2

(TPM

)

-5

0

5

10

15

LRSR LR+SR

Num

ber (

×000

)

0

20

40

% transcripts

RNA-seq DeepCAGE

GENCODE

All

Matching

Variant

Novel

Coding Noncoding

20% 40%0% 60% 80% 100%

log2

(nor

mal

ized

cou

nt)

Cod

ing

Non

codi

ng

Cod

ing

Non

codi

ng

Mat

chin

g

Varia

nt

Nov

el

Mat

chin

g

Varia

nt

Nov

el

D E F

Norm

alize

d RNA

-seq

coun

t

Norm

alize

d dee

pCAG

E co

unt

2.0Kb 2.0KbTTSTSS

20

40

6060

40

20

0

deepCAGERNA-seq

-5

0

5

10

20

15

RNA-seq DeepCAGE

log2

(TPM

)

-5

0

5

10

15

log2

(nor

mal

ized

cou

nt)

-5

0

5

10

20

15

72794(71.7%)

28698(28.3%)

60605(81.2%)

11725(60.0%)

464(6.4%)

6806(93.6%)

7851(40.0%)

14041(18.8%)

99883(69.6%)

43689(30.4%)

Figure 1

.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint

Page 22: Transposable Element-Gene Splicing Modulates the ... · 7/26/2020  · 411 Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 412 Interpretation

21

Figure 1 – Combining short-read RNA-seq and long read RNA-seq to assemble a high-quality 569

hPSC transcriptome. 570

(A) Pileup of deepCAGE and RNA-seq reads across the lengths of the transcripts from TSS to 571

TTS and the flanking 2 kb regions. 572

(B) Number of transcripts (thousands) supported by short-read (SR), long read (LR) only, or both 573

(SR+LR). 574

(C) Number of transcripts defined as matching (internal exon structure matches exactly to a 575

GENCODE transcript), variant (shares any exon or overlapping exon segment with a 576

GENCODE transcript) or novel (does not share any exon base pair with a GENCODE 577

transcript). 578

(D) Violin plots showing RNA-seq log2 transcripts per million (TPM) and deepCAGE log2 579

normalized tag count expression for all unambiguous TSSs in the hPSC assembled 580

transcriptome. 581

(E) Percentage of coding and non-coding transcripts by transcript class. 582

(F) Expression level of coding and non-coding transcripts, for short-read RNA-seq (left violins) 583

or deepCAGE data (right violins). 584

585

586

.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint

Page 23: Transposable Element-Gene Splicing Modulates the ... · 7/26/2020  · 411 Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 412 Interpretation

hPSC Hapmap tissues

Enric

hed

Non

spec

ific

Dep

lete

d

SALL4

NANOGZIC3

DNMT3B

GDAP1

SPATA20

SEC61G

PTPRS

WBP1PDK4

TMEM176B

APOL3

log2 (TPM)

-2 -1 0 1 2

Den

sity

0.0

0.1

0.2

0.3

0.4

0.5

All

Noncoding

Z-score in the bodymap expression-4 -2 0 2 4

Enriched

Nonspecific

Depleted

0% 20% 40% 60% 80% 100%

% transcripts

log2

(TPM

)

Coding Noncoding

A B

C

-5

0

5

10 hPSC

Adip

ose

Adre

nal g

land

Brai

nBr

east

Col

onH

eart

Kidn

eyLi

ver

Lung

Lym

ph n

ode

Ova

ryPr

ostra

teSk

elet

al m

uscl

eTe

stis

Thyr

oid

Whi

te b

lood

cel

ls

Enric

hed

Unb

iase

d

Dep

lete

d

Enric

hed

Unb

iase

d

Dep

lete

d

-5

0

5

10

15

20RNA-seq DeepCAGE

Coding Noncoding

Coding

D

POU5F1CLDN6

ZFP42

LIN28A

LIN28B

NUCB1POLD1

NELFE

DRAP1

ATG101

CD14LAPTM5

KLF2

VSIRACSL1NFICAPOL6

3 4

ZIC2

log2

(nor

mal

ized

cou

nt)

Nonspecific62,913 (62.0%)

Depleted10,984 (10.8%)

Enriched27,595 (27.2%)

15

7986 (72.7%) 2998 (27.3%)

18030 (28.7%)44883 (71.3%)

19925 (72.2%) 7670 (27.8%)

Figure 2

.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint

Page 24: Transposable Element-Gene Splicing Modulates the ... · 7/26/2020  · 411 Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 412 Interpretation

22

Figure 2 – Transcript dynamics of the hPSC transcriptome. 587

(A) Z-score of all, coding, and non-coding transcripts against the Human BodyMap 2.0 dataset 588

(Derrien et al. 2012). 589

(B) Heatmap showing expression of selected hPSC-enriched, -nonspecific and -depleted transcripts 590

in hPSCs and the BodyMap samples. 591

(C) Percent of the coding and non-coding transcripts in the indicated hPSC expression categories. 592

(D) Violin plots showing the RNA-seq expression levels (left panel; TPM: transcripts per million) 593

or deepCAGE data (right panel; in normalized read counts) for the indicated expression classes 594

for coding and non-coding transcripts. 595

596

597

.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint

Page 25: Transposable Element-Gene Splicing Modulates the ... · 7/26/2020  · 411 Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 412 Interpretation

− 5 0 5

3' UTRCDS

5' UTRno TE

*4.0e-021.1e-01*9.6e-15

SINE:Alu:AluSx

− 5 0 5

3' UTRCDS

5' UTRno TE

*3.9e-026.6e-027.7e-02

DNA:hAT-Charlie:MER117

− 5 0 5

3' UTRCDS

5' UTRno TE

*4.6e-021.6e-01*3.3e-09

SINE:Alu:AluJr

Log2 (Transcripts per million)

p-value

− 5 0 5

3' UTRCDS

5' UTRno TE

*2.2e-059.0e-02*2.6e-04

LINE:L1:L1M2_orf2− 5 0 5

3' UTRCDS

5' UTRno TE

*3.7e-02*1.8e-03*1.5e-05

LINE:L1:L1HS_5end− 5 0 5

3' UTRCDS

5' UTRno TE

*1.8e-022.7e-01*3.8e-05

LTR:ERV1:HERVH

Log2 (Transcripts per million)

p-value

5'UTR CDS 3'UTR 5'UTR CDS 3'UTR

TEfre

quen

cyno

rmalize

dto

numbe

roftranscripts

Protein coding RNAsAll transcripts Variant-only

0.0

0.1

0.2

0.3

0.0

0.1

0.2

0.3

LINE:L1

LINE:L2

0.000

0.005

0.010

0.015

LTR:ERV1

SINE:Alu

0.00

0.01

0.02

0.03

TEfre

quen

cyno

rmalize

dto

numbe

roftranscripts

5'UTR CDS 3'UTR 5'UTR CDS 3'UTR

0.00

0.05

0.10

0.15

0.00

0.02

0.04

0.06

GENCODEhPSC enrichedhPSC nonspecifichPSC depleted

Log2 (Transcripts per million)Significantly upSignificantly down

Significantly upSignificantly down

Significantly upSignificantly down

p-value

− 2 0 2

TEno TE

*9.2e-74

Protein coding

Non-coding

− 5 0 5

TEno TE

1.7e-01

Protein coding

− 5 0 5

3' UTRCDS

5' UTRno TE

*4.1e-02*2.9e-30*8.9e-45

Log2 (Transcripts per million)

p-value

0% 50% 100%

GENCODE

Matching

Variant

% TranscriptsContains TENo TE

5458 (45%)

13332 (22%)

17825 (21%)

6694 (55%)

46899 (78%)

66207 (79%)

Contains TENo TE

0% 50% 100%

hPSC enriched

hPSC nonspecific

hPSC depleted

2453 (21%) 9408 (79%)

14594 (26%) 41018 (74%)

1882 (35%) 3439 (65%)

% Transcripts

hPSC enrichedhPSC nonspecifichPSC depleted

Matching

Variant

No TE

TE

No TE

TE

0% 50% 100%

1269 (21%) 3361 (62%) 828 (15%)

2371 (35%) 3715 (55%) 608 (9%)

2815 (21%) 8512 (64%) 2005 (15%)

13284 (28%) 29094 (62%) 4521 (10%)

% Transcripts

LTR:ERV1:LTR7

LTR:ERV1:HERVH

0.0000

0.0025

0.0050

0.0075

LINE:L1:L1HS_5end

5'UTR CDS 3'UTR

LINE:L1:L1M2_orf2

SINE:Alu:AluJr

0.000

0.002

0.004SINE:Alu:FRAM

5'UTR CDS 3'UTR

TEfre

quen

cyno

rmalize

dto

numbe

roftranscripts

0.000

0.005

0.010

0.000

0.005

0.010

0.000

0.002

0.004

0.000

0.005

0.010

0.015

A

E

Figure 3

B

C

H

D

G

I

F

J K

5'UTR

CDS

3'UTR

LTR:GypsyDNA:hAT-BlackjackLINE:RTE-XLINE:CR1LINE:L2DNA:hAT-CharlieRetroposon:SVASINE:AluDNA:TcMar-TiggerLINE:L1LTR:ERVKDNA:TcMar-MarinerLTR:ERVL-MaLRLTR:ERVLSINE:MIRLTR:ERV1DNA:hAT-Tip100

-1 0 1Log2 (fold-change)

versus no TE

.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint

Page 26: Transposable Element-Gene Splicing Modulates the ... · 7/26/2020  · 411 Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 412 Interpretation

23

Figure 3 – TEs spliced into coding sequence genes remodel the expression dynamics of 598

pluripotent stem cells. 599

(A) Proportion of protein coding transcripts in the GENCODE database, of hPSC-expressing 600

transcripts matching (to GENCODE), or variant (share transcript structure with a known 601

transcript), and the proportions that contain at least 1 TE or no TEs. 602

(B) Proportion of coding transcripts that are enriched, nonspecific or depleted in hPSCs 603

(C) Proportions of matching or variant transcripts with or without a TE that are hPSC-enriched, -604

nonspecific or -depleted. 605

(D) Relative density of all TE insertions in all or variant transcripts’ 5’ UTR, CDS or 3’UTR. Each 606

plot is normalized to the length of the transcript region. The blue line indicates all GENCODE 607

transcripts, the red, green and yellow lines indicate hPSC-specific expression categories. 608

(E) Density plots (as in panel D) for LINE L1, L2, SINE Alu, and LTR ERV1 TEs. 609

(F) Density plots for the indicated LTR, LINE and SINE types. 610

(G) Box plot for the expression levels of protein coding and non-coding transcripts with or without 611

a TE in its sequence. p-value is from a Welch’s t-test versus no TE-containing transcripts. 612

(H) Box plot for the expression levels of protein coding transcripts with or without a TE in the 613

5’UTR, CDS or 3’UTR. p-value is from a Welch’s t-test versus no TE-containing transcripts. 614

(I) Heatmap showing the fold-change of transcripts containing one or more of the indicated 615

subtypes of TE versus no TE-containing transcripts. 616

(J) Box plot for the expression levels of the transcripts containing the indicated TEs in the 5’UTR, 617

CDS or 3’UTR. p-value is from a Welch’s t-test versus no TE-containing transcripts. 618

(K) As in panel J, but for example transcripts with TEs in the 3’UTR that lead to higher levels of 619

expression. p-value is from a Welch’s t-test versus no TE-containing transcripts. 620

621

622

.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint

Page 27: Transposable Element-Gene Splicing Modulates the ... · 7/26/2020  · 411 Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 412 Interpretation

Long-read novel coding 12

Novel CDS

Known CDS

TE insertion

In-frame insertion 0

Frame-shift insertion 63

New STOP 214

New ATG 86

Alternate CDS 1117

CDS disruptions

mRNA Number oftranscriptsin class

D

FoundNot found

Number of proteins with >20 unmaskedamino acids and at least one cleavage site

Frameshift insertion

Long-read novel coding

0 1000

new STOPnew ATG

Alternate CDS3/11 (27%)

181/1050 (17%)42/198 (21%)

17/84 (20%)12/58 (20%)

2 ormore unique peptides1 Unique peptide

0 50 100 150 200

Inside TE

Outside TE

15(15%)

70(70%)

32(32%)

145(145%)

Number of genes with unique peptides thatderive from inside or outside a TE sequence

Number of peptides derived from thesequences from the indicated TEs

0 10

DNA:TcMar-Tigger:Tigger1DNA:TcMar-Tigger:Tigger2

LINE:L1:L1M4c_5endLINE:L1:L1M5_orf2

LINE:L1:L1MC4_3endLINE:L1:L1MC5a_3endLINE:L1:L1ME4a_3end

LINE:L1:L1P1_orf2LTR:ERV1:ERV24B_Prim

LTR:ERV1:HERVFH21LTR:ERV1:LTR8

LTR:ERV1:PABL_B-intLTR:ERV1:PRIMA41-int

LTR:ERVK:HERVKLTR:ERVK:HERVK11LTR:ERVK:HERVK13

LTR:ERVK:HERVK14CLTR:ERVK:LTR5

LTR:ERVL-MaLR:MLT1A1LTR:ERVL-MaLR:MLT1J-intLTR:ERVL-MaLR:THE1-intLTR:ERVL:ERV3-16A3_I

LTR:ERVL:ERV3-16A3_LTRLTR:ERVL:LTR16A

Retroposon:SVA:SVA_ARetroposon:SVA:SVA_C

SINE:Alu:AluJr4SINE:Alu:AluScSINE:Alu:AluSp

SINE:Alu:AluSq2SINE:Alu:AluSx1SINE:Alu:AluSx3SINE:Alu:FLAM_A

SINE:MIR:MIRSINE:MIR:MIRc

rRNA:.:LSU-rRNA_Hsa

3321111111122139311132111421311111311

A

Figure 4

B

C E

01975 3847

8718

Coding Seq

DNA/SVAs

LTRs

TEsSINE

s

LINEs

LTR5BHERVKLTR5

LTR5B HERVK HERVK11 HERVKLTR5B

MER11B

MER11A

ENST00000626825.1(+) HPSCSR.312557.3PCAT14

.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint

Page 28: Transposable Element-Gene Splicing Modulates the ... · 7/26/2020  · 411 Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 412 Interpretation

24

Figure 4 – TEs disrupt coding sequences and in limited instances code for peptides 623

(A) Schematic of the type of TE insertions into the coding sequences, and the number of transcripts 624

in each class. Grey regions indicate novel CDS caused by the TE insertion, black indicates the 625

known CDS, dotted lines indicate where the CDS would be if the TE was not present, and red 626

indicates the TE insertion. Transcripts were assigned to each category from top to bottom by 627

classification. 628

(B) Number of novel CDS peptide fragments that can be detected in the HipSci mass spectrometry 629

dataset. A protein was considered found if at least 1 unique peptide mapped to it. 630

(C) Number and percentage of transcripts with a unique peptide derived from inside or outside the 631

TE sequence. 632

(D) Numbers of unique peptides derived from each type of TE. 633

(E) Example domain map for PCAT14, a variant transcript enriched in hPSCs. The location of TEs 634

in the LINE, SINE LTR or DNA/SVAs is indicated on the horizontal lines. The coding sequence 635

is indicated using the thick black bar. The red vertical bars indicate peptides that map inside a 636

TE, the green vertical bar indicates a peptide mapping outside a TE. 637

638

639

.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint

Page 29: Transposable Element-Gene Splicing Modulates the ... · 7/26/2020  · 411 Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 412 Interpretation

− 5 0 5

109876543210

*1.3e-276.4e-014.3e-011.1e-012.9e-014.9e-01*2.6e-021.9e-01*2.3e-032.2e-01

Log2 (Transcripts per million)

Protein coding p-value

Log2 (Transcripts per million)

lncRNAs

− 5 0

109876543210

*3.5e-171*2.1e-09*3.9e-11*2.1e-10*4.6e-18*2.7e-15*3.9e-15*3.6e-18*4.8e-07*2.7e-02

p-value

NumberofTEs

intranscriptG

TSS TTS

All lncRNAs0.6

0.2

LINE:L1:L1HS_5end0.02

0.00

LINE:L1:L1PA3_3end

0.003

0.000

LTR:ERV1:HERV-Fc20.006

0.000

LTR:ERV1:HERVH0.1

0.0

LTR:ERV1:LTR7

0.04

0.00

LTR:ERV1:LTR7Y0.04

0.00TSS TTS

TEdensity(normalizedtototalnum

beroftranscripts)

LINE:L1

0.1

0.05

LINE:L20.03

0.00TE

density(normalizedtototalnum

beroftranscripts)

LTR:ERV10.2

0.1

LTR:ERVK

0.002

0.0010

TSS TTS

GENCODEhPSC enrichedhPSC nonspecifichPSC depleted

GENCODEhPSC enrichedhPSC nonspecifichPSC depleted

GENCODEhPSC enrichedhPSC nonspecifichPSC depleted

A

B

Figure 5

C

F

D

E

Matching

Variant

No TE

TE

No TE

TE

NovelNo TE

TE

1713 (22%) 5297 (69%) 667 (9%)

379 (27%) 829 (59%) 208 (15%)

1411 (22%) 4172 (66%) 764 (12%)

1977 (26%) 4795 (62%) 957 (12%)

270 (26%) 667 (65%) 94 (9%)

1916 (43%) 2263 (50%) 306 (7%)

0% 50% 100%% Transcripts

hPSC enrichedhPSC nonspecifichPSC depleted

0% 50% 100%% Transcripts

GENCODE

Matching

Variant

Novel

Contains TENo TE

4485 (81%)

7729 (85%)

6347 (45%)

37667 (50%)

1031 (19%)

1416 (15%)

7677 (55%)

37520 (50%)

−2.5 0.0 2.5

SINE:Alu:AluSgSINE:Alu:AluSpSINE:Alu:AluYSINE:Alu:FRAM

LINE:L2:L2LINE:L1:L1M2_orf2LINE:L1:L1M5_orf2

LINE:L1:L1ME1_3endLINE:L1:L1HS_5endLINE:L1:L1M1_5endLINE:L1:L1M3_orf2

LINE:L1:L1MC4_5endLINE:L1:L1P1_orf2LINE:L1:L1P4_orf2

LINE:L1:L1PA3_3endLINE:L1:L1PA4_3end

LTR:ERVL-MaLR:MLT1BLTR:ERV1:LTR7

LTR:ERV1:LTR7BLTR:ERV1:LTR7YLTR:ERV1:LTR8

LTR:ERV1:LTR12CLTR:ERV1:LTR12ELTR:ERV1:HERVH

LTR:ERV1:HERVFH21LTR:ERV1:HERVH48LTR:ERV1:HERV-Fc2LTR:ERV1:HERVS71LTR:ERV1:MER50-intLTR:ERV1:HUERS-P3b

LTR:ERV1:MER110LTR:ERVK:HERVK

LTR:ERVL-MaLR:MST-intLTR:ERVL-MaLR:THE1-int

No TE

*2.4e-48*1.2e-59*2.2e-72*2.9e-87*8.0e-64*2.2e-15*1.7e-81*3.3e-26*4.4e-26*7.2e-08*2.5e-15*4.1e-51*1.9e-23*2.0e-13*4.6e-15*3.7e-19*3.0e-301.6e-01*1.4e-05*2.7e-03*9.5e-07*2.1e-09*2.6e-064.4e-01*1.3e-027.7e-019.0e-016.7e-012.7e-012.8e-016.8e-024.9e-01*1.7e-07*2.0e-06

Log2 (Transcripts per million)Significantly upSignificantly down

Significantly upSignificantly down

q-value

.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint

Page 30: Transposable Element-Gene Splicing Modulates the ... · 7/26/2020  · 411 Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 412 Interpretation

25

Figure 5 – Widespread type-specific splicing of TEs into hPSC-specific lncRNAs. 640

(A) Frequency of TE insertions in GENCODE, and hPSC-matching, variant and novel lncRNAs. 641

(B) Frequency of hPSC expression classes for matching, novel and variant lncRNAs, including or 642

excluding a TE. 643

(C) Frequency of TEs across the normalized length of all lncRNAs. TSS=Transcription start site; 644

TTS=Transcription termination site. Frequency is normalized to the total number of transcripts 645

in each expression class. 646

(D) Examples frequency showing the insertion density frequency for LINE L1, L2 and ERV1 and 647

ERVK TEs. 648

(E) As in panel C and D, except showing selected TE types. 649

(F) Box plots showing the expression of lncRNAs containing the indicated TE types, versus all 650

transcripts with no TE. q-value is from a Welch’s t-test with Bonferroni-Hochberg multiple test 651

correction (for all families of TE). Red boxplots indicate significantly up, blue indicates 652

significantly down, grey no significant change. 653

(G) Box plots showing the effect of multiple TE insertions on non-coding and protein-coding 654

transcripts. p-value is from a Welch’s t-test versus transcripts containing zero TEs. 655

656

.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint

Page 31: Transposable Element-Gene Splicing Modulates the ... · 7/26/2020  · 411 Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 412 Interpretation

0 1

SINE:MIRSINE:Alu

Retroposon:SVALTR:ERVL-MaLR

LTR:ERVLLTR:ERVKLTR:ERV1LINE:L2LINE:L1

DNA:hAT-CharlieDNA:TcMar-Tigger

0 1 0 1 0 1 0 1

Cluster 0Pluripotency

Cluster 1Proliferation

Cluster 2Pluripotency

Cluster 3Differentiation

Cluster 4Quiescence

Cluster

GO:0017148 negative regulation of translationGO:0007369 gastrulationGO:0090068 positive regulation of cell cycle processGO:0045740 positive regulation of DNA replicationGO:0080154 regulation of fertilizationGO:1901992 positive regulation of mitotic cell cycle phase transitionGO:0001825 blastocyst formationGO:0001708 cell fate specificationGO:0040019 positive regulation of embryonic developmentGO:0050807 regulation of synapse organizationGO:0045669 positive regulation of osteoblast differentiationGO:0072073 kidney epithelium development

*** *

* *

***

* *** *

***

0 1 2 3 4

-Log10(P-value)2 4 6 8 10

HPSCSR.267787.92PIF1

0 3Normalizedexpression

HPSCSR.281890.135TOP2A

HPSCSR.221584.174MALAT1

HPSCSR.160725.1PODXL (GCTM2)

0

1

23

4

% Cells in clusters

LTR:ERV1:HERVHLTR:ERV1:HERVS71

SINE:AluLTR:ERVLLINE:L1

SINE:AluSINE:MIR

SINE:AluLTR:ERV1:HERVH

LINE:L1

LTR:ERV1:HERVH/LTR7SINE:AluLINE:L1

SINE:AluLINE:L1

Differentiation/EMT

Proliferation

Cluster 2(7.8%)

Cluster 1(17%)

Cluster 0(71%)

Cluster 4(0.5%)

Cluster 3(2.1%)

Cell cycle exitQuiescence

TEno TE

0 100

TEno TE

0 50

TEno TE

0 500

TEno TE

0 1000

TEno TE

0 10

Number of transcripts

EnrichedNonspecific

Depleted

0 100

1

EnrichedNonspecificDepleted

0 50

2

EnrichedNonspecificDepleted

0 5

0

EnrichedNonspecific

Depleted

0 500

3

EnrichedNonspecific

Depleted

0 500

4

Number of transcripts

Cluster

F

G

Number of TEs per number of transcripts in clusterNumber of TEs per number

of transcripts in cluster

0.00 0.10 0.20

H

HPSCSR.67426.5DPPA4

HPSCSR.210694.8UTF1

0 3Normalizedexpression

S G2/MG1

0

1

2

3

4

0% 50% 100%

Cluster

% of cells

94%

62%

89%

68%

75%

1%

3%

0%

2%

2%

5%

34%

11%

30%

23%

A B

Figure 6

D

C

I

E

0 1 2 3 4DNA:hAT-Charlie:MER112DNA:hAT-Charlie:MER58ADNA:hAT-Tip100:MER63CDNA:hAT-Tip100:MER63DLINE:L1:L1M5_orf2LINE:L1:L1MA8_3endLINE:L1:L1ME2z_3endLINE:L1:L1ME4a_3endLINE:L2:L2LINE:L2:L2c_3endLTR:ERV1:HERV-Fc2LTR:ERV1:HERVFH21LTR:ERV1:HERVHLTR:ERV1:HERVS71LTR:ERV1:LTR36LTR:ERV1:LTR38LTR:ERV1:LTR6BLTR:ERV1:LTR7LTR:ERV1:LTR7BLTR:ERV1:LTR7YLTR:ERV1:MER51BLTR:ERVL:MER21ALTR:ERVL:MER21BLTR:ERVL:MER21CRetroposon:SVA:SVA_FSINE:Alu:AluJbSINE:Alu:AluJrSINE:Alu:AluJr4SINE:Alu:AluSc8SINE:Alu:AluSgSINE:Alu:AluSpSINE:Alu:AluSq2SINE:Alu:AluSxSINE:Alu:AluSx1SINE:Alu:AluSzSINE:Alu:AluYSINE:Alu:AluYcSINE:Alu:AluYk3SINE:Alu:FLAM_ASINE:Alu:FLAM_CSINE:Alu:FRAMSINE:Alu:PB1D11SINE:MIR:MIRSINE:MIR:MIRb

Cluster

.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint

Page 32: Transposable Element-Gene Splicing Modulates the ... · 7/26/2020  · 411 Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 412 Interpretation

26

Figure 6 – Single cell expression reveals TE containing transcripts are specifically expressed in 657

subpopulations within hPSC cultures 658

(A) UMAP projection and clustering of scRNA-seq of hPSC data. Clustered using Leiden 659

(resolution=1.0) and clusters showing no differentially regulated genes were iteratively merged. 660

(B) Gene ontology analysis of the genes significantly associated with the indicated clusters (See 661

Supplemental Table S6 for the list of transcripts). 662

(C) UMAP plots colored by the expression level of two pluripotency genes, UTF1 and DPPA4. 663

(D) Estimates of cell cycle phase of the indicated UMAP clusters based on enrichment of cell cycle 664

genes. 665

(E) UMAP plot colored by expression of the G2/M-associated genes PIF1, TOP2A, MALAT1 and 666

PODXL. 667

(F) Number of transcripts in each cluster that is enriched, nonspecific or depleted in hPSCs (Left 668

bar charts), and number of transcripts containing a TE (right bar charts). 669

(G) Number of TEs types per number of transcripts in the indicated clusters. 670

(H) Heatmap indicating the number of TE families per number of transcripts in the indicated clusters 671

(I) Schematic indicating the relationship between the hPSC cell sub-populations. The % of cells in 672

each cluster is indicated, and the major TE types expressed are indicated. 673

674

.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint