Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
1
Transposable Element-Gene Splicing Modulates the Transcriptional Landscape of Human 1
Pluripotent Stem Cells 2
Isaac A. Babarinde1, Gang Ma1, Yuhao Li1, Boping Deng1,2, Zhiwei Luo3, Hao Liu4, Mazid Md. 3
Abdul4, Carl Ward4, Minchun Chen1, Xiuling Fu1, Martha Duttlinger1, Jiangping He5,6,7, Li Sun1, 4
Wenjuan Li4, Qiang Zhuang1, Jon Frampton2, Jean-Baptiste Cazier2,8, Jiekai Chen5,6,7, Ralf Jauch9, 5
Miguel A. Esteban4,5, Andrew P. Hutchins1# 6
7
1Department of Biology, Southern University of Science and Technology, Shenzhen, Guangdong 8
518055, China. 9
2Institute of Cancer and Genomic Sciences, University of Birmingham, B15 2TT, Birmingham, UK. 10
3Guangzhou Medical University, Guangzhou 511436, China. 11
4Laboratory of Integrative Biology, Guangzhou Institutes of Biomedicine and Health, Chinese 12
Academy of Sciences, Guangzhou 510530, China. 13
5Guangzhou Regenerative Medicine and Health-Guangdong Laboratory (GRMH-GDL), 510530 14
Guangzhou, China. 15
6Key Laboratory of Regenerative Biology of the Chinese Academy of Sciences and Guangdong 16
Provincial Key Laboratory of Stem Cell and Regenerative Medicine, Guangzhou Institutes of 17
Biomedicine and Health, Chinese Academy of Sciences, 510530 Guangzhou, China. 18
7Joint School of Life Sciences, Guangzhou Medical University and Guangzhou Institutes of 19
Biomedicine and Health, Chinese Academy of Sciences, 511436 Guangzhou, China. 20
8Centre for Computational Biology, University of Birmingham, Birmingham, UK. 21
9School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, 22
Hong Kong SAR, China. 23
#Correspondance: [email protected] 24
25
.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint
2
Abstract 26
Transposable elements (TEs) occupy nearly 50% of mammalian genomes and are both potential dangers 27
to genome stability and functional genetic elements. TEs can be expressed and exonised as part of a 28
transcript, however, their full contribution to the transcript splicing remains unresolved. Here, guided by 29
long and short read sequencing of RNAs, we show that 26% of coding and 65% of non-coding 30
transcripts of human pluripotent stem cells (hPSCs) contain TEs. Different TE families have unique 31
integration patterns with diverse consequences on RNA expression and function. We identify hPSC-32
specific splicing of endogenous retroviruses (ERVs) as well as LINE L1 elements into protein coding 33
genes that generate TE-derived peptides. Finally, single cell RNA-seq reveals that proliferating hPSCs 34
are dominated by ERV-containing transcripts, and subpopulations express SINE or LINE-containing 35
transcripts. Overall, we demonstrate that TE splicing modulates the pluripotency transcriptome by 36
enhancing and impairing transcript expression and generating novel transcripts and peptides. 37
38
Keywords: transposable elements, pluripotent stem cells, long non-coding RNA 39
40
41
.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint
3
Introduction 42
TEs are repetitive sequences of DNA that make up the majority of mammalian DNA (Jurka et al. 2007; 43
Hutchins and Pei 2015). Whilst the majority of TEs are incapable of retrotransposition, there is growing 44
evidence that inactive incorporated TEs are epigenetically regulated and can alter splicing patterns and 45
generate gene-TE chimeric transcripts that can drive disease and oncogene activity (Jang et al. 2019; 46
Clayton et al. 2020). However, paradoxically, TE expression is also associated with developmental 47
competency and evolutionary innovation (Kunarso et al. 2010; Macfarlan et al. 2012; Fort et al. 2014; 48
Wang et al. 2014; Theunissen et al. 2016; Bourque et al. 2018). TEs are mainly thought to be suppressed 49
by DNA methylation and histone H3K9me3 in somatic cells (Feng et al. 2010; Jonsson et al. 2019), but 50
in the mammalian embryo DNA is demethylated, and human pluripotent stem cells (hPSCs) have 51
reduced epigenetic repression, looser chromatin and a permissive environment for transcript and TE 52
expression (Bulut-Karslioglu et al. 2018). Consequently, TEs are more active during embryogenesis and 53
in hPSCs, compared to somatic cells (Fort et al. 2014), and are expressed in a stage-specifically manner 54
during embryogenesis (Goke et al. 2015; Grow et al. 2015). 55
In addition to their epigenetic activity, TEs are active genetically and can alter transcript splicing 56
patterns by becoming exonised into transcripts (Macfarlan et al. 2012; Wang et al. 2014; Goke et al. 57
2015; Grow et al. 2015). TEs make a major contribution to the sequences of long-non-coding RNAs 58
(lncRNAs) (Lev-Maor et al. 2008; Kelley and Rinn 2012; Kapusta et al. 2013; Naville et al. 2016), and 59
whilst lncRNAs have roles in normal biological processes (Goff and Rinn 2015; Bourque et al. 2018; 60
Lu et al. 2020), and disease (Wapinski and Chang 2011; Carlevaro-Fita et al. 2020), the roles of TEs 61
inside the lncRNAs has received less attention. One model suggests that TEs act as functional domains 62
within lncRNAs, something akin to globular domains of proteins (Johnson and Guigo 2014; Chishima 63
et al. 2018). However, the full contribution of TEs to transcript splicing in both the normal and disease 64
state remains unexplored. Interestingly, the splicing of TEs into normal pluripotency genes has been 65
observed in cancer (Jang et al. 2019), suggesting TE-splicing can contribute to human disease. 66
Consequently, it is important to understand how TEs contribute to the coding and non-coding 67
transcriptome (You et al. 2017; Ma et al. 2019; Morillon and Gautheret 2019), if hPSCs are to be used 68
.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint
4
in cell replacement therapy (Fort et al. 2014; Schumann et al. 2019). However, due to limitations in 69
short-read based sequencing, accurate transcriptome maps have been elusive. Much of this is due to the 70
relatively low expression of the TEs and lncRNAs (Lagarde et al. 2017), the repetitive nature of TEs, 71
and problems in using short-reads to assemble longer transcripts (Steijger et al. 2013; Babarinde et al. 72
2019). 73
To explore the splicing patterns of TEs in the normal, non-diseased, state, we took advantage 74
of the large number of hPSC short-read-RNA-seq samples, which we supplemented with long-read-75
RNA-seq to build a map of the gene-TE transcriptome of hPSCs. Our analysis shows that TEs are 76
expressed and spliced into both coding and non-coding RNAs, and had strong effects on gene 77
expression. TEs make up a substantial fraction of the non-coding transcript complement, and they also 78
alter the peptide output of the cell, producing fragments of endogenous viral proteins and small peptides. 79
Finally, using single-cell RNA-seq we show that the hPSC-state is dominated by HERVH and LTR7-80
containing transcripts, which are associated with proliferating cells and are rapidly lost on differentiation 81
as SINE and LINE-containing transcripts become dominant. 82
83
Results 84
Transcript assembly from short and long reads 85
To build a hPSC transcriptome, we started with 197 publicly available short read (SR) paired-end-only 86
hPSC RNA-seq data samples. To quality control the data, we mapped the hPSC samples to the hg38 87
genome assembly using HISAT2 (Kim et al. 2019). Only samples with a minimum mapping rate of 70% 88
were retained, leaving 171 RNA-seq samples (Supplemental Fig. S1A). There are widespread errors in 89
metadata annotation of publically available transcriptome data (Toker et al. 2016), consequently, to 90
ensure that the samples were undifferentiated hPSCs, we measured gene-level expression (Hutchins et 91
al. 2017). hPSC samples were removed if they were outliers in a co-correlation analysis, or did not 92
express hPSC-marker genes, or expressed differentiation-specific genes (Supplemental Fig. S1B and 93
C). This left 150 samples that passed our quality control metrics (Supplemental Table S1, and 94
Supplemental Fig. S1D). We used the 150 confirmed hPSC samples to assemble transcripts using 95
.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint
5
HISAT2/StringTie (Pertea et al. 2015; Kim et al. 2019) (Supplemental Fig. S1E). In total ~13 billion 96
reads, with a median mapped fragment size of 225 bp of which 92% could be aligned to the hg38 genome 97
assembly. StringTie (Pertea et al. 2015) assembled an initial set of 279,051 transcripts, which was 98
reduced to 272,268 after removing transcripts shorter than 200 bp (Supplemental Fig. S1F). This 99
dataset comprised the transcripts assembled from the SR data. 100
Transcript assembly from SR can be problematic (Wang et al. 2019), especially if the transcripts 101
contain transposable elements (Babarinde et al. 2019). This is reflected in the large number of single-102
exon fragments from the SR assembly (72,902; 27%). Although some of these fragments might be 103
genuine single-exon transcripts, many are likely to be fragmentary assemblies from SR. To address this, 104
we augmented our assembly with SMRT long-read (LR) sequencing using the PacBio platform. We 105
sequenced the hESC cell line H1 and the iPSC cell line S0730 in duplicate (Zhou et al. 2011). Transcripts 106
were identified from long reads using the isoseq pipeline, and were aligned to the genome using GMAP 107
(Wu and Watanabe 2005) (Supplemental Fig. S1E). Long read assembly produced 53,168 unique 108
transcripts that were at least 200bp long (Supplemental Fig. S1F). 109
Both SR-based and LR-based transcriptomes have disadvantages: SR-based assemblies tend to 110
be unreliable (Steijger et al. 2013), and LR-based assemblies can detect rare transcripts and are poor at 111
quantifying expression. Consequently, to arrive at a transcriptome representation, we set two expression 112
level conditions that a transcript must pass: (1) Expression >=0.1 TPM (transcripts per million) in at 113
least 50 of the samples. (2) The average per-base coverage reported by StringTie must be greater than 1 114
for all exons. Additionally, we also deleted single-exon transcripts that appear to be derived from 115
incomplete splicing, if they shared edges near perfectly with an annotated intron (Supplemental Fig. 116
S1G). The final assembly contained 101,492 transcripts, of which 13,177 (13%) were single-exons. 117
Using these criteria, 71% of the LR transcripts were retained, but only 17% of the SR transcripts were 118
kept in the final assembly (Supplemental Fig. S1F, H). To validate our assembly pipeline, we designed 119
RT-PCR primers spanning an intron (Supplemental Table S2) and could detect 31 out of 40 transcripts 120
including 8 out of 9 long-read transcripts (Supplemental Fig. S2A). We confirmed the accuracy of the 121
5’ ends of our transcript assembly using iPSC and hESC deepCAGE data (Consortium et al. 2014) (Fig. 122
.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint
6
1A, S2B). There was a linear relationship between deepCAGE expression and RNA-seq TPM for the 123
transcripts we could unambiguously assign a deepCAGE cluster (Supplemental Fig. S2C). Our final 124
transcript assembly contained 58,201 transcripts with SR support, 6,881 with LR support and 36,410 125
with both SR and LR support (Fig. 1B and Supplemental Table S3). 126
127
Identification of novel transcripts and isoforms 128
We next classified the hPSC transcripts based on the GENCODE transcriptome. We defined three 129
categories: Matching (all internal splices match a GENCODE transcript and at least 75% total exon 130
length overlap); Variant (with an exon overlapping any exon from a GENCODE transcript but not 131
completely matching), and Novel (with no exon overlapping any GENCODE exon) (Fig. 1C). Whilst 132
most transcripts shared a splicing pattern with a GENCODE transcript, 26,846 transcripts were variant 133
or novel (Fig. 1C). For each assembled transcript, we defined completeness as the percent of exons or 134
splices of the closest GENCODE transcript that were correctly retrieved (Supplemental Fig. S2D). 135
Transcripts supported by both LR and SR tend to have higher GENCODE exon and splice completeness 136
(Supplemental Fig. S2E), although there remains a sizeable number of transcripts supported only by 137
SR, suggesting our LR data set has not fully saturated the transcriptome (Supplemental Fig. S2F). The 138
known transcripts tended to be expressed more strongly than novel transcripts, as measured by both 139
RNA-seq and deepCAGE (Fig. 1D). 140
141
A census of coding and non-coding transcripts enriched in pluripotent stem cells 142
FEELnc was used to discriminate coding and non-coding transcripts (Wucher et al. 2017), of which 143
28,699 (28%) were non-coding, and the majority of novel transcripts were non-coding (Fig. 1E, and 144
Supplemental Fig. S2G-H). Conversely, the majority (~81%) of the GENCODE-matching transcripts 145
were protein-coding. As expected, the expression of lncRNAs was lower than protein coding transcripts 146
(Fig. 1F, and Supplemental Fig. S2I). Genes can be expressed in both a cell type-independent and cell 147
type-specific manner (Hutchins et al. 2017). To identify the transcripts important for hPSCs function, 148
we measured the expression of our transcript assembly using the BodyMap dataset (Derrien et al. 2012), 149
.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint
7
and divided the transcripts into hPSC-enriched, -nonspecific or -depleted, based on the Z-score against 150
the BodyMap expression dataset (Fig. 2A, and Supplemental Fig. S2J). This approach could recover 151
known hPSC-enriched transcripts such as NANOG, POU5F1 and SALL4 (Fig. 2B). There was no 152
specific bias in coding and non-coding transcripts between the hPSC expression categories (Fig. 2C). 153
However, novel transcripts were more likely to be enriched in hPSCs (Supplemental Fig. S2K). Finally, 154
non-coding transcripts had a lower level of expression compared to coding transcripts across all 155
categories (Fig. 2D). Using LR and SR we have generated an accurate transcriptome for hPSCs, that 156
describes known and novel transcripts, coding and non-coding, and hPSC-enriched and -depleted 157
transcripts. We will use this transcriptome to explore the effect of TEs. 158
159
TEs are widely spliced into protein coding genes and modulate expression levels 160
TEs are found in the untranslated regions (UTRs) of coding genes (Bourque et al. 2018), and contribute 161
to lncRNAs (Kelley and Rinn 2012; Kapusta et al. 2013). Additionally, deepCAGE data has revealed 162
pluripotent-specific TSSs (transcription start sites) originating inside TEs (Faulkner et al. 2009; Fort et 163
al. 2014). We searched our assembled transcriptome using nhmmer (Wheeler and Eddy 2013), using the 164
DFAM models of TEs (Hubley et al. 2016), and identified 37,493 transcripts (37%) that contained at 165
least one TE (Supplemental Table S4). About 22% of the matching transcripts contained a TE, which 166
was similar to GENCODE, however the variant coding transcripts had a higher percentage of TE-167
containing transcripts in hPSCs (Fig. 3A). There was a very small number of novel coding transcripts 168
(411), of which 139 (34%) were predicted to contain a TE. As this number is close to the false positive 169
rate to detect coding transcripts, we do not explore these further (Fig. 1E, Supplemental Fig. S2G, and 170
Supplemental Table S3). Surprisingly, hPSC-enriched transcripts were less likely than hPSC-171
nonspecific or hPSC-depleted transcripts to contain a TE (Fig. 3B). This effect was not unique to the 172
subset of novel transcripts assembled by us, as the same pattern is observed for transcripts matching the 173
GENCODE annotations (Fig. 3C). This is unexpected as TEs are thought to be more actively expressed 174
in the epigenetically relaxed hPSCs. 175
.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint
8
We next analyzed TE insertion frequency into the UTRs or CDS of coding transcripts (Fig. 3D). 176
We annotated the CDS from either the GENCODE reference or using the longest open reading frame 177
(ORF). There was an enrichment of TEs in the UTRs, particularly inside the 3’UTRs, and a depletion 178
in the CDS (Fig. 3D). We next asked whether integration sites vary across TE families. LINE:L1, L2 179
and SINE:Alu elements were specifically enriched in both UTRs, but particularly in the 3’UTRs (Fig. 180
3E, and Supplemental Fig. S3A, S3B). LINEs and SINEs were robustly depleted in the CDS, compared 181
to the UTRs (Fig. 3E), and most LINE:L1 TEs were biased to the UTRs (Fig. 3E). For example, 182
LINE:L1:L1M2_orf2 was strongly biased to the 3’UTR, and LINE:L1:L1HS_5end was specifically 183
enriched in the 5’UTR (Fig. 3F, and Supplemental Fig. S3C). Intriguingly, the L1HS family of LINEs 184
are active and capable of retrotransposition in humans (Beck et al. 2010), and their presence here in the 185
5’UTRs of genes suggests activity in hPSCs. 186
In a pattern similar to LINEs, several SINE family members were spliced into the 3’ ends of 187
transcripts (Fig. 3E, and Supplemental Fig. S3B), whilst HERVH/LTR7 TEs were specifically 188
enriched in the 5’UTR (Fig. 3F). Previous analysis of deepCAGE data showed that HERVH/LTR7 acts 189
as a hPSC-specific transcription start site (Fort et al. 2014). Our assembly included previously reported 190
LTR7/HERVH initiating transcripts (Wang et al. 2014) (Supplemental Fig. S3D). Additionally, we 191
could assemble transcripts with HERVH spliced throughout the transcripts, including the CDS and 192
3’UTR (Fig. 3F, and Supplemental Fig. S3E), indicating that HERVH can function as an exonised 193
unit. LTR7, the HERVH-accompanying LTR, was restricted to only the 5’UTR (Fig. 3F). 194
We next looked at the effect of TE insertions on transcript expression. Non-coding transcripts 195
containing TEs had significantly lower expression levels, however coding transcripts were overall 196
unaffected by the presence of a TE (Fig. 3G). It was previously reported that retrotransposons in the 197
3’UTR lead to decreased isoform expression (Faulkner et al. 2009), hence we explored if TE location 198
within a coding transcript influences expression. Transcripts with a TE in the 5’UTR or CDS had 199
significantly lower expression, however those with a TE inside the 3’UTR were significantly (though 200
modestly) up-regulated (Fig. 3H). We next looked at the individual subtypes of TEs, and found that 201
most subtypes of TE led to a lower level of expression (Fig. 3I). Transcripts with HERVHs, or L1 LINEs 202
.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint
9
had significantly lower expression no matter the location of their insertion (Fig. 3I, J). However, some 203
TEs could be inserted into the 3’UTR, and expression would be higher. For example, SINEs and the 204
DNA:hAT-Charlie family (Fig. 3I, K). Overall, the effect of TEs in transcripts is both transcript context 205
and TE-family dependent, and whilst LTRs tend to be uniformly deleterious for expression, some SINEs 206
and DNA transposons in the 3’UTR are associated with enhanced expression. 207
208
TE splicing affects the proteome of pluripotent cells 209
We next analyzed whether TE insertions alter the amino acid sequences of coding transcripts. We 210
divided TE insertions into classes based upon the TE insertions effect on the CDS, and counted the 211
number of TE insertions (Fig. 4A). We did not detect any in-frame TE insertions and found only 63 212
frame-shift insertions (Fig. 4A). A sizeable number of TE insertions led to a premature STOP or to the 213
introduction of a new ATG that would disrupt the pre-existing CDS. However, the most common type 214
of TE insertion caused a disruption or frame-shift that led to a longer CDS becoming the most likely 215
predicted CDS (Fig. 4A). These transcripts are predicted as coding, however they may not yield a 216
peptide, as they may lack productive translation or peptide products may be rapidly degraded. 217
Additionally, some TE types, particularly the SINE/Alu and SVA TEs, often have coding-like 218
signatures, but do not encode proteins (Jungreis et al. 2018). To estimate how many of these TE modified 219
CDSs are translated, we extracted the peptide sequences of the putative CDS, deleted any regions that 220
had a >95% BLAST hit against any GENCODE protein, removed peptides shorter than 20 amino acids 221
and retained only those that had at least one Trypsin/Lys-C cleavage site. These strict criteria mean that 222
we would only detect peptides derived from the impact of the TE, or from TE-derived sequences. We 223
then used the HipSci proteomics LC-MS/MS data (Kim and Pevzner 2014; Kilpinen et al. 2017) to 224
search for peptides. Overall, we could detect at least one peptide match for ~17-20% of transcripts (Fig. 225
4B and Supplemental Table S5), indicating that at least some TE-modified transcripts are translated 226
(Supplemental Fig. S4A). The majority of peptides were not derived directly from the TE, but 227
correspond to novel ORFs outside the TE sequence (Fig. 4C). Yet, a small number of peptides are 228
directly encoded by TEs (Fig. 4D). Many of the peptides corresponded to gag, pol and env but not pro, 229
.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint
10
of a putative progenitor HERVK (Dewannieux et al. 2006) (23 unique peptides in total) (Supplemental 230
Fig. S4B). These results are consistent with reports that some human cells synthesize viral-like 231
structures, particularly in the placenta (Kammerer et al. 2011), and HERVK RNAs and proteins are 232
expressed in hPSCs (Fuchs et al. 2013). Our data suggests that these viral proteins are derived from up 233
to nine transcripts (Supplemental Table S5). Interestingly, we find evidence for coding potential for a 234
variant hPSC-enriched transcript PCAT14 (prostate cancer-associated transcript 14) that has hitherto 235
been considered non-coding. We observed 13 unique HERVK peptides derived from PCAT14 that 236
encode a near-complete product of gag and fragments of pol (Fig. 4E, Supplemental Fig. S5A-C and 237
Supplemental Table S5). We did not observe any peptides from HERVH, in agreement with a previous 238
report that the ORFs are not functional (Santoni et al. 2012). In addition to HERVs, we also detected 239
peptides derived from SINEs, which was unexpected as SINEs do not encode proteins and rely on LINEs 240
for retrotransposition. Using BLAST to search for matching peptides against the human non-redundant 241
protein set produced no significant hits. In summary, these data indicate that TE insertions modulate the 242
coding transcriptome of pluripotent cells, and can give rise to novel peptides and the ectopic expression 243
of viral proteins. 244
245
TEs wire the lncRNA complement of hPSCs 246
Previous reports show that TEs constitute a major part of lncRNAs (Kelley and Rinn 2012; Kapusta et 247
al. 2013). Consistently, we observed a large number of non-coding transcripts containing at least 1 TE 248
(65%), slightly less that the 83% reported in a smaller set of lncRNAs (Kelley and Rinn 2012). Of the 249
transcripts matching GENCODE, 45% had a TE, whilst the variant and novel transcripts were more 250
likely to contain a TE (Fig. 5A). Interestingly, and in contrast to coding transcripts, novel and variant 251
non-coding transcripts containing a TE were more likely to be enriched in hPSCs (Fig. 5B). 252
We next looked at the position of TEs inside lncRNA sequences. TEs are spliced into lncRNAs, 253
and are found anywhere from the TSS to the TTS (transcription termination site) with a slight bias 254
towards the 3’ end that was more pronounced in hPSC-enriched transcripts (Fig. 5C). Interestingly, 255
compared to GENCODE, hPSC transcripts were more enriched with TEs, and hPSC-enriched transcripts 256
.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint
11
contained even more TEs (Fig. 5C, 5D, and Supplemental Fig. S6A). This indicates that the 257
contribution of TEs to non-coding transcripts is increased in hPSCs. At the TE family level, LINEs were 258
enriched in lncRNAs, such as LINE:L1:L1PA3_3end and LINE:L1:L1Hs_5end (Fig. 5E, and 259
Supplemental Fig. S6B). The L1HS family are active and capable of retrotransposition in humans, and 260
although we did not observe intact L1HS sequences or peptides, that L1HS TE fragments are contained 261
inside lncRNAs suggests they are under active epigenetic regulation (Fig. 3F and 5E). In addition to 262
LINEs, the SINEs, DNA, SVA and Satellite type repeats/TEs were also enriched (Supplemental Fig. 263
S6A, S6C-E). However, overall, the LTRs had the most complex patterns of splicing (Fig. 5E), and 264
especially the ERV1 and ERVK families of TEs (Supplemental Fig. S6F, S6G). LTR7, the LTR for 265
HERVH, can function as a hPSC-specific TSS (Fort et al. 2014). However, unlike coding transcripts, 266
LTR7 not was biased to the TSS and there was splicing of LTR7 and HERVH along the entirety of the 267
lncRNAs (Fig. 5E). There were similarly complex patterns of other LTRs, for example HERVFH21 268
and HERVH48 were found in the middle of transcripts, and not at the TSS or TTS (Supplemental Fig. 269
S6G). Overall, these results indicate that TEs are more likely to be spliced into hPSC-enriched non-270
coding transcripts, which is in contrast to hPSC-enriched coding transcripts which are depleted of TEs 271
(Fig. 3A, 3B and 5B). 272
Finally, we looked at the effect of TE insertions on lncRNA expression. In contrast to coding 273
transcripts, the presence of a TE inside the lncRNA was more likely to be deleterious for expression 274
(Fig. 3G). Many of the hPSC-specific transcripts with TEs had significantly reduced expression 275
compared to transcripts containing no TE (Fig. 5F). Additionally, multiple TE insertions resulted in 276
decreased expression, an effect not seen in coding transcripts (Fig. 5G). These results indicate that in 277
direct contrast to coding transcripts TE insertions are deleterious for non-coding transcript expression. 278
279
Single cell RNA-seq expression of lincRNAs, heterogeneity in TE splicing 280
We took advantage of our hPSC-specific high-quality transcript assembly to look at single cell RNA-281
seq (sc-RNA-seq) data. Sc-RNA-seq technology can measure specific expression of genes in individual 282
cells. However, transcript identity can be ambiguous as the reads are heavily biased to the 3’ ends. 283
.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint
12
Consequently, we reduced the transcripts to those with unique non-overlapping strand-specific 3’ ends, 284
and collapsed overlapping transcript 3’ends to a single transcript. This resulted in a total set of 88,520 285
transcripts (87% of the total superset of transcripts). We generated two sc-RNA-seq datasets, one from 286
WIBR3 hESCs and another from S0730/c11 iPSCs (Zhou et al. 2011), supplemented with 5 samples 287
from WTC iPSCs from E-MTAB-6687 (Nguyen et al. 2018), and two UCLA1 line hESC samples from 288
GSE140021 (Chen et al. 2019). We aligned the reads to the hg38 genome assembly and annotated the 289
reads to our 3’end transcript database. On average 50-70% of the reads could be aligned to this 3’ end 290
transcriptome (in comparison 60-70% align to GENCODE for the same samples). The majority of the 291
3’ ends show strand-specific read pileups (Supplemental Fig. S7A) indicating that our transcript 292
assembly has accurate 3’ ends and can be used for scRNA-seq. After filtering and normalization, we 293
retrieved 30,001 cells, and 35,400 transcript ends. Projection of the cells into a UMAP (Uniform 294
Maniform and Approximation Projection) plot showed the hESCs and iPSCs mixed well, and there was 295
no bias in the replicate samples (Supplemental Fig. S7B, S7C). We could detect five major clusters of 296
cells (Fig. 6A). GO (gene ontology) analysis for differentially regulated transcripts specific to each 297
cluster and marker genes expression indicated that all clusters except clusters 3 and 4 were pluripotent 298
(Fig. 6B, 6C, and Supplemental Fig. S7D). Analysis of the expression of G1, S and G2/M-specific 299
genes indicated that clusters 0 and 1 had higher expression of G2/M and proliferative genes (Fig. 6D, 300
E), suggesting these are highly proliferative cells identified previously (Nguyen et al. 2018; Lau et al. 301
2020), positive for EPCAM, CD9 and PODXL (GCTM2) (Fig. 6E, and Supplemental Fig. S7E, S7F). 302
Conversely, clusters 2 and 4 showed reduced levels of G2/M (Fig. 6D). Intriguingly, our data contained 303
two novel clusters of cells (cluster 3 and 4), that was predominantly made up of hESCs, rather than 304
iPSCs (Supplemental Fig. S7B). GO analysis suggested that these cells were undergoing differentiation 305
(Fig. 6B). Marker genes for a specific lineage were not evident, but there was a strong shift in epithelial 306
and mesenchymal genes, as CDH1 and EPCAM were downregulated and CDH2 and VIM were up-307
regulated (Supplemental Fig. S7E, S7G). This is reminiscent of the epithelial-mesenchymal transition 308
we saw previously in the early stage of hepatocyte differentiation (Li et al. 2017). The main difference 309
.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint
13
between the two clusters was in cell cycle activity, as many cells were in G2/M in cluster 3, but cluster 310
4 showed reduced cell cycle activity (Fig. 6D). 311
Finally, we looked at the composition of TEs and transcripts specific to each of the clusters. 312
Clusters 0, 1 and 2 represent the bulk hPSCs, and exist on a spectrum, from highly proliferating (cluster 313
1) to medium proliferation (cluster 0) and relatively low proliferation (cluster 2). All three clusters were 314
rich in hPSC-enriched transcripts and TE-containing transcripts (Fig. 6F). We measured the frequency 315
of TEs inside the transcripts, and observed a large number of HERVH TEs, and some LINE:L1 and 316
SINE:Alu-containing transcripts specific to cluster 0, 1 and 2 (Fig. 6G, S8A, B). Proliferative cluster 1 317
showed an increase in SINE:Alu containing transcripts (Fig. 6G). The two differentiating clusters, 3 318
and 4, were almost entirely depleted of LTR-containing transcripts (Fig. 6G), and were only enriched 319
for SINE:Alu, SINE:MIR and LINE:L1-containing transcripts (Fig. 6G, H, S8C, D). Overall, we show 320
that distinct subsets of cells within a hPSC culture contain transcripts with specific patters of TE 321
expression. HERVH TEs are restricted to clusters 0, 1 and 2, and are inversely correlated with 322
proliferation. Proliferating cells tend to express transcripts containing SINEs and LINEs, and cells 323
undergoing an EMT, and starting to differentiate, lose expression of HERVH-containing transcripts 324
(Fig. 6I). These results demonstrate the heterogenous patterns of TE expression in single cells in cultures 325
of hPSCs. 326
327
Discussion 328
TEs constitute the majority of the DNA sequence of the human genome (Hutchins and Pei 2015). They 329
have long been considered as a part, involved in regulatory functions, and providing an evolutionary 330
substrate (Bourque et al. 2018). However, TEs ultimately need to be suppressed and controlled due to 331
their implications in human disease and effects on genome stability (Beck et al. 2010; Hutchins and Pei 332
2015; Tam et al. 2019). Here, using a combination of short and long read RNA-seq data, we show that 333
TEs are an integral part of the hPSC-transcriptome, contributing to 26% of coding transcripts and 65% 334
of non-coding lncRNA transcripts. The insertion of TEs into coding transcripts is poorly tolerated in the 335
5’UTR and CDS, but are tolerated or even beneficial for expression inside the 3’UTR. For non-coding 336
.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint
14
transcripts the effect of TEs is uniformly deleterious, and the higher the number of TEs, the lower the 337
level of gene expression. 338
We found potential coding sequences in our catalog of transcripts; particularly peptides derived 339
from TE insertions. Our estimates of coding transcripts containing a TE suggests that around 20% of 340
these transcripts are capable of generating peptides, at least transiently, although most peptides are not 341
derived directly from the TE sequence. As lincRNAs have been observed to contain potential CDSs 342
(Cabili et al. 2011), it is likely that the non-peptide producing transcripts retain coding signatures but 343
not coding potential. Ultimately, the distinction between coding and non-coding transcripts remains 344
fuzzy, with a large number of genes with unclear designation (Abascal et al. 2018). It is also possible 345
that non-coding transcripts can become coding in a specific cell type. Considering the increasing interest 346
in the biological functions of alternate proteins and small peptides (sCDSs) in other biological contexts 347
(Makarewich and Olson 2017), it will be important to explore how both sCDSs and TE-derived peptides 348
are contributing to hPSCs. One class of TE-containing coding transcript that we could detect peptides 349
for were HERVH-derived viral proteins. Viral proteins have been observed in some cell types, for 350
example HERVW env proteins have been found in placental cells, and are thought to help mediate 351
trophoblast invasion into uterine tissues (Frendo et al. 2003; Chuong et al. 2013). hPSCs express 352
HERVH and HERVEs containing RNAs, but we did not observe any peptides. Only HERVK ERVs 353
showed evidence of gag, pol and env translation. 354
Our catalog of transcripts from hPSCs demonstrates how the complement of TEs is integrated 355
into coding and non-coding transcriptome. Interestingly, the splicing of TEs into pluripotency transcripts 356
has been seen in cancer cells. For example, an AluJb is spliced into the pluripotency gene LIN28B and 357
MLT1J into SALL4 (Yu et al. 2007; Yang et al. 2010; Jang et al. 2019). The splicing of these TEs is 358
thought to convert the pluripotent gene into an oncogene (Jang et al. 2019). Intriguingly, in our hPSC-359
specific dataset we did not observe these specific TE-transcript splicing patterns (Supplemental Fig. 360
S9A-C), suggesting that these are cancer-specific effects, and implying that there is a normal set of TEs 361
that are spliced into transcripts (Fort et al. 2014), and a set that is associated with disease. Indeed, hPSC-362
specific HERVs were not associated with the expression of pluripotency-related transcripts in human 363
.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint
15
cancers (Zapatka et al. 2020), again suggesting that TE splicing has a normal and diseased pattern. 364
Related to this, the observation of proteins derived from HERVK is noteworthy considering their link 365
with various cancers (Subramanian et al. 2011). The application of long-read sequencing to human 366
diseases, and specifically cancers will be important to shed light on the influence of TEs on transcript 367
splicing in human disease. Overall, our study reveals how the transcribed TE complement of the cell 368
contributes to both coding and non-coding transcriptome of hPSCs, and provides a model of the normal 369
TE transcriptome for comparison to diseased TE expression. 370
371
Competing Interests 372
The authors declare no competing interests 373
374
Authors’ contributions 375
A.P.H conceived and funded the project, I.A.B., Y.H.L. and A.P.H. performed the bioinformatic 376
analysis. C.M.C., X.L.F., M.G., S.L. and M.D. performed the long-read RNA-seq and validation. 377
B.P.D., Z.L., M.M.A., C.W. performed the scRNA-seq. A.P.H. and I.A.B. wrote the manuscript with 378
assistance from R.J, M.A.E, J.F, J.B.C and C.J. All authors read and approved the manuscript. 379
380
Acknowledgements 381
This work was supported by the National Natural Science Foundation of China (31970589, 382
31850410463 to A.P.H., 31801217 to Z.Q., 31850410486 to I.A.B.), the Science and Technology 383
Planning Project of Guangdong Province (2019A050510004 to A.P.H. and R.J.), the National Key 384
Research and Development Program of China (2016YFA0100102 and 2018YFA0106903 to M.A.E.), 385
and the Shenzhen Peacock plan (201701090668B to A.P.H.). Supported by Center for Computational 386
Science and Engineering of Southern University of Science and Technology. 387
388
Accessions 389
.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint
16
The long-read RNA-seq was deposited in the Sequence Read Archive (SRA) with the accession number: 390
PRJNA631047, and the scRNA-seq data with the accession number: PRJNA631808. 391
392
References 393
Abascal F, Juan D, Jungreis I, Kellis M, Martinez L, Rigau M, Rodriguez JM, Vazquez J, Tress ML. 394 2018. Loose ends: almost one in five human genes still have unresolved coding status. Nucleic 395 Acids Res 46: 7070-7084. 396
Babarinde IA, Li Y, Hutchins AP. 2019. Computational Methods for Mapping, Assembly and 397 Quantification for Coding and Non-coding Transcripts. Comput Struct Biotechnol J 17: 628-637. 398
Beck CR, Collier P, Macfarlane C, Malig M, Kidd JM, Eichler EE, Badge RM, Moran JV. 2010. LINE-1 399 retrotransposition activity in human genomes. Cell 141: 1159-1170. 400
Bourque G, Burns KH, Gehring M, Gorbunova V, Seluanov A, Hammell M, Imbeault M, Izsvak Z, Levin 401 HL, Macfarlan TS et al. 2018. Ten things you should know about transposable elements. 402 Genome Biol 19: 199. 403
Bulut-Karslioglu A, Macrae TA, Oses-Prieto JA, Covarrubias S, Percharde M, Ku G, Diaz A, McManus 404 MT, Burlingame AL, Ramalho-Santos M. 2018. The Transcriptionally Permissive Chromatin 405 State of Embryonic Stem Cells Is Acutely Tuned to Translational Output. Cell Stem Cell 22: 406 369-383 e368. 407
Cabili MN, Trapnell C, Goff L, Koziol M, Tazon-Vega B, Regev A, Rinn JL. 2011. Integrative annotation 408 of human large intergenic noncoding RNAs reveals global properties and specific subclasses. 409 Genes Dev 25: 1915-1927. 410
Carlevaro-Fita J, Lanzos A, Feuerbach L, Hong C, Mas-Ponte D, Pedersen JS, Drivers P, Functional 411 Interpretation G, Johnson R, Consortium P. 2020. Cancer LncRNA Census reveals evidence 412 for deep functional conservation of long noncoding RNAs in tumorigenesis. Commun Biol 3: 56. 413
Chen D, Sun N, Hou L, Kim R, Faith J, Aslanyan M, Tao Y, Zheng Y, Fu J, Liu W et al. 2019. Human 414 Primordial Germ Cells Are Specified from Lineage-Primed Progenitors. Cell Rep 29: 4568-4582 415 e4565. 416
Chishima T, Iwakiri J, Hamada M. 2018. Identification of Transposable Elements Contributing to Tissue-417 Specific Expression of Long Non-Coding RNAs. Genes (Basel) 9. 418
Chuong EB, Rumi MA, Soares MJ, Baker JC. 2013. Endogenous retroviruses function as species-419 specific enhancer elements in the placenta. Nat Genet 45: 325-329. 420
Clayton EA, Rishishwar L, Huang TC, Gulati S, Ban D, McDonald JF, Jordan IK. 2020. An atlas of 421 transposable element-derived alternative splicing in cancer. Philos Trans R Soc Lond B Biol Sci 422 375: 20190342. 423
Consortium F the RP Clst Forrest AR Kawaji H Rehli M Baillie JK de Hoon MJ Haberle V Lassmann T 424 et al. 2014. A promoter-level mammalian expression atlas. Nature 507: 462-470. 425
Derrien T, Johnson R, Bussotti G, Tanzer A, Djebali S, Tilgner H, Guernec G, Martin D, Merkel A, 426 Knowles DG et al. 2012. The GENCODE v7 catalog of human long noncoding RNAs: analysis 427 of their gene structure, evolution, and expression. Genome Res 22: 1775-1789. 428
.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint
17
Dewannieux M, Harper F, Richaud A, Letzelter C, Ribet D, Pierron G, Heidmann T. 2006. Identification 429 of an infectious progenitor for the multiple-copy HERV-K human endogenous retroelements. 430 Genome Res 16: 1548-1556. 431
Faulkner GJ, Kimura Y, Daub CO, Wani S, Plessy C, Irvine KM, Schroder K, Cloonan N, Steptoe AL, 432 Lassmann T et al. 2009. The regulated retrotransposon transcriptome of mammalian cells. Nat 433 Genet 41: 563-571. 434
Feng S, Jacobsen SE, Reik W. 2010. Epigenetic reprogramming in plant and animal development. 435 Science 330: 622-627. 436
Fort A, Hashimoto K, Yamada D, Salimullah M, Keya CA, Saxena A, Bonetti A, Voineagu I, Bertin N, 437 Kratz A et al. 2014. Deep transcriptome profiling of mammalian stem cells supports a regulatory 438 role for retrotransposons in pluripotency maintenance. Nat Genet 46: 558-566. 439
Frendo JL, Olivier D, Cheynet V, Blond JL, Bouton O, Vidaud M, Rabreau M, Evain-Brion D, Mallet F. 440 2003. Direct involvement of HERV-W Env glycoprotein in human trophoblast cell fusion and 441 differentiation. Mol Cell Biol 23: 3566-3574. 442
Fuchs NV, Loewer S, Daley GQ, Izsvak Z, Lower J, Lower R. 2013. Human endogenous retrovirus K 443 (HML-2) RNA and protein expression is a marker for human embryonic and induced pluripotent 444 stem cells. Retrovirology 10: 115. 445
Goff LA, Rinn JL. 2015. Linking RNA biology to lncRNAs. Genome Res 25: 1456-1465. 446
Goke J, Lu X, Chan YS, Ng HH, Ly LH, Sachs F, Szczerbinska I. 2015. Dynamic transcription of distinct 447 classes of endogenous retroviral elements marks specific populations of early human 448 embryonic cells. Cell Stem Cell 16: 135-141. 449
Grow EJ, Flynn RA, Chavez SL, Bayless NL, Wossidlo M, Wesche DJ, Martin L, Ware CB, Blish CA, 450 Chang HY et al. 2015. Intrinsic retroviral reactivation in human preimplantation embryos and 451 pluripotent cells. Nature 522: 221-225. 452
Hubley R, Finn RD, Clements J, Eddy SR, Jones TA, Bao W, Smit AF, Wheeler TJ. 2016. The Dfam 453 database of repetitive DNA families. Nucleic Acids Res 44: D81-89. 454
Hutchins AP, Pei D. 2015. Transposable elements at the center of the crossroads between 455 embryogenesis, embryonic stem cells, reprogramming, and long non-coding RNAs. Sci Bull 60: 456 1722–1733. 457
Hutchins AP, Yang Z, Li Y, He F, Fu X, Wang X, Li D, Liu K, He J, Wang Y et al. 2017. Models of global 458 gene expression define major domains of cell type and tissue identity. Nucleic Acids Res 45: 459 2354-2367. 460
Jang HS, Shah NM, Du AY, Dailey ZZ, Pehrsson EC, Godoy PM, Zhang D, Li D, Xing X, Kim S et al. 461 2019. Transposable elements drive widespread expression of oncogenes in human cancers. 462 Nat Genet 51: 611-617. 463
Johnson R, Guigo R. 2014. The RIDL hypothesis: transposable elements as functional domains of long 464 noncoding RNAs. RNA 20: 959-976. 465
Jonsson ME, Ludvik Brattas P, Gustafsson C, Petri R, Yudovich D, Pircs K, Verschuere S, Madsen S, 466 Hansson J, Larsson J et al. 2019. Activation of neuronal genes via LINE-1 elements upon global 467 DNA demethylation in human neural progenitors. Nat Commun 10: 3182. 468
.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint
18
Jungreis I, Tress ML, Mudge J, Sisu C, Hunt T, Johnson R, Uszczynska-Ratajczak B, Lagarde J, Wright 469 J, Muir P et al. 2018. Nearly all new protein-coding predictions in the CHESS database are not 470 protein-coding. bioRxiv doi:10.1101/360602: 360602. 471
Jurka J, Kapitonov VV, Kohany O, Jurka MV. 2007. Repetitive sequences in complex genomes: 472 structure and evolution. Annu Rev Genomics Hum Genet 8: 241-259. 473
Kammerer U, Germeyer A, Stengel S, Kapp M, Denner J. 2011. Human endogenous retrovirus K 474 (HERV-K) is expressed in villous and extravillous cytotrophoblast cells of the human placenta. 475 J Reprod Immunol 91: 1-8. 476
Kapusta A, Kronenberg Z, Lynch VJ, Zhuo X, Ramsay L, Bourque G, Yandell M, Feschotte C. 2013. 477 Transposable elements are major contributors to the origin, diversification, and regulation of 478 vertebrate long noncoding RNAs. PLoS Genet 9: e1003470. 479
Kelley D, Rinn J. 2012. Transposable elements reveal a stem cell-specific class of long noncoding 480 RNAs. Genome Biol 13: R107. 481
Kilpinen H, Goncalves A, Leha A, Afzal V, Alasoo K, Ashford S, Bala S, Bensaddek D, Casale FP, Culley 482 OJ et al. 2017. Common genetic variation drives molecular heterogeneity in human iPSCs. 483 Nature 546: 370-375. 484
Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. 2019. Graph-based genome alignment and 485 genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol 37: 907-915. 486
Kim S, Pevzner PA. 2014. MS-GF+ makes progress towards a universal database search tool for 487 proteomics. Nat Commun 5: 5277. 488
Kunarso G, Chia NY, Jeyakani J, Hwang C, Lu X, Chan YS, Ng HH, Bourque G. 2010. Transposable 489 elements have rewired the core regulatory network of human embryonic stem cells. Nat Genet 490 42: 631-634. 491
Lagarde J, Uszczynska-Ratajczak B, Carbonell S, Perez-Lluch S, Abad A, Davis C, Gingeras TR, 492 Frankish A, Harrow J, Guigo R et al. 2017. High-throughput annotation of full-length long 493 noncoding RNAs with capture long-read sequencing. Nat Genet 49: 1731-1740. 494
Lau KX, Mason EA, Kie J, De Souza DP, Kloehn J, Tull D, McConville MJ, Keniry A, Beck T, Blewitt ME 495 et al. 2020. Unique properties of a subset of human pluripotent stem cells with high capacity for 496 self-renewal. Nat Commun 11: 2420. 497
Lev-Maor G, Ram O, Kim E, Sela N, Goren A, Levanon EY, Ast G. 2008. Intronic Alus influence 498 alternative splicing. PLoS Genet 4: e1000204. 499
Li Q, Hutchins AP, Chen Y, Li S, Shan Y, Liao B, Zheng D, Shi X, Li Y, Chan WY et al. 2017. A sequential 500 EMT-MET mechanism drives the differentiation of human embryonic stem cells towards 501 hepatocytes. Nat Commun 8: 15166. 502
Lu JY, Shao W, Chang L, Yin Y, Li T, Zhang H, Hong Y, Percharde M, Guo L, Wu Z et al. 2020. Genomic 503 Repeats Categorize Genes with Distinct Functions for Orchestrated Regulation. Cell Rep 30: 504 3296-3311 e3295. 505
Ma L, Cao J, Liu L, Du Q, Li Z, Zou D, Bajic VB, Zhang Z. 2019. LncBook: a curated knowledgebase of 506 human long non-coding RNAs. Nucleic Acids Res 47: D128-D134. 507
Macfarlan TS, Gifford WD, Driscoll S, Lettieri K, Rowe HM, Bonanomi D, Firth A, Singer O, Trono D, 508 Pfaff SL. 2012. Embryonic stem cell potency fluctuates with endogenous retrovirus activity. 509 Nature 487: 57-63. 510
.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint
19
Makarewich CA, Olson EN. 2017. Mining for Micropeptides. Trends Cell Biol 27: 685-696. 511
Morillon A, Gautheret D. 2019. Bridging the gap between reference and real transcriptomes. Genome 512 Biol 20: 112. 513
Naville M, Warren IA, Haftek-Terreau Z, Chalopin D, Brunet F, Levin P, Galiana D, Volff JN. 2016. Not 514 so bad after all: retroviruses and long terminal repeat retrotransposons as a source of new 515 genes in vertebrates. Clin Microbiol Infect 22: 312-323. 516
Nguyen QH, Lukowski SW, Chiu HS, Senabouth A, Bruxner TJC, Christ AN, Palpant NJ, Powell JE. 517 2018. Single-cell RNA-seq of human induced pluripotent stem cells reveals cellular 518 heterogeneity and cell state transitions between subpopulations. Genome Res 28: 1053-1066. 519
Pertea M, Pertea GM, Antonescu CM, Chang TC, Mendell JT, Salzberg SL. 2015. StringTie enables 520 improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol 33: 290-295. 521
Santoni FA, Guerra J, Luban J. 2012. HERV-H RNA is abundant in human embryonic stem cells and a 522 precise marker for pluripotency. Retrovirology 9: 111. 523
Schumann GG, Fuchs NV, Tristan-Ramos P, Sebe A, Ivics Z, Heras SR. 2019. The impact of 524 transposable element activity on therapeutically relevant human stem cells. Mob DNA 10: 9. 525
Steijger T, Abril JF, Engstrom PG, Kokocinski F, Consortium R, Hubbard TJ, Guigo R, Harrow J, Bertone 526 P. 2013. Assessment of transcript reconstruction methods for RNA-seq. Nat Methods 10: 1177-527 1184. 528
Subramanian RP, Wildschutte JH, Russo C, Coffin JM. 2011. Identification, characterization, and 529 comparative genomic distribution of the HERV-K (HML-2) group of human endogenous 530 retroviruses. Retrovirology 8: 90. 531
Tam OH, Ostrow LW, Gale Hammell M. 2019. Diseases of the nERVous system: retrotransposon 532 activity in neurodegenerative disease. Mob DNA 10: 32. 533
Theunissen TW, Friedli M, He Y, Planet E, O'Neil RC, Markoulaki S, Pontis J, Wang H, Iouranova A, 534 Imbeault M et al. 2016. Molecular Criteria for Defining the Naive Human Pluripotent State. Cell 535 Stem Cell 19: 502-515. 536
Toker L, Feng M, Pavlidis P. 2016. Whose sample is it anyway? Widespread misannotation of samples 537 in transcriptomics studies. F1000Res 5: 2103. 538
Wang J, Xie G, Singh M, Ghanbarian AT, Rasko T, Szvetnik A, Cai H, Besser D, Prigione A, Fuchs NV 539 et al. 2014. Primate-specific endogenous retrovirus-driven transcription defines naive-like stem 540 cells. Nature 516: 405-409. 541
Wang X, You X, Langer JD, Hou J, Rupprecht F, Vlatkovic I, Quedenau C, Tushev G, Epstein I, 542 Schaefke B et al. 2019. Full-length transcriptome reconstruction reveals a large diversity of RNA 543 and protein isoforms in rat hippocampus. Nat Commun 10: 5009. 544
Wapinski O, Chang HY. 2011. Long noncoding RNAs and human disease. Trends Cell Biol 21: 354-545 361. 546
Wheeler TJ, Eddy SR. 2013. nhmmer: DNA homology search with profile HMMs. Bioinformatics 29: 547 2487-2489. 548
Wu TD, Watanabe CK. 2005. GMAP: a genomic mapping and alignment program for mRNA and EST 549 sequences. Bioinformatics 21: 1859-1875. 550
.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint
20
Wucher V, Legeai F, Hedan B, Rizk G, Lagoutte L, Leeb T, Jagannathan V, Cadieu E, David A, Lohi H 551 et al. 2017. FEELnc: a tool for long non-coding RNA annotation and its application to the dog 552 transcriptome. Nucleic Acids Res 45: e57. 553
Yang J, Gao C, Chai L, Ma Y. 2010. A novel SALL4/OCT4 transcriptional feedback network for 554 pluripotency of embryonic stem cells. PLoS One 5: e10766. 555
You BH, Yoon SH, Nam JW. 2017. High-confidence coding and noncoding transcriptome maps. 556 Genome Res 27: 1050-1062. 557
Yu J, Vodyanik MA, Smuga-Otto K, Antosiewicz-Bourget J, Frane JL, Tian S, Nie J, Jonsdottir GA, 558 Ruotti V, Stewart R et al. 2007. Induced pluripotent stem cell lines derived from human somatic 559 cells. Science 318: 1917-1920. 560
Zapatka M, Borozan I, Brewer DS, Iskar M, Grundhoff A, Alawi M, Desai N, Sultmann H, Moch H, 561 Pathogens P et al. 2020. The landscape of viral associations in human cancers. Nat Genet 52: 562 320-330. 563
Zhou T, Benda C, Duzinger S, Huang Y, Li X, Li Y, Guo X, Cao G, Chen S, Hao L et al. 2011. Generation 564 of induced pluripotent stem cells from urine. J Am Soc Nephrol 22: 1221-1228. 565
566
567
568
.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint
60
A CB
0
20
40
60
Num
ber (
×000
)
Mat
chin
g
Varia
nt
Nov
el
log2
(TPM
)
-5
0
5
10
15
LRSR LR+SR
Num
ber (
×000
)
0
20
40
% transcripts
RNA-seq DeepCAGE
GENCODE
All
Matching
Variant
Novel
Coding Noncoding
20% 40%0% 60% 80% 100%
log2
(nor
mal
ized
cou
nt)
Cod
ing
Non
codi
ng
Cod
ing
Non
codi
ng
Mat
chin
g
Varia
nt
Nov
el
Mat
chin
g
Varia
nt
Nov
el
D E F
Norm
alize
d RNA
-seq
coun
t
Norm
alize
d dee
pCAG
E co
unt
2.0Kb 2.0KbTTSTSS
20
40
6060
40
20
0
deepCAGERNA-seq
-5
0
5
10
20
15
RNA-seq DeepCAGE
log2
(TPM
)
-5
0
5
10
15
log2
(nor
mal
ized
cou
nt)
-5
0
5
10
20
15
72794(71.7%)
28698(28.3%)
60605(81.2%)
11725(60.0%)
464(6.4%)
6806(93.6%)
7851(40.0%)
14041(18.8%)
99883(69.6%)
43689(30.4%)
Figure 1
.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint
21
Figure 1 – Combining short-read RNA-seq and long read RNA-seq to assemble a high-quality 569
hPSC transcriptome. 570
(A) Pileup of deepCAGE and RNA-seq reads across the lengths of the transcripts from TSS to 571
TTS and the flanking 2 kb regions. 572
(B) Number of transcripts (thousands) supported by short-read (SR), long read (LR) only, or both 573
(SR+LR). 574
(C) Number of transcripts defined as matching (internal exon structure matches exactly to a 575
GENCODE transcript), variant (shares any exon or overlapping exon segment with a 576
GENCODE transcript) or novel (does not share any exon base pair with a GENCODE 577
transcript). 578
(D) Violin plots showing RNA-seq log2 transcripts per million (TPM) and deepCAGE log2 579
normalized tag count expression for all unambiguous TSSs in the hPSC assembled 580
transcriptome. 581
(E) Percentage of coding and non-coding transcripts by transcript class. 582
(F) Expression level of coding and non-coding transcripts, for short-read RNA-seq (left violins) 583
or deepCAGE data (right violins). 584
585
586
.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint
hPSC Hapmap tissues
Enric
hed
Non
spec
ific
Dep
lete
d
SALL4
NANOGZIC3
DNMT3B
GDAP1
SPATA20
SEC61G
PTPRS
WBP1PDK4
TMEM176B
APOL3
log2 (TPM)
-2 -1 0 1 2
Den
sity
0.0
0.1
0.2
0.3
0.4
0.5
All
Noncoding
Z-score in the bodymap expression-4 -2 0 2 4
Enriched
Nonspecific
Depleted
0% 20% 40% 60% 80% 100%
% transcripts
log2
(TPM
)
Coding Noncoding
A B
C
-5
0
5
10 hPSC
Adip
ose
Adre
nal g
land
Brai
nBr
east
Col
onH
eart
Kidn
eyLi
ver
Lung
Lym
ph n
ode
Ova
ryPr
ostra
teSk
elet
al m
uscl
eTe
stis
Thyr
oid
Whi
te b
lood
cel
ls
Enric
hed
Unb
iase
d
Dep
lete
d
Enric
hed
Unb
iase
d
Dep
lete
d
-5
0
5
10
15
20RNA-seq DeepCAGE
Coding Noncoding
Coding
D
POU5F1CLDN6
ZFP42
LIN28A
LIN28B
NUCB1POLD1
NELFE
DRAP1
ATG101
CD14LAPTM5
KLF2
VSIRACSL1NFICAPOL6
3 4
ZIC2
log2
(nor
mal
ized
cou
nt)
Nonspecific62,913 (62.0%)
Depleted10,984 (10.8%)
Enriched27,595 (27.2%)
15
7986 (72.7%) 2998 (27.3%)
18030 (28.7%)44883 (71.3%)
19925 (72.2%) 7670 (27.8%)
Figure 2
.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint
22
Figure 2 – Transcript dynamics of the hPSC transcriptome. 587
(A) Z-score of all, coding, and non-coding transcripts against the Human BodyMap 2.0 dataset 588
(Derrien et al. 2012). 589
(B) Heatmap showing expression of selected hPSC-enriched, -nonspecific and -depleted transcripts 590
in hPSCs and the BodyMap samples. 591
(C) Percent of the coding and non-coding transcripts in the indicated hPSC expression categories. 592
(D) Violin plots showing the RNA-seq expression levels (left panel; TPM: transcripts per million) 593
or deepCAGE data (right panel; in normalized read counts) for the indicated expression classes 594
for coding and non-coding transcripts. 595
596
597
.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint
− 5 0 5
3' UTRCDS
5' UTRno TE
*4.0e-021.1e-01*9.6e-15
SINE:Alu:AluSx
− 5 0 5
3' UTRCDS
5' UTRno TE
*3.9e-026.6e-027.7e-02
DNA:hAT-Charlie:MER117
− 5 0 5
3' UTRCDS
5' UTRno TE
*4.6e-021.6e-01*3.3e-09
SINE:Alu:AluJr
Log2 (Transcripts per million)
p-value
− 5 0 5
3' UTRCDS
5' UTRno TE
*2.2e-059.0e-02*2.6e-04
LINE:L1:L1M2_orf2− 5 0 5
3' UTRCDS
5' UTRno TE
*3.7e-02*1.8e-03*1.5e-05
LINE:L1:L1HS_5end− 5 0 5
3' UTRCDS
5' UTRno TE
*1.8e-022.7e-01*3.8e-05
LTR:ERV1:HERVH
Log2 (Transcripts per million)
p-value
5'UTR CDS 3'UTR 5'UTR CDS 3'UTR
TEfre
quen
cyno
rmalize
dto
numbe
roftranscripts
Protein coding RNAsAll transcripts Variant-only
0.0
0.1
0.2
0.3
0.0
0.1
0.2
0.3
LINE:L1
LINE:L2
0.000
0.005
0.010
0.015
LTR:ERV1
SINE:Alu
0.00
0.01
0.02
0.03
TEfre
quen
cyno
rmalize
dto
numbe
roftranscripts
5'UTR CDS 3'UTR 5'UTR CDS 3'UTR
0.00
0.05
0.10
0.15
0.00
0.02
0.04
0.06
GENCODEhPSC enrichedhPSC nonspecifichPSC depleted
Log2 (Transcripts per million)Significantly upSignificantly down
Significantly upSignificantly down
Significantly upSignificantly down
p-value
− 2 0 2
TEno TE
*9.2e-74
Protein coding
Non-coding
− 5 0 5
TEno TE
1.7e-01
Protein coding
− 5 0 5
3' UTRCDS
5' UTRno TE
*4.1e-02*2.9e-30*8.9e-45
Log2 (Transcripts per million)
p-value
0% 50% 100%
GENCODE
Matching
Variant
% TranscriptsContains TENo TE
5458 (45%)
13332 (22%)
17825 (21%)
6694 (55%)
46899 (78%)
66207 (79%)
Contains TENo TE
0% 50% 100%
hPSC enriched
hPSC nonspecific
hPSC depleted
2453 (21%) 9408 (79%)
14594 (26%) 41018 (74%)
1882 (35%) 3439 (65%)
% Transcripts
hPSC enrichedhPSC nonspecifichPSC depleted
Matching
Variant
No TE
TE
No TE
TE
0% 50% 100%
1269 (21%) 3361 (62%) 828 (15%)
2371 (35%) 3715 (55%) 608 (9%)
2815 (21%) 8512 (64%) 2005 (15%)
13284 (28%) 29094 (62%) 4521 (10%)
% Transcripts
LTR:ERV1:LTR7
LTR:ERV1:HERVH
0.0000
0.0025
0.0050
0.0075
LINE:L1:L1HS_5end
5'UTR CDS 3'UTR
LINE:L1:L1M2_orf2
SINE:Alu:AluJr
0.000
0.002
0.004SINE:Alu:FRAM
5'UTR CDS 3'UTR
TEfre
quen
cyno
rmalize
dto
numbe
roftranscripts
0.000
0.005
0.010
0.000
0.005
0.010
0.000
0.002
0.004
0.000
0.005
0.010
0.015
A
E
Figure 3
B
C
H
D
G
I
F
J K
5'UTR
CDS
3'UTR
LTR:GypsyDNA:hAT-BlackjackLINE:RTE-XLINE:CR1LINE:L2DNA:hAT-CharlieRetroposon:SVASINE:AluDNA:TcMar-TiggerLINE:L1LTR:ERVKDNA:TcMar-MarinerLTR:ERVL-MaLRLTR:ERVLSINE:MIRLTR:ERV1DNA:hAT-Tip100
-1 0 1Log2 (fold-change)
versus no TE
.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint
23
Figure 3 – TEs spliced into coding sequence genes remodel the expression dynamics of 598
pluripotent stem cells. 599
(A) Proportion of protein coding transcripts in the GENCODE database, of hPSC-expressing 600
transcripts matching (to GENCODE), or variant (share transcript structure with a known 601
transcript), and the proportions that contain at least 1 TE or no TEs. 602
(B) Proportion of coding transcripts that are enriched, nonspecific or depleted in hPSCs 603
(C) Proportions of matching or variant transcripts with or without a TE that are hPSC-enriched, -604
nonspecific or -depleted. 605
(D) Relative density of all TE insertions in all or variant transcripts’ 5’ UTR, CDS or 3’UTR. Each 606
plot is normalized to the length of the transcript region. The blue line indicates all GENCODE 607
transcripts, the red, green and yellow lines indicate hPSC-specific expression categories. 608
(E) Density plots (as in panel D) for LINE L1, L2, SINE Alu, and LTR ERV1 TEs. 609
(F) Density plots for the indicated LTR, LINE and SINE types. 610
(G) Box plot for the expression levels of protein coding and non-coding transcripts with or without 611
a TE in its sequence. p-value is from a Welch’s t-test versus no TE-containing transcripts. 612
(H) Box plot for the expression levels of protein coding transcripts with or without a TE in the 613
5’UTR, CDS or 3’UTR. p-value is from a Welch’s t-test versus no TE-containing transcripts. 614
(I) Heatmap showing the fold-change of transcripts containing one or more of the indicated 615
subtypes of TE versus no TE-containing transcripts. 616
(J) Box plot for the expression levels of the transcripts containing the indicated TEs in the 5’UTR, 617
CDS or 3’UTR. p-value is from a Welch’s t-test versus no TE-containing transcripts. 618
(K) As in panel J, but for example transcripts with TEs in the 3’UTR that lead to higher levels of 619
expression. p-value is from a Welch’s t-test versus no TE-containing transcripts. 620
621
622
.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint
Long-read novel coding 12
Novel CDS
Known CDS
TE insertion
In-frame insertion 0
Frame-shift insertion 63
New STOP 214
New ATG 86
Alternate CDS 1117
CDS disruptions
mRNA Number oftranscriptsin class
D
FoundNot found
Number of proteins with >20 unmaskedamino acids and at least one cleavage site
Frameshift insertion
Long-read novel coding
0 1000
new STOPnew ATG
Alternate CDS3/11 (27%)
181/1050 (17%)42/198 (21%)
17/84 (20%)12/58 (20%)
2 ormore unique peptides1 Unique peptide
0 50 100 150 200
Inside TE
Outside TE
15(15%)
70(70%)
32(32%)
145(145%)
Number of genes with unique peptides thatderive from inside or outside a TE sequence
Number of peptides derived from thesequences from the indicated TEs
0 10
DNA:TcMar-Tigger:Tigger1DNA:TcMar-Tigger:Tigger2
LINE:L1:L1M4c_5endLINE:L1:L1M5_orf2
LINE:L1:L1MC4_3endLINE:L1:L1MC5a_3endLINE:L1:L1ME4a_3end
LINE:L1:L1P1_orf2LTR:ERV1:ERV24B_Prim
LTR:ERV1:HERVFH21LTR:ERV1:LTR8
LTR:ERV1:PABL_B-intLTR:ERV1:PRIMA41-int
LTR:ERVK:HERVKLTR:ERVK:HERVK11LTR:ERVK:HERVK13
LTR:ERVK:HERVK14CLTR:ERVK:LTR5
LTR:ERVL-MaLR:MLT1A1LTR:ERVL-MaLR:MLT1J-intLTR:ERVL-MaLR:THE1-intLTR:ERVL:ERV3-16A3_I
LTR:ERVL:ERV3-16A3_LTRLTR:ERVL:LTR16A
Retroposon:SVA:SVA_ARetroposon:SVA:SVA_C
SINE:Alu:AluJr4SINE:Alu:AluScSINE:Alu:AluSp
SINE:Alu:AluSq2SINE:Alu:AluSx1SINE:Alu:AluSx3SINE:Alu:FLAM_A
SINE:MIR:MIRSINE:MIR:MIRc
rRNA:.:LSU-rRNA_Hsa
3321111111122139311132111421311111311
A
Figure 4
B
C E
01975 3847
8718
Coding Seq
DNA/SVAs
LTRs
TEsSINE
s
LINEs
LTR5BHERVKLTR5
LTR5B HERVK HERVK11 HERVKLTR5B
MER11B
MER11A
ENST00000626825.1(+) HPSCSR.312557.3PCAT14
.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint
24
Figure 4 – TEs disrupt coding sequences and in limited instances code for peptides 623
(A) Schematic of the type of TE insertions into the coding sequences, and the number of transcripts 624
in each class. Grey regions indicate novel CDS caused by the TE insertion, black indicates the 625
known CDS, dotted lines indicate where the CDS would be if the TE was not present, and red 626
indicates the TE insertion. Transcripts were assigned to each category from top to bottom by 627
classification. 628
(B) Number of novel CDS peptide fragments that can be detected in the HipSci mass spectrometry 629
dataset. A protein was considered found if at least 1 unique peptide mapped to it. 630
(C) Number and percentage of transcripts with a unique peptide derived from inside or outside the 631
TE sequence. 632
(D) Numbers of unique peptides derived from each type of TE. 633
(E) Example domain map for PCAT14, a variant transcript enriched in hPSCs. The location of TEs 634
in the LINE, SINE LTR or DNA/SVAs is indicated on the horizontal lines. The coding sequence 635
is indicated using the thick black bar. The red vertical bars indicate peptides that map inside a 636
TE, the green vertical bar indicates a peptide mapping outside a TE. 637
638
639
.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint
− 5 0 5
109876543210
*1.3e-276.4e-014.3e-011.1e-012.9e-014.9e-01*2.6e-021.9e-01*2.3e-032.2e-01
Log2 (Transcripts per million)
Protein coding p-value
Log2 (Transcripts per million)
lncRNAs
− 5 0
109876543210
*3.5e-171*2.1e-09*3.9e-11*2.1e-10*4.6e-18*2.7e-15*3.9e-15*3.6e-18*4.8e-07*2.7e-02
p-value
NumberofTEs
intranscriptG
TSS TTS
All lncRNAs0.6
0.2
LINE:L1:L1HS_5end0.02
0.00
LINE:L1:L1PA3_3end
0.003
0.000
LTR:ERV1:HERV-Fc20.006
0.000
LTR:ERV1:HERVH0.1
0.0
LTR:ERV1:LTR7
0.04
0.00
LTR:ERV1:LTR7Y0.04
0.00TSS TTS
TEdensity(normalizedtototalnum
beroftranscripts)
LINE:L1
0.1
0.05
LINE:L20.03
0.00TE
density(normalizedtototalnum
beroftranscripts)
LTR:ERV10.2
0.1
LTR:ERVK
0.002
0.0010
TSS TTS
GENCODEhPSC enrichedhPSC nonspecifichPSC depleted
GENCODEhPSC enrichedhPSC nonspecifichPSC depleted
GENCODEhPSC enrichedhPSC nonspecifichPSC depleted
A
B
Figure 5
C
F
D
E
Matching
Variant
No TE
TE
No TE
TE
NovelNo TE
TE
1713 (22%) 5297 (69%) 667 (9%)
379 (27%) 829 (59%) 208 (15%)
1411 (22%) 4172 (66%) 764 (12%)
1977 (26%) 4795 (62%) 957 (12%)
270 (26%) 667 (65%) 94 (9%)
1916 (43%) 2263 (50%) 306 (7%)
0% 50% 100%% Transcripts
hPSC enrichedhPSC nonspecifichPSC depleted
0% 50% 100%% Transcripts
GENCODE
Matching
Variant
Novel
Contains TENo TE
4485 (81%)
7729 (85%)
6347 (45%)
37667 (50%)
1031 (19%)
1416 (15%)
7677 (55%)
37520 (50%)
−2.5 0.0 2.5
SINE:Alu:AluSgSINE:Alu:AluSpSINE:Alu:AluYSINE:Alu:FRAM
LINE:L2:L2LINE:L1:L1M2_orf2LINE:L1:L1M5_orf2
LINE:L1:L1ME1_3endLINE:L1:L1HS_5endLINE:L1:L1M1_5endLINE:L1:L1M3_orf2
LINE:L1:L1MC4_5endLINE:L1:L1P1_orf2LINE:L1:L1P4_orf2
LINE:L1:L1PA3_3endLINE:L1:L1PA4_3end
LTR:ERVL-MaLR:MLT1BLTR:ERV1:LTR7
LTR:ERV1:LTR7BLTR:ERV1:LTR7YLTR:ERV1:LTR8
LTR:ERV1:LTR12CLTR:ERV1:LTR12ELTR:ERV1:HERVH
LTR:ERV1:HERVFH21LTR:ERV1:HERVH48LTR:ERV1:HERV-Fc2LTR:ERV1:HERVS71LTR:ERV1:MER50-intLTR:ERV1:HUERS-P3b
LTR:ERV1:MER110LTR:ERVK:HERVK
LTR:ERVL-MaLR:MST-intLTR:ERVL-MaLR:THE1-int
No TE
*2.4e-48*1.2e-59*2.2e-72*2.9e-87*8.0e-64*2.2e-15*1.7e-81*3.3e-26*4.4e-26*7.2e-08*2.5e-15*4.1e-51*1.9e-23*2.0e-13*4.6e-15*3.7e-19*3.0e-301.6e-01*1.4e-05*2.7e-03*9.5e-07*2.1e-09*2.6e-064.4e-01*1.3e-027.7e-019.0e-016.7e-012.7e-012.8e-016.8e-024.9e-01*1.7e-07*2.0e-06
Log2 (Transcripts per million)Significantly upSignificantly down
Significantly upSignificantly down
q-value
.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint
25
Figure 5 – Widespread type-specific splicing of TEs into hPSC-specific lncRNAs. 640
(A) Frequency of TE insertions in GENCODE, and hPSC-matching, variant and novel lncRNAs. 641
(B) Frequency of hPSC expression classes for matching, novel and variant lncRNAs, including or 642
excluding a TE. 643
(C) Frequency of TEs across the normalized length of all lncRNAs. TSS=Transcription start site; 644
TTS=Transcription termination site. Frequency is normalized to the total number of transcripts 645
in each expression class. 646
(D) Examples frequency showing the insertion density frequency for LINE L1, L2 and ERV1 and 647
ERVK TEs. 648
(E) As in panel C and D, except showing selected TE types. 649
(F) Box plots showing the expression of lncRNAs containing the indicated TE types, versus all 650
transcripts with no TE. q-value is from a Welch’s t-test with Bonferroni-Hochberg multiple test 651
correction (for all families of TE). Red boxplots indicate significantly up, blue indicates 652
significantly down, grey no significant change. 653
(G) Box plots showing the effect of multiple TE insertions on non-coding and protein-coding 654
transcripts. p-value is from a Welch’s t-test versus transcripts containing zero TEs. 655
656
.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint
0 1
SINE:MIRSINE:Alu
Retroposon:SVALTR:ERVL-MaLR
LTR:ERVLLTR:ERVKLTR:ERV1LINE:L2LINE:L1
DNA:hAT-CharlieDNA:TcMar-Tigger
0 1 0 1 0 1 0 1
Cluster 0Pluripotency
Cluster 1Proliferation
Cluster 2Pluripotency
Cluster 3Differentiation
Cluster 4Quiescence
Cluster
GO:0017148 negative regulation of translationGO:0007369 gastrulationGO:0090068 positive regulation of cell cycle processGO:0045740 positive regulation of DNA replicationGO:0080154 regulation of fertilizationGO:1901992 positive regulation of mitotic cell cycle phase transitionGO:0001825 blastocyst formationGO:0001708 cell fate specificationGO:0040019 positive regulation of embryonic developmentGO:0050807 regulation of synapse organizationGO:0045669 positive regulation of osteoblast differentiationGO:0072073 kidney epithelium development
*** *
* *
***
* *** *
***
0 1 2 3 4
-Log10(P-value)2 4 6 8 10
HPSCSR.267787.92PIF1
0 3Normalizedexpression
HPSCSR.281890.135TOP2A
HPSCSR.221584.174MALAT1
HPSCSR.160725.1PODXL (GCTM2)
0
1
23
4
% Cells in clusters
LTR:ERV1:HERVHLTR:ERV1:HERVS71
SINE:AluLTR:ERVLLINE:L1
SINE:AluSINE:MIR
SINE:AluLTR:ERV1:HERVH
LINE:L1
LTR:ERV1:HERVH/LTR7SINE:AluLINE:L1
SINE:AluLINE:L1
Differentiation/EMT
Proliferation
Cluster 2(7.8%)
Cluster 1(17%)
Cluster 0(71%)
Cluster 4(0.5%)
Cluster 3(2.1%)
Cell cycle exitQuiescence
TEno TE
0 100
TEno TE
0 50
TEno TE
0 500
TEno TE
0 1000
TEno TE
0 10
Number of transcripts
EnrichedNonspecific
Depleted
0 100
1
EnrichedNonspecificDepleted
0 50
2
EnrichedNonspecificDepleted
0 5
0
EnrichedNonspecific
Depleted
0 500
3
EnrichedNonspecific
Depleted
0 500
4
Number of transcripts
Cluster
F
G
Number of TEs per number of transcripts in clusterNumber of TEs per number
of transcripts in cluster
0.00 0.10 0.20
H
HPSCSR.67426.5DPPA4
HPSCSR.210694.8UTF1
0 3Normalizedexpression
S G2/MG1
0
1
2
3
4
0% 50% 100%
Cluster
% of cells
94%
62%
89%
68%
75%
1%
3%
0%
2%
2%
5%
34%
11%
30%
23%
A B
Figure 6
D
C
I
E
0 1 2 3 4DNA:hAT-Charlie:MER112DNA:hAT-Charlie:MER58ADNA:hAT-Tip100:MER63CDNA:hAT-Tip100:MER63DLINE:L1:L1M5_orf2LINE:L1:L1MA8_3endLINE:L1:L1ME2z_3endLINE:L1:L1ME4a_3endLINE:L2:L2LINE:L2:L2c_3endLTR:ERV1:HERV-Fc2LTR:ERV1:HERVFH21LTR:ERV1:HERVHLTR:ERV1:HERVS71LTR:ERV1:LTR36LTR:ERV1:LTR38LTR:ERV1:LTR6BLTR:ERV1:LTR7LTR:ERV1:LTR7BLTR:ERV1:LTR7YLTR:ERV1:MER51BLTR:ERVL:MER21ALTR:ERVL:MER21BLTR:ERVL:MER21CRetroposon:SVA:SVA_FSINE:Alu:AluJbSINE:Alu:AluJrSINE:Alu:AluJr4SINE:Alu:AluSc8SINE:Alu:AluSgSINE:Alu:AluSpSINE:Alu:AluSq2SINE:Alu:AluSxSINE:Alu:AluSx1SINE:Alu:AluSzSINE:Alu:AluYSINE:Alu:AluYcSINE:Alu:AluYk3SINE:Alu:FLAM_ASINE:Alu:FLAM_CSINE:Alu:FRAMSINE:Alu:PB1D11SINE:MIR:MIRSINE:MIR:MIRb
Cluster
.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint
26
Figure 6 – Single cell expression reveals TE containing transcripts are specifically expressed in 657
subpopulations within hPSC cultures 658
(A) UMAP projection and clustering of scRNA-seq of hPSC data. Clustered using Leiden 659
(resolution=1.0) and clusters showing no differentially regulated genes were iteratively merged. 660
(B) Gene ontology analysis of the genes significantly associated with the indicated clusters (See 661
Supplemental Table S6 for the list of transcripts). 662
(C) UMAP plots colored by the expression level of two pluripotency genes, UTF1 and DPPA4. 663
(D) Estimates of cell cycle phase of the indicated UMAP clusters based on enrichment of cell cycle 664
genes. 665
(E) UMAP plot colored by expression of the G2/M-associated genes PIF1, TOP2A, MALAT1 and 666
PODXL. 667
(F) Number of transcripts in each cluster that is enriched, nonspecific or depleted in hPSCs (Left 668
bar charts), and number of transcripts containing a TE (right bar charts). 669
(G) Number of TEs types per number of transcripts in the indicated clusters. 670
(H) Heatmap indicating the number of TE families per number of transcripts in the indicated clusters 671
(I) Schematic indicating the relationship between the hPSC cell sub-populations. The % of cells in 672
each cluster is indicated, and the major TE types expressed are indicated. 673
674
.CC-BY-ND 4.0 International license(which was not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprintthis version posted July 26, 2020. . https://doi.org/10.1101/2020.07.26.220608doi: bioRxiv preprint