72
Statistical Challenges in Single-Cell Biology Ascona, April 30 - May 5 2017 Organised by Niko Beerenwinkel, Peter B¨ uhlman, Wolfgang Huber

Statistical Challenges in Single-Cell Biology

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Statistical Challenges in Single-Cell Biology

Statistical Challenges in Single-Cell BiologyAscona, April 30 - May 5 2017

Organised by Niko Beerenwinkel, Peter Buhlman, Wolfgang Huber

Page 2: Statistical Challenges in Single-Cell Biology

Ascona 2017 2

Page 3: Statistical Challenges in Single-Cell Biology

Driven by biotechnological developments that enable single-cell handling and measure-ments on a genome-wide scale, single-cell profiles are now generated widely and rapidly.Single-cell approaches allow for probing biological systems at an unprecedented level ofdetail. For the first time, we can manipulate single cells and investigate variation amongindividual cells and inter-cellular interactions on the molecular level.This interdisciplinary workshop is intended to be a forum for the dissemination of cutting-edge biotechnological and computational developments and the identification of opendata analysis problems and solutions. Targeted areas for the workshop include: novelexperimental techniques for single-cell analysis, statistical models of cell-to-cell varia-tion, data integration, and applications of single-cell genomics to somatic variation indevelopment and disease.

Ascona 2017 i

Page 4: Statistical Challenges in Single-Cell Biology

Sponsors

Ascona 2017 ii

Page 5: Statistical Challenges in Single-Cell Biology

Venue

Monte VeritaVia Collina 84CH-6612 Asconatel. +41 91 785 40 40

About the Congressi Stefano Franscini (CSF)

The Congressi Stefano Franscini (CSF) is the international conference centre of the SwissFederal Institute of Technology (ETH) in Zurich, situated in the south of Switzerland(Canton Ticino) at Monte Verita. It has been named after the Federal Councillor StefanoFranscini, a native of Ticino who, in 1854, played an important part in establishing the firstFederal Institute of Technology in Switzerland, ETH Zurich. Every year, the centre hosts20 - 25 conferences organised by professors working at Swiss universities and concerningall disciplines (sciences and humanities) taught at academic level. The centre is also opento the local population with a regular program of public events (lectures, concerts, films,etc.) organised in the context of its international conferences and/or Monte Verita’scultural programme.

Shuttle service from Locarno Station

A free 13-seater shuttle bus to Monte Verita leaves from Locarno railway station SundayApril 30 at the following times: 14.05; 14.45; 15.35; 16.15; 17.05; 17.45; 18.30.The meeting point is on the right side of the train platforms in Locarno (see image).

Ascona 2017 iii

Page 6: Statistical Challenges in Single-Cell Biology

Ascona 2017 iv

Page 7: Statistical Challenges in Single-Cell Biology

Keynote lectures

Identifying differentially variable genes from single-cell RNA-sequencing dataand applications in immunity and ageing

John Marioni

Cancer Research UK, Cambridge, UK

Single-cell mRNA sequencing can uncover novel cell-to-cell heterogeneity in gene ex-pression levels within seemingly homogeneous populations of cells. However, the datagenerated are subject to high levels of technical noise, creating new challenges for iden-tifying genes that show genuine heterogeneous expression within the population of cellsunder study. In this presentation I will describe BASiCS (Bayesian Analysis of Single-CellSequencing data), an integrated Bayesian hierarchical approach that appropriately mod-els noise and identifies both highly variable genes within a population of cells as well asgenes that are differentially variable between populations of cells. I will demonstrate howBASiCS can be applied by discussing how aging impacts transcriptional dynamics usingsingle-cell RNA-sequencing of unstimulated and stimulated naive and effector memoryCD4+ T cells from young and old mice from two divergent species.

1

Page 8: Statistical Challenges in Single-Cell Biology

Keynotes

From one to millions of cells: computational approaches for single-cell analysis

Peter Kharchenko

Department of Biomedical Informatics, Harvard Medical School

Over the last five years, our ability to isolate and analyze detailed molecular featuresof individual cells has expanded greatly. In particular, the number of cells measuredby single-cell RNA-seq (scRNA-seq) experiments has gone from dozens to over a millioncells, thanks to improved protocols and fluidic handling. Analysis of such data can providedetailed information on the composition of heterogeneous biological samples, and varietyof cellular processes that altogether comprise the cellular state. Such inferences, however,require careful statistical treatment, to take into account measurement noise as well asinherent biological stochasticity. I will discuss several approaches we have developed toaddress such problems, including error modeling techniques, statistical interrogation ofheterogeneity using gene sets, and visualization of complex heterogeneity patterns, im-plemented in PAGODA package. I will discuss how these approaches have been modifiedto enable fast analysis of very large datasets in PAGODA2, and how the flow of typicalscRNA-seq analysis can be adapted to take advantage of potentially extensive repositoriesof scRNA-seq measurements.

Synthetic gene circuits for in situ cell classification

Kobi Benenson

Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland

Classification of biological samples is typically performed by computer algorithms thatprocess sample-derived data in order to assign a sample to a phenotypic class. Recentdevelopments in synthetic biology and biomolecular computing open up a possibility toclassify individual live cells in situ using complex multilayered gene circuits. These circuitstypically deploy multiple sensors to read out a panel of cytoplasmic biomarkers, and aninformation-processing module that integrates sensory data and generates an output

Ascona 2017 2

Page 9: Statistical Challenges in Single-Cell Biology

Keynotes

whose intensity correlates with the cell phenotype. One potential application of suchcircuits is specific cell targeting in genetic diseases and cancer.In the talk I will discuss recent experimental progress in developing foundational tech-nologies for classifier circuits, as well as a computational framework for their automateddesign. In particular, I will describe novel approaches towards sensing and integration ofmultiple transcriptional and microRNA inputs. I will also describe novel characterizationmethods that enable rapid evaluation of input-output relationship of complex circuits inmammalian cells.

Highly multiplexed analysis of the tumor ecosystem by mass cytometry

Bernd Bodenmiller

Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland

Cancer is a tissue disease. Cancer cells and normal cells form an evolving ecosystemto support tumour development. The complexity of this system is a key obstacle to healthe disease. The visualization and study of the tumour ecosystem is thus essential toenable an understanding of tumour biology, to define biomarkers, and to identify newtherapeutic routs. For this purpose, we developed imaging mass cytometry (IMC) whichenables to visualize >50 epitopes simultaneously on tissues with subcellular resolution.Soon >130 markers can be visualized. We applied IMC for the analysis of 100s of breastcancer samples. Our analysis reveals a surprising level of inter- and intra-tumor hetero-geneity and identify new diversity within known breast cancer subtypes. Furthermore, weidentify cell-cell interaction motifs in the tumor microenvironment correlating with clinicaloutcomes of the analyzed patients. Our results show that IMC provides high-dimensionalanalysis of cell type, cell state and cell-to-cell interactions in the tumour ecosystem. Weenvision that IMC will enable a personalized approach to diagnose disease and to guidetreatment.

Ascona 2017 3

Page 10: Statistical Challenges in Single-Cell Biology

Keynotes

Identifying novel correlates of protection via single-cell analysis of antigen-specific T-cells

Raphael Gottardo

Fred Hutchinson Cancer Research Center

Antigen (Ag)-specific T-cells are rare among circulating peripheral blood mononuclearcells. Despite their critical role in containing infection, their total frequency in blooddoes not correlate with clinical protection; therefore, it is likely that some characteristic orquality of a subset of Ag-specific cells confers protective immunity for infectious diseases.Recent technical advances such as polychromatic flow cytometry and single-cell genomicshave enabled the high-throughput quantification of genes or proteins at the single-celllevel. Although many analytical tools exist for analyzing high-dimensional data, such asfrom gene expression arrays, very few tools are available for single-cell data, which has itsown bioinformatics and statistical challenges. During this talk I will give an overview ofthe challenges involved in the analysis of single-cell data and show how such technologiescombined to novel computational methods can be used to improve the characterizationof Ag-specific T-cells to reveal novel correlates of protection for HIV and malaria.

Mapping genetic interactions during intestinal organoid self-organization

Prisca Liberali

FMI, University of Basel, Basel, Switzerland

Collective behaviour is a complex behaviour in which understanding of the individualsingle cells does not necessarily explain the collective behaviour of the population ofcells. The unexplainable behaviour is the self-organization of the system. Recently,model systems have been developed from stem cells that can self-organize into organoidstructures in vitro. In particular, intestinal organoids recapitulate most of the spatialand temporal processes of morphogenesis and patterning observed in intestinal tissue inanimals and contain all cell types found in the intestine. To identify genes and map

Ascona 2017 4

Page 11: Statistical Challenges in Single-Cell Biology

Keynotes

regulatory genetic interactions underlying self-organization during organoid development,a series of small compound screens have been performed in a 3D intestinal stem cellculture model. Extensive image analysis and advanced statistics is applied to quantifymultiple features in a large number of single cells in each perturbed cell population.We are then building a statistical trajectory of organoid development in multivariatefeature space. This analysis is generating a cross-comparable dataset that is particularlywell suited for the inference of regulatory genetic interactions. Modules of genes thatfunctionally interact in regulating some or multiple processes involved in self-organizationhave been identified. For several genes with strong effects and key roles in the regulatorygenetic interaction network we performed live cell imaging of single intestinal stem cellto a fully grown organoid with a custom made light sheet microscope. This approach isestablishing a novel paradigm in genetic interaction screening applied to collective cellbehavior and symmetry breaking in stem cells.

Symmetry breaking and self-organisation in mouse development

Takashi Hiiragi

EMBL, Heidelberg, Germany

A defining feature of living systems is the capacity to break symmetry and generatewell-defined forms and patterns through self-organisation. Our group aims to understandthe principle of multi-cellular self-organisation using a well-suited model system: earlymouse embryos. Mammalian eggs lack polarity and thus symmetry is broken duringearly embryogenesis. This symmetry breaking results in the formation of a blastocystconsisting of two major cell types, the inner cell mass and the trophectoderm, eachdistinct in its position and gene expression. Our recent studies unexpectedly revealed thatmorphogenesis and gene expression are highly dynamic and stochastically variable duringthis process. Determining which signal breaks the symmetry and how the blastocystestablishes a reproducible shape and pattern despite the preceding variability remainsfundamental open questions in mammalian development. We have recently developed aunique set of experimental frameworks that integrate biology and physics. With this weaim to understand how molecular, cellular and physical signals are dynamically coupledacross the scales for self-organisation during early mammalian development.

Ascona 2017 5

Page 12: Statistical Challenges in Single-Cell Biology

Keynotes

Droplet-based microfluidic for single cell analysis

Valerie Taly

Paris Descartes

Droplet-based microfluidic has led to the development of highly powerful systems thatrepresent a new paradigm in High-Throughput Screening where individual assays arecompartmentalized within microdroplet microreactors. By combining a decrease of assayvolume and an increase of throughput, this technology goes beyond the capacities ofconventional screening systems. Droplets (in the pL to nL range) are produced as inde-pendent microreactors that can be further actuated and analyzed at rates of the order of1000 droplets per seconds. Added to the flexibility and versatibility of platform designs,such progress in sub-nanoliter droplet manipulation allows for a level of control that washitherto impossible [1-3].Microfluidics has recently emerged as a major player in the single cell era that is gradu-ally emerging among biology laboratories, mainly due to the single-cell high-throughputhandling solutions it offers. After a presentation of different microfluidic systems andstrategies allowing for single cell manipulation and analysis, the presentation will focuson compartment-based microfluidic approaches. Illustrative examples of several tech-nologies will be presented including applications in directed evolution, high throughputscreening and omics. Finally recent works with high potential impact for cancer researchwill be presented [4,5].

References

1. Taly V, Pekin D, Abed AE, Laurent-Puig P. Detecting biomarkers with microdroplet technology.Trends Mol Med. 2012;18(7):405-16. doi:10.1016/j.molmed.2012.05.001.

2. Kelly BT, Baret JC, Taly V, Griffiths AD. Miniaturizing chemistry and biology in microdroplets.Chem Commun (Camb). 2007(18):1773-88. doi:10.1039/b616252e.

3. Taly V, Kelly BT, Griffiths AD. Droplets as Microreactors for High-Throughput Biology. Chem-biochem. 2007;8(3):263-72.

4. Perkins G, Lu H, Garlan F, Taly V. Droplet-Based Digital PCR: Application in Cancer Research.Advances in Clinical Chemistry. 2017;85:43-91.

5. Microchip Diagnostics. Series Methods in Molecular Biology. Springer Protocols. 2017; 1547.

Ascona 2017 6

Page 13: Statistical Challenges in Single-Cell Biology

Keynotes

Single cell analysis of circulating tumor cell clusters

Nicola Aceto

Department of Biomedicine, University of Basel, Basel, Switzerland

Cancer patients that develop a metastatic disease are currently considered incurable.Mainly, this is due to a limited understanding of the molecular mechanisms that charac-terize the metastatic process, and the lack of effective metastasis-suppressing agents. Themetastatic cascade begins when primary tumor cells enter the circulatory system, and itis followed by their extravasation at distant sites, where they form proliferative metastaticlesions. Cancer cells in the bloodstream are referred to as circulating tumor cells (CTCs),and while technically challenging, their isolation and interrogation holds the key to un-derstanding the principles governing the metastatic spread of cancer. For instance, usinga combination of microfluidic technologies, single cell sequencing, molecular and compu-tational biology, we recently understood that CTC-clusters, rather than single migratoryCTCs, are highly efficient precursors of metastasis in several cancer types. With the iso-lation and sequencing of single cells within CTC-clusters, we aim to dissect their cellularand molecular heterogeneity, to shed light on some unique features of these metastaticprecursors, and to enable the development of new metastasis-suppressing therapies.

Single Cells, Big Data

Nicholas E. Navin

MD Anderson Cancer Center, Houston, TX, USA

Single cells generate big data sets. Sequencing the genome or exome of a single cellcan generate terabytes of data that must be processed to mitigate technical errors thatarise during whole-genome amplification. The error profiles of single cell DNA sequencingdata are unique compared to standard next-generation sequencing datasets and thereforeviolate many assumptions that these methods make. Although there has been significant

Ascona 2017 7

Page 14: Statistical Challenges in Single-Cell Biology

Keynotes

progress in the last few years in developing computational methods for single cell RNAsequencing data, statistical methods for single cell DNA data are far behind. In this talkI will provide an overview of the experimental and computational methods our group hasdeveloped for performing single cell DNA sequencing to measure copy number profiles,point mutations and indels in individual tumor cells. I will also discuss applications ofthese methods to study metastatic lineages in colon cancer and punctuated evolution intriple-negative breast cancer patients.

Latent variable models for decomposing single-cell expression variation

Oliver Stegle

European Molecular Biology Laboratory, European Bioinformatics Institute, WellcomeTrust Genome Campus, CB10 1SD Hinxton, Cambridge, UK

Technological advances permit assaying the transcriptome and the proteome of single-cells, both in cell suspensions and increasingly in their natural spatio-temporal contexts.Cell-to-cell differences in gene expression can be driven by both observed and unmeasuredfactors, including technical effects such as batch, or biological processes including the cellcycle, apoptosis or cell differentiation. In this talk, I will describe latent variable modelsfor decomposing the sources of variation in single-cell studies, thereby inferring bothbiological and confounding sources of variation.In the first part of this talk, I will describe scalable sparse factor analysis models forsingle-cell RNA-seq. A particular focus of these methods is the ability to integrate priorannotations on biological gene sets, thereby identifying biological drivers of expressionheterogeneity. In the second part I will describe methods based on Gaussian processes tomodel and test for spatial gene expression heterogeneity.

Ascona 2017 8

Page 15: Statistical Challenges in Single-Cell Biology

Keynotes

Inferring early bifurcation events from single-cell RNA-Seq data

Magnus Rattray

University of Manchester

Single-cell RNA-Seq (scRNA-Seq) data can be used to uncover the dynamic processeswhere cells differentiate into different lineages, through the use of pseudo-time inferencemethods. We have developed a statistical tool for identifying gene-specific differentiationtimes given pseudo-time estimates for each cell. We model a branching process in timeas a Gaussian process, building on a recent approach for identifying perturbations fromtwo-sample gene expression time-series data (Yang et al. 2016). In the case of single-celldata the inference problem has to be modified to allow inference of branch identity forgenes diverging prior to the global cellular branching point identified using the Monocle2algorithm (Trapnell et al. 2016). We implement our method using the GPflow Gaussianprocess library which uses Tensorflow to allow automatic differentiation of the variationalobjective function and exploiting rapid processing by GPUs when available (Matthewset al. 2016). Our method allows for the inference of the branching time for each genealong with an associated Bayesian credible region. We use simulated data to compareour method to a spline-based approach implemented in the BEAM method (Qiu et al.2017) within Monocle2 and a Bayesian mixture of Factor Analysers approach (Campbelland Yau, 2017). We apply our method for scRNA-Seq and drop-seq datasets to identifyearly differentiation genes.

Joint work with Alexis Boukouvalas and James Hensman

Ascona 2017 9

Page 16: Statistical Challenges in Single-Cell Biology

Talks

Contributed talks

Bayesian Inference for Single-cell ClUstering and ImpuTing (BISCUIT)

Ambrose Carr

Memorial Sloan Kettering Cancer Center

Single-cell RNA-seq gives access to gene expression measurements for thousands of cells,allowing discovery and characterization of cell types. However, the data is noise-prone dueto experimental errors and cell type-specific biases. Current computational approachesfor analyzing single-cell data involve a global normalization step which introduces incor-rect biases and spurious noise and does not resolve missing data (dropouts). This canlead to misleading conclusions in downstream analyses. Moreover, a single normalizationremoves important cell type-specific information. We introduce a data-driven model,BISCUIT, that iteratively normalizes and clusters cells, thereby separating noise frominteresting biological signals. BISCUIT is a Bayesian probabilistic model that learns cell-specific parameters to intelligently drive normalization. This approach displays superiorperformance to global normalization followed by clustering.

We apply BISCUIT to single cell data on tumor-infiltrating cells from breast cancer pa-tients showcasing the strength of this method. BISCUIT identifies both expected andnovel biological populations, while common normalization techniques failed to revealstructure, instead amplifying strong biases differentiating patients. BISCUIT enablesnovel characterization of multiple different subpopulations of T cells, multiple myeloidclusters, and distinct populations of regulatory T cells and dendritic cells which are not de-tected with other normalization methods. This indicates the appropriateness of BISCUITfor experimental data containing significant diversity in cells. BISCUIT can infer under-lying co-expression patterns and cell type-specific expression from data, and thereforeit has advantages beyond predicting clusters and imputing data. Specifically, BISCUITrevealed strong differences in co-expression patterns between subpopulations of T-cells.Interestingly, varying co-expression patterns between co-receptor genes in regulatory Tcells showed significant variation across patients. These interesting observations couldimpact immunotherapy and might in the future suggest avenues for tailoring patient-specific immunotherapy. Also, BISCUIT parameters specific to each cell type can beused to explore differences in the tumor ecosystem across patients.

Ascona 2017 10

Page 17: Statistical Challenges in Single-Cell Biology

Talks

Missing Data and Technical Variability in Single-Cell RNA-Sequencing Experi-ments

Stephanie Hicks

Dana-Farber Cancer Institute / Harvard T.H. Chan School of Public Health

Recent advances in high-throughput technology permit genome-wide gene expressionmeasurement at the single cell level. Single-cell RNA-Sequencing (scRNA-Seq) is themost widely and has been used in numerous publications. Although systematic errors, in-cluding batch effects, have been widely reported as a major challenge in high-throughputtechnologies, these issues have received minimal attention in published studies based onscRNA-Seq technology. Here, we examine data from published studies and found thatsystematic errors can explain a substantial percentage of observed cell-to-cell expressionvariability. Specifically, we show that scRNA-Seq reports more zeros than expected, andthat technical variability can lead to cell-to-cell differences that can be confused withnovel biological results. Finally, we demonstrate how batch-effects can exacerbate theproblem.

Transcriptome-wide splicing quantification in single cells

Yuanhua Huang

School of Informatics, University of Edinburgh

Single cell RNA-seq (scRNA-seq) has revolutionised our understanding of transcriptomevariability, with profound implications both fundamental and translational. While scRNA-seq provides a comprehensive measurement of stochasticity in transcription, the limita-tions of the technology have prevented its application to dissect variability in RNA pro-cessing events such as splicing. Here we present BRIE (Bayesian Regression for IsoformEstimation), a Bayesian hierarchical model which resolves these problems by learning aninformative prior distribution from multiple single cells. BRIE combines the mixture mod-elling approach for isoform quantification with a regression approach to learn sequencefeatures which are predictive of splicing events. We validate BRIE on several scRNA-seqdata sets, showing that BRIE yields reproducible estimates of exon inclusion ratios in

Ascona 2017 11

Page 18: Statistical Challenges in Single-Cell Biology

Talks

single cells and provides an effective tool for differential isoform quantification betweenscRNA-seq data sets. BRIE therefore expands the scope of scRNA-seq experiments toprobe the stochasticity of RNA-processing.

Notes: This is a joint work with Guido Sanguinetti. The full manuscript is available onBioRxiv: http://biorxiv.org/content/early/2017/01/05/098517

The two phases in gene expression regulation

Jean (Zhijin) Wu

Brown University

Transcription is a complex process enabled and regulated by many factors. We propose alatent class probabilistic model of sequencing counts that well explains the characteristicsof observed single cell transcriptome, including the heterogeneity of the zero inflation, thesparsity and extreme skewness of expression in some genes. Using the analogy of energylevels, we consider the expression level of a gene in a given cell may be in the groundstate or an excited state. Unlike existing methods, we do not expect the expression tobe zero even if the gene is in ground state. We allow the transcript counts given groundstate to be a zero-inflated Poisson random variable, with parameters depending on thecell. Using data from all genes in all cells in an experiment, we estimate cell specificparameters for the ground state, but gene specific parameters for their excited states.

At the specific gene level, we present a robust and fast algorithm to estimate the geneand cell specific states, and differential expression in two forms: the binary state changefrom ground to excited, and the continuous regulation in the excited state. We reportgene expression regulation in both phases, most interestingly, expression compensationwhen the expression level in excited states are up-regulated to compensate for the reducedproportion of cells in the excited state.

At the whole cell level, we introduce a likelihood ratio based similarity measure thatcaptures the overall concordance of gene expression in both phases between a pair ofcells. We demonstrate the use of this model to identify the major regulation phases indifferent systems.

Ascona 2017 12

Page 19: Statistical Challenges in Single-Cell Biology

Talks

Single-cell analysis on microfluidic platforms

F. Kurth, P.S. Dittrich

Department of Biosystems Science and Engineering, ETH Zurich, Switzerland

Microfluidics is a promising technology and provides a huge toolbox for analytical and bio-analytical methods in general. With respect to cell analysis, the emergence of microfluidicplatforms has paved the way to novel analytical strategies for positioning, treatment andobservation of living cells, for creating of chemically defined liquid environments, and fortailoring mechanical or physical conditions. In particular, the development of microfluidicplatforms opened new possibilities to manipulate and investigate individual cells and hastremendously increased our knowledge of single-cell behavior [1,2]. In this presentation,our recent approaches to single-cell analysis will be presented. We developed microfluidicplatforms produced by so-called multilayer soft lithography to systematically study singlecells. In these microdevices, single cells are physically trapped isolated from the envi-ronment by round-shaped valves that encapsulate the cell in a volume of a few hundredpicoliters [3]. The realization of this tiny analytical chamber is the key requirement fora highly sensitive detection of the target molecules directly in the cell lysate, even whenpresent in very low copy numbers. We integrated the platform with immunological meth-ods such as enzyme-linked immunosorbent assays (ELISA) and competitive ELISA, whichallows for the quantification of proteins [4] and other biomolecules [5]. In addition to thechemical analysis of the cell lysate, such devices were employed for fast evaluation of theeffect of chemical compounds, e.g. drug candidates, on cells, with the benefit that het-erogeneous responses can be easily revealed. The microfluidic platform can be employedfor a wide range of molecules, for studies of various cell types including mammalian cells,yeast cells and bacteria as well as for liposomes as artificial cell models. Apart fromthe analysis of isolated single cells, we adopted microfluidic chambers for mechanobi-ological single cell studies within small cell populations. We applied fluid flow inducedshear stress to induce calcium entry via reputedly mechanosensitive channel classes undervarying culture conditions [6]. In combination with pharmacological channel modulation,our results could identify the role of particular channels, which has only been feasible bylow throughput technologies before.

Literature

1. L. Armbrecht, P. S. Dittrich, Recent advances in single cell analysis, Anal. Chem.

Ascona 2017 13

Page 20: Statistical Challenges in Single-Cell Biology

Talks

2. Hummer, F. Kurth, N. Naredi-Rainer, P. S. Dittrich, Single cells in confined vol-umes: microchambers and microdroplets, Lab Chip 2016, 16, 447.

3. K. Eyer, P. Kuhn, C. Hanke, P. S. Dittrich, A microchamber array for single cellisolation and analysis of intracellular biomolecules, Lab Chip 2012, 12, 765.

4. K. Eyer, S. Stratz, P. Kuhn, S. K. Kuster, P. S. Dittrich, Implementing enzyme-linked immunosorbent assays on a microfluidic chip to quantify intracellular moleculesin single cells, Anal. Chem. 2013, 85, 3280.

5. P. Kuhn, K. Eyer, S. Allner, D. Lombardi, P. S. Dittrich, A microfluidic vesiclescreening platform: monitoring the lipid membrane permeability of tetracyclines,Anal. Chem. 2011, 83, 8877.

6. F. Kurth, A. Franco-Obregon, M. Casarosa, S. K. Kuster, K. Wuertz-Kozak, P. S.Dittrich, Transient receptor potential vanilloid 2-mediated shear-stress responses inC2C12 myoblasts are regulated by serum and extracellular matrix, FASEB J. 2015,29, 4726

CytoGLMM: Bayesian Hierarchical Linear Modeling for Flow Cytometry Data

Christof Seiler

Department of Statistics, Stanford University

Cytometry by time-of-flight (CyTOF) characterizes around 40 cell markers simultane-ously. The numbers of measured cells per sample is usually around 10,000. Despite smalldonor sample sizes, the Nolan lab predicted surgical recovery with 32 donors and the Blishlab found associations with HIV acquisition with 33 donors. To relate CyTOF data tophenotypic outcomes, current methods cluster cells into subpopulations and then relatecluster features to outcomes (FlowMap-FR, Citrus, flowType-RchyOptimyx, FloReMi,and COMPASS). These approaches assume a discrete set of cell subpopulations. Thereis growing evidence suggesting that some cells adapt to counteract different threats suchas viruses and cancers. To find such salient marker changes, BayesFlow extends cluster-based methods by allowing each cell to belong to a mixture of subpopulations. Wepropose to skip the clustering step and use a generalized linear model where the response

Ascona 2017 14

Page 21: Statistical Challenges in Single-Cell Biology

Talks

variable is the outcome and the explanatory variables are the 40 protein expression pro-files. We build a hierarchical model of CyTOF data that allows us to estimate populationlevel parameters and marginalize out the donor-specific parameters. Estimating donor-specific parameters is straightforward because the number of cells exceed the number ofmarkers, whereas estimating population-level parameters is challenging because the num-ber of markers usually exceed the number of donors. Therefore it is important to borrowinformation through partial pooling across donors when estimating the population-levelparameters. We check the model through test quantities of the observed data and theposterior predictive distribution. An R implementation using the probabilistic program-ming language Stan is available in our new R package CytoGLMM. We validate ourapproach on open access CyTOF data available on FlowRepository and ImmPort.

Sensitive detection of rare disease-associated cell subsets via representationlearning

Eirini Arvaniti

ETH Zurich

Rare cell populations play a pivotal role in the initiation and progression of diseasessuch as cancer. However, the identification of such subpopulations remains a difficulttask. This work describes CellCnn, a representation learning approach to detect rare cellsubsets associated with disease using high-dimensional single-cell measurements.

Existing approaches [1,2] address the task of detecting phenotype-associated cell pop-ulations via small variations of the following pipeline: cell populations are defined viaa clustering algorithm, a cluster-based representation of each sample (e.g. in terms ofcluster frequencies) is computed and, finally, a supervised learning module is used toassociate with the phenotype of interest. Successful application of such approaches maybe compromised by the quality of the clustering result, especially for rare hard-to-detectcell types.

To overcome this limitation, CellCnn does not separate the steps of extracting a cellpopulation representation and associating it with disease status. Combining these twotasks requires an approach that (1) is capable of operating on the basis of a set ofunordered single cell measurements, (2) specifically learns representations of single cellmeasurements that are associated with the considered phenotype and (3) takes advantage

Ascona 2017 15

Page 22: Statistical Challenges in Single-Cell Biology

Talks

of the possibly large number of such observations. We bring together concepts frommultiple instance learning and convolutional neural networks to meet these requirements.

In this study, we apply CellCnn in a classification setting to reconstruct cell type-specificsignaling responses in samples of peripheral blood mononuclear cells. We additionallyapply CellCnn in a regression setting to identify abundant cell populations associatedwith disease onset after HIV infection, and achieve comparable prediction accuracy to astate of the art analysis performed recently [2], however with computational cost reducedby several orders of magnitude. Finally, we demonstrate the unique ability of CellCnn toidentify extremely rare (down to 0.01% frequency) phenotype-associated cell subsets bydetecting memory-like NK cells associated with prior CMV infection and leukemic blastsin minimal residual disease-like situations.

1. Aghaeepour, N. et al. Critical assessment of automated flow cytometry data analysistechniques. Nat. Methods 10, 228-238 (2013). 2. Bruggner, R. V., Bodenmiller, B., Dill,D. L., Tibshirani, R. J. & Nolan, G. P. Automated identification of stratifying signaturesin cellular subpopulations. Proc. Natl. Acad. Sci. U. S. A. 111, E2770-7 (2014).

Mapping genetic effects on interactions with single-cell states

Davis McCarthy

EMBL-EBI, Hinxton, UK

Technological advances have made it possible to sequence transcriptomes at single-cellresolution (scRNA-seq), at high-throughput, yielding insights into development and theetiology of different tissues and cell states. Simultaneously, it has become possible toassay multiple genomic layers in large cohorts of individuals. This now allows genotyp-ing and scRNA-seq to be carried out for large numbers of genetically distinct individ-uals. Resources such as the Human Induced Pluripotent Stem Cells Initiative (HipSci;www.hipsci.org) provide cell lines from hundreds of human donors that can be used toinvestigate genetic effects on heterogeneity in single cells, in cell types inaccessible inlarge numbers of human individuals.

The particular characteristics of scRNA-seq data demand that approaches developed forbulk RNA-seq need to be adapted and extended to map genetic effects on expressionheterogeneity in single cells. The large number of repeated measurements (cells from thesame individual) provides challenges and opportunities for QTL mapping. We go beyond

Ascona 2017 16

Page 23: Statistical Challenges in Single-Cell Biology

Talks

mapping eQTLs to study variance QTLs (genetic variants that associate with variancephenotypes for a gene) and investigate interactions between expression, genetic variationand cell state, such as differentiated cell type or expression activity in particular pathwaysof interest.We demonstrate the application of such methods to a dataset of more than 10,000 cellsfrom over 60 individual human donors. In this study, HipSci iPSC lines are differentiatedto definitive endoderm, with cells sampled at four time-points across the three-day dif-ferentiation experiment. We can map eQTLs and varQTLs in these data. Further more,using clock-time, pseudotemporal ordering of cells from expression data and independentmeasurements of cell-surface markers we can define multiple cell states and map geneticeffects on interactions between these states and gene expression, yielding insights intogenetic effects on interactions in single-cell states in early development.

Pooled CRISPR screening with single-cell transcriptome readout

Andre Rendeiro

CeMM Research Centre for Molecular Medicine of the Austrian Academy of Sciences

CRISPR-based genetic screens are accelerating biological discovery, but current meth-ods have inherent limitations. Widely used pooled screens are restricted to simple read-outs including cell proliferation and sortable marker proteins. Arrayed screens allow forcomprehensive molecular readouts such as transcriptome profiling, but at much lowerthroughput. Here we combine pooled CRISPR screening with single-cell RNA sequencinginto a broadly applicable workflow, directly linking guide RNA expression to transcriptomeresponses in thousands of individual cells. Our method for CRISPR droplet sequencing(CROP-seq) enables pooled CRISPR screens with single-cell transcriptome resolution,which will facilitate high-throughput functional dissection of complex regulatory mecha-nisms and heterogeneous cell populations.

Ascona 2017 17

Page 24: Statistical Challenges in Single-Cell Biology

Talks

Quantifying developmental plasticity by integrated single-cell RNA-seq and exvivo culture

Lars Velten

EMBL

Single cell RNA-seq has emerged as a powerful tool to map cellular trajectories andbranch points during development. However, due to the snapshot nature of RNA-Seq,no direct statements can be made about the reversibility of cell fate decisions and theability of cells to transit between trajectories: The progeny of a cell that appears ad-vanced towards a particular developmental end point might still functionally contributeto multiple lineages. To quantitatively assess cellular plasticity on scRNAseq-based devel-opmental maps of human blood formation, we have combined high-dimensional surfacemarker indexing by FACS with both single cell RNA-Seq and single cell ex vivo cultiva-tion. Our data reveal that while the position on the “map” is tightly correlated to thepredominant cell type generated ex vivo, the probability of switching developmental fategradually declines as development progresses. Our work suggests that blood formation,and possibly development in general, occurs in a Waddington landscape permissive ofstochastic transitions between lineages downstream of bifurcation points, and it providesthe tools required for quantitatively integrating single cell functional and transcriptomicdata.

Single cell-based detection of diverse classes of genomic DNA rearrangements

Jan Korbel

European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg,Germany, and European Molecular Biology Laboratory-European Bioinformatics Insti-tute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, UK.

Our laboratory has recently applied Strand-seq, a single cell based technology, to de-tect different classes of genomic structural variation (SVs), including copy-number un-balanced SVs and copy-neutral rearrangements. This unique technology sequences onlyDNA template strands present in single cells to preserve the structure and identity of in-dividual homologues. Homologue resolution allows diverse classes of SV to be identified

Ascona 2017 18

Page 25: Statistical Challenges in Single-Cell Biology

Talks

including copy-neutral genomic rearrangements such as inversions and highly complexSVs that are extremely challenging, or even impossible, to identify with alternative se-quencing approaches. While this places Strand-seq as a forefront method for detection ofdiverse classes of structural variation in genomes and in single cells, tools for integrativecomputational analysis of Strand-seq data in single cells are under active developmentin our group. We are currently pursuing extensive benchmarking and validation of thesetools to verify that simple and complex DNA rearrangements can be predicted sensitivelyand with accuracy. Our presentation will, first and foremost, focus on novel geneticvariant classes that we are able to discover in single cells using novel computational ap-proaches operating on Strand-seq data. We have performed extensive benchmarking ofour integrative Strand-seq computational analysis approaches using a set of “gold stan-dard” lymphoblastoid cell lines enabling us to compare Strand-seq based rearrangementcalls against various “3rd-generation” genomic techniques (i.e., Pacific Biosciences andMoleculo single molecule sequencing data, Bionano optical maps, and 10X Genomicssub-chromosome scale haplotypes). We have also applied this to study events formed bychromothripsis, a catastrophic DNA rearrangement process that can generate hundredsof SV breakpoints in a single cell division cycle. Modelling this event using our group’sin vitro “complex alterations after selection and transformation (CAST)” approach wehave identified extensive SV heterogeneity in chromothriptic cell lines, illustrating thenew potential for studying complex genomic heterogeneity at the cell population level.

Leveraging single-cell technology for haplotyping

Tobias Marschall

Saarland University / Max Planck Institute for Informatics

Joint work with David Porubsky, Shilpa Garg, and Ashley Sanders

Humans and many other species are diploid. Every individual inherits two versions of eachautosomal chromosome, called haplotypes, one from its mother and one from its father.Moving from (sequences of) genotypes to haplotypes is known as phasing or haplotyping.The knowledge of haplotypes is critical for addressing a variety of important questions infundamental and clinical research [doi:10.1038/nrg2950].

While phasing is commonly approached by statistical inference on genotype data of largecohorts, such methods have the inherent limitation that they cannot reliably phase vari-

Ascona 2017 19

Page 26: Statistical Challenges in Single-Cell Biology

Talks

ants of low allele frequency, which can be of particular clinical interest.

Strand-Seq is a protocol that selectively sequences only template strands after one roundof DNA replication in the presence of BrdU [doi:10.1038/nmeth.2206]. Single-cell se-quencing of the daughter cells leads to sequencing reads whose directionality are informa-tive of the haplotype of origin [doi:10.1101/gr.209841.116]. This technique is extremelypowerful as it naturally provides chromosome-length haplotype information, i.e. it is ableto phase over difficult regions such as centromeres, homozygosity regions or segmentalduplications.

As Strand-Seq bypasses whole genome amplification, the sequencing yield from eachsingle cell is low and many cells of the same individual need to be sequenced to obtaindense, whole-genome haplotypes. Here, we explore a hybrid strategy of combining Strand-Seq single-cell technology with long-read PacBio bulk sequencing.

We show how this data integration task can be cast as the weighted Minimum ErrorCorrection (wMEC) problem. Solving this NP-hard problem leads to Maximum Likelihoodhaplotypes. We show how problem instances encountered in practice can be solvedoptimally using a fixed-parameter tractable algorithm [doi:10.1089/cmb.2014.0157].

This novel hybrid approach allows us to generate dense and extremely accurate chromosome-length haplotypes using as few as 10 Strand-seq cells combined with only 10-fold coveragePacBio data, paving the way to make haplotype-level genomics a reality.

Reconstructing tumour mutation histories from single-cell sequencing data

Katharina Jahn

CBG, BSSE, ETH Zurich

The mutational heterogeneity observed within tumours is a key obstacle to the devel-opment of efficient cancer therapies. A thorough understanding of subclonal tumourcomposition and the underlying mutational history is essential to open up the design oftreatments tailored to individual patients. Recent advances in next-generation sequencingoffer the possibility to analyse the evolutionary history of tumours at an unprecedentedresolution, by sequencing single cells. This development poses a number of statisticalchallenges such as elevated noise rates due to allelic drop out, missing data and contam-ination with doublet samples.

Ascona 2017 20

Page 27: Statistical Challenges in Single-Cell Biology

Talks

We present SCITE our probabilistic approach for reconstructing tumour mutation histo-ries from single-cell sequencing data with a focus on two recent extensions, the explicitmodelling of doublet samples and a rigorous statistical test to identify the presence ofparallel mutations and mutational loss.

Towards a statistical theory of ontogenetics

Geoffrey Schiebinger

Broad Institute

The past decade has witnessed rapid improvement of single cell DNA sequencing technol-ogy, with growth rates that rival the computer boom of the 1990’s. These techniques arebeginning to shed light on the variation that exists at the single cell level in the humanbody: while each cell in the body has the same basic information encoded in its DNA,different cell types use it differently by expressing different genes differently over time.For example, neurons express neutransmitters, B cells express antibody genes, and betacells in the pancreas express insulin.

This project develops a mathematical theory of ontogeny – the process by which multipledifferentiated cell types develop from a progenitor cell, such as a stem cell or fertilizedegg. Just as phylogenetics reveals the relationship between species over evolutionary time,ontogenetics traces cell fates through the course of development. While phylogenetics ismature and is well studied within mathematics and statistics, there is no comprehensivestatistical treatment of ontogenetics. As the data begin to accumulate, an urgent needarises for methods to map out developmental trajectories.

We address this need by developing mathematically rigorous methods to reconstructontogenetic trees from gene expression profiles. Our methods are based on mathematicaltools from optimization and statistics, including optimal transport and mixing times ofmarkov chains. We analyze our methods in the context of a time-varying nonparametricmixture model.

By carefully examining the trajectories upstream of branch points, we uncover some of thegenetic mechanisms responsible for differentiation. A better understanding of how normaldevelopment proceeds could help us understand what goes wrong in disease states such as

Ascona 2017 21

Page 28: Statistical Challenges in Single-Cell Biology

Talks

cancer. Moreover, a deep understanding of the process by which stem cells differentiateto heal wounds or regenerate tissues could ultimately lead to new therapies and have aprofound impact on human health.

Whole organism lineage tracing with genome editing

James Gagnon

Harvard University

Multicellular organisms develop from single cells by way of a lineage. However, currentlineage tracing approaches scale poorly to whole, complex organisms. I have developed alineage tracing method – GESTALT – that uses genome editing to progressively introduceand accumulate mutations in a DNA barcode. Lineage relationships can be reconstructedfrom the patterns of mutations shared between cells. In zebrafish, I generated thousandsof lineage-informative alleles – “barcodes” – in single animals through CRISPR/Cas9editing of a genetic target array. These barcodes are fixed in the genome of all daughtercells and describe relationships between embryonic progenitor cells and their fate in larvaland adult tissues. Most cells in adult organs derive from relatively few embryonic progen-itors. This phenomenon of clonal restriction occurs after embryogenesis, perhaps duringmysterious mechanisms of tissue homeostasis. I will finally discuss my ongoing work con-necting lineage and cell identity using single-cell RNAseq with the goals of understandingcell fate acquisition and stem cell dynamics.

Sparse Gamma/Poisson PCA to unravel the genomic diversity of single-cellexpression data

Laurent Modolo

CNRS - Lyon

The developpment of high throughput single-cell technologies now allows the investi-gation of the genome-wide diversity of transcription. This diversity has shown two faces:

Ascona 2017 22

Page 29: Statistical Challenges in Single-Cell Biology

Talks

the expression dynamics (gene to gene variability) can be quantified more accurately,thanks to the measurement of lowly-expressed genes. Second, the cell-to-cell variabilityis high, with a low proportion of cells expressing the same gene at the same time/level.Those emerging patterns appear to be very challenging from the statistical point of view,especially to represent and to provide a summarized view of single-cell expression data.PCA is one of the most powerful framework to provide a suitable representation of highdimensional datasets, by searching for new axis catching the most variability in the data.Unfortunately, classical PCA is based on Euclidian distances and projections that workpoorly in presence of over-dispersed counts that show zero-inflation. We propose a proba-bilistic PCA for single-cell expression data, that relies on a sparse Gamma-Poisson model.This hierarchical model is inferred using a variational EM algorithm, and we revisit theselection of the number of axis using an integrated likelihood criterion. We show howthis probabilistic framework induces a geometry that is suitable for single-cell data, andproduces a compression of the data that is very powerful for clustering purposes. Ourmethod is competed to other standard representation methods like tSNE, and we illus-trate its performance on a project that is based on transcriptomic data of CD8+ T cells.Understanding the mechanisms of an adaptive immune response is of great interest forthe creation of new vaccines. We show that our method allows a better understandingof the transcriptomic diversity of T cells, which constitutes a new challenge to bettercharacterize the short and long-term response to vaccination.

Ascona 2017 23

Page 30: Statistical Challenges in Single-Cell Biology

Posters

Posters

ProSolo - Discovering and Typing Genetic Variants in Single Cells

Alexander Schonhuth

Centrum Wiskunde & Informatica, Amsterdam

Calling and genotyping genetic variants from single cell sequencing experiments posesa variety of novel statistical challenges. The major source of biases is obvious: the initialamplification step, required for generating sufficient amounts of DNA for sequencing.Dealing with these biases in the discovery and genotyping of genetic variants is statisti-cally involved and leads to computational issues that have not yet been comprehensivelyaddressed. Exemplary questions requiring new answers are: (1) How to robustly geno-type in the presence of differentially amplified alleles, (2) how to resolve ambigouslymapped reads, (3) how to identify amplification errors and (4) how to integrate severalsingle cells and bulk sequencing experiments into a comprehensive differential analysis ofgenetic variants.

Here, we present PROSOLO, a latent variable framework that provides sound answers tothese questions. One key observation made use of by PROSOLO is that all read-basedobservations relating to a putative genetic variant are conditionally independent given the(unknown) allele frequency after amplification. This crucial insight exposes the inherentalgorithmic problem to be of a runtime depending linearly – and not exponentially –on the number of the reads aligned to the region of interest. Overcoming this crucialcomputational bottleneck allows for a substantially refined analysis of single cell geneticvariants and provides answers to the questions listed above. In addition, PROSOLO alsoquantifies the inherent uncertainties and allows to determine sequential motifs leading toamplification errors.

We applied PROSOLO to a large collection of single human blood cells and demonstratethe ability to call cell-specific single nucleotide variants for those cells at favorable perfor-mance rates. We further demonstrate how to identify amplification induced nucleotideerrors and how to integrate them into variant calling pipelines. Finally, our analyses yieldinteresting insights into hematopoiesis.

**Joint with David Laehnemann, Alice McHardy, Helmholtz Center for Infection Re-search; Johannes Koster, Dana Farber, HarvardU.

Ascona 2017 24

Page 31: Statistical Challenges in Single-Cell Biology

Posters

Parameter estimation of nonlinear mixed effects models by two-stage approaches

Hans-Michael Kaltenbach

ETH Zurich

Until recently, modeling of cellular signaling largely relied on cell population data andcorresponding ODE-based models described the average behavior of a population. Ad-vances in live-cell imaging technology with fluorescent probes now allow simultaneouslong-term recording of single-cell data for hundreds of cells at densely sampled time-points.Nonlinear mixed effects models (NLMEs) provide a statistical approach to extend ODE-based models by maintaining a single deterministic model for all cells, with random effects(partially) explaining cell-to-cell heterogeneity of the observed responses by cell-specificvalues of parameters and initial conditions.However, parameter estimation in NLMEs is notoriously difficult and existing approachesare tailored toward situations with few individuals and few observations per individual,typical for pharmacokinetic/pharmacodynamic (PK/PD) studies. For single-cell data,traditional approaches such as first-order conditional linearization are problematic due tothe severe nonlinearity of the mean model (given as the solution of a system of ODEs),while current approaches such as stochastic approximation EM (SAEM) algorithms haveheavy computational demands.We show that for typical single-cell microscopy data, two-stage approaches that first inde-pendently estimate parameters for each individual cell and then combine these estimatesand their uncertainties to yield the final NLME parameters become attractive alterna-tives to handle NLME models. Using a recent systems biology model of cell signaling, wedemonstrate that the resulting estimates are comparable to SAEM in precision and accu-racy, but computations are typically several times faster. Moreover, two-stage methodsare conceptually very simple, easy to implement, can easily be adapted to incorporate,e.g., robust estimation procedures at the individual cell and the population parameterlevel, and their computationally more demanding first stage is easily parallelizable.

Ascona 2017 25

Page 32: Statistical Challenges in Single-Cell Biology

Posters

PCA for zero-inflated negative binomial data

Jean-Philippe Vert

ENS Paris

Single-cell gene expression data are characterized by large number of zero counts dueto drop-out, and over-dispersion of non-zero counts. We propose a framework to fit alinear model, including or not latent factors, to such count data modeled by a zero-inflatednegative binomial distribution. This allows to perform for example principal componentanalysis (PCA) while correcting for batch effects or sequencing depth on single-cell geneexpression data. I will show how the model allows to better differentiate biological fluc-tuations from technical fluctuations, which are often confounded when a more naiveapproach is used, such as performing a standard PCA on the log counts. This is jointwork with Davide Risso, Svetlana Gribkova, Fanny Perraudeau and Sandrine Dudoit.

Disentangling the different sources of variation in multi-omics single-cell se-quencing

Ricard Argelaguet

European Bioinformatics Institute

Single-cell RNA sequencing is becoming a well-established routine that is revolution-ising our understanding of cellular phenotypes. Interestingly, other data modalities arealso starting to be assayed at the single-cell level, including epigenetics, proteomics andmetabolomics, raising the question of how to jointly analyse these set of complex high-dimensional data sets using a statistically rigorous framework. Here we present single-cellGroup Factor Analysis (scGFA), a generalisation of traditional factor analysis that inte-grates multi-omics data sets and is suited to the analysis of noisy single-cell sequencingdata. scGFA calculates a low dimensional representation of the data which hopefullycaptures an inherent structure that might be masked by the noisy high-dimensional repre-sentation. Furthermore, it disentangles the different sources of variation and it calculateswhether they are unique to a single omics or shared by multiple data sets, thereby reveal-ing hidden sources of covariation. We applied scGFA to a recent data set of 61 embryonic

Ascona 2017 26

Page 33: Statistical Challenges in Single-Cell Biology

Posters

stem cells generated by a technology called scMT-seq, a recent method which uses single-cell genome-wide bisulfite sequencing and RNA sequencing to perform a parallel profilingof the DNA methylation and the gene expression in single cells. Our results show the ex-istence of several axes of variation related to known biological processes and suggest theexistence of three subpopulations that are associated with different pluripotency potentialand genome-wide methylation rate.

Bead based compensation to correct for channel crosstalk in mass cytometry

Helena Crowell

University of Zurich

By addressing the limit of measurable fluorescent parameters due to instrumentationand spectral overlap, mass cytometry (CyTOF) combines heavy metal spectrometry toallow examination of up to 100 parameters at the single cell level. While spectral overlapis significantly less pronounced in CyTOF than flow cytometry, spillover due to detectionsensitivity, isotopic impurities, and oxide formation can impede data interpretability. Wedesigned CATALYST (Cytometry dATa anALYSis Tools) to provide tools for preprocessingand analysis of cytometry data, including compensation and in particular, an improvedimplementation of the single-cell deconvolution (SCD) algorithm for debarcoding anddoublet-removal (Zunder et al. 2015, Nature Protocols 10, 316-333).

The CATALYST R package is available on Github and will be submitted to Bioconductorfor review shortly. Currently, CATALYST provides a user-friendly R implementation ofthe SCD algorithm, and a function for estimating a compensation matrix from a prioriidentified single positive populations, which may be preceded by estimation of an optimaltrim value that minimizes the sum of population- and channel-wise squared mediansupon compensation. The matrix returned by this work flow may be directly applied tothe measurement data or exported, e.g. to FlowJo or Cytobank.

We have demonstrated that spill estimates are to a great extent panel-specific, therebyeliminating the need for single-stain controls in each measurement. Moreover, removal ofspillover artefacts will considerably effect downstream data interpretation, for example,

Ascona 2017 27

Page 34: Statistical Challenges in Single-Cell Biology

Posters

increase correlation between channels using the same antibody and decrease correlationbetween cross talking channels.

Additionally, we forsee the CATALYST R package as a collector for future developmentsin CyTOF data processing, such as statistical methods for differential discovery.

Studying the genotypic and phenotypic evolution of tumours using single cellRNAseq

Edith Ross

Cancer Research UK Cambridge Institute, University of Cambridge

Tumour evolution leads to genetic intra-tumour heterogeneity, which poses major chal-lenges to cancer therapy. While tumour heterogeneity has been documented in severalcases, many details of the underlying evolutionary processes are still unknown.

Recent advances in single-cell sequencing technologies have triggered the developmentof phylogenetic methods for single cell data that take into account the noise that isinherent to this type of data and promise to reveal tumour heterogeneity at a muchhigher resolution. This includes our own method oncoNEM (oncological Nested EffectsModels), which is based on the nested structure of mutations observed between cells andjointly infers the tree structure, the number of clones and their composition.

So far these methods have only been applied to data derived from DNA sequencing.Here, we present the results of inferring tumour phylogenies from single cell RNAseq.Using RNA instead of DNA offers the opportunity to combine the phylogenetic insightswith the gene expression of cells. Since the selective forces that shape evolution act onthe phenotype not the genotype, this is an important step towards understanding tumourevolution. After inferring the phylogenetic trees with oncoNEM, we correlated the geneexpression patterns found in the sequencing data with the structure of the tree to identifyphenotypic similarities and differences between the clones. In this talk we will present theresults of two case studies.

Ascona 2017 28

Page 35: Statistical Challenges in Single-Cell Biology

Posters

Evolutionary history of circulating tumor cells

Ewa Szczurek

Institute of Informatics, Faculty of Informatics, Mathematics and Mechanics, Univer-sity of Warsaw

In this work, we focus on the phylogeny of circulating tumor cells (CTCs). From availablesingle cell sequencing data of CTCs, we infer phylogenetic models for, and apply coales-cent theory to speculate about the principles of the evolutionary process that generatedtheir genomic sequences.

Beyond comparisons of means: understanding changes in gene expression atthe single-cell level

Catalina Vallejos

Alan Turing Institute and UCL

Traditional differential expression tools are limited to detecting changes in overall ex-pression, and fail to uncover the rich information provided by single-cell level data sets.We present a Bayesian hierarchical model that builds upon BASiCS to study changesthat lie beyond comparisons of means, incorporating built-in normalization and quantify-ing technical artifacts by borrowing information from spike-in genes. Using a probabilisticapproach, we highlight genes undergoing changes in cell-to-cell heterogeneity but whoseoverall expression remains unchanged. Control experiments validate our method’s per-formance and a case study suggests that novel biological insights can be revealed. Ourmethod is implemented in R and available at https://github.com/catavallejos/BASiCS.

Ascona 2017 29

Page 36: Statistical Challenges in Single-Cell Biology

Posters

Single-cell data reveals widespread recurrence of mutational hits in the lifehistories of tumors

Jack Kuipers

CBG, BSSE, ETH Zurich

Intra-tumor heterogeneity poses substantial challenges for cancer treatment. A tumor’scomposition can be deduced by reconstructing its mutational history. Central to currentapproaches is the infinite sites assumption that every genomic position can only mutateonce over the lifetime of a tumor. The validity of this assumption has never been quanti-tatively assessed. We developed a rigorous statistical framework to test the infinite sitesassumption with single-cell sequencing data. Our framework accounts for the high noiseand contamination present in such data. We found strong evidence for the same genomicposition being mutationally affected multiple times in individual tumors for 8 out of 9single-cell sequencing datasets from a variety of human cancers. Six cases involved theloss of earlier mutations, five of which occurred at sites unaffected by large scale genomicdeletions. Two cases exhibited parallel mutation, including the dataset with the strongestevidence of recurrence, indicating convergent evolution at the base pair level. Our resultsrefute the general validity of the infinite sites assumption and indicate that more complexmodels are needed to adequately quantify intra-tumor heterogeneity for more effectivecancer treatment.

Single-cell mutation calling via phylogenetic tree inference

Jochen Singer

CBG, BSSE, ETH Zurich

Understanding the evolution and dynamics of cancer is a crucial aspect towards thedevelopment of appropriate cancer therapies. This is a challenging task because cancersevolve as heterogeneous tumor populations with an unknown number of subclones ofvarying frequencies. Typically insights are gained through bulk sequencing. However,since here mutations cannot be directly assigned to subclones the subclonal informationneeds to be deconvolved. Here, the deconvolution is based on mutation frequencies,which is challenging for the identification of nested and similar size subclones. In con-

Ascona 2017 30

Page 37: Statistical Challenges in Single-Cell Biology

Posters

trast, single cell sequencing information provides a direct assignment of mutations tosingle cells. Here, a major challenge is the elevated error rates, allelic dropout and un-even coverage compared to traditional bulk sequencing data. To robustly account forthese sources of noise we first identify sites which are likely to show a mutation in atleast one cell. This is achieved by efficiently computing the probabilities of all possiblemutation combinations across cells. Then the phylogeny of the tumor is computed with astochastic search over the possible tree space via a Markov-Chain-Monte-Carlo scheme. Inaddition to offering a maximum likelihood phylogeny and a mutation to cell assignment,we provide a confidence to the mutation calls by sampling from the posterior. In contrastto existing methods, by using evolutionary information of the tumor tissue our approachenables us to reliably call mutations for a single cell even in the absence of sequencinginformation, as we demonstrate on several data sets.

Employing a Mixture Nested Effects Model to account for the variance of amixed population of single cells.

Martin Pirkl

CBG, BSSE, ETH Zurich

New technologies allow for the elaborate measurement of different traits of single cells.These data promise the opportunity to elucidate causal intra-cellular mechanisms in un-precedented detail. Insights like those help not only to learn how cells generally function,but why and at what point they cease to function properly or in the worst case are de-regulated. That de-regulation can lead to life threatening diseases like cancer. The battleagainst those diseases benefits from our understanding of cellular function on the singlecell level. We follow the assumption, that all cells harbor the same underlying signalingpathways. However, the data, which is usually produced only shows a snapshot of thepathway at different times for each single cell, thus leading to a high variance for theobserved multi trait phenotypes. We employ a mixture model to estimate the snapshotof the underlying graph for each cluster of cells from the whole population and combinethe family of inferred networks to one consensus signaling pathway with detailed logicalstructures. That information is further be inspected to forward our understanding ofcellular function.

Ascona 2017 31

Page 38: Statistical Challenges in Single-Cell Biology

Posters

Combining population and single-cell RNA-Seq to investigate determinants ofsuccessful HIV expression reactivation

Monica Golumbeanu

CBG, BSSE, ETH Zurich

HIV establishes latency in a minority of cells that are infected. These cells represent onereservoir from where the virus can reinitiate new rounds of infection and are currentlyconsidered as a major obstacle to HIV cure. The switch between HIV latency and HIVproduction from infected cells is at the center of HIV eradication strategies, including the“shock and kill” strategy that uses pharmacological and immunological agents to purgethis reservoir. To date, it has been shown that the infected cells are not equal and thatHIV gene expression depends on multiple parameters. These include the viral integrationsite location within the host cell genome and the host cell protein composition, which isaffected by the cellular activation state and by environmental conditions. To investigateheterogeneity in the transcriptional reprogramming during HIV latency and reactivation,primary human CD4+ T cells were infected with an HIV-based vector and allowed toreturn to a resting, latent state for about 4 weeks. Subsequently, latently infected cellswere exposed to either SAHA, a histone deacetylase inhibitor, or to anti-CD3/anti-CD8antibodies mimicking antigen-mediated T-cell receptor stimulation. RNA-Seq from bulkas well as Fluidigm-isolated single cells was performed for each condition. We exploredtranscriptional heterogeneity of HIV and host cell gene expression using various statisticalmodels designed for bulk and single-cell data analysis in order to identify subpopulationsof cells based on their differential expression profiles. This type of analysis aims to iden-tify transcriptional programs leading to successful reactivation of HIV expression, and tofacilitate screening of future reactivating agents.

Ascona 2017 32

Page 39: Statistical Challenges in Single-Cell Biology

Posters

Exploiting heterogeneity in single-cell transcriptomic analyses: how to movepast comparisons of averages

Keegan Korthauer

Dana-Farber Cancer Institute and Harvard T.H. Chan School of Public Health

The ability to quantify cellular heterogeneity is a major advantage of single-cell tech-nologies. Specifically, it is now possible to elucidate gene expression dynamics that wereinvisible using bulk RNA-seq, such as the presence of distinct expression states. However,statistical methods often treat cellular heterogeneity as a nuisance. I will present a novelmethod to characterize differences in expression in the presence of distinct expressionstates within and among biological conditions. I will demonstrate that this frameworkcan detect differential expression patterns under a wide range of settings. Compared toexisting approaches, this method has higher power to detect subtle differences in geneexpression distributions that are more complex than a mean shift, and can characterizethose differences. The R package scDD implements the approach, and is available onBioconductor.

Unraveling Cortical Development Using Population and Single-cell RNA-SeqData

Zahra Karimaddini

Department of Biosystems Science and Engineering, ETH Zurich, Switzerland

The brain, as part of the central nervous system, is the most complex organ in the mam-malian body and the mechanisms that regulate its development are poorly understood.During brain development, neural stem cells (NSCs) generate thousands of different neu-ronal subtypes that are organized in precise and functionally distinct layers of the cerebralcortex. This process is a prerequisite for normal brain functions and any deviation fromthe standard developmental path can lead to debilitating brain disorders. To unravel thesemechanisms, we study changes in the expression of transcription factors and signalingcomponents in NSCs and progenitor populations (NeuroStemX, SystemsX.ch). To thisend, we use population and single-cell RNA sequencing of each population at daily inter-vals during mouse cortical development obtaining a data set containing more than 100

Ascona 2017 33

Page 40: Statistical Challenges in Single-Cell Biology

Posters

population samples and more than 1000 single cells. This enabled us to identify a set ofnovel genes that characterizes NSCs and progenitor cells at the population and single celllevel at distinct stages of brain development. Using machine learning methods, we identi-fied a continuous differentiation path and, from this, determined different transcriptionalstates. Remarkably, we can show that the single cells follow a similar differentiation pathto that predicted from transcriptional analysis at the population level. In addition, thesingle cells can be divided into subpopulations that emerge over time. To summarize,our gene expression data of different cell types at the population and single-cell level fordaily intervals during neurogenesis combined with the appropriate data analyses give anunprecedented insight into the complex process of stem cell patterning and fate decisionmaking in early brain development.

Unsupervised identification of multifurcations in high-dimensional single celldata with TreeTop

Will Macnair

ETH IMSB

Branching processes, including hematopoeisis and clonal evolution of cancer, are an im-portant area of study. Recent advances in single-cell technology such as mass cytometryand single-cell RNAseq permit branching processes to be measured with a high level ofdimensionality, via simultaneous measurement of 10s / 1000s of species. However, ap-proaches to date have been restricted to points where at most three branches meet, andare limited to datasets which are well-represented by trees.

By fitting an ensemble of trees to the data and seeking points which partition these treesconsistently, our algorithm TreeTop identifies points of bifurcation and their correspondingbranches. TreeTop then allows analysis of transitions undergone during such branchingprocesses, in terms of evolution of markers and transitional gated celltypes.

TreeTop represesents a cell population by an ensemble of trees, which connect randompoints evenly distributed through the dataset to represent subpopulations. Connections ineach tree represent the range of possible transitions between these subpopulations. Theensemble of trees is visualized via a force-directed graph layout algorithm. To identifypoints of bifurcation within this ensemble of trees, we define a point of bifurcation aswhere three or more branches of a process meet. Across an ensemble of trees, cutting

Ascona 2017 34

Page 41: Statistical Challenges in Single-Cell Biology

Posters

at such points induces consistent partitions of nodes into branches which are consistentacross the ensemble of trees. TreeTop utilizes a score evaluating the consistency of suchpartitions to identify such points of bifurcation and their corresponding branches.

We have successfully applied TreeTop to datasets representing varying topologies: simplybranching (T cell development in the thymus), circular (cell cycle) and multifurcating(healthy human bone marrow). The points of bifurcation identified fit with biologicalexpectations. TreeTop also identifies multiple layers of bifurcation within synthetic datasampled from a hierarchy of bifurcations. TreeTop permits points of bifurcation to beexplored in less well-characterized datasets.

Accurate modeling and correction of GC content and gene length bias of single-cell RNA-seq

Hubert Rehrauer

ETH Zurich and University of Zurich

There are very diverse single cell library preparation protocols in practical use that provideeither 3’-end or full-length transcript coverage. Next to the obvious differences impliedby these two strategies, we observe GC content and gene length effects that are specificto the individual protocols. Additionally we observe also cell-to-cell variations withinindividual protocols, and for the joint analysis of heterogeneous datasets a library biascorrection (LBC) is needed. While several approaches that deal with GC content andgene length dependent biases in RNA-seq based transcript abundances exist, they are notsatisfactorily because they do not consider the potential interdependence and interactionof both effects. Additionally, they are prone to introduce artifacts for genes that havepartially or entirely zero counts. Thirdly, the biases may affect a large fraction of genessuch that the assumed global normalization schemes are invalid. With our work, wepresent a novel method for library bias correction with a true two-dimensional functionthat simultaneously depends on GC content and gene length. Our approach is uniqueby modeling not the absolute biases but the sample-specific deviations from the datasetwide bias. The computed correction factors are precise even in the case of wide-spreadzero counts. The presented method is useful for correcting data sets with subsets ofdeviating cells as well as for the joined analysis of different data sets generated withdifferent biases.

Ascona 2017 35

Page 42: Statistical Challenges in Single-Cell Biology

Posters

scEM: An EM algorithm for Differential Expression Analysis of Single-Cell RNA-seq

Agus Salim

La Trobe University and Walter and Eliza Hall Institute of Medical Research

We developed an EM algorithm to perform differential expression analysis of single-cellRNA-seq data, specifically UMI-based data. For a cell type, the unobserved numberof molecules for a particular gene follows zero-inflated negative binomial (ZINB) distri-bution with mean parameter proportional to the size factor parameter that measuresdifferences in starting materials across cells. Dropout phenomenon in which only a frac-tion of molecules is observed due to low capture efficiently is modelled using logisticregression whose parameters are estimated using external spike-ins. Simulation studiesdemonstrated that our approach has better sensitivity (at similar specificity) when com-pared to SCDE, MAST and BPSC. The log fold-change estimates are also robust whencapture efficiency in endogenous RNA is higher than the external spike-ins, as long asthe ratio of the two capture efficiencies is fixed across cells. We demonstrated the poten-tial applicability of our method by applying it to analyse differential expression of mouseblood cells at different stage of development.

DNA Methylation Dynamics of Human Hematopoietic Stem Cell Differentiation

Fabian Muller

Max Planck Institute for Informatics, Saarbrucken, Germany

Although virtually all cells in an organism share the same genome, regulatory mecha-nisms give rise to hundreds of different, highly specialized cell types. These mechanismsare governed by epigenetic patterns, such as DNA methylation, which determine DNApackaging, spatial organization, interactions with regulatory enzymes as well as RNA ex-pression and which ultimately reflect the state of each individual cell. Using low-input andsingle-cell whole genome bisulfite sequencing, we generated genome-wide DNA methyla-

Ascona 2017 36

Page 43: Statistical Challenges in Single-Cell Biology

Posters

tion maps of blood stem and progenitor cells [1]. These maps enabled us to characterizecell-type heterogeneity, and aggregating methylation levels of small pools of cells and sin-gle cells across putative regulatory regions, we dissected the DNA methylation dynamics ofhuman hematopoiesis. We observed lineage-specific DNA methylation patterns betweenmyeloid and lymphoid progenitors and associate these patterns to regulatory elements,gene expression and chromatin accessibility. Using statistical learning, we were able toaccurately infer cell types from DNA methylation signatures and the resulting modelscould be used for a data-driven reconstruction of the human hematopoietic system. Ourobservations illustrate the power of DNA methylation analysis for the in vivo dissectionof differentiation landscapes as a complementary approach to lineage tracing and in vitrodifferentiation assays. The generated methylome maps and analysis methods provide acomprehensive framework for studying epigenetic regulation of cell differentiation andblood-linked diseases.

[1] Farlik, M., Halbritter, F., Muller, F., et al. (2016). DNA Methylation Dynamicsof Human Hematopoietic Stem Cell Differentiation. Cell Stem Cell, 19(6), 808-822.http://doi.org/10.1016/j.stem.2016.10.019

Statistical methods for differential discovery in multi-dimensional flow and masscytometry (CyTOF) data

Lukas Weber

Institute of Molecular Life Sciences, University of Zurich

Recent technological advances in multi-dimensional flow cytometry and mass cytome-try (CyTOF) enable the measurement of up to 50 protein marker expression levels percell, allowing cell populations to be characterized in unprecedented detail. Due to thehigh dimensionality, significant efforts are underway to develop automated data analysismethods to replace traditional “manual gating” or visual inspection of projected data.Differential discovery experiments aim to detect biological features that vary consistentlybetween biological samples in different conditions, such as diseased and healthy. Forexample, in cytometry, it may be of interest to detect differentially abundant cell pop-ulations, or differential expression of functional markers within specific cell populations,especially rare populations. We have tested or adapted multiple new as well as existingmethods for performing differential discovery analyses in large-scale, multi-dimensional,

Ascona 2017 37

Page 44: Statistical Challenges in Single-Cell Biology

Posters

small-sample cytometry data. Our methods make use of a range of statistical techniques,including empirical Bayes moderation of variances and functional data analysis. In par-ticular, we take into account information contained in biological replicate samples, whichare crucial for statistical inference. Our methods are benchmarked against existing ap-proaches, such as Citrus and CellCnn, using simulated and experimental data; and willbe available as an R/Bioconductor package.

Imputation of single-cell DNA methylation profiles by transferring informationacross cells

Chantriolnt Andreas Kapourani

University of Edinburgh, School of Informatics

New technologies enabling the measurement of DNA methylation at the single cell levelare promising to revolutionise our understanding of epigenetic control of gene expression.Yet, intrinsic limitations of the technology result in very sparse coverage of CpG sites,effectively limiting the analysis repertoire to a semi-quantitative level. Here we proposea Bayesian hierarchical method to quantify spatially-varying methylation profiles acrossgenomic regions from single-cell Bisulphite sequencing data (scBS-seq). The methodclusters individual cells based on the methylation profiles, enabling the discovery of epi-genetic diversities and commonalities among individual cells. The clustering also acts asan effective regularisation method for imputation of methylation on unassayed CpG sites,enabling transfer of information between individual cells. We show that the resultingimputation is highly accurate both on simulated and real data sets.Joint work with Guido Sanguinetti

Ascona 2017 38

Page 45: Statistical Challenges in Single-Cell Biology

Posters

Latent Factor Analysis of scRNAseq

Daniel Wells

University of Oxford

A common challenge in the analysis of single cell RNA sequencing data is clusteringof cells into biologically appropriate classes. We show that SDA (Sparse Decompositionof Arrays) is able to simultaneously identify cell types along with the genes which definethem. SDA decomposes a digital gene expression matrix into components which haveassociated cell scores and gene loadings. In this framework a cell type can be representedas a component where the cell scores indicate which cells are members and the geneloadings indicate which genes are active. SDA uses a bayesian framework with a sparsityprior on the gene loadings which facilitates interpretation of the cell classes in biologicalterms. We compare this method to other commonly applied techniques such as PCA(principal components analysis) and ZIFA (zero inflated factor analysis) using both realdatasets and simulated data.

On the probability of differential distribution in single-cell RNA-seq

Michael Newton

University of Wisconsin, Madison

Accounting for the mixture of unlabeled cell populations underlying a sample of single-cell RNA-Seq profiles leads to natural structural constraints on the form of empiricalBayesian inference for changes in gene-level distributions of gene expression. We de-scribe a computationally and statistically efficient model-based approach that deliversposterior probabilities of differential distribution through a novel discrete mixture definedjointly over expression changes per cellular type and over subtype cell proportions. Weillustrate the computations in both synthetic data and on data from embyronic stem cellexperiments. This is joint work with Xiuyu Wang and Christina Kendziorski.

Ascona 2017 39

Page 46: Statistical Challenges in Single-Cell Biology

Posters

Evaluation of methods for differential expression analysis of single-cell RNA-seqdata on a collection of consistently processed public data sets

Charlotte Soneson

University of Zurich

As single-cell RNA-seq becomes increasingly widely used, the amount of publicly avail-able data grows rapidly. This provides a useful resource for computational method de-velopment as well as for reanalysis and extension of published results. In public datarepositories, typically both the raw data and a processed data set, used by the data gen-erators for their analysis, are available. However, the procedure to obtain the processeddata set is tailored to the specific application, and can be widely different between datasets. We present conquer, an open collection of consistently processed, analysis-readysingle-cell RNA-seq data sets. For each data set, we provide count and TPM estimatesfor both genes and transcripts, as well as quality control and exploratory analysis reportsto assist users in determining whether a particular data set is suitable for their purpose.To illustrate the usefulness of the repository, we use some of the data sets to perform anextensive evaluation of multiple methods for differential gene expression analysis. Severalmethods developed specifically for single-cell RNA-seq data were compared to existingmethods developed for bulk RNA-seq differential expression analysis. Considerable dif-ferences were found between the behaviour of different methods, but also for the samemethod applied to different randomly selected subsets of a given data set. We furtherinvestigate the characteristics of the significant genes called with different methods, andshow that gene filtering can have a substantial effect on the performance.

Discovering gene and drug connections via Image-based morphological profiling

Juan C. Caicedo

Broad Institute of MIT and Harvard

Microscopy images are now used as a source of quantitative information in a varietyof biological experiments. Images acquired for studying phenotypes in cell biology cap-ture high resolution morphological variations of cells. This morphological information can

Ascona 2017 40

Page 47: Statistical Challenges in Single-Cell Biology

Posters

be extracted by transforming pixels into single cell measurements that encode the pheno-typic changes of the experiment. The collection of single cell measurements (“profiles”)can be used to reveal similarities and differences between cell populations under differentexperimental conditions.

In our lab, we aim to make morphological information computable and reproduciblethrough the development of open source software tools (such as CellProfiler and Cy-tominer) that can create rich quantitative profiles efficiently. In collaboration with biologylabs, we use morphological profiling to address fundamental research questions such asidentifying the functional impact of variants of a gene associated with cancer, identifyingdrugs to target diseases with known genetic basis, and uncovering novel functions forunannotated genes. These are just a few examples of the wide range of biological ap-plications that can be approached using image-based morphological profiling We expectimage-based profiling and analysis to be powerful tools that complement well-established-omics methods to address these challenging questions.

A comparison of bioinformatics tools to analyze single-cell transcriptomes

Elisabetta Mereu

CNAG-CRG, Universitat Pompeu Fabra (UPF), Barcelona, Spain

Single cell RNA-seq has become a powerful method to explore cell-to-cell variabilityin transcriptome profiles with unprecedented resolution. However, single cell transcrip-tomics data contain many sources of technical noise (e.g. dropout events or batcheffects), often hiding the real structure and challenging downstream analysis. To addressthis, many tools have been proposed to characterize cell heterogeneity using co-expressedgene sets and to identify differentially expressed transcripts. To evaluate the performanceof such tools , we generated high quality single cell transcriptome datasets and analyzedcell heterogeneity using Pagoda, CleanCount and Seurat. To identify differentially ex-pressed genes we applied statistical methods implementing scDE and scDD tools. Weinvestigated the robustness of the results by comparing gene clusters and differentiallyexpressed genes by their distribution of overlap sizes between methods. The stability incell clustering has been assessed using bootstrap resampling, based on metrics such asJaccard similarity coefficients. We assessed the strengths of the methods in terms of theirsensitivity to detect subpopulations. We also critically discussed pipeline characteristics,

Ascona 2017 41

Page 48: Statistical Challenges in Single-Cell Biology

Posters

such as input formats, handling of covariates, applicability and/or specific requirementsand assumptions. Finally, we suggest an analysis strategy, combining these tools in orderto optimize results and maximize information from single cell transcriptional profiles.

Quantifying punctuated equilibrium via stochastic modeling across cancer types

Simona Cristea

Dana-Farber Cancer Institute & Harvard School of Public Health

Understanding the evolutionary dynamics of cancers is an essential step towards clinicalsuccess. Recently, various studies have reported experimental and theoretical evidence ofcancer progressing via relatively few instances of simultaneous genomic alterations (suchas point mutations, copy number alterations or chromosomal rearrangements). Thismodel of tumor evolution has been termed “punctuated”, as opposed to the “gradual”,or classical model, in which tumors sequentially accumulate genomic alterations acrosslarge periods of time. Here, we investigate these two paths of cancer progression viastochastic modeling, across various cancer types. We simulate tumor evolution by amulti-type branching process with various mutation rates and fitness distributions. Afterlearning some of the model parameters from bulk sequencing data, we use single celldata to fit the two types of models. Based on these fits, we devise a new measure forpunctuated equilibrium and propose various biological mechanisms that can explain thehigher likelihood of punctuated evolution in various cancer types.

Hierarchical population model for multivariate single-cell data

Carolin Loos

Helmholtz Zentrum Munchen - German Research Center for Environmental Health, In-stitute of Computational Biology, 85764 Neuherberg, Germany

Joint work with: Katharina Moeller, Fabian Frohlich, Tim Hucho, Jan Hasenauer

Ascona 2017 42

Page 49: Statistical Challenges in Single-Cell Biology

Posters

A comprehensive understanding of biological systems requires the investigation of hetero-geneity and its underlying sources and mechanisms. To elucidate cellular heterogeneity,mechanistic population models need to be calibrated to single-cell data, which are col-lected by e.g. flow cytometry or microscopy. However, the simulation and analysis ofthese models is challenging and therefore hinders parameter estimation and model se-lection. Especially the study of cell populations that comprise multiple subpopulationsremains an unsolved problem.

We present a data-driven modeling framework for heterogeneous cell populations, whichcombines mixture modeling with approaches for the approximation of distributions toaccount for several levels of heterogeneity. This method facilitates the detection ofcausal differences between subpopulations and between cells of the same subpopulation.Its computational efficiency allows the comparison of hundreds of competing hypothesesand thus enables a detailed study of biological processes.

We apply our method to artifical data and experimental data of NGF-induced Erk signal-ing in neuronal cell populations. Our approach elucidates the influence of the extracellularmatrix on pain sensitization and is able to detect different levels and sources of variabil-ity. Our results suggest that the presented method will enable a better understanding ofcellular heterogeneity.

Stochastic Profiling for mRNA-Seq Data

Lisa Amrhein

Helmholtz Zentrum Munchen, Institute of Computational Biology

Acute Myeloid Leukemia (AML) is a type of blood cancer affecting the myeloid lin-eage. The incidence of AML increases with age, and it is the most frequent type of acuteleukemia among adults. Although approximately 70% of patients achieve complete re-mission, very small numbers of leukemic cells remain and cannot be detected with currentdiagnostic techniques. Nearly everybody with AML will relapse in the end if no furtherpostremission or consolidation therapy is given, and this relapse is almost always lethal.

AML patients frequently carry a mixture of different cancer cell types, so-called subclones,which evolve over time, so that the mixture at relapse is different from the one atdiagnosis. Understanding clonal evolution and identifying rare subclones, especially forthose mutations causing relapse, is still an open challenge.

Ascona 2017 43

Page 50: Statistical Challenges in Single-Cell Biology

Posters

We aim to parameterize transcriptional heterogenity from mRNA-Seq counts taken fromsmall groups of cells (e.g. 10-cells). To that end, we will extend our Stochastic Pro-filing Method previously proposed for microrray data. This technique infers single-cellregulatory states by mathematically deconvolving n-cell measurements. This averaging-and-deconvolution approach allows us to quantify single-cell regulatory heterogeneitieswhile avoiding the technical measurement noise of single-cell techniques.

Integrative analysis of single-cell expression data reveals distinct regulatorystates in bidirectional promoters

Fatemeh Behjati

Max Planck Institute for Informatics

Bidirectional promoters (BPs) are prevalent in eukaryotic genomes. However, it is poorlyunderstood how the cell integrates different epigenomic information, such as transcriptionfactor (TF) binding and chromatin marks, to determine directionality of gene expressionat BPs. Single cell sequencing technologies are revolutionizing genetics and this projectfocuses on the integration of single-cell RNA data with bulk ChIP-seq and other epige-netics data for which single cell technologies are not yet established. We utilized novelhuman single cell RNA-seq data to reveal clusters of BP genes exhibiting various statesof directionality across individual cells. For instance, we found a cluster with highlyexpressed genes at both genes of a BP for almost all single cells. However, some BPgenes are expressed in an alternating manner, where the expression of one gene alwaysdominates the other one, and vice versa, depending on the subpopulation of cells. Weintegrated other levels of genomic and epigenomic information to shed light on this pre-viously unrecognized complexity in BP gene regulation. We found unique TF motifs andbinding patterns associated with these expression states. Further, we stratified differentHistone Modifications (HM) on these clusters. Despite the fact that the clusters arederived from the single cell data, the bulk HM and TF profiles were consistent with thosestates. It is an interesting research direction to try to deconvolute bulk-seq data withsingle-cell expression data, although many statistical challenges are as of yet unexplored.

Ascona 2017 44

Page 51: Statistical Challenges in Single-Cell Biology

Posters

scmap: a tool for fast and accurate mapping of cells to a reference databaseusing scRNA-seq data

Vladimir Kiselev

Sanger Institute

In recent years, technological developments have allowed researchers to collect up to106 cells. Moreover, large scale projects such as the NIH BRAIN Initiative and the Hu-man Cell Atlas aim to generate even larger datasets. However, analyzing and integratingof large single-cell RNA-seq datasets remains challenging. Here, we present scmap - a toolfor fast and accurate mapping of cells to a reference database for scRNA-seq data. Wedemonstrate that scmap can be used for integrating publicly available different datasetscollected from different platforms and labs. Moreover, we show that scmap can be usedfor one of the central tasks of a cell atlas: projecting new samples (e.g. from a diseasemodel) onto an existing reference.

Bayesian Unidimensional Scaling for latent ordering and uncertainty estimation

Lan Huong Nguyen

Stanford University

Detecting patterns in high-dimensional multivariate datasets is non-trivial. Clusteringand dimensionality reduction techniques often help in discerning inherent structures. Inbiological datasets such as microbial community composition or gene expression data,observations can be generated from a continuous process, often unknown. Estimatingdata points’ ‘natural ordering’ and their corresponding uncertainties can help researchersdraw insights about the mechanisms involved.

We introduce a Bayesian Unidimensional Scaling (BUDS) technique which extracts dom-inant sources of variation in high dimensional datasets and produces their visual datasummaries, facilitating the exploration of a hidden continuum. The method maps multi-variate data points to latent one-dimensional coordinates along their underlying trajectory,and provides estimated uncertainty bounds. By statistically modeling dissimilarities andapplying a DiSTATIS method to their posterior samples, we are able to incorporate visu-

Ascona 2017 45

Page 52: Statistical Challenges in Single-Cell Biology

Posters

alizations of uncertainties in the estimated data trajectory across different regions usingconfidence contours for individual data points. We also illustrate the estimated overalldata density across different areas by including density clouds. One-dimensional coordi-nates recovered by BUDS help researchers discover sample attributes or covariates thatare factors driving the main variability in a dataset. We demonstrated usefulness andaccuracy of BUDS on a set of published microbiome 16S and single cell RNA-seq data.Our method effectively recovers and visualizes natural orderings present in datasets. Au-tomatic visualization tools for data exploration and analysis are available at: https:

//github.com/nlhuong/visTrajectory.

Supervised classification and post-stratification with scRNAseq

Andrew McDavid

University of Rochester

To reveal latent structure in single cell RNA sequencing (scRNAseq) experiments manyunsupervised clustering methods have been developed. In some cases, however, a sub-set of cells may have known population classifications, and these can serve to train apredictive model. In other cases, a historical experiment, perhaps of bulk data, provideslabels and expression measurements to train a model which is hoped to generalize ontoa new, unlabeled scRNAseq experiment. Here I consider various basis expansions tai-lored for scRNAseq data, and argue that these enhance the out-of-experiment predictiveperformance compared to use of the raw expression values. I also review the accuracyand calibration of several procedures for supervised classification, and ways in whichclassification uncertainty can be propagated onto downstream analysis.

Ascona 2017 46

Page 53: Statistical Challenges in Single-Cell Biology

Posters

Integrated single cell data analysis for understanding mechanisms of neuronaldiversity

Jean Yang

University of Sydney

Technological advances such as large scale single cell transcriptome profiling has ex-ploded in recent years and enabled unprecedented insight into the behavior of individualcells. Identifying genes with high levels of expression using data from single cell RNAsequencing can be useful to characterize very active genes and cells in which this occurs.In particular single cell RNA-Seq allows for cell-specific characterization of high gene ex-pression, as well as gene coexpression. In this talk, I will describe a versatile modelingframework to identify transcriptional states motivated by a neuronal single cell project.

Neuronal cell systems exhibit extraordinary levels of complexity. Thus it is of great in-terest to explore the ways in which this neuronal diversity is generated and manifestedto encompass achieve such complexity. One such mechanism is patterns of gene tran-scription across neurons. We will describe how a gamma-normal mixture model is usedto identify active gene expression across cells; we then use these to characterise markersfor olfactory sensory neutron cell maturity and to build cell-specific coactivation net-works. We found that combined analysis of multiple datasets results in more knownmaturity markers being identified, as well as pointing towards some novel genes that maybe involved in neuronal maturation. We also observed that the cell-specific coactivationnetworks of mature neurons tended to have a higher centralization network measure thanimmature neurons. Finally, we will describe an approach to evaluate evidence of genetranscriptional mosaics as a mechanism for achieving diversity of neuronal cells.

Single-Cell Phenotype Classification Using Deep Convolutional Neural Networks

Beate Sick

ZHAW

Deep learning methods are currently outperforming traditional state-of-the-art computervision algorithms in diverse applications and recently even surpassed human performance

Ascona 2017 47

Page 54: Statistical Challenges in Single-Cell Biology

Posters

in object recognition. A short introduction in deep learning and convolutional neural net-work models will be given. Then we demonstrate the potential of deep learning methodsto high-content screening-based phenotype classification. We trained a deep learningclassifier in the form of convolutional neural networks with approximately 40,000 pub-licly available single-cell images from samples treated with compounds from four classesknown to lead to different phenotypes. The input data consisted of multichannel images.The construction of appropriate feature definitions was part of the training and carriedout by the convolutional network, without the need for expert knowledge or handcraftedfeatures. We compare our results against the recent state-of-the-art pipeline achieved byexperts in the field and demonstrate a reduced classification error rate when using a deeplearning approach.

Targeted Drop-seq: a hypothesis-driven approach to single-cell transcriptomics

Andreas Gschwind

EMBL Heidelberg

Gene expression and regulation are central questions in many fields of biology, rangingfrom applied medical to fundamental evolutionary biology. Transcriptomic technologiessuch as RNA-seq became the standard in the post-genomic era to analyze gene expressionand how genes are regulated. Recent advances in single-cell RNA sequencing (scRNA-seq) enable to measure cell-to-cell differences in gene expression, which remain elusive inconventional bulk sequencing approaches. Droplet-based technologies such as Drop-seqemerge as powerful tools for whole transcriptome sequencing on single-cell level. Theyenable the analysis of an exceptionally high number of single cells at low cost, howeverthe high number of assessed transcripts can decrease the power to precisely measureexpression changes of individual genes. We develop an adapted version of the Drop-seqprotocol for targeted transcriptomics, where the expression of a pre-defined set of tran-scripts is measured exclusively. By this, we expect an increase in sensitivity with whichchanges in expression of the targeted transcripts can be measured compared to conven-tional Drop-seq at the same sequencing depth. This approach is suitable for a wide rangeof hypothesis-driven applications, where a group of targets can be defined. Such setsmight contain genes in a specific genomic region or genes involved in a biological processof interest. One promising application could be in conjunction with single-cell pertur-

Ascona 2017 48

Page 55: Statistical Challenges in Single-Cell Biology

Posters

bation methods such as CRISPR, where multiple genomic elements can be targeted byspecific sgRNAs in a pooled approach. Targeted Drop-seq could then be used to monitorthe resulting expression changes of genes of interest in individual cells, thus enabling toefficiently test a large number of regulatory hypotheses.

3D-organoid segmentation and drug-response testing with a deep neural net-work

Jan Sauer

German Cancer Research Center (DKFZ)

Early detection and treatment of colorectal cancer is vital to the long-term survival ofa patient. The goal of this study is explore the response of pharmaceuticals availableto treat colorectal cancer in a patient specific way with the goal to make a treatmentrecommendation. Recently, organoids have been found to be useful tools to model or-gans and study drug-response phenotypes on a tissue level. 3D high-content microscopyscreening of organoids grown from both human and mouse colon tissue, after treatmentwith various FDA approved compounds, promises to show how new drugs and drug com-binations may affect the growth and survival of cancerous cells. The analysis of these3D culture images requires the development of novel software capable of accurately seg-menting the individual cells of these spheroids from the background. The problem is todetect partially overlapping objects of a specific shape but with largely varying diameterin the images. To achieve this, a deep neural network (DNN) is being developed for thesegmentation and the subsequent feature extraction of the images. The design of theDNN allows the detection of edges of the organoids on its first layers. On the higherlevels of the network, the edge map is used to detect the organoids and to determinetheir size. To train this network a small number of annotated images are required. Fromthese training samples, a large number of small image blobs are extracted that are usedto train the edge detector. After segmentation of organoids and single cells, phenotypicfeatures are calculated and averaged to an experiment specific feature vector. With thecurrently available data, we will show that we can distinguish the phenotypic effect ofdifferent drugs and have a basis for treatment recommendation.

Ascona 2017 49

Page 56: Statistical Challenges in Single-Cell Biology

Posters

Cell lineage tracing in zebrafish using CRISPR/Cas9 induced genetic marks

Maria Florescu

Hubrecht Institute, Utrecht University

Understanding how a zygote grows into a complex, multicellular organism is a centralquestion in developmental biology. We present scartrace, a strategy for whole-organismlineage tracing based on Cas9 induced genetic marks, named scars, in several GFP inte-grations of a zebrafish line (Junker et al, BioRxiv 2016). Scarring is a dynamic process,meaning that scars are progressively introduced over multiple rounds of cell division. Us-ing scartrace, we investigate morphogenesis in the early zebrafish embryo from scars ofadult organs. We assign organ tissue to the corresponding germ layers and estimatethe number of clones generating a certain tissue from bulk scarred tissue. Recently weachieved single cell scar detection, which greatly improves lineage-tracing resolution.

Measuring batch effects in single-cell RNA sequencing data

Maren Buttner

Institute of Computational Biology, Helmholtz Zentrum Munchen

Quantification of the complete transcriptome of single cells using RNA-seq is a versatiletool for exploring heterogeneous cell populations, but suffers substantially from technicalnoise and batch effects. This observation has motivated the development of several batchcorrection and data normalisation techniques to separate technical and biological variationin the data. Measuring batch effect amounts to quantifying the difference of distribu-tions in different batches to decide whether they are equivalent, or not. Traditionally, thisdecision has been based on visual inspection of dimension-reduced representations of thedata, as obtained, for example, by principal component analysis. A generalization to thisapproach involves testing the significance of a batch effect in each component of princi-pal component analysis, which has high statistical power, but still it is computationallyinefficient and hard to interpret. We present kBET, a batch effect test based on k-nearestneighbours that tests nonparametrically for inequivalence of distributions. The batch la-bel is known for each sample; hence a particular spatial subset (i.e. a subset of samples

Ascona 2017 50

Page 57: Statistical Challenges in Single-Cell Biology

Posters

that potentially forms a cluster in the high-dimensional space) is hypothesised to be com-posed of the same batch labels (with equal fractions) as the complete data set. Beingeasy to interpret and computationally efficient, kBET overcomes the above-mentionedlimitations while keeping high statistical power. We demonstrate its performance on var-ious experimental and simulated data sets and demonstrate substantial contribution ofthe batch effect in single-cell RNAseq data that is intractable by inspecting projecteddata (as in principal component analysis for example). Thus, kBET enables an unbiasedcomparison of the efficiency of recently developed batch effect correction techniques.

Exploring influenza A virus infection using single-cell RNA-sequencing

Lam-Ha Ly

Max-Planck-Institute for Molecular Genetics

In virology influenza A infected cells show a large cell-to-cell variability in the number ofreleased progeny virions, the virus yield. To study the cellular heterogeneity of infectedcells we performed single-cell RNA-sequencing (scRNA-seq) to profile both, the cellularand the viral transcriptome using the Smartseq2-protocol. Here, we treated Madin-DarbyCanine Kidney (MDCK) cell lines with influenza A viruses and classified cells having ahigh and low virus yield. Additionally, as a reference, we performed bulk RNA-sequencingto profile infected and non-infected cells. This approach enables the investigation ofhost-pathogen cell interactions in order to explain differences in the viral replication andintegrity. First computational analysis includes a comparison between bulk and single-cellsamples and differential expression analysis of high and low virus yield cells. Preliminaryresults will be shown.

Ascona 2017 51

Page 58: Statistical Challenges in Single-Cell Biology

Posters

Single cell mRNA sequencing to reveal the in vivo hierarchy of miRNA targets

Andrzej Rzepiela

ScopeM ETH

By promoting target mRNA decay, miRNAs can down-regulate mRNA targets, reduce thecell-to-cell variability of protein expression, and provide a channel through which mRNAtargets can act as “competing RNAs” to influence each other’s expression. The rela-tive importance of these regulatory levels for cellular function is strongly debated, owingto the regulatory network size and the lack of in vivo measurements of miRNA-targetinteraction parameters. Combining single cell analysis with mathematical modeling wederived a method for estimating Michaelis-Menten constants of endogenous miRNA tar-gets, comprehensively and in the context of live cells. Applying the approach to twodistinct miRNAs and using data from hundreds of single cell mRNA-Seq measurements,we tested the possibility to uncover a hierarchy of targets, and to find the targets thatare very sensitive to the miRNA presence. We show that the sensitivity of a target to themiRNA is determined by the interplay between the rates of target-miRNA association anddissociation as well as the intrinsic and miRNA-induced target decay rates. We discussthe characteristic of the most sensitive targets that we discovered with the method. Ourapproach enables elucidation of complex behaviors resulting from the interactions of alarge number of targets with a common regulator.

Single Cell DNA Sequencing Reveals a Late-Dissemination Model in MetastaticColorectal Cancer

Alexander Davis

The University of Texas MD Anderson Cancer Center

Metastasis is a complex process and has been difficult to study in human patients. Amajor technical obstacle has been the extensive intratumor heterogeneity at the primaryand metastatic tumor sites. To address this problem, we developed a highly-multiplexedsingle cell DNA sequencing approach to trace the metastatic lineages of two colorectalcancer (CRC) patients with matched liver metastases. Single cell copy number and muta-

Ascona 2017 52

Page 59: Statistical Challenges in Single-Cell Biology

Posters

tional profiling was performed on 444 cells, in addition to bulk exome and deep-targetedsequencing. In the first patient we observed monoclonal seeding, in which a single clonehad evolved a large number of mutations prior to migrating to the liver to establish themetastatic tumor. In the second patient we observed polyclonal seeding, in which twoindependent clones seeded the metastatic tumor after having diverged at different timepoints from the primary tumor lineage. The single cell data also revealed an unexpectedindependent tumor lineage that did not metastasize, and early progenitor clones withthe first hit in APC that subsequently gave rise to both the primary and metastatictumors. Collectively, these data reveal a late-dissemination model of metastasis in twoCRC patients, and provide an unprecedented view of metastasis at single cell genomicresolution.

Modelling zeros for differential expression and feature selection in scRNASeq

Tallulah Andrews

Wellcome Trust Sanger Institute

Single cell RNA sequencing (scRNASeq) detects far fewer genes in each cell than tradi-tional RNASeq. As a result typical RNASeq expression datasets contain a majority of zerovalues (dropouts), even after filtering out poor quality cells and undetected genes. Exist-ing RNASeq analysis methods are ill-suited for dealing with these dropouts. We demon-strate how modelling zeros in scRNASeq data, both with and without unique molecularidentifiers (UMIs), can improve the identification of biologically important genes (featureselection), and accuracy of differential expression (DE) testing. We introduce two novelmethods for differential expression testing which perform favourably against 11 publishedmethods for both UMI-tagged and full-transcript single-cell RNASeq data. In addition,we use the relationship between gene expression and the frequency of dropouts to iden-tify features corresponding to differentially expressed genes without a priori knowledgeof cell populations. These features are robust to batch effects and they can be used todetermine consistent cell types across multiple datasets from the same biological system.We apply this method to a meta-analysis of single-cell data of mammalian developmentand identify novel markers for distinct embryonic stages.

Ascona 2017 53

Page 60: Statistical Challenges in Single-Cell Biology

Posters

Investigating Microenvironment-to-cell Signaling in 3D Spheroids through Imag-ing Mass Cytometry

Vito Zanotelli

IMLS UZH/ Life Science Graduate School Zurich, Systems Biology Program

Joint work with: Georgi F, Schulz D, Schapiro D, Andriasyan V, Yakimovich A, CatenaR, Jackson H, Bodenmiller B

Question Every cell senses its local 3D environment and adapts its phenotype, accord-ingly. This process is critical in tissue development and homeostasis. Deregulated it canlead to diseases such as cancer. However, how the interactions between cells and theirenvironment shape cellular phenotypes in tumors and contribute to tumor heterogeneityis largely unknown. Here we set out to develop a high throughput setup to quantify theinfluences of these interactions on the phenotypic heterogeneity in an in vitro 3D spheroidcell culture system.Methods We developed a workflow based on metal label barcoding. It allows an efficientcoupling of 3D spheroid assays with an imaging mass cytometry (IMC) readout. Thisenables the simultaneous quantification of more than 40 phenotypic and functional mark-ers at subcellular resolution in 3D microtissue slices. Firstly, this system is suitable toperform large scale studies on spatial relationships of complex phenotypes. Secondly, itcan be used to follow perturbations mediated by small molecule inhibitors.Results We first demonstrate the feasibility of the barcoding approach. Based on dataof unperturbed breast cancer spheroids we show how IMC can capture phenotypic het-erogeneity and coordination in the microenvironment. Finally, we explore how such datacan be integrated using mathematical modeling to gain quantitative insights.Conclusions We present the development of a broadly applicable, scalable screening ap-proach efficiently combining high throughput 3D tissue culture with IMC. The approachcan be applied to more complex cell culture settings, including 3D co-cultures, organoidsas well as advanced perturbations, such as stimulation time courses. Combined withinhibitor screens, the technology is a solid basis to investigate spatial coordination in 3Dtissue models.

Ascona 2017 54

Page 61: Statistical Challenges in Single-Cell Biology

Posters

Systematic analysis of cell phenotypes and cellular social networks in tissuesusing multiplexed image cytometry analysis toolbox (miCAT)

Denis Schapiro

University of Zurich, Institute of Molecular Life Sciences Systems Biology PhD Pro-gram, ETH and University of Zurich

Single-cell, spatially resolved ‘omics analysis of tissues is poised to transform biomedicalresearch and clinical practice. We developed an open-source computational multiplexedimage cytometry analysis toolbox (miCAT) to enable the interactive, quantitative andcomprehensive exploration of single cell phenotypes, cell-to-cell interactions, microenvi-ronments and morphological structures within intact tissues. We highlight the uniqueabilities of miCAT by analysis of highly multiplexed mass cytometry images of humanbreast cancer tissues.

FACSanadu: An open source tool for rapid visualization and quantification offlow cytometry data

Thomas Burglin

Dept Biomedicine, University of Basel

1) European Molecular Biology Laboratory European Bioinformatics Institute WellcomeTrust Genome Campus, Hinxton Cambridge CB10 1SD UK2) Department of Biomedicine University of Basel Mattenstrasse 28 CH 4058 BaselSwitzerland* presenting authorMotivation: Flow cytometry is a fundamental technique in cell biology, yet few opensource packages are available to analyze these data. Here we describe FACSanadu,an interactive package for rapid visualization and measurement of FCS data. It is thefirst open source package that can read data of the COPAS Biosorter. Availabilityand Implementation: FACSanadu is implemented in Java and uses the Qt frameworkfor display. Binary distributions are made for all major operating systems (Windows,Macintosh, Linux). The source code and documentation is available as free software atwww.facsanadu.org.

Ascona 2017 55

Page 62: Statistical Challenges in Single-Cell Biology

Posters

Network-based regularized optimization for cancer survival data analysis

Susana Vinga

IDMEC, Instituto Superior Tecnico, Universidade de Lisboa, Portugal

Learning statistical models from oncological data has now become a major challengedue to the significant increase of molecular information available. The inherent high-dimensionality of these ‘omics datasets, where the number of features largely exceeds thenumber of observations, leads to ill-posed inverse problems and, consequently, to modelsthat often lack interpretability and are prone to overfitting. In order to tackle this prob-lem, regularized optimization has emerged as a promising approach, allowing to introduceconstraints on the structure of the solutions. These include the application of sparsity-inducing norms on the loss function, for which the l1-norm leading to the Lasso (leastabsolute shrinkage and selection operator) method is probably the best-known example,along with the elastic net, based on the l1 and l2-norms. More recently, several network-based regularizers have been proposed to analyze survival data when the features have agraph relationship, such as the Net-Cox, which promotes parameter smoothness acrossthe network, and DegreeCox, where the degree of each node is taken into account. Pre-liminary testing on breast and ovarian cancer survival data from the The Cancer GenomeAtlas (TCGA) illustrates some of the advantages of these methods. The comparisoncriteria include the concordance c-index for Cox models, and the p-value of the log-ranktests for the separation of high vs. low-risk patients. Preliminary results show an im-provement of the c-index when network information is included, whereas the separationseems to decrease. However, the models estimated by cross-validation are different interms of number of selected features, which may hamper the correct comparison of theresults. Extensions to other norms are being conducted and may lead to improved survivalmodels in terms of interpretability and overfitting control, a key aspect to improve andsupport clinical decision systems and prognostic assessment of oncological patients.

Ascona 2017 56

Page 63: Statistical Challenges in Single-Cell Biology

Posters

Finding subtle differences in single cell RNA-seq data

Camille Stephan-Otto Attolini

IRB Barcelona

Cell populations characterized by different functional programs usually present radicaldifferences in the expression of relatively large sets of genes. In the context of single cellRNAseq this is easily detected despite the highly variable nature of the count data.Colon cancer stem cells have until now been considered to be a homogeneous populationdefined mainly by the expression of the LGR5 gene. We analyzed a large dataset of LGR5positive (LGR5+) cells and their one-division progeny (LGR5-) in order to determine theexistence of heterogeneity within the LGR5+ cells and the relation to the populationsseen in LGR5- cells. We found that applying existing methodologies for SC-RNAseq leadto no apparent heterogeneity among the stem cell population. We hypothesized that thesubtle differences expected among these cells were masked by the high abundance of zerocounts and the large differences between the expressions of highly and lowly expressedgenes.Our methodology consists of quality filtering of cells and genes, a transformation of thecount matrix, simple dimensional reduction methods and unsupervised clustering. Ourresults suggest the existence of two populations characterized by genes associated withproliferation and with the gene Mex3a. The expressions of these two gene sets areinversely correlated as expected from experimental observations. We investigated therobustness of these clusters and found that a minimum number of cells is necessary toobserve the different populations. We were also able to detect progenitors of known celltypes present in the differentiated colon tissue among the LGR5- cells.Our investigation suggests that specific methodologies may be necessary depending onthe nature of the data when dealing with SC-RNAseq data.

Ascona 2017 57

Page 64: Statistical Challenges in Single-Cell Biology

Posters

TRACE: Reconstructing trajectories of cell cycle evolution using single-cell masscytometry data

Maria Anna Rapsomaniki1, Xiaokang Lun2, Johanna Wagner2, Bernd Bodenmiller2 andMaria Rodriguez Martinez1

1IBM Research Laboratory, Saumerstrasse 4, CH-8803 Ruschlikon, Switzerland2Institute of Molecular Life Sciences, University of Zurich, 8057 Zurich, Switzerland

As single-cell experimental approaches become increasingly popular, cell-to-cell hetero-geneity has emerged as a key determinant factor contributing to variability in gene expres-sion and signaling responses. Mass cytometry (CyTOF) is a new proteomic technologythat enables the simultaneous quantification of dozens of proteins in thousands of indi-vidual cells. In the context of cancer research, recent applications of CyTOF include thecharacterization of inter- and intra-tumor heterogeneity and the identification of novel cellsubpopulations. However, as already demonstrated for single-cell RNA-seq, the resultingmeasurements are largely influenced by confounding factors, such as the cell cycle andcell volume. We present here TRACE, a novel computational approach to quantify thissource of variability. TRACE first exploits a hybrid machine learning approach to clas-sify single cells into discrete cell cycle phases according to measurements of establishedmarkers. Next, a metric embedding optimization technique creates a one-dimensionalcontinuous marker that tracks biological pseudotime and individual cells are subsequentlyordered according to this pseudotime marker. The resulting cell cycle trajectories acrossperturbation time points allow us to separate cell cycle effects from experimentally in-duced responses, enabling the direct comparison of signaling responses through cell cycleprogression. Additionally we show that volume biases can be corrected using housekeep-ing gene measurements. Our approach, implemented in a simple and intuitive GraphicalUser Interface, was used to analyze data from various cell lines subject to different stim-ulations. In each case, TRACE was able to separate confounding effects from signalingresponses, enabling the unbiased analysis of biological processes.

Ascona 2017 58

Page 65: Statistical Challenges in Single-Cell Biology

Posters

Seeing is Believing: Biology-Driven Visualizations for Single Cell High Dimen-sional Datasets

Yann Abraham

Janssen Pharmaceutica

Joint work with: Bertran Gerrits, Marie-Gabrielle Ludwig, Frederik Stevenaert, GreetVanhoof, Anish Suri, Pieter Peeters, Caroline Gubser Keller

Visualization and interpretation of high dimensional data is a common challenge, andseveral solutions have been developed that address specific issues such as the identificationof groups of points sharing similar characteristics. Most methods rely on projections toa plane, and any information on the initial dimensions is lost. When applied to biologicaldataset, this makes the interpretation of the resulting visualization difficult, whetherone is trying to infer the nature of the different groups of points or to determine ifthe differences between samples are biologically relevant. To overcome this challengewe applied to CyTOF datasets two visualizations that had been previously developed tovisualize trends in high dimensional data. First, we used Radviz to project all cells ina two-dimensional space in a way that maintains the relation between each point andevery channel that has been measured. Using standard visualization techniques severalconditions can be compared at the sample level to identify trends in the data. Once groupsof relevant cells have been identified we used Fan plots to visualize the contribution ofeach channel and to estimate the variability of each channel within the group. Fan plotscan also be used to compare a given group of cells across several conditions. Bothmethods can be used together or in addition to other methods, including analytical ones.Compared to other methods such as t-SNE and SPADE, Radviz and Fan plots make theinterpretation of complex datasets easier. They provide an important interface betweenanalytical methods and the biology.

Ascona 2017 59

Page 66: Statistical Challenges in Single-Cell Biology

Posters

Identification of Expanded Clonotypes based on TCR sequencing of Single T-Cells

An De Bondt

Janssen Research & Development, Beerse 2340, Belgium

Our aim is to characterize the TCR repertoire as a method to identify onset and tointercept and cure disease by developing targeted strategies. To this end we developeda platform to identify expanded T-cell clones. T-cells are a major component of theimmune system, and the T-cell Receptor (TCR) plays a pivotal role in the identifica-tion of antigens. Via recombination of the TCR locus, each T-cell expresses a differentTCR which is specific to a given antigen. All recombined TCR sequences across all Tcells within an individual, the ‘T-cell repertoire’, provides the specificity required for aproper immune response. The interaction between a TCR and a given antigen triggersactivation and expansion of the T-cell, accompanied by a switch in gene expression thatultimately modulate the T-cell function. We have established and validated protocols forTCR sequencing on cell lines as well as on single T-cells from healthy donors. After ex-vivo stimulation of peripheral blood mononuclear cells (PBMCs), we sequenced the mostvariable region of both chains of the receptor. The analysis of the TCR sequencing data,allowed us to identify the oligoclonal expansion of T-cells. For this analysis, we evalu-ated tools like MiXCR and TraCeR and further developed visualisations and standardizedreporting. In conclusion, we have established a platform for analyzing the immune reper-toire which can be readily applied to diseases where immune infiltration is at play suchas oncology, hepatitis and diabetes.

Ascona 2017 60

Page 67: Statistical Challenges in Single-Cell Biology

Posters

Linear mixed effects models for complex experimental designs in single cell masscytometry

Marjolein Crabbe and Marianne Tuefferd

Janssen Research and Development BE, Translational Biomarkers, Infectious DiseasesJohnson & Johnson Pharmaceutical Research & Development

Joint work with: W. Talloen, Y. Abraham, Y. Zhang, G. Vanhoof, F. Stevenaert, K.Spittaels, J. Bollekens, J. Aerssens

Single cell profiling technologies open new perspectives in clinical research, aiming atimproved insights into disease heterogeneity and detailed patient response profiles totreatments. Integrating the increasing complexity of modern clinical study designs in thedata analysis strategy of single cell assessments poses, however, significant challenges.Here we propose an analysis approach for single cell patient-derived data that considersthe clinical study setup, based on linear mixed effects modeling.

To illustrate this approach, a CyTOF (Cytometry by Time-Of-Flight) dataset generatedfrom patient samples collected in a cross-sectional study that recruited multiple resolvertypes of chronic HBV patients is considered. Ex vivo stimulations of individual patientsamples (paired stimulated versus unstimulated samples) and the potential interaction ofthe stimulation with the distinction in clinical resolver profiles in this study, contributeto the complexity of the experimental design. In addition to 19 markers used in theCyTOF analysis to annotate different cell types, each individual cell is characterizedby 18 functional channels. Hence, earlier described methods for multivariate statisticalanalysis of such high-dimensional datasets are challenging in this complex experimentaldesign setting.

The proposed exploratory statistical analysis approach of the single cell CyTOF data gen-erated in the context of more complex experimental designs assesses the advantage ofmixed models in their ability to identify functional differences within specific cell popula-tions between the different resolver groups. Applying a univariate approach, the paireddesign structure embedded in the experimental study design can be exploited and thepotential interaction between cohort type and stimulation accounted for, enabling a morerefined analysis. Moreover, mixed models provide a flexible framework to analyze a varietyof extensions including repeated sampling in a longitudinal study design.

Ascona 2017 61

Page 68: Statistical Challenges in Single-Cell Biology

Posters

Ascona 2017 62

Page 69: Statistical Challenges in Single-Cell Biology

Participants

Yann Abraham Janssen PharmaceuticaNicola Aceto University of Basel Lisa Amrhein Helmholtz Zentrum Munchen, Instituteof Computational BiologyTallulah Andrews Wellcome Trust Sanger InstituteRicard Argelaguet European Bioinformatics InstituteEirini Arvaniti ETH ZurichNiko Beerenwinkel CBG, BSSE, ETH ZurichFatemeh Behjati Max Planck Institute for InformaticsKobi Benenson ETH ZurichNicolas Bennett Seminar for Statistics, ETH ZurichBernd Bodenmiller University of ZurichThomas Burglin Dept Biomedicine, University of BaselPeter Buhlmann ETH ZurichMaren Buttner Institute of Computational Biology, Helmholtz Zentrum MunchenJuan C. Caicedo Broad Institute of MIT and HarvardAmbrose Carr Memorial Sloan Kettering Cancer CenterLuciano Cascione Bioinformatics Core Unit at IORFrancesc Castro-Giner University of BaselMarjolein Crabbe Janssen Research and Development BESimona Cristea Dana-Farber Cancer Institute & Harvard School of Public HealthHelena Crowell University of ZurichAlexander Davis The University of Texas MD Anderson Cancer CenterAn De Bondt Janssen Research & DevelopmentMaria Florescu Hubrecht Institute, Utrecht UniversityJames Gagnon Harvard UniversityJohann Gagnon-Bartsch University of Michigan, StatisticsJavier Gayan F. Hoffmann-La RocheMonica Golumbeanu CBG, BSSE, ETH ZurichRaphael Gottardo Fred Hutchinson Cancer Research Center

63

Page 70: Statistical Challenges in Single-Cell Biology

Participants

Andreas Gschwind EMBL HeidelbergStephanie Hicks Dana-Farber Cancer Institute / Harvard T.H. Chan School of PublicHealthTakashi Hiiragi EMBL HeidelbergSusan Holmes Statistics Dept, StanfordYuanhua Huang School of Informatics, University of EdinburghWolfgang Huber EMBL HeidelbergSteffen Jaensch Johnson & JohnsonKatharina Jahn CBG, BSSE, ETH ZurichHans-Michael Kaltenbach ETH ZurichChantriolnt Andreas Kapourani University of Edinburgh, School of InformaticsZahra Karimaddini BSSE, ETH Zurich, SwitzerlandPeter Kharchenko Harvard Medical SchoolVladislav Kim European Molecular Biology LaboratoryVladimir Kiselev Sanger InstituteJan Korbel European Molecular Biology Laboratory (EMBL), Genome Biology Unit,Heidelberg, Germany, and European Molecular Biology Laboratory-European Bioinfor-matics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, UK.Keegan Korthauer Dana-Farber Cancer Institute and Harvard T.H. Chan School ofPublic HealthJack Kuipers CBG, BSSE, ETH ZurichIvo Kwee Bioinformatics Core Unit at IORGriet Laenen Open AnalyticsLisa Lamberti University of OxfordPrisca Liberali FMI BaselCarolin Loos Helmholtz Zentrum Munchen - German Research Center for Environmen-tal Health, Institute of Computational BiologyJunyan Lu European Molecular Biology LaboratoryLam-Ha Ly Max-Planck-Institute for Molecular GeneticsWill Macnair ETH IMSBJohn Marioni Cancer Research UK, CambridgeTobias Marschall Saarland University / Max Planck Institute for InformaticsDavis McCarthy EMBL-EBI, Hinxton, UKAndrew McDavid University of RochesterElisabetta Mereu CNAG-CRG, Universitat Pompeu Fabra (UPF), Barcelona, SpainEugenia Migliavacca NIHSLaurent Modolo CNRS - LyonFabian Muller Max Planck Institute for Informatics, Saarbrucken, Germany

Ascona 2017 64

Page 71: Statistical Challenges in Single-Cell Biology

Participants

Nick Navin MD AndersonMichael Newton University of Wisconsin, MadisonLan Huong Nguyen Stanford University (prof. Susan Holmes Lab)Martin Pirkl CBG, BSSE, ETH ZurichMichael Prummer NEXUS, ETHZMagnus Rattray University of ManchesterMaria Anna Rapsomaniki IBM Research LaboratoryHubert Rehrauer ETH Zurich and University of ZurichAndre Rendeiro CeMM Research Centre for Molecular Medicine of the Austrian Academyof SciencesEdith Ross Cancer Research UK Cambridge Institute, University of CambridgeAndrzej Rzepiela ScopeM ETHAgus Salim La Trobe University and Walter and Eliza Hall Institute of Medical ResearchJan Sauer German Cancer Research Center (DKFZ)Denis Schapiro University of Zurich, Institute of Molecular Life Sciences Systems Biol-ogy PhD Program, ETH and University of ZurichGeoffrey Schiebinger Broad InstituteAlexander Schonhuth Centrum Wiskunde & Informatica, AmsterdamSven Schuierer Novartis Institutes of Biomedical ResearchChristof Seiler Department of Statistics, Stanford UniversityBeate Sick ZHAWJochen Singer CBG, BSSE, ETH ZurichCharlotte Soneson University of ZurichMichael Stadler Friedrich Miescher Insitute for Biomedical ResearchOliver Stegle EMBL-EBI HinxtonDaniel Stekhoven ETH ZurichCamille Stephan-Otto Attolini IRB BarcelonaBarbara Szczerba University of BaselEwa Szczurek Institute of Informatics, Faculty of Informatics, Mathematics and Me-chanics, University of WarsawValerie Taly Paris DescartesMarianne Tuefferd Translational Biomarkers, Infectious Diseases, Johnson & JohnsonPharmaceutical Research & DevelopmentCatalina Vallejos Alan Turing Institute and UCLLars Velten EMBLBritta Velten EMBL HeidelbergJean-Philippe Vert ENS ParisSusana Vinga IDMEC, Instituto Superior Tecnico, Universidade de Lisboa, Portugal

Ascona 2017 65

Page 72: Statistical Challenges in Single-Cell Biology

Participants

Sijian Wang Department of Biostatistics & Medical Informatics, University of Wisconsin-Madison USALukas Weber Institute of Molecular Life Sciences, University of ZurichDaniel Wells University of OxfordJean (Zhijin) Wu Brown UniversityJean Yang University of SydneyVito Zanotelli IMLS UZH/ Life Science Graduate School Zurich, Systems Biology Pro-gram

Ascona 2017 66