Data Science Initiative - Stanford University · Data Science Research at Stanford 2017–2018 1 ABOUT STANFORD DATA SCIENCE INITIATIVE The Stanford Data Science Initiative (SDSI)

Data Science Initiative

Data Science Research at Stanford 2017–2018

Stanford Data Science Initiative

TABLE OF CONTENTS

About Stanford Data Science Initiative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Letter from the Directors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Flagship Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Data Science for Personalized Medicine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Mapping the “Social Genome” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Privacy Preserving Internet of Things–Analytics for Human Behavior Interventions . . . . . . . . . . . . . . . . 6

Small Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Algorithms and Foundations for Valid Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Big Data for Agricultural Risk Management in the United States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9AMELIE: Making Genetic Diagnostics Accessible, Reproducible, Ubiquitous . . . . . . . . . . . . . . . . . . . . . . . . 9Food Prices and Mortality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Blockchain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Inferring the Mass Map of the Observable Universe from 10 Billion Galaxies . . . . . . . . . . . . . . . . . . . . . . 11Real-Time Large-Scale Neural Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Stanford Distributed Clinical Data Project and MS Azure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12MyHeart Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12Physics Event Reconstruction at the Large Hadron Collider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Use of Electronic Phenotyping and Machine Learning Algorithms to Identify Familial

Hypercholesterolemia Patients in Electronic Health Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Selected Research Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Application of Computing and Informatics Technologies to Problems Relevant to Medicine . . . . . . . . 16Data-Intensive Systems and Tools for Making Complex, High-Volume Data Analytics

More Useful and More Accessible . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Natural Language Processing and Customer Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Studying the Dark Matter of the Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Applications of Cryptography to Computer Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Computational Approaches to Help Address the Societal and Environmental Challenges . . . . . . . . . . 19Food Production, Food Security, and the Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Computational Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Mining and Modeling Large Social and Information Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Parallel Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Sensing, Reconstructing and Building an Expert System for Data Centers and Clouds . . . . . . . . . . . . . 22Computer Vision, Robotic Perception and Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Weld . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Advanced Temporal Language Aided Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Human Genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

SDSI Affiliated Faculty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Corporate Members . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1Data Science Research at Stanford 2017–2018

ABOUT STANFORD DATA SCIENCE INITIATIVE

The Stanford Data Science Initiative (SDSI) is a university-wide organization focused on core data technologies with strong ties to application areas across campus. SDSI comprises methods research, infrastructure, and education.

Recently there has been a paradigm shift in the way data is used. Today researchers are mining data for patterns and trends that lead to new hypotheses. This shift is caused by the huge volumes of data available from web query logs, social media posts and blogs, satellites, sensors, and medical devices.

Data-centered research faces many challenges. Current data management and analysis techniques do not scale to the huge volumes of data that we expect in the future. New analysis techniques that use machine learning and data mining require careful tuning and expert direction. In order to be effective, data analysis must be combined with knowledge from domain experts. Future breakthroughs will often require intimate and combined knowledge of algorithms, data management, the domain data, and the intended applications.

SDSI consists of data science research, shared data and computing infrastructure, shared tools and techniques, industrial links, and education. SDSI has strong ties to groups across Stanford University such as medicine, computational social science, biology, energy, and theory.

Contact Steve Eglash, Executive Director, for more information: seglash@stanford .edu .

LETTER FROM THE DIRECTORS

2 Stanford Data Science Initiative

The Stanford Data Science Initiative: Enabling Deep Engagement between Industry and Stanford Researchers

The world is being transformed by large-scale data and massive computation. Data-based decision making is rapidly becoming an integral part of business, science, and society. The Stanford Data Science Initiative is an interdisciplinary focal point for research that harnesses massive data, fast computation, and new machine learning techniques. SDSI is a collaborative effort between industry and academia spanning all industries, the entire world, and the whole university including methodologists from computer science, statistics, and artificial intelligence and experts in disciplines that are being transformed by data science and computation such as medicine, physics, earth science, social sciences, life sciences, education, business, and law.

SDSI’s emphasis is on research for accelerating data science, discovery, and application across the university and throughout business and society. As this brochure shows, SDSI has been funding research and enabling collaboration across all seven schools at Stanford and with industry in medicine, climate change, social science, energy, and more. Every company and industry on the planet is affected by the data science and artificial intelligence revolution. Smart companies are exploiting this revolution as an opportunity to improve competitive advantage and increase market share, revenue, and profitability. SDSI provides visibility into emerging technology, accurate assessments of current capability, and impactful insights in many specific domains.

There are many pressing questions about the use of data science in civil society. Challenges include algorithmic bias and fairness, the interpretability and accountability of decisions made by autonomous systems, security and privacy of data, decisions made using shared data, and the impact of data science on law, transportation, markets, and national defense. Autonomous vehicles, robots, intelligent agents, and other forms of automation will cause job loss and the need for education. Education is being transformed by massive online courses and automatic tutoring. SDSI strives to work with industry and academia to assure that we are asking the right questions and developing the most impactful innovations.

Euan Ashley Director, Stanford Data Science Initiative; Professor, Medicine and Genetics

Jure Leskovec Director, Stanford Data Science Initiative; Associate Professor, Computer Science

Steve Eglash Executive Director, Stanford Strategic Research Initiatives, Computer Science

EUAN ASHLEY

JURE LESKOVEC

STEVE EGLASH


Flagship Projects


FLAGSHIP PROJECTS

DATA SCIENCE FOR PERSONALIZED MEDICINEMichael Snyder, David Tse, Euan Ashley, Mohsen Bayati, Dan Boneh, Andrea Montanari, Ayfer Ozgur, Tsachy Weissman

Recent technological advances have enabled collection of diverse health data at an unprecedented level. Omics information of genomes, transcriptomes, proteomes and metabolomes, DNA methylomes, and microbiome as well as electronic medical records and data from sensors and wearable devices provide detailed view of disease state, physiological, and behavioral parameters at the individual level. Availability of such massive-scale digital footprint of an individual’s health opens the door to numerous opportunities for monitoring and accurately predicting the individual’s health outcomes in addition to customizing treatments at individual level, hence realizing the goal of personalized medicine. A major challenge is how to efficiently collect, store, secure and most importantly, analyze such massive-scale and highly private data so that accuracy of outcome predictions and treatment analysis is not impacted. The “Data Science for Personalized Health” flagship project will

design a system that will address this challenge and validate it on several personalized medicine tasks. Specifically, we will 1) devise new algorithms for sampling, for imputation of missing data and for joint processing of multiple measurements; 2) build novel frameworks to house and manage complex data in a useful and secure fashion; 3) devise new tools for the analysis of and the prediction from high dimensional, complex, longitudinal data. Using a unique dataset on 70 pre-diabetic participants, we devise a personalized and highly accurate early detection method for diabetes and analyze the consequences of weight change, physical activity, stress, and respiratory viral infection on individuals’ digital health footprint and ultimately predict the effect of such perturbations on individuals’ health outcomes. The research is led by an interdisciplinary team of faculty with expertise in medicine, genetics, machine learning, security and information theory, and the tools developed will be of broad interest to other data science problems as well.


FLAGSHIP PROJECTS

MAPPING THE “SOCIAL GENOME”Jure Leskovec, Michael Bernstein, Amir Goldberg, Dan Jurafsky, Dan McFarland, Christopher Potts

The initial research plan is built around three interrelated levels of analysis: individual, group, and society. At each level, we are investigating the interplay between static and dynamic properties, and paying special attention to the ethical and economic issues that arise when confronting major scientific challenges like this one. Our ultimate goal is to identify ways in which scientists, engineers, community builders, and community leaders can contribute to the development of more productive, vibrant, and informed teams, online and offline communities, and societies.

The goal of this project is to develop data science tools and statistical models that bring networks and language together in order to make more and better predictions about both. Our focus is on joint models of language and network structure. This brings natural language processing and social network analysis together to provide a detailed picture not only of what is being said in a community, but also who is saying it, how the information is being transmitted through the network, how that transmission affects network structure, and, coming full circle, how those evolving structures affect linguistic expression. We plan to develop statistical models using diverse data sets, including not only online social networks (Twitter, Reddit, Facebook), but also hyperlink networks of news outlets (using massive corpora we collect on an ongoing basis) and networks of political groups, labs, and corporations.

Leskovec maintains a large collection of network and language data sets at the website for the Stanford Network Analysis Project SNAP (http://snap.stanford.edu). The pilot work described in general terms here relies mainly on resources that have been posted on SNAP for public use. (In some cases, privacy or business concerns preclude such distribution.) Moreover, we have access to several powerful, comprehensive data sets: (i) cell phone call traces of entire countries; (ii) complete article commenting and voting from sites like CNN, NPR, FOX, and similar; (iii) a near complete U.S. media picture: 10 billion blog posts and news articles (5 million per day over last six years); (iv) complete Twitter, LinkedIn, and Facebook data (through direct collaboration with these companies); (v) five years of email logs from a medium-sized company.


FLAGSHIP PROJECTS

PRIVACY PRESERVING INTERNET OF THINGS–ANALYTICS FOR HUMAN BEHAVIOR INTERVENTIONSPhilip Levis, Noah Diffenbaugh, Dan Boneh, Mark Horowitz

The high-level, long-term goal is to research how to use the Internet of Things to collect data on human behavior in a manner that preserves privacy but provides sufficient information to allow interventions which modify that behavior. We are exploring this research question in the context of water conservation at Stanford: how can smart water fixtures collect data on how students use water, such that dormitories can make interventions to reduce water use, while keeping detailed water use data private?

Towards this end, we have deployed a water use sensing network in Stanford dormitories. Our pilot deployment in the winter of 2017 showed several interesting results, such that the average and median shower length for men and women is the same. More importantly, using this network we have been able to determine that placards suggesting using less water, placed within the showers, are correlated with 10% shorter showers. Furthermore, while the average shower length is 8.8 minutes, there is an extremely long tail, with 20% of showers being longer than 15 minutes.

We have been asked to deploy the network again in order to measure water use within a larger dormitory. Currently, the network is purely observational: our next goal is to augment the network with real-time feedback to users, such as blinking a red light when a shower is running long. This will allow us to explore the relative efficacy of delayed (signs on doors, messages to dormitories), immediate (placards in showers) and real-time (indicator lights) interventions. Can the system do this in a way that provides the aggregate results without revealing the behavior of individuals or individual water use events?


Small Projects


SELECTED RESEARCH PROJECTSSMALL PROJECTS

ALGORITHMS AND FOUNDATIONS FOR VALID DATA EXPLORATIONJames Zou

LARGE-SCALE EXPERIMENTS TO MEASURE THE SCIENCE OF DATA SCIENCE

Despite the tremendous growth of data science, we lack systematic and quantitative understandings of how data science is done in practice. For example, if we ask 10,000 data scientists to independently explore the same dataset, how different would be their sequence of analysis steps and their findings? Are certain paths of analysis more likely to lead to biases and false discoveries? What resources and trainings can we provide to the analyst to improve the analysis accuracy? We are answering these and many other fundamental questions in the largest controlled experiments to get at the heart of the science behind data science. In collaboration with

several of the most popular data science MOOCs, we are recruiting thousands of analysts to an online platform where we provide them specific datasets and track each step of their analysis. The results of these experiments could lead to insights that improve the robustness and reproducibility of data science.

ADAPTIVELY COLLECTED DATA

From scientific experiments to online A/B testing, adaptive procedures for data collection are ubiquitous in practice. The previously observed data often affects how future experiments are performed, which in turn affects which data will be collected. Such adaptivity introduces complex correlations between the data and the collection procedure, which has been largely ignored by data scientists.

We prove that under very general conditions, any adaptively collected data has negative bias, meaning that the observed effects in the data systematically underestimate the true effect sizes. As an example, consider an adaptive clinical trial where additional data points are more likely to be tested for treatments that show initial promise. Our result implies that the average observed treatment effects would be smaller than the true effects of each treatment. This is quite surprising, because folklore says that, if anything, we might over-estimate the true effect due to Winner’s Curse. We prove that the opposite is true. Moreover, we develop algorithms that effectively reduce this bias and improve the usefulness of adaptively collected data.

https://arxiv .org/abs/1708 .01977


SELECTED RESEARCH PROJECTS

BIG DATA FOR AGRICULTURAL RISK MANAGEMENT IN THE UNITED STATESStefano Ermon and David Lobell

This project aims to improve in-season predictions of yields for major crops in the United States, as well as a related goal of mapping soil properties across major agricultural states, and mapping crop locations and crop types around the world. The project uses a combination of graphical models, approximate Bayesian computation, and crop simulation models to make predictions based on weather and satellite data.

This work has been published in the AAAI and recognized by the best student paper award and an award from the World Bank Big Data Challenge competition.

cs .stanford .edu/~ermon/group/website/papers/jiaxuan_AAAI17 .pdf

www .worldbank .org/en/news/feature/2017/03/27/and-the-winners-of-the-big-data-innovation-challenge-are#

AMELIE: MAKING GENETIC DIAGNOSTICS ACCESSIBLE, REPRODUCIBLE, UBIQUITOUSGill Bejerano, Christopher Ré

Mendelian diseases are caused by single gene mutations. In aggregate, they affect 3% (~250M) of the world’s population. The diagnosis of thousands of Mendelian disorders has been radically transformed by genome sequencing. The potential of changing so many lives for the better, is held back by the associated human labor costs. Genome sequencing is a simple, fast procedure costing hundreds of dollars. The mostly manual process of finding which, if any of the patient’s sequenced variants is responsible for their phenotypes against an exploding body of literature, makes genetic diagnosis 10X more expensive, unsustainably slow and incompletely reproducible.

Our project, a unique collaboration between Stanford’s Computer Science department and Stanford’s children’s hospital Medical Genetics Division, aims to develop and deploy a first of a kind computer system to greatly accelerate the clinical diagnosis workflow and additionally derive novel disease gene hypotheses from it. This effort will produce a proof of principle workflow and worldwide deployable tools to significantly improve diagnostic throughput, greatly reduce the time spent by expert clinicians to reach a diagnosis and associated costs thereby making genomic testing accessible, reproducible and ubiquitous.

A first flagship analysis web portal for the project has launched at: https://AMELIE .stanford .edu

SMALL PROJECTS


FOOD PRICES AND MORTALITYEran Bendavid, Sze-chuan Suen, Sanjay Basu

In the late 2000’s, the prices of many staple crops sold on markets in low- and middle-income African countries tripled. Higher prices may compromise households’ ability to purchase enough food, or alternatively increase incomes for food-producing households. Despite these different potential effects, the net impact of this “food crisis” on the health of vulnerable populations remains unknown.

We extracted data on the local prices of four major staple crops—maize, rice, sorghum, and wheat—from 98 markets in 12 African countries (2002–2013), and studied their relationship to under-five mortality from Demographic and Health Surveys. Using within-country fixed effects models, distributed lag models, and instrumental variable approaches, we used the dramatic price increases in 2007–2008 to test the relationship between food prices and under-five mortality, controlling for secular trends, gross domestic product per capita, urban residence, and seasonality.

The prices of all four commodities tripled, on average, between 2006 and 2008. We did not find any model specification in which the increased prices of maize, sorghum, or wheat were consistently associated with increased under-five mortality. Indeed, price increases for these commodities were more commonly associated with (statistically insignificant) lower mortality in our data. A $1 increase in the price per kg of sorghum, a common African staple, was associated with 0.07–4.50 fewer child deaths per 10,000 child-months, depending on the specification (p=0. 0.25–0.98). In rural areas where higher food prices may benefit households that are net food producers, increasing maize prices were associated with lower child mortality compared with urban households (12.4 fewer child deaths per 10,000 child-months with each $1 increase in the price of maize; p=0.04).

We did not detect a significant overall relationship between increased prices of maize, rice, sorghum or wheat and increased under-5 mortality. There is some suggestion that food-producing areas may benefit from higher prices, while urban areas may be harmed.

BLOCKCHAINDan Boneh

Boneh’s lab is working on an efficient mechanism for confidential transactions on the block chain (joint work with Benedikt Buenz). Confidential transactions (CT) is a way for two parties to transact on the block chain without revealing the amount of money that one party is paying the other. This capability is absolutely necessary if the block chain is ever going to be used for business. Current CT mechanisms have a number of drawbacks, most notably, CT transaction size is much larger than non-CT transactions. Our construction greatly shrinks the overhead for CT transactions.

SMALL PROJECTS


INFERRING THE MASS MAP OF THE OBSERVABLE UNIVERSE FROM 10 BILLION GALAXIESRisa Wechsler, Phil Marshall

Mapping the Universe is an activity of fundamental interest, linking as it does some of the biggest questions in modern astrophysics and cosmology: What is the Universe made of, and why is it accelerating? How do the initial seeds of structure form and grow to produce our own Galaxy? Wide field astronomical surveys, such as that planned with the Large Synoptic Survey Telescope (LSST), will provide measurements of billions of galaxies over half of the sky; we want to analyze these datasets with sophisticated statistical methods that allow us to create the most accurate map of the distribution of mass in the Universe to date. The sky locations, colors and brightnesses of the galaxies allow us to infer (approximately) their positions in 3D, and their stellar masses; the distorted apparent shapes of galaxies contain information about the gravitational effects of mass in other galaxies along the line of sight. Our proposed work is to take the first step in using all of this information in a giant hierarchical inference of our Universe’s cosmological and galaxy population model hyper-parameters, after explicit marginalization of the parameters describing millions—and perhaps billions—of individual galaxies. We will need to develop the statistical machinery to perform this inference, and implement it at the appropriate computational scale. Training and testing will require large cosmological simulations, generating plausible mock galaxy catalogs; we plan to make all of our data public to enable further investigations of this type.

REAL-TIME LARGE-SCALE NEURAL IDENTIFICATIONE.J. Chichilnisky, Andrea Montanari

Electronic interfaces to the brain are increasingly being used to treat incurable disease, and eventually may be used to augment human function. An important requirement to improve the performance of such devices is that they be able to recognize and effectively interact with the neural circuitry to which they are connected. An example is retinal prostheses for treating incurable blindness. Early devices of this form exist now, but only deliver limited visual function, in part because they do not recognize the diverse cell types in the retina to which they connect. We have developed automated classifiers for functional identification of retinal ganglion cells, the output neurons of the retina, based solely on recorded voltage patterns on an electrode array similar to the ones used in retinal prostheses. Our large collection of data—hundreds of recordings from primate retina over 18 years—made an exploration of automated methods for cell type identification possible for the first time. We trained classifiers based on features extracted from electrophysiological images (spatiotemporal voltage waveforms), inter-spike intervals (auto-correlations), and functional coupling between cells (cross-correlations), and were able to routinely identify cell types with high accuracy. Based on this work, we are now developing the techniques necessary for a retinal prosthesis to exploit this information by encoding the visual signal in a way that optimizes artificial vision.

SMALL PROJECTS

imag

e|N

ASA.

gov


STANFORD DISTRIBUTED CLINICAL DATA PROJECT AND MS AZUREPhilip Lavori, Balasubramanian Narasimhan, Daniel Rubin

A system has been developed at Stanford that enables using confidential healthcare data among distant hospitals and clinics for creating decision support applications without requiring sharing any patient data among those institutions, thus facilitating multi-institution research studies on massive datasets. This collaboration between Microsoft and Stanford will develop a MS Azure application based on this, thus providing a solution that is robust, usable, and deployable widely at many healthcare institutions.

MyHEART COUNTSEuan Ashley

The MyHeart Counts study—launched in the spring of 2015 on Apple’s Research Kit platform—seeks to mine the treasure trove of heart health and activity data that can be gathered in a population through mobile phone apps. Because the average adult in the U.S. checks his/her phone dozens of times each day, phone apps that target cardiovascular health are a promising tool to quickly gather large amounts of data about a population’s health and fitness, and ultimately to influence people to make healthier choices. In the first 8 months, over 40,000 users downloaded the app. Participants recorded physical activity, filled out health questionnaires, and completed a 6-minute walk test. We applied unsupervised machine learning techniques

to cluster subjects into activity groups, such as those more active on weekends. We then developed algorithms to uncover associations between the clusters of accelerometry data and subjects’ self-reported health and well-being outcomes. Our results, published in JAMA Cardiology in December 2016, are in line with the accepted medical wisdom that more active people are at lower risk for diabetes, heart disease, and other health problems. However, there is more to the story, as we learned that certain activity patterns are healthier than others. For example, subjects who were active throughout the day in brief intervals had lower incidence of heart disease compared to those who were active for the same total amount of time, but got all their activity in a single longer session. In the second iteration of our study, we aim to answer more complex research questions, focusing on gene-environment interactions as well as discovering the mechanisms that are most effective in encouraging people to lead more active lifestyles. The app now provides users with different forms of coaching as well as graphical feedback about their performance throughout the duration of the study. Additionally, we know it’s not just environmental factors that

affect heart health, so MyHeart Counts in collaboration with 23andMe has added a function in the app to allow participants who already have a 23andMe account to seamlessly upload their genome data to our servers. This data is coded and available to approved researchers. Combining data collected through the app with genetic data allows us to do promising and exciting research in heart health.

Download the MyHeart Counts app from the iTunes store and come join our research!

SMALL PROJECTS


PHYSICS EVENT RECONSTRUCTION AT THE LARGE HADRON COLLIDERAriel Schwartzman

The aim of this proposal is to develop and apply advanced data science techniques to address fundamental challenges of physics event reconstruction and classification at the Large Hadron Collider (LHC). The LHC is exploring physics at the energy frontier, probing some of the most fundamental questions about the nature of our universe. The datasets of the LHC experiments are among the largest in all science. Each particle collision event at the LHC is rich in information, particularly in the detail and complexity of each event picture (consisting of 100 million pixels images taken 40 million times a second), making it ideal for the application of modern machine learning techniques to extract the maximum amount of physics information. Up until now, most of the methods used to extract useful information from the large datasets of the LHC have been based on physics intuition built from existing models. During the last several years, spectacular advances in the fields of artificial intelligence, computer vision, and deep learning have resulted in remarkable performance improvements in image classification and vision tasks, in particular through the use of deep convolutional neural networks (CNN). Representing LHC collision events as images, a novel concept developed by SDSI, has enabled, for the first time, the application of computer vision and deep learning methods for event classification and reconstruction, resulting in impressive gains in the discovery potential of the LHC. We plan to continue to improve physics event interpretation at the LHC by the development and application of advanced machine learning algorithms to some of the most difficult and exciting challenges in physics event reconstruction at hadron colliders, such as the identification of Higgs bosons, and the mitigation of pileup—many overlapping collisions in a single event. These developments will have important implications in extracting knowledge in high energy physics. The problem also provides a setting for more general exploration of tools to find subtle correlations embedded in a large dataset.

Below is the list ofpublications funded by SDSI: Jet-Images: Computer Vision Inspired Techniques for Jet Tagging, JHEP 02 (2015) 118.

https://arxiv .org/abs/1407 .5675

SMALL PROJECTS


USE OF ELECTRONIC PHENOTYPING AND MACHINE LEARNING ALGORITHMS TO IDENTIFY FAMILIAL HYPERCHOLESTEROLEMIA PATIENTS IN ELECTRONIC HEALTH RECORDSJoshua W. Knowles, Nigam Shah

The FIND FH EHR project (Flag, Identify, Network and Deliver for Familial Hypercholesterolemia) aims to pioneer new techniques for the identification of individuals with Familial Hypercholesterolemia (FH) within electronic health records (EHRs). FH is a common but vastly underdiagnosed, inherited form of high cholesterol and coronary heart disease that is potentially devastating if undiagnosed but can be ameliorated with early identification and proactive treatment. Traditionally, patients with a phenotype (such as FH) are identified through rule-based definitions whose creation and validation are time consuming. Machine learning approaches to phenotyping are limited by the paucity of labeled training datasets. In this project, we have demonstrated the feasibility of utilizing

noisy labeled training sets to learn phenotype models from the patient’s clinical record. We have searched both structured and unstructured data with in EHRs to identify possible FH patients. Individuals with possible FH have been “flagged” and are being contacted in a HIPAA compliant manner to encourage guideline-based screening and therapy. Algorithms developed have been tested in datasets from collaborating institutions and are broadly applicable to several different EHR platforms. Furthermore, the principles can be applied to multiple conditions thereby extending the utility of this approach. The project is in partnership with the FH Foundation (www.thefhfoundation.org), a non-profit organization founded and led by FH patients that is dedicated to improving the awareness and treatment of FH.

SMALL PROJECTS

What do I mean by “a purposeful university”?

I mean a university that promotes and celebrates excellence not as an end in

itself, but rather as a means to multiply its beneficial impact on society.

Marc Tessier-Lavigne President, Stanford University

Selected Research Projects



APPLICATION OF COMPUTING AND INFORMATICS TECHNOLOGIES TO PROBLEMS RELEVANT TO MEDICINERuss Altman

The annual volume of safety reports submitted to the FDA adverse event reporting system (FAERS) has increased 10x over the past 15 years to 1.2 million in 2016; yet the promise of big data analysis in drug safety monitoring has not paid off because of pervasive data quality issues. Almost 60 percent of serious or life-threatening adverse event reports submitted by manufacturers in 2016 did not have all of: (1) patient age, (2) gender, (3) event date, and (4) at least one medical term describing the event. By contrast, over 80 percent of direct submissions in 2016 were reasonably complete.

Patient reported outcomes (PROs) are seen as a way to improve the quality of adverse event reports, and mobile apps that facilitate direct reporting to FAERS are currently under development. Research to date has focused on challenges in usability, leaving unanswered the question of data quality assurance—another key barrier to adoption. The impact of mobile apps on pharmacovigilance practices must be quantifiable for them to be a viable data capture technology, and currently no cost effective means of evaluation exists.

The overall goal of this project is to use artificial intelligence to quantitatively measure the impacts of mobile app design factors on the assessment value of safety reports. Our research supports wider efforts to use mobile technology to integrate PROs into pharmacovigilance practices as a means of quality assurance testing for mobile health apps. We will leverage our automated AE report assessment tool and software engineering expertise to build a mobile app and companion server for event data collection and report quality assessment. This will provide us a means of measuring the impact of mobile app design features on the quality AE reports.

To achieve our research goal we will (1) research use cases, conduct competitive landscape analysis, and establish baseline usability and quality benchmarks. We will (2) build a prototype, and validate its quality and usability against our previously established benchmarks. Finally, we will (3) measure the report quality impact of selected app features.

For more information, please see the Helix Group website: http://helix .stanford .edu/about .html



DATA-INTENSIVE SYSTEMS AND TOOLS FOR MAKING COMPLEX, HIGH-VOLUME DATA ANALYTICS MORE USEFUL AND MORE ACCESSIBLEPeter Bailis

One of our core projects is MacroBase, a new data processing engine for analyzing high-volume monitoring and business data. While Big Data infrastructure has made it increasingly cheap to collect telemetry from automated sources and sensors such mobile devices and manufacturing equipment, human cognition hasn’t kept pace. Therefore, to help humans understand complex behaviors in complex application deployments, MacroBase combines feature extraction, classification, and model explanation at scale. Using this combo, MacroBase has successfully delivered new, previously unknown results in domains including online services, mobile application monitoring, manufacturing, and automotives, and serves as a platform for ongoing research in scalable data analytics, including time-series classification, large-scale video processing, and data visualization.

For more information, please see the website: http://www .bailis .org

NATURAL LANGUAGE PROCESSING AND CUSTOMER SUPPORTDan Jurafsky

Jurafsky’s research ranges widely across computational linguistics; special interests include natural language understanding, human-human conversation, the relationship between human and machine processing, and the application of natural language processing to the social and behavioral sciences.

The Natural Language Processing Group at Stanford University is a team of faculty, postdocs, programmers and students who work together on algorithms that allow computers to process and understand human languages. Their work ranges from basic research in computational linguistics to key applications in human language technology, and covers areas such as sentence understanding, automatic question answering, machine translation, syntactic parsing and tagging, sentiment analysis, and models of text and visual scenes, as well as applications of natural language processing to the digital humanities and computational social sciences. They provide a widely used, integrated NLP toolkit, Stanford CoreNLP. Particular technologies include our competition-winning coreference resolution system; a high speed, high performance neural network dependency parser; a state-of-the-art part-of-speech tagger; a competition-winning named entity recognizer; and algorithms for processing Arabic, Chinese, French, German, and Spanish text.

For a list of Jurafsky’s recent publications, please go to this website: https://web .stanford .edu/~jurafsky



STUDYING THE DARK MATTER OF THE WEBMichael Bernstein

Is the web a perfect reflection of reality? If it were, a quick image search for “grandma” might convince you that all grandmothers are old white ladies with glasses and frizzy hair (seriously, try it). In perhaps a more serious example, a search for “chest pain” might leave a web user with anxiety about an imminent heart attack, when that outcome is actually very unlikely.

This disconnect is the web’s dark matter: the vast canyon of human experience that goes undocumented or overshadowed by the small proportion to which we ourselves, the web’s users and content generators, give disproportionate airtime. Our goal is to quantify this dark matter across the entire web. We’ll be crawling the entire web (as archived by CommonCrawl) and categorizing webpages about a variety of topics, primarily political opinions including viewpoints on marriage equality, abortion rights, marijuana legalization, and other highly-relevant issues. We will then compare the proportions of web pages supporting these issues to the offline proportions gathered via Pew and Gallup surveys. The result will tell us what impact the web is having on content production, as well as the impact it has on people when they browse it.

Professor Bernstein’s research applies a computational lens to empower large groups of people connecting and working online. He designs crowdsourcing and social computing systems that enable people to connect toward more complex, fulfilling goals. His systems have been used to convene on-demand “flash” organizations, engage over a thousand people worldwide in open-ended research, and shed light on the dynamics of antisocial behavior online.

LINKS TO OTHER PROJECTS

http://hci .stanford .edu/msb

Visual Genome: http://visualgenome .org

Anyone Can Be A Troll: https://www .technologyreview .com/s/603489/theres-a-troll-inside-all-of-us-researchers-say

Iris: https://hackernoon .com/a-conversational-agent-for-data-science-4ae300cdc220

REINFORCEMENT LEARNINGEmma Brunskill

A goal of Brunskill’s work is to increase human potential through advancing interactive machine learning. Revolutions in storage and computation have made it easy to capture and react to sequences of decisions made and their outcomes. Simultaneously, due to the rise of chronic health conditions, and demand for educated workers, there is an urgent need for more scalable solutions to assist people to reach their full potential. Interactive machine learning systems could be a key part of the solution. To enable this, Brunskill’s lab’s work spans from advancing our theoretical understanding of reinforcement learning, to developing new self-optimizing tutoring systems that we test with learners and in the classroom. Their applications focus on education since education can radically transform the opportunities available to an individual.



APPLICATIONS OF CRYPTOGRAPHY TO COMPUTER SECURITYDan Boneh

Professor Boneh heads the applied cryptography group and co-directs the computer security lab. Professor Boneh’s research focuses on applications of cryptography to computer security. His work includes cryptosystems with novel properties, web security, security for mobile devices, and cryptanalysis. The Applied Crypto Group is a part of the Security Lab in the Computer Science Department at Stanford University. Research projects in the group focus on various aspects of network and computer security. In particular, the group focuses on applications of cryptography to real-world security problems.

For more information, please see the website https://crypto .stanford .edu

COMPUTATIONAL APPROACHES TO HELP ADDRESS THE SOCIETAL AND ENVIRONMENTAL CHALLENGESStefano Ermon

Ermon’s group researches innovative computational approaches to help address the societal and environmental challenges of the 21st century. They combine research on the foundations of artificial intelligence and machine learning with applications in science and engineering. Their work enables computers to act intelligently and adaptively in increasingly complex and uncertain real world environments.

Please see the website for more information: https://cs .stanford .edu/~ermon/group/website

FOOD PRODUCTION, FOOD SECURITY, AND THE ENVIRONMENTDavid Lobell

The Lobell research group studies the interactions between food production, food security, and the environment. Their work relies heavily on using modern sensors and quantitative methods to better understand cropping systems, both in developed and developing countries. Their research focuses on agriculture and food security, specifically on generating and using unique datasets to study rural areas throughout the world. Lobell’s projects span Africa, South Asia, Mexico, and the United States, and involve a range of tools including remote sensing, GIS, and crop and climate models.

Please see the website for more information: https://lobell-lab .stanford .edu



COMPUTATIONAL LOGICMichael Genesereth

Stanford Logic Group’s work aims at developing innovative logic-based technologies to realize the vision of a “Declarative Enterprise”—an enterprise that declaratively defined business policies act as executable specifications of its business operations. Stanford Logic Group’s work in the relevant time period addresses (a) development of techniques and tools for easy creation, maintenance of user-friendly web forms based on formal encoding of laws, regulations and business policies—Smart Forms, and (b) development of techniques and tools for integrated read & write access to structured data—Jabberwocky. Below we present a short summary of the conducted research and development work.

SMART FORMS

Smart Forms technology makes it so easy to create, maintain and evaluate powerful yet user-friendly web forms that they could be created and maintained by domain experts themselves. In particular, neither the creation and maintenance of Smart Forms nor the evaluation of the data entered into a Smart Form require traditional procedural coding.

JABBERWOCKY

Jabberwocky is a browser based explorer for integrated structured data in the Web. Jabberwocky integrates structured open data from authoritative sources and makes it easy for end users to browse as well as expressively query Jabberwocky’s data graph in ad-hoc fashion. Currently, in order to find answers to their complex questions end users have to browse multiple web sites with different designs, data presentation rationales and query capabilities. Jabberwocky fills this gap.

For more information, please go to the Stanford Logic Group’s website: http://logic .stanford .edu



MINING AND MODELING LARGE SOCIAL AND INFORMATION NETWORKSJure Leskovec

Leskovec’s group has developed a new state-of-the-art framework, called GraphSage, for deep learning on social and biological networks. GraphSage can be used to make predictions about individual nodes in a network, for example, predicting user behavior in a social network or drug interactions. Instead of relying on hand-engineered network statistics, GraphSage automatically learns how to incorporate information from a node’s local network neighborhood in order to make predictions. Unlike previous approaches, GraphSage is capable of scaling to networks that have billions of nodes and edges while achieving state-of-the-art results on a number of common tasks, such as content recommendation in social networks and predicting drug interactions.

Other projects include leveraging big sensor data to understand human mobility and obesity and open-domain social media analysis. Leskovec’s group used big data from smartphones tracking the activity levels of hundreds of thousands of people around the globe to understand human mobility and its relation to obesity. Considering that an estimated 5.3 million people die from causes associated with physical inactivity every year, they looked for a simple and convenient way to measure activity across millions of people to help figure out why obesity is a bigger problem in some countries than others.

Opinion analysis of consumers is done traditionally through pools and questionnaires, which makes it costly, covers only a small sample of the population, and cannot provide real-time updates. Widespread use of social media allows for opinion analysis to be performed at a much deeper level by using automatic methods which are used to process online discussions in a cost-effective manner, covering significantly larger populations and providing real-time updates as new discussions are published.

For links, please go to the website http://snap .stanford .edu

PARALLEL COMPUTINGKunle Olukotun

The core of the Stanford Pervasive Parallelism Lab’s research agenda is to allow the domain expert to develop parallel software without becoming an expert in parallel programming. The approach is to use a layered system based on DSLs, a common parallel compiler and runtime infrastructure, and an underlying architecture that provides efficient mechanisms for communication, synchronization, and performance monitoring.

New heterogeneous architectures continue to provide increases in achievable performance, but programming these devices to reach maximum performance levels is not straightforward. The goal of the PPL is to make heterogeneous parallelism accessible to average software developers through domain-specific languages (DSLs) so that it can be freely used in all computationally demanding applications.



SENSING, RECONSTRUCTING AND BUILDING AN EXPERT SYSTEM FOR DATA CENTERS AND CLOUDSBalaji Prabhakar

Over the past decade, the users and operators of large cloud platforms and campus networks have desired a much more programmable network infrastructure so as to configure it to the needs of different applications and reduce the friction they can cause to each other. This has culminated in the SDN paradigm, initiated at Stanford, and now widely adopted. But it is hard to program what you do not understand: the volume, velocity and richness of network applications and traffic seem beyond the ability of direct human comprehension. What is needed is an expert system that can observe the data emitted by a network during the course of its operation, continually learn the best responses to rapidly-changing load and operating conditions, and help the network adapt to them in real-time.

COMPUTER VISION, ROBOTIC PERCEPTION AND MACHINE LEARNINGSilvio Savarese

The Computational Vision and Geometry Lab (CVGL) at Stanford is directed by Prof. Silvio Savarese. Their research addresses the theoretical foundations and practical applications of computational vision. The Lab’s interest lies in discovering and proposing the fundamental principles, algorithms and implementations for solving high level visual recognition and reconstruction problems such as object and scene understanding as well as human behavior recognition in the complex 3D world.

For more information, please see the website http://cvgl .stanford .edu/index .html

Also see information on the social navigation robot, Jackrabbot, http://cvgl .stanford .edu/projects/jackrabbot

WELDMatei Zaharia

Weld is a runtime for improving the performance of data-intensive applications. It optimizes across libraries and functions by expressing the core computations in libraries using a small common intermediate representation, similar to CUDA and OpenCL.

Modern analytics applications combine multiple functions from different libraries and frameworks to build complex workflows. Even though individual functions can achieve high performance in isolation, the performance of the combined workflow is often an order of magnitude below hardware limits due to extensive data movement across the functions. Weld’s take on solving this problem is to lazily build up a computation for the entire workflow, optimizing and evaluating it only when a result is needed.

For more information, please see the website https://cs .stanford .edu/~matei



ADVANCED TEMPORAL LANGUAGE AIDED SEARCHNigam Shah

At Stanford, we have developed a search engine, called ATLAS—for Advanced Temporal Language Aided Search—to find similar patients from the patient data in the Stanford clinical data warehouse. For example, we can search for, and, in under one second, find matching patients by searching across diagnosis, billing and procedure codes, concepts extracted from textual data, laboratory test results, vital signs, as well as visit types and duration of inpatient stays. Such rapid querying across diverse data types for cohort-building is not possible at any other academic medical center in the country.

The end goal is to solve the remaining hurdles in patient matching, automated cohort building, and statistical inference so that for a specific case, we can instantly generate a report with a descriptive summary of similar patients in Stanford’s clinical data warehouse, the common treatment choices made, and the observed outcomes after specific treatment choices. This project pursues a unique opportunity to generate actionable insights from the large amounts of health data that are routinely generated as a byproduct of clinical processes.

For more information about the Shah Lab, please see the website https://shahlab .stanford .edu

HUMAN GENOMEMichael Snyder

The Snyder laboratory study was the first to perform a large-scale functional genomics project in any organism, and has developed many technologies in genomics and proteomics. These including the development of proteome chips, high resolution tiling arrays for the entire human genome, methods for global mapping of transcription factor binding sites (ChIP-chip now replaced by ChIP-seq), paired end sequencing for mapping of structural variation in eukaryotes, de novo genome sequencing of genomes using high throughput technologies and RNA-Seq. These technologies have been used for characterizing genomes, proteomes and regulatory networks.

Seminal findings from the Snyder laboratory include the discovery that much more of the human genome is transcribed and contains regulatory information than was previously appreciated, and a high diversity of transcription factor binding occurs both between and within species.

Please see the Snyder Lab website for additional information http://snyderlab .stanford .edu/Snyder .html



Russ Altman Professor of Bioengineering, of Genetics, of Medicine (General Medical Discipline), of Biomedical Data Science and, by courtesy, of Computer Science

Euan Ashley Professor of Medicine (Cardiovascular) and, by courtesy, of Pathology at the Stanford University Medical Center

Peter Bailis Assistant Professor of Computer Science

Sanjay Basu Assistant Professor of Medicine (Primary Care and Outcomes Research) and, by courtesy, of Health Research and Policy (Epidemiology)

Mohsen Bayati Associate Professor of Operations, Information and Technology at the Graduate School of Business and, by courtesy, of Electrical Engineering

Gill Bejerano Associate Professor of Developmental Biology, of Computer Science and of Pediatrics (Genetics)

Eran Bendavid Assistant Professor of Medicine (Primary Care and Population Health)

Michael Bernstein Assistant Professor of Computer Science

Dan Boneh Professor of Computer Science and of Electrical Engineering

Emma Brunskill Assistant Professor of Computer Science

E .J . Chichilnisky Professor of Neurosurgery and of Ophthalmology and, by courtesy, of Electrical Engineering

Somalee Datta Director of Research IT, SoM – IRT Research Technology

Noah Diffenbaugh Professor of Earth System Science

Stefano Ermon Assistant Professor of Computer Science

Michael Genesereth Associate Professor of Computer Science and, by courtesy, of Law

Amir Goldberg Associate Professor of Organizational Behavior in the Graduate School of Business and, by courtesy, of Sociology,

Mark Horowitz Professor of Electrical Engineering and of Computer Science

Daniel Jurafsky Professor and Chair of Linguistics Department and Professor of Computer Science

Joshua Knowles Assistant Professor of Medicine (Cardiovascular Medicine)

Philip Lavori Professor of Biomedical Data Science, Emeritus

Jure Leskovec Associate Professor of Computer Science

Philip Levis Associate Professor of Computer Science and of Electrical Engineering

David Lobell Professor of Earth System Science

Phil Marshall Staff scientist at the Kavli Institute for Particle Astrophysics and Cosmology

Daniel McFarland Professor of Education and, by courtesy, of Sociology and of Organizational Behavior at the Graduate School of Business

Andrea Montanari Professor of Electrical Engineering and of Statistics

Balasubramanian Narasimhan Senior Research Scientist in Statistics and in Biomedical Data

Kunle Olukotun Professor of Electrical Engineering

Ayfer Ozgur Assistant Professor of Electrical Engineering

Christopher Potts Professor of Linguistics and, by courtesy, of Computer Science

Balaji Prabhakar Professor of Electrical Engineering and of Computer Science and, by courtesy, of Management Science and Engineering and of Operations, Information and Technology at the Graduate School of Business

Christopher Ré Associate Professor of Computer Science

Daniel Rubin Associate Professor of Biomedical Data Science and of Radiology (Integrative Biomedical Imaging Informatics at Stanford), of Medicine (Biomedical Informatics Research) and, by courtesy, of Ophthalmology

Silvio Savarese Associate Professor of Computer Science

Ariel Schwartzman Associate Professor of Particle Physics and Astrophysics

Nigam Shah Associate Professor of Medicine (Biomedical Informatics Research) and of Biomedical Data Science

Michael Snyder Professor in Genetics

David Tse Professor of Electrical Engineering

Risa Wechsler Associate Professor of Physics and of Particle Physics and Astrophysics

Tsachy Weissman Professor of Electrical Engineering

Matei Zaharia Assistant Professor of Computer Science and, by courtesy, of Electrical Engineering

James Zou Assistant Professor of Biomedical Data Science and, by courtesy, of Computer Science and of Electrical Engineering

SDSI AFFILIATED FACULTY


CORPORATE MEMBERS

FOUNDING MEMBERS

REGULAR MEMBERS

Data Science Initiative

The Stanford Data Science Initiative is focused on core data technologies with strong ties to application areas across campus.

sdsi.stanford.edu

Printed on paper from responsible sources certified by Forest Stewardship Council®

Documents

Data Science Initiative - Stanford University · Data Science Research at Stanford 2017–2018 1 ABOUT STANFORD DATA SCIENCE INITIATIVE The Stanford Data Science Initiative (SDSI)