43
Yahoo!-DAIS Seminar (CS591DAI) Orientation ChengXiang (“Cheng”) Zhai Department of Computer Science University of Illinois at Urbana- Champaign

Yahoo!-DAIS Seminar (CS591DAI) Orientation

  • Upload
    roden

  • View
    42

  • Download
    0

Embed Size (px)

DESCRIPTION

Yahoo!-DAIS Seminar (CS591DAI) Orientation. ChengXiang (“Cheng”) Zhai Department of Computer Science University of Illinois at Urbana-Champaign. Basic Information about CS591DAI (=Yahoo!-DAIS seminar). Meets at 4-5pm, Tuesdays, 0216 SC 1 Credit Hour  Miss at most two talks - PowerPoint PPT Presentation

Citation preview

Page 1: Yahoo!-DAIS Seminar (CS591DAI) Orientation

Yahoo!-DAIS Seminar (CS591DAI) Orientation

ChengXiang (“Cheng”) Zhai

Department of Computer Science

University of Illinois at Urbana-Champaign

Page 2: Yahoo!-DAIS Seminar (CS591DAI) Orientation
Page 3: Yahoo!-DAIS Seminar (CS591DAI) Orientation

Basic Information about CS591DAI (=Yahoo!-DAIS seminar)

Meets at 4-5pm, Tuesdays, 0216 SC 1 Credit Hour Miss at most two talks At each meeting, you can expect

An interesting research talk Opportunity for research discussion An attendance sheet for collecting signatures And snacks (thanks to Yahoo!)

Seminar coordinators: Yucheng Chen, Hao Luo Website: http://dais.cs.uiuc.edu/seminars.html (under

construction) Mailing list: [email protected] (make sure you subscribe to it)

Page 4: Yahoo!-DAIS Seminar (CS591DAI) Orientation

Prof. Emerita

Bioinformatics

The DAIS Group: Data & Information Systems

Kevin Chang Jiawei Han Saurabh Sinha Marianne Winslett Cheng Zhai

Tandy Warnow (+ Bioenginering)

Hari Sundaram(+Advertising)

Aditya Parameswaran

Jian Peng

+ many grad/undergrad students + few postdocs/visitors

Page 5: Yahoo!-DAIS Seminar (CS591DAI) Orientation

Landscape of DAIS Research

Scalability

Intelligence

Application impactHealth/Medical/Biology

Education

Productivity (Web, email, …)Decision making (government, business,

personal)

Search

Data/Information Access

Information analysis & Data mining

Browsing

Recommendation

Decision/Task support

Intelligent information agent

Gigabytes

Terabytes

Petabytes

Storage

5

Page 6: Yahoo!-DAIS Seminar (CS591DAI) Orientation

DAIS & Related Areas in CS

Scalability

Intelligence

Application impact

Health/Medical/Biology

Education

Productivity (Web, email, …)Decision making (government, business,

personal)

Search

Data/Information Access

Information analysis & Data mining

Browsing

Recommendation

Decision/Task support

Intelligent information agent

Gigabytes

Terabytes

Petabytes

Storage

Artificial Intelligence

Statistics

Human-Computer Interaction, Graphics

Systems & Networking

Parallel Computing

Theory & Algorithms

6

Page 7: Yahoo!-DAIS Seminar (CS591DAI) Orientation

Overview of DAIS Faculty Research

Page 8: Yahoo!-DAIS Seminar (CS591DAI) Orientation

How to bring structured/semantic-rich access to the myriad and massive unstructured data which accounts for most of the world's information?

How to search the Web that Google does not see? How to reach into and across pages that Google only

takes you to? How to listen to buzz of the world and make sense?

Kevin C. Chang: Bridging Structured and Unstructured Data

Page 9: Yahoo!-DAIS Seminar (CS591DAI) Orientation

Project 1. MetaQuerier: Exploring and Integrating the Deep Web

MetaExplorer• source discovery• source modeling• source indexing

MetaIntegrator• source selection• schema integration• query mediation

FIND sources

QUERY sources

db of dbs

unified query interface

Page 10: Yahoo!-DAIS Seminar (CS591DAI) Orientation

10

Data Aware Search

Project 2. WISDM: Data-aware Search over the Web

Page 11: Yahoo!-DAIS Seminar (CS591DAI) Orientation

Demo: Entity Search-- “university of california #location”

Page 12: Yahoo!-DAIS Seminar (CS591DAI) Orientation

Project BigSocial: Social Data Analytics

Social Data Analytics

Page 13: Yahoo!-DAIS Seminar (CS591DAI) Orientation

Aditya Parameswaran

Started August 2014 i.stanford.edu/~adityagp

Interests: Data Management, Mining and Algorithms

Specific topics of interest Interactive Data Analytics Crowd-powered Analytics Approximate Analytics Visual Analytics and Data Mining

Page 14: Yahoo!-DAIS Seminar (CS591DAI) Orientation

Research Style

Data Systems Data

Mining

Theory

Data Science andApplied MLBuild Real Data

Analytics Tools / Systems

Design Algorithms with Guarantees

Page 15: Yahoo!-DAIS Seminar (CS591DAI) Orientation

Research Goal: Simplifying Data Analytics

“….(in the next few years) we project a need for 1.5 million additional analysts in the United States

who can analyze data effectively…“,-- McKinsey Big Data Study, 2012

How do we make it easier for novice data analysts to get insights from data?

Page 16: Yahoo!-DAIS Seminar (CS591DAI) Orientation

Simplifying Data Analytics: Four Aspects

Unstructured Querying ScaleVisualizations

CrowdPoweredAnalytics

InteractiveAnalytics

ApproximateAnalytics

VisualAnalytics

Page 17: Yahoo!-DAIS Seminar (CS591DAI) Orientation

Example Projects

• Crowd-powered search

• Crowd-powered data extraction & cleaning

• Interactive query synthesis

• Speculative querying and caching

• Recommending visualizations automatically

• Approximate visualizations with guarantees

• Fundamental principles: cost, latency, error

• Browsing-based query processing

Crowd

Query

Visual

Approx

Page 18: Yahoo!-DAIS Seminar (CS591DAI) Orientation

Hari Sundaram

Prof. Hari Sundaram used a separate file for his presentation, which is available at;

http://times.cs.uiuc.edu/czhai/pub/DAIS-fall14-Hari.pdf

Page 19: Yahoo!-DAIS Seminar (CS591DAI) Orientation

Jiawei Han: Data Mining & Information Networks

How to perform data mining effectively in massive data and in heterogeneous information networks?

How to mine structures and construct networks from unstructured real-world data?

Specific research subareas: Effective methods for mining heterogeneous networks Construction of heterogeneous networks from unstructured data Multi-dimensional unstructured data summarization & OLAP Truth discovery and outlier mining in networked data Spatiotemporal and cyber-physical data mining (e.g., mobile objects,

sensor/mobile data mining)

Page 20: Yahoo!-DAIS Seminar (CS591DAI) Orientation

20

Recent Research Focus on Data Mining

Network Construction, Search and Mining on Real datasets News Network: 10M news articles + news of last 70 years Computer Science Research Network: DBLP + citations +

abstracts + Web pages of researcher + other related web pages Tweet network and other social media network Bio-Medical Research Network: PubMed and other medical

sources Networkfication of Knowledge-Bases: Wikipedia, DBPedia,

Freebase Cyber-Physical Networks (internet of things)

Other frontiers: trajectory mining, truth finding, anomaly, … Two ARL-funded projects in 2014-2016 for Network Science

Collaborative Technology Alliance (NSCTA)

Page 21: Yahoo!-DAIS Seminar (CS591DAI) Orientation

21

Constructing Unified, Structured Knowledge Networks

Team: H. Ji (RPI) (lead), J. Han (UIUC), G. Cao (PSU), C. Voss (ARL), W. Wallace (RPI)

Current State-of-the-Art Natural language processing Graph-based mining Theory of planned behavior

Army Needs and Benefits Exploitation of unstructured data for improved

situational analysis Predictive tools to understand adversarial intent

Long Term Goal: Near perfect reliable network construction through progressive source processing and network refinement

Construction of latent social and information networks require understanding connecting the dots in documents by linking entities and understanding human behavior

Rotations: Ji: post-doc (Hong): 3 mo. at ARL; Han: student (Liu/Brova): 3 mo. at ARL; Wallace: student (Yulia): 3 mo. at ARL; each PI: visits to ARL totaling 1 wk or more. RPI students visit UIUC for one week each.

Research Topics/Technical Approach

Automated Construction of adaptable knowledge networks Preliminary network construction via NLP and social

network techniques Exploitation of links for network refinement Streaming updates

Multi-dimension truth analysis Credibility analysis by processing the interactions of the

physical and social/cognitive states of the social network and its interactions with the information network

Information processing & social/cognitive modeling Build socio-cognitive models to predict human behavior User-oriented and constraint-aware information

processing

Page 22: Yahoo!-DAIS Seminar (CS591DAI) Orientation

22

Distributed, User-Oriented Multi-Scale Network Summarization and OLAP

Team: J. Han (UIUC) (lead), J. Hendler (RPI), T. Hanratty (ARL), H Ji (RPI), B. Welles (NEU)

Current State-of-the-Art Summarization on relational and text data Social and cognitive computing Online analytical processing on data cubes

Army Needs and Benefits Summarization/visualization matched to the user

cognitive abilities for improved situational awareness Flexible and efficient situation analysis for diverse groups

of users

Long Term Goal: Support of distributed, multi-scale, multi-genre network summarization, OLAP and situation analysis for diverse user groups

Distributed, user-oriented, multi-scale summarization of networks to support online information processing and situation analysis for diverse groups of users

Rotations: Han: student (Tao/Song): 3 mo. at ARL; Ji: student (Zhang): 3 mo. at ARL; each PI: visits to ARL, total 1 wk; 3 univ. mutual-visits: PIs + students

Research Topics/Technical Approach

Multi-scale network summarization & aggregation Creation and enrichment of multi-dimension information

from text and unstructured data Network cube construction: conflict resolution, topical

hierarchy generation, multidimensional indexing & selective, partial cube materialization

User-oriented adaptation of network cube views User- or social network-oriented OLAP Accommodate dynamic updates of underlying social

and/or communication networks Cost- and constraint-aware network cube Distributed, user- or community-oriented cube views Cost-, constraint- and availability- aware drilling, search

and analysisCognitive modeling, visualization, and supporting human decision making Experimental platform: news, blogs, tweets, etc.

Page 23: Yahoo!-DAIS Seminar (CS591DAI) Orientation

From Data Mining to Mining Info. Networks

23

Han, Kamber and Pei,Data Mining, 3rd ed. 2011

Yu, Han and Faloutsos (eds.), Link Mining, 2010

Sun and Han, Mining HeterogeneousInformation Networks, 2012

Page 24: Yahoo!-DAIS Seminar (CS591DAI) Orientation

ChengXiang Zhai: Intelligent text information management & analysis

How can we develop intelligent algorithms and systems to help people manage and exploit large amounts of text data (e.g., Web pages, blog articles, news, email, literature…)?

Two subtopics: Information retrieval: how can we connect the right information with

the right users at the right time with minimum or no user effort?

Text mining: How can we automatically discover useful knowledge from text? How can we mine text data together with non-textual data in an integrative manner?

Applications: Web, biomedical, health, education, … Research methodology:

Emphasize general & principled solutions without manual effort Mainly use statistical models, machine learning, and natural

language processing techniques

Page 25: Yahoo!-DAIS Seminar (CS591DAI) Orientation

Sample Project: Latent Aspect Rating Analysis

How to infer aspect ratings?

Value Location Service …..

How to infer aspect weights?

Value Location Service

Page 26: Yahoo!-DAIS Seminar (CS591DAI) Orientation

Solution: Latent Rating Regression Model

Reviews + overall ratings Aspect segments

location:1amazing:1walk:1anywhere:1

0.10.70.10.9

nice:1accommodating:1smile:1friendliness:1attentiveness:1

Term weights Aspect Rating

0.00.90.10.3

room:1nicely:1appointed:1comfortable:1

0.60.80.70.80.9

Aspect Segmentation

Latent Rating Regression

1.3

1.8

3.8

Aspect Weight

0.2

0.2

0.6

Topic model for aspect discovery

+

Page 27: Yahoo!-DAIS Seminar (CS591DAI) Orientation

Aspect-Based Opinion Summarization

Page 28: Yahoo!-DAIS Seminar (CS591DAI) Orientation

Reviewer Behavior Analysis & Personalized Ranking of Entities

People like cheap hotels because of good value

People like expensive hotels because of good service

Query: 0.9 value 0.1 others

Non-Personalized

Personalized

Page 29: Yahoo!-DAIS Seminar (CS591DAI) Orientation

Tandy Warnow

Page 30: Yahoo!-DAIS Seminar (CS591DAI) Orientation

Gene Tree Estimation: first align, then construct the tree

S1 = -AGGCTATCACCTGACCTCCAS2 = TAG-CTATCAC--GACCGC--S3 = TAG-CT-------GACCGC--S4 = -------TCAC--GACCGACA

S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGCS4 = TCACGACCGACA

S1

S4

S2

S3

Sounds easy, but every good approachis NP-hard, and statistical methods(based on stochastic models of evolution)are very slow.

Accuracy is essential, datasets arebig, and they are also messy.

Species tree estimation is even harder, becausegene trees can be different from the species tree!

Page 31: Yahoo!-DAIS Seminar (CS591DAI) Orientation

Avian Phylogenomics Project

G Zhang, BGI

• Approx. 50 species, whole genomes• 8000+ genes, UCEs

MTP Gilbert,Copenhagen

S. Mirarab Md. S. Bayzid UT-Austin UT-Austin

T. WarnowUT-Austin

Plus many many other people…

Erich Jarvis,HHMI

Challenges: Maximum likelihood tree estimation on multi-million-site sequence alignments Massive gene tree incongruence

My students and I developed a new technique (“Statistical Binning”)to enable a statistical estimation of the avian species tree, takinggene tree incongruence into account (both papers under review in Science)

Page 32: Yahoo!-DAIS Seminar (CS591DAI) Orientation

1kp: Thousand Transcriptome Project

Plant Tree of Life based on transcriptomes of ~1200 species More than 13,000 gene families (most not single copy)Gene Tree Incongruence

G. Ka-Shu WongU Alberta

N. WickettNorthwestern

J. Leebens-MackU Georgia

N. MatasciiPlant

T. Warnow, S. Mirarab, N. Nguyen, Md. S.BayzidUT-Austin UT-Austin UT-Austin UT-Austin

Challenges: Multiple sequence alignment of datasets with > 100,000 sequencesGene tree incongruence

My students and I developed ASTRAL – a technique to estimate species trees on large datasets (ECCB), and used it to analyze this dataset (under review in PNAS)UPP – new multiple sequence alignment method that can analyze up to 1,000,000 sequences (in preparation)

Plus many many other people…

Page 33: Yahoo!-DAIS Seminar (CS591DAI) Orientation

Current Projects

Computer science and mathematics issues:

Heuristics for NP-hard optimization problems Graph algorithms Statistical estimation on messy data Mining sets of trees/alignments High Performance Computing Mathematical modelling Probabilistic analysis of algorithms

Bioinformatics Problems

Multiple Sequence Alignment Gene Tree Estimation Species tree estimation (when gene trees conflict) Genome rearrangement phylogeny Phylogenetic network estimation Metagenomic data analysis

Also: Computational Historical Linguistics

Page 34: Yahoo!-DAIS Seminar (CS591DAI) Orientation

Jian Peng: Machine learning for computational biology

Biological data

Machine learning

Knowledge

Page 35: Yahoo!-DAIS Seminar (CS591DAI) Orientation

Disease-related genes

Functional homologs

Experiments

Biological data integration

Hypothesis

Page 36: Yahoo!-DAIS Seminar (CS591DAI) Orientation

Biological data integration: network biology

Modeling information diffusion on biological networks

Integrating networks from multiple species

Inference of gene function from network data

Page 37: Yahoo!-DAIS Seminar (CS591DAI) Orientation

Other computational biology projects

Protein science Structure prediction Protein folding Viral proteins

Translational bioinformatics Drug discovery and optimization Drug repositioning

Genomics Large-scale read mapping Algorithms for genome assembly

Page 38: Yahoo!-DAIS Seminar (CS591DAI) Orientation

Machine learning: modeling complex data

Graphical models Latent variable models Efficient learning and inference algorithms Causal/correlation structures Applications to protein folding, gene expression analysis and

biological network construction Learning representations for heterogeneous data

Low-dimensional embedding for network, text and molecular data sets

Learning structured prediction with complex loss functions Applications to biology, computer vision, speech recognition

and natural language processing

Page 39: Yahoo!-DAIS Seminar (CS591DAI) Orientation

Saurabh Sinha: Bioinformatics

How is information about us encoded in our DNA ?How does this information evolve, giving rise to what Darwin called “endless forms

most beautiful”?

Research questions: Gene regulation: How are genes turned on and off in precisely orchestrated ways? Comparative genomics: What can we learn by comparing genomes of tens of

different species? Regulatory evolution: Can we build a mathematical model of evolution? Genomics of behavior: How does DNA encode animal behavior ?

39

http://www.sinhalab.net/

Page 40: Yahoo!-DAIS Seminar (CS591DAI) Orientation

Genomics of behavior: honeybee

• What causes older bees to be more aggressive than younger ones?• What causes Africanized bees to be more aggressive than European ones?• What causes a bee to become aggressive if you annoy them?• DNA sequence analysis shows that origins of aggression are the same !

Page 41: Yahoo!-DAIS Seminar (CS591DAI) Orientation

Genomics of aging

Find genes associated with aging, by searching the DNA sequence for certain patterns.

Knock down one such gene; old cells became young !

Page 42: Yahoo!-DAIS Seminar (CS591DAI) Orientation

A complete bioinformatics pipeline

Slide 42

From cells … to data … to analysis … to hypotheses & experiments

Page 43: Yahoo!-DAIS Seminar (CS591DAI) Orientation

QUESTIONS?

43