Upload
roden
View
42
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Yahoo!-DAIS Seminar (CS591DAI) Orientation. ChengXiang (“Cheng”) Zhai Department of Computer Science University of Illinois at Urbana-Champaign. Basic Information about CS591DAI (=Yahoo!-DAIS seminar). Meets at 4-5pm, Tuesdays, 0216 SC 1 Credit Hour Miss at most two talks - PowerPoint PPT Presentation
Citation preview
Yahoo!-DAIS Seminar (CS591DAI) Orientation
ChengXiang (“Cheng”) Zhai
Department of Computer Science
University of Illinois at Urbana-Champaign
Basic Information about CS591DAI (=Yahoo!-DAIS seminar)
Meets at 4-5pm, Tuesdays, 0216 SC 1 Credit Hour Miss at most two talks At each meeting, you can expect
An interesting research talk Opportunity for research discussion An attendance sheet for collecting signatures And snacks (thanks to Yahoo!)
Seminar coordinators: Yucheng Chen, Hao Luo Website: http://dais.cs.uiuc.edu/seminars.html (under
construction) Mailing list: [email protected] (make sure you subscribe to it)
Prof. Emerita
Bioinformatics
The DAIS Group: Data & Information Systems
Kevin Chang Jiawei Han Saurabh Sinha Marianne Winslett Cheng Zhai
Tandy Warnow (+ Bioenginering)
Hari Sundaram(+Advertising)
Aditya Parameswaran
Jian Peng
+ many grad/undergrad students + few postdocs/visitors
Landscape of DAIS Research
Scalability
Intelligence
Application impactHealth/Medical/Biology
Education
Productivity (Web, email, …)Decision making (government, business,
personal)
Search
Data/Information Access
Information analysis & Data mining
Browsing
Recommendation
Decision/Task support
Intelligent information agent
Gigabytes
Terabytes
Petabytes
Storage
5
DAIS & Related Areas in CS
Scalability
Intelligence
Application impact
Health/Medical/Biology
Education
Productivity (Web, email, …)Decision making (government, business,
personal)
Search
Data/Information Access
Information analysis & Data mining
Browsing
Recommendation
Decision/Task support
Intelligent information agent
Gigabytes
Terabytes
Petabytes
Storage
Artificial Intelligence
Statistics
Human-Computer Interaction, Graphics
Systems & Networking
Parallel Computing
Theory & Algorithms
6
Overview of DAIS Faculty Research
How to bring structured/semantic-rich access to the myriad and massive unstructured data which accounts for most of the world's information?
How to search the Web that Google does not see? How to reach into and across pages that Google only
takes you to? How to listen to buzz of the world and make sense?
Kevin C. Chang: Bridging Structured and Unstructured Data
Project 1. MetaQuerier: Exploring and Integrating the Deep Web
MetaExplorer• source discovery• source modeling• source indexing
MetaIntegrator• source selection• schema integration• query mediation
FIND sources
QUERY sources
db of dbs
unified query interface
10
Data Aware Search
Project 2. WISDM: Data-aware Search over the Web
Demo: Entity Search-- “university of california #location”
Project BigSocial: Social Data Analytics
Social Data Analytics
Aditya Parameswaran
Started August 2014 i.stanford.edu/~adityagp
Interests: Data Management, Mining and Algorithms
Specific topics of interest Interactive Data Analytics Crowd-powered Analytics Approximate Analytics Visual Analytics and Data Mining
Research Style
Data Systems Data
Mining
Theory
Data Science andApplied MLBuild Real Data
Analytics Tools / Systems
Design Algorithms with Guarantees
Research Goal: Simplifying Data Analytics
“….(in the next few years) we project a need for 1.5 million additional analysts in the United States
who can analyze data effectively…“,-- McKinsey Big Data Study, 2012
How do we make it easier for novice data analysts to get insights from data?
Simplifying Data Analytics: Four Aspects
Unstructured Querying ScaleVisualizations
CrowdPoweredAnalytics
InteractiveAnalytics
ApproximateAnalytics
VisualAnalytics
Example Projects
• Crowd-powered search
• Crowd-powered data extraction & cleaning
• Interactive query synthesis
• Speculative querying and caching
• Recommending visualizations automatically
• Approximate visualizations with guarantees
• Fundamental principles: cost, latency, error
• Browsing-based query processing
Crowd
Query
Visual
Approx
Hari Sundaram
Prof. Hari Sundaram used a separate file for his presentation, which is available at;
http://times.cs.uiuc.edu/czhai/pub/DAIS-fall14-Hari.pdf
Jiawei Han: Data Mining & Information Networks
How to perform data mining effectively in massive data and in heterogeneous information networks?
How to mine structures and construct networks from unstructured real-world data?
Specific research subareas: Effective methods for mining heterogeneous networks Construction of heterogeneous networks from unstructured data Multi-dimensional unstructured data summarization & OLAP Truth discovery and outlier mining in networked data Spatiotemporal and cyber-physical data mining (e.g., mobile objects,
sensor/mobile data mining)
20
Recent Research Focus on Data Mining
Network Construction, Search and Mining on Real datasets News Network: 10M news articles + news of last 70 years Computer Science Research Network: DBLP + citations +
abstracts + Web pages of researcher + other related web pages Tweet network and other social media network Bio-Medical Research Network: PubMed and other medical
sources Networkfication of Knowledge-Bases: Wikipedia, DBPedia,
Freebase Cyber-Physical Networks (internet of things)
Other frontiers: trajectory mining, truth finding, anomaly, … Two ARL-funded projects in 2014-2016 for Network Science
Collaborative Technology Alliance (NSCTA)
21
Constructing Unified, Structured Knowledge Networks
Team: H. Ji (RPI) (lead), J. Han (UIUC), G. Cao (PSU), C. Voss (ARL), W. Wallace (RPI)
Current State-of-the-Art Natural language processing Graph-based mining Theory of planned behavior
Army Needs and Benefits Exploitation of unstructured data for improved
situational analysis Predictive tools to understand adversarial intent
Long Term Goal: Near perfect reliable network construction through progressive source processing and network refinement
Construction of latent social and information networks require understanding connecting the dots in documents by linking entities and understanding human behavior
Rotations: Ji: post-doc (Hong): 3 mo. at ARL; Han: student (Liu/Brova): 3 mo. at ARL; Wallace: student (Yulia): 3 mo. at ARL; each PI: visits to ARL totaling 1 wk or more. RPI students visit UIUC for one week each.
Research Topics/Technical Approach
Automated Construction of adaptable knowledge networks Preliminary network construction via NLP and social
network techniques Exploitation of links for network refinement Streaming updates
Multi-dimension truth analysis Credibility analysis by processing the interactions of the
physical and social/cognitive states of the social network and its interactions with the information network
Information processing & social/cognitive modeling Build socio-cognitive models to predict human behavior User-oriented and constraint-aware information
processing
22
Distributed, User-Oriented Multi-Scale Network Summarization and OLAP
Team: J. Han (UIUC) (lead), J. Hendler (RPI), T. Hanratty (ARL), H Ji (RPI), B. Welles (NEU)
Current State-of-the-Art Summarization on relational and text data Social and cognitive computing Online analytical processing on data cubes
Army Needs and Benefits Summarization/visualization matched to the user
cognitive abilities for improved situational awareness Flexible and efficient situation analysis for diverse groups
of users
Long Term Goal: Support of distributed, multi-scale, multi-genre network summarization, OLAP and situation analysis for diverse user groups
Distributed, user-oriented, multi-scale summarization of networks to support online information processing and situation analysis for diverse groups of users
Rotations: Han: student (Tao/Song): 3 mo. at ARL; Ji: student (Zhang): 3 mo. at ARL; each PI: visits to ARL, total 1 wk; 3 univ. mutual-visits: PIs + students
Research Topics/Technical Approach
Multi-scale network summarization & aggregation Creation and enrichment of multi-dimension information
from text and unstructured data Network cube construction: conflict resolution, topical
hierarchy generation, multidimensional indexing & selective, partial cube materialization
User-oriented adaptation of network cube views User- or social network-oriented OLAP Accommodate dynamic updates of underlying social
and/or communication networks Cost- and constraint-aware network cube Distributed, user- or community-oriented cube views Cost-, constraint- and availability- aware drilling, search
and analysisCognitive modeling, visualization, and supporting human decision making Experimental platform: news, blogs, tweets, etc.
From Data Mining to Mining Info. Networks
23
Han, Kamber and Pei,Data Mining, 3rd ed. 2011
Yu, Han and Faloutsos (eds.), Link Mining, 2010
Sun and Han, Mining HeterogeneousInformation Networks, 2012
ChengXiang Zhai: Intelligent text information management & analysis
How can we develop intelligent algorithms and systems to help people manage and exploit large amounts of text data (e.g., Web pages, blog articles, news, email, literature…)?
Two subtopics: Information retrieval: how can we connect the right information with
the right users at the right time with minimum or no user effort?
Text mining: How can we automatically discover useful knowledge from text? How can we mine text data together with non-textual data in an integrative manner?
Applications: Web, biomedical, health, education, … Research methodology:
Emphasize general & principled solutions without manual effort Mainly use statistical models, machine learning, and natural
language processing techniques
Sample Project: Latent Aspect Rating Analysis
How to infer aspect ratings?
Value Location Service …..
How to infer aspect weights?
Value Location Service
Solution: Latent Rating Regression Model
Reviews + overall ratings Aspect segments
location:1amazing:1walk:1anywhere:1
0.10.70.10.9
nice:1accommodating:1smile:1friendliness:1attentiveness:1
Term weights Aspect Rating
0.00.90.10.3
room:1nicely:1appointed:1comfortable:1
0.60.80.70.80.9
Aspect Segmentation
Latent Rating Regression
1.3
1.8
3.8
Aspect Weight
0.2
0.2
0.6
Topic model for aspect discovery
+
Aspect-Based Opinion Summarization
Reviewer Behavior Analysis & Personalized Ranking of Entities
People like cheap hotels because of good value
People like expensive hotels because of good service
Query: 0.9 value 0.1 others
Non-Personalized
Personalized
Tandy Warnow
Gene Tree Estimation: first align, then construct the tree
S1 = -AGGCTATCACCTGACCTCCAS2 = TAG-CTATCAC--GACCGC--S3 = TAG-CT-------GACCGC--S4 = -------TCAC--GACCGACA
S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGCS4 = TCACGACCGACA
S1
S4
S2
S3
Sounds easy, but every good approachis NP-hard, and statistical methods(based on stochastic models of evolution)are very slow.
Accuracy is essential, datasets arebig, and they are also messy.
Species tree estimation is even harder, becausegene trees can be different from the species tree!
Avian Phylogenomics Project
G Zhang, BGI
• Approx. 50 species, whole genomes• 8000+ genes, UCEs
MTP Gilbert,Copenhagen
S. Mirarab Md. S. Bayzid UT-Austin UT-Austin
T. WarnowUT-Austin
Plus many many other people…
Erich Jarvis,HHMI
Challenges: Maximum likelihood tree estimation on multi-million-site sequence alignments Massive gene tree incongruence
My students and I developed a new technique (“Statistical Binning”)to enable a statistical estimation of the avian species tree, takinggene tree incongruence into account (both papers under review in Science)
1kp: Thousand Transcriptome Project
Plant Tree of Life based on transcriptomes of ~1200 species More than 13,000 gene families (most not single copy)Gene Tree Incongruence
G. Ka-Shu WongU Alberta
N. WickettNorthwestern
J. Leebens-MackU Georgia
N. MatasciiPlant
T. Warnow, S. Mirarab, N. Nguyen, Md. S.BayzidUT-Austin UT-Austin UT-Austin UT-Austin
Challenges: Multiple sequence alignment of datasets with > 100,000 sequencesGene tree incongruence
My students and I developed ASTRAL – a technique to estimate species trees on large datasets (ECCB), and used it to analyze this dataset (under review in PNAS)UPP – new multiple sequence alignment method that can analyze up to 1,000,000 sequences (in preparation)
Plus many many other people…
Current Projects
Computer science and mathematics issues:
Heuristics for NP-hard optimization problems Graph algorithms Statistical estimation on messy data Mining sets of trees/alignments High Performance Computing Mathematical modelling Probabilistic analysis of algorithms
Bioinformatics Problems
Multiple Sequence Alignment Gene Tree Estimation Species tree estimation (when gene trees conflict) Genome rearrangement phylogeny Phylogenetic network estimation Metagenomic data analysis
Also: Computational Historical Linguistics
Jian Peng: Machine learning for computational biology
Biological data
Machine learning
Knowledge
Disease-related genes
Functional homologs
Experiments
Biological data integration
Hypothesis
Biological data integration: network biology
Modeling information diffusion on biological networks
Integrating networks from multiple species
Inference of gene function from network data
Other computational biology projects
Protein science Structure prediction Protein folding Viral proteins
Translational bioinformatics Drug discovery and optimization Drug repositioning
Genomics Large-scale read mapping Algorithms for genome assembly
Machine learning: modeling complex data
Graphical models Latent variable models Efficient learning and inference algorithms Causal/correlation structures Applications to protein folding, gene expression analysis and
biological network construction Learning representations for heterogeneous data
Low-dimensional embedding for network, text and molecular data sets
Learning structured prediction with complex loss functions Applications to biology, computer vision, speech recognition
and natural language processing
Saurabh Sinha: Bioinformatics
How is information about us encoded in our DNA ?How does this information evolve, giving rise to what Darwin called “endless forms
most beautiful”?
Research questions: Gene regulation: How are genes turned on and off in precisely orchestrated ways? Comparative genomics: What can we learn by comparing genomes of tens of
different species? Regulatory evolution: Can we build a mathematical model of evolution? Genomics of behavior: How does DNA encode animal behavior ?
39
http://www.sinhalab.net/
Genomics of behavior: honeybee
• What causes older bees to be more aggressive than younger ones?• What causes Africanized bees to be more aggressive than European ones?• What causes a bee to become aggressive if you annoy them?• DNA sequence analysis shows that origins of aggression are the same !
Genomics of aging
Find genes associated with aging, by searching the DNA sequence for certain patterns.
Knock down one such gene; old cells became young !
A complete bioinformatics pipeline
Slide 42
From cells … to data … to analysis … to hypotheses & experiments
QUESTIONS?
43