Upload
aquila-townsend
View
48
Download
13
Tags:
Embed Size (px)
DESCRIPTION
Current Research in Data Mining Research Group. Jiawei Han Data Mining Research Group Department of Computer Science University of Illinois at Urbana-Champaign Acknowledgements: NSF, ARL, NASA, AFOSR (MURI), DHS, Microsoft, IBM, Yahoo!, HP Lab & Boeing November 8, 2014. Outline. - PowerPoint PPT Presentation
Citation preview
1
Current Research in Data Current Research in Data Mining Research GroupMining Research Group
Jiawei HanData Mining Research Group
Department of Computer Science
University of Illinois at Urbana-ChampaignAcknowledgements: NSF, ARL, NASA, AFOSR (MURI), DHS, Microsoft, IBM, Yahoo!, HP Lab & Boeing
April 20, 2023
2
OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group
Mining and OLAPing Information NetworksMining and OLAPing Information Networks
Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks
Mining Text-Rich Information NetworksMining Text-Rich Information Networks
OLAPing (Multi-dimensional analysis) of information OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networksnetworks: TextCube, OLAP heterogeneous networks
Taming the Web: WINACS (Integrated mining of Web structures Taming the Web: WINACS (Integrated mining of Web structures and contents)and contents)
Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks
ConclusionsConclusions
Data Mining and Data WarehousingData Mining and Data WarehousingJiawei Han’s Group at CS, Jiawei Han’s Group at CS, UIUCUIUC
Mining patterns and knowledge discovery from massive data Data mining in heterogeneous information networks Exploring broad applications of data mining
Developed many effective data mining algorithms, e.g., FPgrowth, PrefixSpan, gSpan, StarCubing, CrossMine, RankingCube, CrossClus , RankClus, and NetClus
600+ research papers in conferences and journals Fellow of ACM, Fellow of IEEE, ACM SIGKDD Innovation Award, W.
McDowell Award, Daniel Drucker Eminent Faculty Award Textbook, “Data mining: Concepts and Techniques,” adopted
worldwide Project lead for NASA EventCube for Aviation Safety [2008-2012] Director of Information Network Academic Research Center funded
from Army Research Lab (ARL) [2009-2014]3
New Books on Data Mining & Link MiningNew Books on Data Mining & Link Mining
5
Han, Kamber and Pei,Data Mining, 3rd ed. 2011
Yu, Han and Faloutsos (eds.), Link Mining, 2010
Sun and Han, Mining Heterogeneous
Information Networks, 2012
6
OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group
Mining and OLAPing Information NetworksMining and OLAPing Information Networks
Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks
Mining Text-Rich Information NetworksMining Text-Rich Information Networks
OLAPing (Multi-dimensional analysis) of information OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networksnetworks: TextCube, OLAP heterogeneous networks
Taming the Web: WINACS (Integrated mining of Web structures Taming the Web: WINACS (Integrated mining of Web structures and contents)and contents)
Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks
ConclusionsConclusions
Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks
RankClus/NetClus
SIGMOD
SDM
ICDM
KDD
EDBT
VLDB
ICML
AAAI
Tom
Jim
Lucy
Mike
Jack
Tracy
Cindy
Bob
Mary
Alice
SIGMOD
VLDB
EDBT
KDDICDM
SDM
AAAI
ICML
Objects
Ra
nki
ng
RankCompete: A Competing Random Walk Model for Rank-Based Clustering
Database Data Mining AI IR
Top-5 ranked
conferences
VLDB KDD IJCAI SIGIR
SIGMOD SDM AAAI ECIR
ICDE ICDM ICML CIKM
PODS PKDD CVPR WWW
EDBT PAKDD ECML WSDM
Top-5 ranked terms
data mining learning retrieval
database data knowledge information
query clustering reasoning web
system classification logic search
xml frequent cognition text
RankClass [KDD11]
Knowledge Propagation in Heterogeneous Network
8
Similarity Search and Role Discovery in Similarity Search and Role Discovery in Information NetworksInformation Networks
Path: ITI Path: ITIGITI
Which images are most similar to me in Flickr?PathSim [VLDB11]
Meta Path-Guided Similarity Search in
Networks
A “dirty” Information Network (imaginary)
Cleaned/InferredAdversarial Network
Cleaned/InferredAdversarial Network
Chief
Insurgent
Cell Lead
Automatically infer
Role Discovery in Information Networks [KDD’10]
Advisee Top Ranked Advisor
Time Note
David M. Blei
1. Michael I. Jordan
01-03 PhD advisor, 2004
2. John D. Lafferty
05-06 Postdoc, 2006
Hong Cheng
1. Qiang Yang 02-03 MS advisor, 2003
2. Jiawei Han 04-08 PhD advisor, 2008
Sergey Brin
1. Rajeev Motawani 97-98 Unofficial advisor
Meta-Paths & Their Prediction PowerMeta-Paths & Their Prediction Power List all the meta-paths in bibliographic network up to length 4
Investigate their respective power for coauthor relationship prediction Which meta-path has more prediction power? How to combine them to achieve the best quality of prediction
9
Relationship Prediction in Heterogeneous Info NetworksRelationship Prediction in Heterogeneous Info Networks
Why Prediction of Co-Author Relationship in DBLP? Prediction of relationships between different types of nodes
in heterogeneous networks E.g., what papers should Faloutsos writes?
Traditional link prediction: homogeneous networks Co-author networks in DBLP, friendship networks in Facebook
Relationship prediction Study the roles of topological features in heterogeneous
networks in predicting the co-author relationship building Meta-path guided prediction!
Y. Sun, et al., "Co-Author Relationship Prediction in Heterog. Bibliographic Networks", ASONAM'11, July 2011
10
Guidance: Meta Path in Bibliographic NetworkGuidance: Meta Path in Bibliographic Network
Relationship prediction: meta path-guided prediction Meta path relationships among similar typed links share similar
semantics and are comparable and inferable
11
papertopic
venue
author
publish publish-1
mention-1
mention writewrite-1
contain/contain-1 cite/cite-1
Co-author prediction (A—P—A) using topological features also encoded by meta paths, e.g., citation relations between authors (A—P→P—A)
Case Study in CS Bibliographic NetworkCase Study in CS Bibliographic Network The learned significance for each meta path under measure “normalized
path count” for HP-3hop dataset
12
Case Study: Predicting Concrete Co-AuthorsCase Study: Predicting Concrete Co-Authors High quality predictive power for such a difficult task
13
Using data in T0 =[1989; 1995] and T1 = [1996; 2002]
Predict new coauthor relationship in T2 = [2003; 2009]
14
OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group
Mining and OLAPing Information NetworksMining and OLAPing Information Networks
Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks
Mining Text-Rich Information NetworksMining Text-Rich Information Networks
OLAPing (Multi-dimensional analysis) of information OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networksnetworks: TextCube, OLAP heterogeneous networks
Taming the Web: WINACS (Integrated mining of Web structures Taming the Web: WINACS (Integrated mining of Web structures and contents)and contents)
Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks
ConclusionsConclusions
Structural Layer: follow the sametopology as the document network
iTopicModel: Model Set-Up & Objective FunctioniTopicModel: Model Set-Up & Objective Function
Graphical model: ϴi=(ϴi1, ϴi2,…, ϴiT): Topic distribution for document xi
Text Layer: follow PLSA, i.e., for each word, pick a topic z~multi(ϴi), then
pick a word w~multi(βz)
Objective function: joint probability X: observed text informationG: document networkParameters
ϴ: topic distributionβ: word distribution
ϴ is the most critical, need to be consistent with the text as well as the network structure
Structure part Text partCan model them separately!
Case Study: Topic Hierarchy Building for DBLPCase Study: Topic Hierarchy Building for DBLP
Probabilistic Topic Models with Network-Based Probabilistic Topic Models with Network-Based Biased PropagationBiased Propagation
Text-rich heterogeneous information network Ubiquitous textual documents (news, papers) Connect with users and other objects: Topic propagation
Deng, Han et al, “Probabilistic Topic Models with Biased Propagation on Heterogeneous Information Networks”, KDD’11
17
How to discover latent topics and identify clusters of multi-typed objects simultaneously?
How can text data and heterogeneous information network mutually enhance each other in topic modeling and other text mining tasks?
Biased Topic PropagationBiased Topic PropagationIntuition: InfoNet provides valuable informationDifferent objects have their own inherent information (e.g., D with rich text and U without explicit text) To treat documents with rich text and other objects without explicit text in a different way Topic(D) inherent text + connected U Topic(U) connected D
18
Basic Criterion: (Biased Topic Propagation) The topic of an object without explicit text depends on the topic of the
documents it connects The topic of a document is correlated with its objects to some extend, and
should be principally determined by its inherent content of the text A simple and unbiased topic propagation does not make much sense
Incorporating Heterogeneous Info. NetworkIncorporating Heterogeneous Info. Network
19
L(C): Topic modelR(G): Biased propagation
Experiments: DBLP & NSF AwardsExperiments: DBLP & NSF Awards Data Collection
DBLP NSF-Awards
Metrics Accuracy (AC) Normalized mutual information (NMI)
Results
20
21
OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group
Mining and OLAPing Information NetworksMining and OLAPing Information Networks
Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks
Mining Text-Rich Information NetworksMining Text-Rich Information Networks
OLAPing (Multi-dimensional analysis) of information OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networksnetworks: TextCube, OLAP heterogeneous networks
Taming the Web: WINACS (Integrated mining of Web structures Taming the Web: WINACS (Integrated mining of Web structures and contents)and contents)
Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks
ConclusionsConclusions
Event Cube:Event Cube: An Overview An Overview
MultidimensionalText Database
98.0199.0299.01
98.02
LAX SJC MIA AUS
overshoot
undershootbirds
turbulence
Time
Location
Topic
CA FL TXLocatio
n
1998
1999
Time
Deviation
Encounter
Topic
drill-down
roll-up
Event CubeRepresentation
Analyst…Multidimensional OLAP, Ranking, Cause Analysis,
Topic Summarization/Comparison …… Analysis Support
22 Event Cube: An Organized Approach for Mining and Understanding Anomalous Aviation EventsEvent Cube: An Organized Approach for Mining and Understanding Anomalous Aviation Events
Funded by NASA (2008-2010)
Text/Topic Cube: General Idea
Heterogeneous: categorical attributes + unstructured text
How to combine? Our solution:
Time Location Place Environment … … Event ReportACN
Text data
Cube: Categorical Attributes
Term/Topic Weight
T1 W1
T2 W2
T3 W3
… …
Text/Topic Model: Unstructured TextMeasure
24
Effective Keyword Search TopCells (ICDE’ 10): Ranking aggregated cells (objects) in
TextCube.
HealthcareReform
…
Effective OLAP Exploration TEXplorer (submitted): Integrating keyword-based ranking
and OLAP exploration
25
HealthcareReform
Effective Event Tracking PET (KDD’ 10): tracking popularity and textual representation
of events in social communities (twitter)
26
debate,cost,senate,…
pass,success,law,…
HealthcareReform
benefit,profit,effective,…
27
OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group
Mining and OLAPing Information NetworksMining and OLAPing Information Networks
Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks
Mining Text-Rich Information NetworksMining Text-Rich Information Networks
OLAPing (Multi-dimensional analysis) of information OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networksnetworks: TextCube, OLAP heterogeneous networks
Taming the Web: WINACS (Integrated mining of Web structures Taming the Web: WINACS (Integrated mining of Web structures and contents)and contents)
Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks
ConclusionsConclusions
Growing Parallel Paths Growing Parallel Paths (WWW 2011)(WWW 2011)
DIV UL
AB
AC
HTML DIV UL
LI
LI
AX
AY
HTML DIV UL
LI
LI
AZ
AW
TABLE TR
TD
TD AU
AV
HTML
HTML
LI
LI
DIV
DIV ...
...
Page A
Page D
Page E
Page F
DIV P AFHTML
Page C
DIV
P
AE
Page B
HTML
P
AD
1
2
3
4
5
6
X
Y
Z
W
U
V
Path
Result:
28
Mapping Pages to Records Mapping Pages to Records (CIKM’10)(CIKM’10)
/people
/people/faculty
/jiawei-han
/people/faculty
/dan-roth
/people/faculty/vikram-
adve
/research/research
/areas/data
Faculty
DataMining
Jiawei Han
Dan Roth
Vikram Adve
Jiawei Han
Dan Roth
People
/people/faculty
www.cs.illinois.edu/homes/hanj/
llvm.cs.uiuc.edu/~vadve/Home.html
l2r.cs.uiuc.edu/~danr/
Research
PersonalSite
PersonalSite
PersonalSite
/ (root) [cs.illinois.edu]
llvm.cs.uiuc.edu/~vadve/Home.html
rsim.cs.illinois.edu/~sadve/
www.cs.illinois.edu/homes/hanj/
l2r.cs.uiuc.edu/~danr/
Tarek AbdelzaherSarita AdveVikram Adve
Gul AghaEyal AmirDan Roth
Jiawei Han
--------------
Name URL
Structured Data Web PagesMappings
--------------
Zipcode
Database records can be found on link paths!
29
WinaCS: Web Information Network Analysis WinaCS: Web Information Network Analysis for Computer Sciencefor Computer Science
Integration of Web structure mining and information network analysis
Tim Weninger, Marina Danilevsky, et al., “WinaCS: Construction and Analysis of Web-Based Computer Science Information Networks", ACM SIGMOD'11 (system demo), Athens, Greece, June 2011.
30
31
OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group
Mining and OLAPing Information NetworksMining and OLAPing Information Networks
Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks
Mining Text-Rich Information NetworksMining Text-Rich Information Networks
OLAPing (Multi-dimensional analysis) of information OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networksnetworks: TextCube, OLAP heterogeneous networks
Taming the Web: WINACS (Integrated mining of Web structures Taming the Web: WINACS (Integrated mining of Web structures and contents)and contents)
Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks
ConclusionsConclusions
32
Discovery of Swarms and Periodic Patterns in Moving Discovery of Swarms and Periodic Patterns in Moving Object DataObject Data
A system that mines moving object patterns: Z. Li, et al., “MoveMine: Mining Moving Object Databases", SIGMOD’10 (system demo)
Z. Li, B. Ding, J. Han, and R. Kays, “Mining Hidden Periodic Behaviors for Moving Objects”, KDD’10 (sub)
Z. Li, B. Ding, J. Han, and R. Kays, “Swarm: Mining Relaxed Temporal Moving Object Clusters”, VLDB’10 (sub)
← Bird flying paths shown on Google Earth
Mined periodic patterns by our new method →
← Convoy discovers only restricted patterns
Swarm discovers more patterns →
GeoTopic Discovery: Mining Spatial TextGeoTopic Discovery: Mining Spatial Text
LDM
TDM
GeoFolk
LGTA
Geo-tagged photos w. landscape (coast vs. desert vs. mountain)
33
Z. Yin, et a., GeoTopic Discovery and Comparison, WWW'11
34
OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group
Mining and OLAPing Information NetworksMining and OLAPing Information Networks
Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks
Mining Text-Rich Information NetworksMining Text-Rich Information Networks
OLAPing (Multi-dimensional analysis) of information OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networksnetworks: TextCube, OLAP heterogeneous networks
Taming the Web: WINACS (Integrated mining of Web structures Taming the Web: WINACS (Integrated mining of Web structures and contents)and contents)
Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks
ConclusionsConclusions
35
Conclusions: Conclusions: Towards Mining Data Semantics in Integrated Towards Mining Data Semantics in Integrated Heterog. NetworksHeterog. Networks
Most data objects are linked, forming heterogeneous information networks Most datasets can be “organized” or “transformed” into
“structured” multi-typed heterogeneous info. networks Examples: DBLP, IMDB, Flickr, Google News, Wikipedia, … Structures can be progressively mined from less organized
data sets by info. network analysis Surprisingly rich knowledge can be mine from such structured
heterogeneous info. networks Clustering, ranking, classification, data cleaning, trust analysis,
role discovery, similarity search, relationship prediction, …… It is promising to mine data semantics from rich info. networks !
References for the TalkReferences for the Talk J. Han, Y. Sun, X. Yan, and . S. Yu, “Mining Heterogeneous Information Networks"
(tutorial), KDD'10. Ming Ji, Jiawei Han, and Marina Danilevsky, "Ranking-Based Classification of
Heterogeneous Information Networks", KDD'11. Y. Sun, J. Han, et al., "RankClus: Integrating Clustering with Ranking for Heterogeneous
Information Network Analysis", EDBT’09 Y. Sun, Y. Yu, and J. Han, "Ranking-Based Clustering of Heterogeneous Information
Networks with Star Network Schema", KDD’09 Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, “PathSim: Meta Path-Based Top-K Similarity
Search in Heterogeneous Information Networks”, VLDB'11 Y. Sun, R. Barber, M. Gupta, C. Aggarwal and J. Han, "Co-Author Relationship Prediction
in Heterogeneous Bibliographic Networks", ASONAM'11 C. Wang, J. Han, et al.,, , “Mining Advisor-Advisee Relationships from Research
Publication Networks", KDD'10. Tim Weninger, Marina Danilevsky, et al., “WinaCS: Construction and Analysis of Web-
Based Computer Science Information Networks", ACM SIGMOD'11 (system demo) X. Yin, J. Han, and P. S. Yu, “Truth Discovery with Multiple Conflicting Information
Providers on the Web”, IEEE TKDE, 20(6), 200836