Upload
kenyon
View
54
Download
0
Tags:
Embed Size (px)
DESCRIPTION
WinaCS Project Web Entity Extraction and Mapping Discovering and Propagating Context. Tim Weninger. Department of Computer Science University of Illinois Urbana-Champaign, Urbana, IL. Past, Present, Future. Past – Entity search and retrieval is one of the dreams of the Web – TBL - PowerPoint PPT Presentation
Citation preview
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
CS 512Jan 18, 2010
WinaCS ProjectWeb Entity Extraction and Mapping
Discovering and Propagating Context
Tim Weninger
Department of Computer ScienceUniversity of Illinois Urbana-Champaign, Urbana, IL
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
CS 512Jan 18, 2010
Past, Present, Future
Past – Entity search and retrieval is one of the dreams of the Web – TBL
Present – Ranking and Retrievalbi-directional approach
1) Information Networks 2) Web mining and Information Extraction
a) List Findingb) Entity-page Discoveryc) Entity-page Mapping
Future – InfoBase ProjectInformation extraction via Schema Discovery
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
CS 512Jan 18, 2010
Finding lists on the Web is Hard! (KDD Explorations Dec. 2010)
B
C
A
1
2
3
4
1. Google Sets2. WebTables3. Mining Data Records (MDR)4. World Wide Tables (WWT)5. Tag Path Clustering6. RoadRunner6. SEAL 7. Visual List Extraction8. VIsual-based Page Segmentation (VIPS)9. Visualized Element Nodes Table extraction (VENTex)
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
CS 512Jan 18, 2010
Why is finding lists important?
• Jiawei Han• ChengXiang Zhai• Kevin Chang• Dan Roth• Marianne Winslett
• Jiawei Han• ChengXiang Zhai• Kevin Chang• Dan Roth• Marianne Winslett• Sarita Adve• Tarek Adelzaher• Vikram Adve• Gul Agha•…
• Charu Aggarwal• Deepayan Chakrabarti• Ed Chang• Kevin Chang• Olivier Chapelle• Chris Clifton• Jiawei Han•…
CORRECTIONINFERENCE
DISAMBIGUATIONRECOMMENDATION
ETC
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
CS 512Jan 18, 2010
Our list finding algorithm (Accepted: WWW 2011)
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
CS 512Jan 18, 2010
List Finding for Entity Page Discovery
HTML
DIV DIV
UL
LI LI
hrefY hrefX
UL UL
LI LI LI LI
hrefA hrefB hrefC hrefD
UL
LI LI
hrefE hrefF
P
hrefG
P
hrefH
LI
hrefZ
Data Region 2Data Region 1
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
CS 512Jan 18, 2010
Growing Parallel Paths (Accepted: WWW 2011)
DIV UL
AB
AC
HTML DIV ULLI
LI
AX
AY
HTML DIV ULLI
LI
AZ
AW
TABLE TRTD
TD AU
AV
HTML
HTML
LI
LI
DIV
DIV ...
...
Page A
Page D
Page E
Page F
DIV P AFHTMLPage C
DIV
P
AE
Page B
HTML
P
AD
1
2
3
4
5
6
X
Y
Z
W
U
V
Path
Result:
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
CS 512Jan 18, 2010
Mapping Pages to Records (CIKM’10)
llvm.cs.uiuc.edu/~vadve/Home.html
rsim.cs.illinois.edu/~sadve/
www.cs.illinois.edu/homes/hanj/
l2r.cs.uiuc.edu/~danr/
Tarek AbdelzaherSarita AdveVikram Adve
Gul AghaEyal AmirDan Roth
Jiawei Han
--------------
Name URL
Structured Data Web PagesMappings
--------------
Zipcode
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
CS 512Jan 18, 2010
Mapping Pages to Records (CIKM’10)
/people
/people/faculty
/jiawei-han
/people/faculty
/dan-roth
/people/faculty/vikram-
adve
/research/research
/areas/data
Faculty
DataMining
Jiawei Han
Dan Roth
Vikram Adve
Jiawei Han
Dan Roth
People
/people/faculty
www.cs.illinois.edu/homes/hanj/
llvm.cs.uiuc.edu/~vadve/Home.html
l2r.cs.uiuc.edu/~danr/
Research
PersonalSite
PersonalSite
PersonalSite
/ (root) [cs.illinois.edu]
Example
Ap1={People, Faculty, Dan Roth, Personal Site} Ap2={Research, Data Mining, Dan Roth, Personal Site}
Bag of Anchors: {Research:1, People:1, Faculty:1, Data Mining:1, Dan Roth:2, Personal Site:2}
Sorted Bag of Anchors: Au;v1={Dan Roth:2/2=1, Research:1/2=0.5, Data Mining:1/2 =0.5, Personal Site:2/5=0.4, People:1/3=0.33, Faculty:1/3=0.33}
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
CS 512Jan 18, 2010
CSMap
Locations of top 25 computer science departments. Automatically generated by extracting and ranking 5
digit numbers from Entity Web pages.
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
CS 512Jan 18, 2010
Next Steps: The hard part!
Infer categories/schemas from a set of WebPages
Example:
What does these entities have in common?
NameAddressZipCodePublicationsCollaboratorsOrganizations
How can we infer this schema?Wikipedia?
How can we populate it?
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
CS 512Jan 18, 2010
Idea! Propagating schemas
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
CS 512Jan 18, 2010
Next Steps: The hardest part!
Name Address ZipCode Organizations Collaborators PublicationsJiawei Han A 1 FK FK FKTarek Adelzaher B 2 FK FK FKGerald DeJong C 3 FK FK FKMichael Heath D 4 FK FK FK
This can be modeled as a heterogeneous information network.
Thus, Ranking and Clustering is possibleSo is semantic search, keyword search and typal search
Cube operations are possible
Given Inferred
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
CS 512Jan 18, 2010
WinaCS – An information network based Web search engine
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
CS 512Jan 18, 2010
Questions? Challenges?