Upload
net2-project
View
436
Download
1
Embed Size (px)
DESCRIPTION
Citation preview
Evolving Web, gEvolving Search
Yuan Tian & Tianqi ChenYuan Tian & Tianqi ChenApex Data & Knowledge Management Lab
Shanghai Jiao Tong University
Agendag
Introduction to SJTU Introduction to Apex Labp Research Demo Demo
Agendag
Introduction to SJTU Introduction to Apex Labp Research Demo Demo
Shanghai Jiao Tong Universityg J g y
Location Historyy Student Campus Campus
Agendag
Introduction to SJTU Introduction to Apex Labp Research Demo Demo
Apex Labp
Director Professor Director Professor Yong Yu
Associate ProfessorG i X Guirong Xue
Apex Labp
Research Web Search Social Web Semantic Search Machine Learning Image Search
Apex Labp
Project Partners
Apex Labp
Ph.D. Students Haofen Wang Jing Lu Jia Chen Guangcan Liu Xian Wu Yunbo Cao Ruihua Songg
35 Master Students
Agendag
Introduction to SJTU Introduction to Apex Labp Research Demo Demo
Research
Traditional Web Social Web Semantic Web Machine Learning Machine Learning
Research
Traditional Web Social Web Semantic Web Machine Learning Machine Learning
Search on Traditional Web
Focus on how to improve search relevance? rank pages? integrate mining technologies into search? search finer grained objects instead of documents?
Search Applicationspp General search engine Vertical search engineg Meta search engine
Expert Search
Expert Search (introduction)p ( )
Treat web page as bag of words Queries are not fully understoodQ y
Expert Search (motivation)p ( )
Searching for Experts: Searching for Experts: • A more and more important information needA more and more important information need
• PM search for DevP ti t h f D t• Patient search for Doctor
• Student search for Professor• ……
• Not only in EnterpriseBut also on WWW• But also on WWW
Query
Ranked List of
ExpertsExperts
An Evidence: an expert and a query co-occur in a document undercertain relation constraintcertain relation constraint
Research
Traditional Web Social Web Semantic Web Machine Learning Machine Learning
The Emergence of Web 2.0 Web gets social
g
Web 1.0 -> Web 2.0Publishing -> Participation
Personal Websites -> Blogging
Content Management Systems -> Wikis
Britannica Online -> WikipediaBritannica Online > Wikipedia
Directories (taxonomy) -> Tagging ("folksonomy")
Lower the barrier for contribution. More people are involved. They are less professional. More people are involved. They are less professional.
Search on Web 2.0
Focus on how to elaborate user involved data? search on new social media
Deegleg
(WWW 2006, WWW 2007, SIGIR 2008)
Related facetsRelated facets Related Related tagstags
Search resultsSearch results
Relatd Relatd usersusers
Emotion Analysis on the Blogy g
Blog can be the resource of the news, but also be the stage for representing the emotion
Enhancing the blog search for different user Enhancing the blog search for different user
Blog Searchg
I f ti ti lInformative articleNews that is similar to the news on traditional
b itnews websitesTechnical descriptions, e.g. programming
techniquestechniques.Commonsense knowledgeObjective comments on the events in the worldObjective comments on the events in the worldAffective article b l ffDiaries about personal affairsSelf-feelings or self-emotions descriptions
Two types of blogyp g
Intent-driven blog search (WWW 2007)
Informative Sense
Snippets
1 1 00 The catalogue of IBM certification: DB21 1.00 The catalogue of IBM certification: DB2Database Administrator DB2 ApplicationDeveloper MQSeries Engineer VisualAgeFor Java …
2 -0.94 Crazy Me! I have hesitated between Acerand smuggled IBM for one week. Iwouldn’t have taken into account theprice, quality or service if I had enoughmoney …
3 1.00 Selling IBM laptop, t22p3-900, , dvd S3/,g p p, p , , ,independent accelerating display card.3550 YUAN. (Post fee notincluded) .Please contact 30316255. Weguarantee the quality. This product is onlysold within Tianjing citysold within Tianjing city ...
4 -0.35 I got a laptop from my friend this week.Although outdated, it is still a classicalone in IBM enthusiast’s mind. There aremany second hand IBM laptops in the
k Al h h I h ld IBMmarket. Although I have sold many IBMlaptops …
5 -0.53 Doctor said that I should make morepreparations mentally. You have stayedwith me for three years, leaving withouty gany words. Do you feel fair for me? Doyou remember the moments we weretogether? You are heartless, I hate you! ...
Informative SnippetsSense
1 1.00 The catalogue of IBM certification: DB2 DatabaseAdministrator DB2 Application Developer MQSeriesEngineer VisualAge For Java …
2 1.00 Biao Lin is a military talent. Stalin called him “thegiftedgeneral”. Americans called him “the unbeaten general”.general . Americans called him the unbeaten general .Chiang Kai-shek called him “devil of war”. Biao Lin is aspecial person in modern history …
3 0.99 Microsoft’s hotmail can only be registered with suffix“@hotmail.com” by default. You can register @msn.com byvisiting…
4 0 95 Yi Sh i till di th fil t I ill ti it l t 14 0.95 Yi Shang is still sending the file to me. I will practice it later. 1.Start up Instance (db2inst1) db2start; 2. Stop Instance(db2inst1) db2stop …
5 0.84 Name: Lei Zhang. Student number: 5030309959. Classnumber: 007. The analysis and review about the tendency ofJilin Chemical Industry’ stock in 2005. Date, Increasing andDecreasing ranges, Open Price, Close Price, Amount ofdeals …
6 0.01 Recently I like reading the Buddhist Scripture. I can learnphilosophies in it. It makes me comfortable. It is from ...
7 -0.11 It’s out of my mind when I first saw it. The water seemed to beexuding from the building. There was much water on the floorof education building. Water was all around us, anywhere youcan touch had water. …
8 -0.51 I read an article about the last emperor Po-yee today. I havewatched “The Last Emperor” before, which realisticallydescribed his life without losing artistry. His love impresseddescribed his life without losing artistry. His love impressedme. As an emperor, he can’t choose the one he loved …
9 -0.53 She is 164 in height with white skin, black hair and long limpleg. I like the girl who has long hair and likes sport anddancing. I like sweet girls. …
10 -0.94 I have many things to do at the end of this semester. There arefi fi l i ti Di t M th tifive final examinations, Discrete Mathematics,Communication Theory, Architecture of Computer, Algorithmand Law. I know little about them. OMG! Only four weeks areleft. There are also two projects, Compiler and OperationSystem. Complier can be easily completed but OperationSystem …
Research
Traditional Web Social Web Semantic Web Machine Learning Machine Learning
Our Vision of Semantic Web Search• It covers most of the important topics in SW• A lot of tools are built in o o oo s e bueach layer
• 10+ top papers (WWW’09, SIGMOD’09, SIGMOD’08, VLDB’07, ICDE’09, ISWC’07, etc)
Knowledge Engineering Layerg g g y
Ontology Engineering Orient: Integrating Ontology Engineering into Industry
Tooling Environment (ISWC 2004)
O t l L i & P l ti Ontology Learning & Population EachWiki: Facilitating Semantics Reuse for Wikipedia
Authoring (ISWC/ASWC 2007)u o g ( S C/ S C 00 ) PORE: Semi-supervised Positive Only Relation Extraction
from Wikipedia (ISWC/ASWC 2007)HS E l U i d Hi hi l S ti E l HS Explorer: Unsupervised Hierarchical Semantics Explorer for Social Annotations (ISWC/ASWC 2007)
Catriple: Extracting Triples from Wikipedia Categories p g p p g(ASWC 2008)
Indexing and Search Layerg y
Ontology Query Engine based on DBMS SOR: A Practical System for OWL Ontology Storage,
Reasoning and Search (VLDB 2007, SIGMOD 2008)
A t ti b d S ti S h E i (DB + IR) Annotation-based Semantic Search Engine (DB + IR) CE2: Towards Large Scale Annotation-based Semantic
Search (CIKM 2008)Sea c (C 008)
An Extension to IR index for Relational Search Semplore: An IR Approach to Scalable Hybrid Query of
Semantic Web Data (ISWC/ASWC 2007, ASWC 2008, WWW 2009, JWS)
Pattern based RDF Store Pattern-based RDF Store
SOR
Semantic Object Repository
Based on IBM DB2 Supports T-Box Supports T Box
reasoning
Semplorep
Extension to traditional IR engine
Ranking is considered
CE^2
Search over semantically annotated corpus
Combination of DB and IR search engines
Pattern-based RDF store
Learning to materialize join results Efficient retrieval of pattern matchesp Reasonable extra space -> Significant
performance increase (on some dataset)performance increase (on some dataset)
Query Interface and User Interaction Layer Keyword Interface for Semantic Search Keyword Interface for Semantic Search
Q2Semantic: Lightweight Ontology based Keyword Interpretation for Semantic Search (ESWC 2008, ICDE 2009)2009)
Natural Language Interface for Semantic Search PANTO: A Portable Natural Language Interface to
Ontologies (ESWC 2007) Snippet Generation
Snippet Generation for Semantic Web Search Engines Snippet Generation for Semantic Web Search Engines (ASWC 2008)
Ontology Presentation ZoomRDF: Semantic-driven Fisheye Zooming for RDF Data
(WWW 2010)
Q2SemanticQ
Structured queries vs. keyword queries
Structural data
RDF Snippetpp
Representation of search results
How will you know which answers are most relevant?
ZoomRDF
Research
Traditional Web Social Web Semantic Web Machine Learning Machine Learning
Agendag
Introduction to SJTU Introduction to Apex Labp Research Demo Demo
How to make them as a whole? We focused on Semantic Web
search Closed corpus / one single data source Closed corpus / one single data source
involved Just scale to million triples Uncertainty is not fully considered or usedy y
We need Semantic Web search, however
M th 11 illi d t (W b More than 11 million data sources (Web heterogeneity)
More than 2 billion triples (Scalability) Uncertainty everywhere Uncertainty everywhere
Thus, we carefully consider the following topics Pay as you go for semantic data integration Pay as you go for semantic data integration Semantic search engine towards billion
triples User-friendly query Interface for Semantic
MissingLet’s ForgetWeb Let s Forget
Hermes (2nd place Billion Triple Challenge, S SSIGMOD 2009, JWS)
1. Integrate and index data sourcesSelect a query Input keywords Refine or navigate
2. Understand user’s need 3. Search and refineq y
“ArticleStanfordTuring Award”
123
p y
ResultsRudi Studer, Semantic Web...Suggestions
g
Distributed Query Processing
Schema‐level Mapping Data‐level Mapping
Graph Data Processing Keyword Translation
SuggestionsAffiliations...
Element Label Extraction
Keyword Mapping
Top‐k Query G h S h Local Query
Query Graph Decomposition
Result Combination & Ranking
Data Graph Summarization
Query Planning
Graph Element Scoring
Mapping Discovery
Graph Search Local Query Processing
Query Planning& Optimization
Internal IndicesMapping Discovery
IndexingKeyword Index
Schema
Index
Mapping Index
Graph IndicesStructure
IndexIndex Index
Heterogeneous Transfer L iLearning
Machine Learning TeamgAPEXLABShanghai Jiao Tong University
Machine Learning Team in APEXg
Focus on machine learning and its application in Web mining and IR. Transfer learning Advertising Techniques in Web Short text classification&clustering Multiligual search result integeration
Outline
Introduction to heterogeneous transfer learning Cross media: Text Image g
Clustering Classification
Cross language: English Chinese Application: Visual Contextual Advertising Application: Visual Contextual Advertising
47
Outline
Introduction to heterogeneous transfer learning Cross media: Text Image g
Clustering Classification
Cross language: English Chinese Application: Visual Contextual Advertising Application: Visual Contextual Advertising
48
Traditional machine learningg
training data and test data in a same distribution.
T i i d t T t dTraining data: newsTest da49
Transfer learningg
Transfer learning: distributions are not identical.
Training data: newsTest datag50
Heterogeneous Transfer Learningg g
Learning across different feature spaces.
A fixed-wing aircraft, typically called an airplane, aeroplane or simply plane, is an aircraft capable of flight using forwardcapable of flight using forward motion that generates lift as the wing moves through the air…
An automobile, motor car or car is a wheeled motor vehicle used for transporting passengers, which also carries p g ,its own engine or motor...
T i i d T DT dTraining data: Text DoTest data51
Related Areas of Heterogeneous Learningg g
Feature Space
Multiple Domain Data
Heterogeneous Homogeneous
Feature Space among Domains
Instance D t Di t ib tiInstance Correspondences among Domains
Data Distribution among Domains
Each instance in onedomain has its
There are few or noInstance
Different Same
Multi-view Learning
Heterogeneous Transfer Learning
Transfer Learning across Different
Distributions
Traditional Machine Learning
correspondencesIn other domains
correspondenceamong domains
Apple is a fr-uit that can be found …
Banana is the common name for…
SourceDomain
TargetDomain
52
Related Areas of Heterogeneous Learningg g
Feature Space
Multiple Domain Data
Heterogeneous Homogeneous
Feature Space among Domains
Instance D t Di t ib tiInstance Correspondences among Domains
Data Distribution among Domains
Each instance in onedomain has its
There are few or noInstance
Different Same
Multi-view Learning
Heterogeneous Transfer Learning
Transfer Learning across Different
Distributions
Traditional Machine Learning
correspondencesIn other domains
correspondenceamong domains
Apple is a fr-uit that can be found …
Banana is the common name for…
SourceDomain
TargetDomain
53
Related Areas of Heterogeneous Learningg g
Feature Space
Multiple Domain Data
Heterogeneous Homogeneous
Feature Space among Domains
Instance D t Di t ib tiInstance Correspondences among Domains
Data Distribution among Domains
Each instance in onedomain has its
There are few or noInstance
Different Same
Multi-view Learning
Heterogeneous Transfer Learning
Transfer Learning across Different
Distributions
Traditional Machine Learning
correspondencesIn other domains
correspondenceamong domains
Apple is a fr-uit that can be found …
Banana is the common name for…
SourceDomain
TargetDomain
54
Related Areas of Heterogeneous Learningg g
Feature Space
Multiple Domain Data
Heterogeneous Homogeneous
Feature Space among Domains
Instance D t Di t ib tiInstance Correspondences among Domains
Data Distribution among Domains
Each instance in onedomain has its
There are few or noInstance
Different Same
Multi-view Learning
Heterogeneous Transfer Learning
Transfer Learning across Different
Distributions
Traditional Machine Learning
correspondencesIn other domains
correspondenceamong domains
Apple is a fr-uit that can be found …
Banana is the common name for…
SourceDomain
TargetDomain
55
Related Areas of Heterogeneous Learningg g
Feature Space
Multiple Domain Data
Heterogeneous Homogeneous
Feature Space among Domains
Instance D t Di t ib tiInstance Correspondences among Domains
Data Distribution among Domains
Each instance in onedomain has its
There are few or noInstance
Different Same
Multi-view Learning
Heterogeneous Transfer Learning
Transfer Learning across Different
Distributions
Traditional Machine Learning
correspondencesIn other domains
correspondenceamong domains
Apple is a fr-uit that can be found …
Banana is the common name for…
SourceDomain
TargetDomain
56
Outline
Introduction to heterogeneous transfer learning Cross media: Text Image g
Classification Clusteringg
Cross language: English Chinese Application: Visual Contextual Advertising Application: Visual Contextual Advertising
57
Text to Images[Dai et al. NIPS 2008] [Lin et al. APWeb 2010]
Mining and learning the multimedia data is becoming increasing importantbecoming increasing important
Li i d b l b l d i d Limited by scarce labeled image data, can we use abundant text data in the Web?
Our answer is YES
58
Objective
EleLearningIn Ophma
pu utphanmai translatiput
utpu
antssi translati
t put
LearningIn Ots ve ng tpu utare ho learning 59
Basic Ideas
Exploiting co occurrence data as a bridge between text and imageExploiting co-occurrence data as a bridge between text and image
Data Sets
Documents from ODP Images from Caltech-256g
Experimental Resultp
Approach 2: Naïve Bayes Waypp y y[Lin et al. APWeb 2010]
P( | )P( | )P(v|w)P(w|c)P(w|c)
P(v|w)
Text-aided Image Classification g(TAIC)
64
Experiments: TAICp
Data sets: 9 binary classification data sets and 5 are six-class classification data sets Image data from Caltech-256 and Fifteen scene Auxiliary text data from Open Directory Project
Baseline methods Base classifiers: Naïve Bayes (NBC) and Support
vector machine (SVM)
65
Evaluation 1: Classification
Heterogeneous TL No‐Heterogeneous TL
Average Error Rate 0.318 0.334
66
Average Error Rate 0.318 0.334
4 8% error reduc
Outline
Introduction to heterogeneous transfer learning Cross media: Text Image g
Classification Clusteringg
Cross language: English Chinese Application: Visual Contextual Advertising Application: Visual Contextual Advertising
67
T t id d Im Cl t rinText-aided Image Clustering[Yang et al. ACL 2009]
Image clustering is a effective method for increasing accessibility of image search result
Apple =OR
But traditional clustering methods do not work
Apple OR
But traditional clustering methods do not work well with small amount of data
d d h l We consider use annotated images in the social Web to help image clustering
68
Annotated PLSA Model for Clustering
Leveraging the auxiliary text data by From Flickrusing the topics as a bridge
Z
W dFrom Flickr.
Words Topicsfrom Image featuresTopics
Aux I
69DataIma
Making the transfer…g Log-likelihood objective function
T t i f t d ili t t Two parts: image features and auxiliary text features Image feature to image instance correlation: A Image feature to image instance correlation: A Word feature to image feature correlation: B
BA Nortrade
j llj
j lj
lj
iij
j ij
ij wfPB
BvfP
AA
)|(log)1()|(log' '' '
LNormali
tradeoff
Lik lih d fLik lihmali
ti-off
70Likelihood of Likelihozatiopara
Experiment Setupp p
Data sets: Generated from Caltech-256 and 15-scene corpora
Baseline methods Baseline clustering methods: KMeans, PLSA and STC Strategies:
clustering on target image data only combined: clustering target image data and annotated image combined: clustering target image data and annotated image
data together and evaluate result for target image data
71
Experimental Resultp
KM_Seperate KM_Combine PLSA_Seperate PLSA_Combine STC aPLSA
1 41.61.8
2
0 60.8
11.21.4
Entr
opy
00.20.40.6
Heterogeneous TL No‐Heterogeneous TL
Average Entropy 0.741 0.786
72
Average Entropy 0.741 0.786
5 7% entroy redu
Clustering Resultsgon Caltech256 [Griffin et al. TR 2007]
f k kbj h i tt hfrogkayakbearjesus-christwatch
73
Outline
Introduction to heterogeneous transfer learning Cross media: Text Image g
Clustering Classification
Cross language: English Chinese Application: Visual Contextual Advertising Application: Visual Contextual Advertising
74
Cross-language Classification g g[Ling et al. WWW 2008]
Classifier
llearn classify
Labeled Chinese Web
Unlabeled Chinese WebChinese Web
pagesChinese Web
pages
Text Classification 75
Cross-language Classificationg g
Much labelled data in English, but few in g ,Chinese.
Labeled Data English Chinese
News Reuters‐21578 ?News Reuters 21578 ?
newsgroups 20 Newsgroups ?
Web pages Open Document Project
Very few ODP dataProject
(> 1M)data (< 20k, ~ 1%)
76
Cross-language Classificationg g
ClassifierClassifier
learn classifyclassify
Labeled English Web
Unlabeled Chinese Web
pages pages
Cross‐language Classification77
Cross-language Classificationg g Information Bottleneck
l b d d ( b ) X : signals to be encoded (Web pages) : codewords (class labels) X Y : features related to X (terms)
XX
78
Cross-language Classificationg g
Optimization
minimizeInformation betwminimizeInformation betw
Minimize this distance
79
Cross-language Classificationg g
Performance
80
Outline
Introduction to heterogeneous transfer learning Cross media: Text Image g
Clustering Classification
Cross language: English Chinese Application: Visual Contextual Advertising Application: Visual Contextual Advertising
81
Application: Visual Contextual Advertising [Chen et al. AAAI 2010]
P i h f d d ti i f t t[ ] Previous research focused on advertising for text
Web pages.With th b i f lti di d t d With the booming of multimedia data, we need to recommend advertisement for these dataDiffi lt i d th t t i diff t f t Difficulty: image and the text in different feature spacesU th d t t b id th t Use the co-occurrence data to bridge these two feature spaces
Figure illustration of Visual Contextual gAdvertising
Visual Contextual Advertisingg
(based on the independWe assume that there isindependent
We assume that there isent
iWhere
assumpti
Experimental Resultsp
Co-occurrence data from Flickr. Test Image from Flickr and Fifteen scene data g
set Advertisement are crawled from MSN search Advertisement are crawled from MSN search
engine with queries chosen from AOL query log.
Experimental Resultp
Experimental Resultp
Thank youy
For more details of APEXLAB http://apex.sjtu.edu.cn/apex_wiki/FrontPage
Our works http://apex.sjtu.edu.cn/apex_wiki/Papersp // p j / p _ / p