Upload
nicole-medina
View
218
Download
2
Tags:
Embed Size (px)
Citation preview
HUMAN EXPERTISE AND ARTIFICIAL INTELLIGENCE IN VERTICAL SEARCHPeter Jackson & Khalid Al-Kofahi
Corporate Research & Development
HORIZONTAL VERSUS VERTICAL SEARCH
HORIZONTAL VERTICAL
Consumer focus Professional focus
General interest Specialist interests
Average user Expert user
Shallow information need Deep information need
2
THE PARADOX OF SEARCH
• The further you get from keyword indexing and retrieval, the harder it is to explain a search result– Professional searchers demand transparency
• Tool versus appliance
• You need an ‘explanatory model’ that people can relate to and understand, even if it is actually just a cartoon of the real process– Examples: Basic PageRank, Collaborative Filtering
• Such models don’t work so well in vertical domains– Links aren’t always endorsements
– Sparsity of data in smaller communities
3
RECENT TRENDS IN SEARCH
• Fragmentation of ‘horizontal’ search– Media, location, demographics (Weber & Castillo, 2010)
• More sophisticated models of user behavior– Post-click behaviors (Zhong, Wang, et al, 2010)
• ‘Practical semantics’ versus Semantic Web– Maps as search results for local, micro-results
• Incorporation of domain knowledge into search– Taxonomies, vocabularies, use cases, work flows
4
THE EXAMPLE OF LEGAL SEARCH
• The completeness requirement– Recall as important as precision
• Less redundancy than on the Web
• The authority requirement– Court superiority, jurisdiction
– Highly cited cases and statutes• Supercession by statute or regulation
• The multi-topical nature of documents– Case may cover many points of law but only cited for one
– Citations can be negative as well as positive per topic
>These factors also apply to scientific documents
5
POWER LAW AND LEGAL TOPICS
6
POWER LAW AND WESTLAW USERS
7
EXPERT SEARCH
• In many verticals, there are at least two sources of expertise available for enhancing search– Editors and authors, who generate useful metadata
– Users, who generate clickstreams and other data
• Editorial value addition improves recall especially– Helps find both fat neck and long tail document on a topic
• Aggregate user behavior mostly improves precision– Power users find most relevant and important documents
• The model of expert search enables and explains the portfolio of results, rather than individual results
8
9
SOURCES OF EVIDENCE:AUTHORS & EDITORS
Headnote, KNHeadnote, KN
text text texttext citation textcitation text text
= = == = == = =
= = == = == = =
CASE
= = == = == = =
= = == = == = =
CASE
= = == = == = =
= = == = == = =
CASE
= = == = == = =
= = == = == = =
CASE
= = == = == = =
= = == = == = =
CASE
= = == = == = =
= = == = == = =
CASE
CASE
= = == = == = =
= = == = == = =
CASE
= = == = == = =
= = == = == = =
CASE
= = == = == = =
= = == = == = =
172013 (A)28 (B)
205,3105 (A)19 (B)
Issue: Long arm jurisdiction12 A (Key cases)54 B (Highly Relevant)
354 (A)5 (B)
= = == = == = =
= = == = == = =
CASE
Burger King Corp, V.
Rudzewicz
Burger King Corp, V.
Rudzewicz
10
SOURCES OF EVIDENCEAUTHORS & EDITORS
HN1 KN1HN2 KN2HN3 KN2…. ….…. ....HN35 KN14
= = == = == = =
= = == = == = =
ALR
= = == = == = =
= = == = == = =
CJS
= = == = == = =
= = == = == = =
AMJUR
= = == = == = == = == = == = =
= = == = == = == = == = == = =
= = == = == = == = == = == = =
= = == = == = == = == = == = =
CASES
= = == = == = =
= = == = == = =
= = == = == = == = == = == = =
= = == = == = == = == = == = =
= = == = == = == = == = == = =
= = == = == = == = == = == = =
CASES
= = == = == = =
= = == = == = =
= = == = == = == = == = == = =
= = == = == = == = == = == = =
= = == = == = == = == = == = =
= = == = == = == = == = == = =
CASES
= = == = == = =
= = == = == = =
= = == = == = == = == = == = =
= = == = == = == = == = == = =
= = == = == = == = == = == = =
= = == = == = == = == = == = =
CASES
= = == = == = =
= = == = == = =
= = == = == = == = == = == = =
= = == = == = == = == = == = =
= = == = == = == = == = == = =
= = == = == = == = == = == = =
CASES
= = == = == = =
= = == = == = =
= = == = == = == = == = == = =
= = == = == = == = == = == = =
= = == = == = == = == = == = =
= = == = == = == = == = == = =
CASES
= = == = == = =
= = == = == = =
Another set of related cases
Burger King Corp, V.
Rudzewicz
11
SOURCES OF EVIDENCE: USERS (I)
= = == = == = == = == = == = =
= = == = == = == = == = == = =
= = == = == = == = == = == = =
= = == = == = == = == = == = =
CASES
= = == = == = =
= = == = == = =
Query 1
Query 2
Query 3
Query NCLICK
SESSION 1
CLICK
SESSION N
= = == = == = == = == = == = =
= = == = == = == = == = == = =
= = == = == = == = == = == = =
= = == = == = == = == = == = =
CASES
= = == = == = =
= = == = == = =
ACTIONS
ACTIONS
Link query language to document language via click, print, and cite checking behaviors
Identify documents that are co-clicked, co-printed, etc, with the Burger King
case across user sessions
CLICK
KEYCITE
12
QUERY 1
QUERY N
"personal jurisdiction” 176"minimum contacts” 50"forum selection clause” 39“personal jurisdiction” 39"forum non conveniens” 32"choice of law” 29
IN THE LAST 3 MONTHS
SOURCES OF EVIDENCE: USERS (II)
Original breach of contract and trademark
infringement case turned into a civil procedure case
about jurisdictionon appeal
Burger King Corp, V.
Rudzewicz
= = == = == = == = == = == = =
= = == = == = == = == = == = =
= = == = == = == = == = == = =
= = == = == = == = == = == = =
CASES
= = == = == = =
= = == = == = =
SESSION 1
CLICK
SESSION N
= = == = == = == = == = == = =
= = == = == = == = == = == = =
= = == = == = == = == = == = =
= = == = == = == = == = == = =
CASES
= = == = == = =
= = == = == = =
ACTIONS
ACTIONS
USER ACTIONS: 10417 TOTAL SESSIONS: 9758
AI & THE RANKING PROBLEM
• Supervised Machine Learning (Ranker SVM)– Iteratively retrieve and rank documents
– Incorporate all available cues: text similarity, classifications, citations, user behavior and query logs
– All of this requires lots of data!
• Training & Validation– Gold data: hand-crafted research reports covering a
variety of legal issues
– Report contains an issue statement, multiple queries, all seminal, highly relevant documents, some relevant docs• > 100K documents judged against ~400 legal issues
– System was also tested by an independent 3rd party
13
HADOOP FOR BIG DATA PROCESSING
• At launch, query logs contained ~ 2 Billion records– Queries & user actions
• Relied on a Hadoop cluster to– Extract, Transform, and Load processes.
– Cluster similar queries together
– Extract, normalize, collate citation contexts
• Dramatic improvement in processing times– From tens of hours to tens of minutes
14
COMPUTATION NORMAL TIME HADOOP TIME
Building complete Westlaw dictionary
2.5 days 1 hour
Clustering similar Westlaw queries
1.5 days 3 minutes
Citation extraction from over 10 M documents
1.25 days 3 hours
HADOOP: TYPICAL SPEED UPS
CLUSTER CONFIGURATION: QUERIES
• 8 machines, each with 16 cores
• Only 14 cores/machine were available for processing– Giving a total of 112 cores
• Block size of 64 MB– Each core processes one block at a time
• Cluster can process 7 GB at each step
• Latest cluster is twice the size: 224 cores– Almost 1 TB of memory and over 1 PB of storage
16
THE POWER OF EXPERT SEARCH
• Leverages expertise of community: authors, editors, & users– We know why documents are linked
– We know exactly who our users are
• Metadata, authority & aggregated user data all contribute to relevance, importance & popularity
• Can still benefit from Power Law phenomena so common on the Web
• Can exploit data parallelism to achieve the same kind of scale as horizontal search
17
LESSONS LEARNED
• Vertical search is not just about search– It’s about findability
• Includes navigation, recommendations, clustering, faceted classification, etc.
– It’s about satisfying a set of well-understood tasks• Usually on enhanced content
• Usually for expert customers
• Leveraging human value addition is key– None of the human actors set out to improve search
• Difficult to design complete solution upfront– Need platform for experimentation and validation at scale
18
QUESTIONS?
• A relevant paper is downloadable from
http://labs.thomsonreuters.com
19