Upload
nikhil-manjrekar
View
221
Download
0
Embed Size (px)
Citation preview
8/6/2019 ask.com ppt
1/43
Large Scale InternetSearch at Ask.com
Tao YangChief Scientist and Senior Vice President
InfoScale 2006
8/6/2019 ask.com ppt
2/43
Outline
Overview of the company and products Core techniques for page ranking
ExpertRank
Challenges in building scalable searchservices
Neptune clustering middleware.
Fault detection and isolation.
8/6/2019 ask.com ppt
3/43
Innovative search technologies.#6 U.S. Web Property; #8 Global in terms of user coverage
28.5% reach - Active North American Audience with 48.8 millionunique users133 million global unique users for ASK worldwide sites: USA, UK,
Germany, France, Italy, Japan, Spain, Netherlands. A Division of IAC Search and Media (Formally Ask Jeeves)
Ask.com: Focused on Delivering a Better Search Experience
8/6/2019 ask.com ppt
4/43
Media & Advertising
Membership &Subscriptions
Services
Retailing
EmergingBusinesses
Sectors of IAC (InterActiveCorp)
8/6/2019 ask.com ppt
5/43
IAC (InterActiveCorp)
Fortune 500 companyCreate, acquire and build businesses withleading positions in interactive markets.
60 specialized & global brands
28K+ employees
$5.8 billion 2005 Revenue $668 million 2005 OIBA (Profit)
$1.5 billion net cash
8/6/2019 ask.com ppt
6/43
8/6/2019 ask.com ppt
7/43
Ask.com Site Relaunching and branding in Q12006
Cleaner interface with a list of search tools
8/6/2019 ask.com ppt
8/43
Site Features: Smart Answer
8/6/2019 ask.com ppt
9/43
Topic Zooming with Search Suggestions
8/6/2019 ask.com ppt
10/43
Site Feature: Web Direct Answer
8/6/2019 ask.com ppt
11/43
More Site Features - Binoculars
Our Binoculars tool lets you see what a site lookslike before clicking to visit it
8/6/2019 ask.com ppt
12/43
Ask Competitive Strengths
Deeper topic view of the InternetQuery-specific link and text analysis withbehavior analysisDifferentiated clustering technology
Natural Language ProcessingBetter understanding/analysis of queries anduser behavior
Integration of structured data with web search.Smart answers
8/6/2019 ask.com ppt
13/43
8/6/2019 ask.com ppt
14/43
Engine Architecture
Neptune Clustering Middleware
Document
Abstract
Frontend
Client queries Traffic load balancer
FrontendFrontendFrontend
Pageindex
Document
AbstractDocumentAbstractDocumentdescription
RankingRankingRankingRankingRankingRankingClassification
HierarchicalResult Cache
StructuredDB
Page
index
8/6/2019 ask.com ppt
15/43
Concept: Link-based Popularity for Ranking
A is a connectivity matrix among webpages. A(i,j)=1 for edge from i to j.
Query-independent popularity. Query-specific popularity
1
3
8
12
456 7
9
11 1
11
11
11
1 2 3 4 5 6 7 8 9
1
1
32
4576
8
8/6/2019 ask.com ppt
16/43
Approaches for Page Ranking
PageRank :[Brin/Page98] offline computation of query-independent popularity iteratively. HITS:[Kleinberg98, IBM Clever]
Build a query-based connectivity matrix on the fly.H, R are hub and authority weights of pages.Repeat until H, R converge. R=A H= AA R; Normalize H, R.
ExpertRank: Compute query-specific communitiesand ranking in real time.
Started from Teoma and evolved at Ask.com
8/6/2019 ask.com ppt
17/43
Steps of ExpertRank at Ask.comSearch the index for a query 2 Clustering for subjectcommunities for matched
results
4 Ranking with knowledge andclassification
1
local subject-specific mining3
Local SubjectCommunity
8/6/2019 ask.com ppt
18/43
8/6/2019 ask.com ppt
19/43
Multi-stage Cluster Refinement with IntegratedLink/Topic Analysis
Link-guided pageclusteringCluster refinementwith content analysisand topic purification
Text classificationand NLP
Similarity andoverlappinganalysis
2
8/6/2019 ask.com ppt
20/43
Subject-specific ranking
Examplebat, flying mammals vs.baseball bat.
For each topic group,identify experts for pagerecommendation, andremove spamming links.
Derive local rankingscores
3
Hub
Authority
8/6/2019 ask.com ppt
21/43
Integrated Ranking with User Intention Analysis
Score weighting from multiple topic groups.Authoritativeness and freshnessassessment.
User intention analysis.Result diversification.
Local SubjectCommunity
Hub
Authority
4
8/6/2019 ask.com ppt
22/43
Scalability Challenges
Data scalability: From millions of pages to billions of pages.Clean vs. datasets with lots of noise.
Infrastructure scalability: Tens of thousands of machines.Tens of Millions of usersImpact on response time, throughput, &availability,data center power/space/networking.
People scalability: From few persons to manyengineers with non-uniform experience.
8/6/2019 ask.com ppt
23/43
8/6/2019 ask.com ppt
24/43
Examples of Scalability Problems
Mining question answers from web. Large-scale spammer detection. Computing with irregular data. On-chip cache. Large-scale memory management: 32 bits vs. 64 bits. Incremental cluster expansion and topology mgmt. High throughput write/read traffic. Reliability. Fast and reliable data propagation across networks. Architecture optimization for low power consumption. Update large software & data on a live platform. Distributed debugging thousands of machines.
8/6/2019 ask.com ppt
25/43
Some of Lessons Learned
DataData methods can behave differently withdifferent data sizes/noise levels.Data-driven approaches with iterative refinement.
Architecture & SoftwareDistributed service-oriented architecturesMiddleware support.
Product:Monitoring is as critical as others.Simplicity
8/6/2019 ask.com ppt
26/43
The Neptune Clustering Middleware
A simple/flexible programming modelAggregating and replicating application moduleswith persistent data.Shielding complexity of service discovery, load
balancing, consistency, and failover managementProviding inter-service communication.Providing quality-aware request scheduling for service differentiation
Started at UCSB. Evolved with Teoma, Ask.com.
8/6/2019 ask.com ppt
27/43
Programming Model and Cluster-levelParallelism/Redudancy in Neptune
Request-driven processing model. SPMD model (single program/multiple data) while
large data sets are partitioned and replicated. Location-transparent service access with consistency
support.
Servicemethod
Request
Providermodule
Providermodule
Service cluster
Clustering byNeptune
Data
8/6/2019 ask.com ppt
28/43
Neptune architecture for cluster-basedservices
Symmetric and decentralized :Each node can host multiple services, acting as aservice provider.Each node can also subscribe internal services
from other nodes, acting as a consumer. Support multi-tier or nested service architecture
Client requests
Service consumer/provider
8/6/2019 ask.com ppt
29/43
Inside a Neptune Server Node
N et w
or k
t ot h
er e
s t
of t h
e c l u
s t er
ServiceAccessPoint
ServiceProviders
ServiceRuntime
rvice HandlingModule
ServiceAvailabilit
yDirectory
ServiceAvailabilityPublishing
ServiceAvailabilitySubsystem
PollingAgent
LoadIndex
Server
ServiceLoad-balancing
Subsystem
ServiceConsume
rs
8/6/2019 ask.com ppt
30/43
Impact of Component Failure in Multi-tierservices
Failure of one replica: 7s - 12s Service unavailable: 10s - 13s
Front-endService
Replica1
Replica2
Replica3
8/6/2019 ask.com ppt
31/43
Problems that affect availability Threads are blocked with slow service dependency. Fault detection speed.
Service A(From healthy to unresponsive)
Thread Pool
Service BReplica #2(Healthy)
Service BReplica #1
(Unresponsive)
RequestsQueue
Service BReplica #1
(Unresponsive)
8/6/2019 ask.com ppt
32/43
Dependency Isolation
Per-dependencymanagement with capsules.Isolate their performance impact.maintain dependency-specific feedbackinformation for QoScontrol.
Programming support with
automatic recognition of dependency states.
Request queue
NetworkServiceCapsule
NetworkServiceCapsule
NetworkServiceCapsule
DiskCapsule
Other
Capsule
MainWorking
ThreadPool
8/6/2019 ask.com ppt
33/43
Fast Fault Detection and Information Propagationfor Large-Scale Cluster-Based Services
Complex 24x7 networktopology in serviceclusters.
Frequent events: failures,structure changes, andnew services.
Yellowpage directory
discovery of services and their attributesServer aliveness
Internet
DataCenter
California
DataCenter
New York
DataCenter Asia
3DNS-WAN Load Balancer
Asian user
NY user
V P N
V P N
VP N
CA user
Level-2 Sw itch Level-2 S witch Level-2 Sw itch
Level-3 Sw itch
Level-2 Sw itch
Level-3 Switch
.. .
8/6/2019 ask.com ppt
34/43
TAMP: T opology- Adaptive Membership Protocol
Highly Efficient: Optimize bandwidth, # of packets Topology-aware:
Form a hierarchical tree according to networktopologyLocalize traffic within switches and adaptive tochanges of switch architecture.
Topology-adaptive:Network changes: switches
Scalable: scale to tens of thousands of nodes.
Easy to operate.
8/6/2019 ask.com ppt
35/43
Hierarchical Tree Formation Algorithm
Exploiting TTL count in IP packet for topology-adaptive design. Each multicast group with a fixed TLL valueperforms an election;Group leaders form higher level groups withlarger TTL values;Stop when max. TTL value is reached; otherwise,goto Step 2 .
8/6/2019 ask.com ppt
36/43
An Example of Hiearchical Tree Formation
Group 2a239.255.0.22
TTL=3
Group 2b239.255.0.22
TTL=3
AB
C
A B CGroup 1a
239.255.0.21TTL=2
AB C
Group 1b239.255.0.21
TTL=2
Group 1c239.255.0.21
TTL=2
Group 0a239.255.0.20
TTL=1
A B C
Group 0b239.255.0.20
TTL=1
Group 0c239.255.0.20
TTL=1
B
Group 3a
239.255.0.23TTL=4
8/6/2019 ask.com ppt
37/43
Scalability Analysis
Basic performance factorsFailure detection time ( T fail_detect )
View convergence time ( T converge )
Communication cost in terms of bandwidth ( B)
Two metricsBDP = B * T fail_detect , lower failure detection time with lowbandwidth is desiredBCP = B * T converge , lower convergence time with lowbandwidth is desired
8/6/2019 ask.com ppt
38/43
8/6/2019 ask.com ppt
39/43
Bandwidth Consumption
All-to-All & Gossip: quadratic increase TAMP: close to linear
8/6/2019 ask.com ppt
40/43
Failure Detection Time
Gossip: log(N) increase All-to-All & TAMP: constant
8/6/2019 ask.com ppt
41/43
View Convergence Time
Gossip: log(N) increase All-to-All & TAMP: constant
8/6/2019 ask.com ppt
42/43
8/6/2019 ask.com ppt
43/43