ask.com ppt

Embed Size (px)

Citation preview

  • 8/6/2019 ask.com ppt

    1/43

    Large Scale InternetSearch at Ask.com

    Tao YangChief Scientist and Senior Vice President

    InfoScale 2006

  • 8/6/2019 ask.com ppt

    2/43

    Outline

    Overview of the company and products Core techniques for page ranking

    ExpertRank

    Challenges in building scalable searchservices

    Neptune clustering middleware.

    Fault detection and isolation.

  • 8/6/2019 ask.com ppt

    3/43

    Innovative search technologies.#6 U.S. Web Property; #8 Global in terms of user coverage

    28.5% reach - Active North American Audience with 48.8 millionunique users133 million global unique users for ASK worldwide sites: USA, UK,

    Germany, France, Italy, Japan, Spain, Netherlands. A Division of IAC Search and Media (Formally Ask Jeeves)

    Ask.com: Focused on Delivering a Better Search Experience

  • 8/6/2019 ask.com ppt

    4/43

    Media & Advertising

    Membership &Subscriptions

    Services

    Retailing

    EmergingBusinesses

    Sectors of IAC (InterActiveCorp)

  • 8/6/2019 ask.com ppt

    5/43

    IAC (InterActiveCorp)

    Fortune 500 companyCreate, acquire and build businesses withleading positions in interactive markets.

    60 specialized & global brands

    28K+ employees

    $5.8 billion 2005 Revenue $668 million 2005 OIBA (Profit)

    $1.5 billion net cash

  • 8/6/2019 ask.com ppt

    6/43

  • 8/6/2019 ask.com ppt

    7/43

    Ask.com Site Relaunching and branding in Q12006

    Cleaner interface with a list of search tools

  • 8/6/2019 ask.com ppt

    8/43

    Site Features: Smart Answer

  • 8/6/2019 ask.com ppt

    9/43

    Topic Zooming with Search Suggestions

  • 8/6/2019 ask.com ppt

    10/43

    Site Feature: Web Direct Answer

  • 8/6/2019 ask.com ppt

    11/43

    More Site Features - Binoculars

    Our Binoculars tool lets you see what a site lookslike before clicking to visit it

  • 8/6/2019 ask.com ppt

    12/43

    Ask Competitive Strengths

    Deeper topic view of the InternetQuery-specific link and text analysis withbehavior analysisDifferentiated clustering technology

    Natural Language ProcessingBetter understanding/analysis of queries anduser behavior

    Integration of structured data with web search.Smart answers

  • 8/6/2019 ask.com ppt

    13/43

  • 8/6/2019 ask.com ppt

    14/43

    Engine Architecture

    Neptune Clustering Middleware

    Document

    Abstract

    Frontend

    Client queries Traffic load balancer

    FrontendFrontendFrontend

    Pageindex

    Document

    AbstractDocumentAbstractDocumentdescription

    RankingRankingRankingRankingRankingRankingClassification

    HierarchicalResult Cache

    StructuredDB

    Page

    index

  • 8/6/2019 ask.com ppt

    15/43

    Concept: Link-based Popularity for Ranking

    A is a connectivity matrix among webpages. A(i,j)=1 for edge from i to j.

    Query-independent popularity. Query-specific popularity

    1

    3

    8

    12

    456 7

    9

    11 1

    11

    11

    11

    1 2 3 4 5 6 7 8 9

    1

    1

    32

    4576

    8

  • 8/6/2019 ask.com ppt

    16/43

    Approaches for Page Ranking

    PageRank :[Brin/Page98] offline computation of query-independent popularity iteratively. HITS:[Kleinberg98, IBM Clever]

    Build a query-based connectivity matrix on the fly.H, R are hub and authority weights of pages.Repeat until H, R converge. R=A H= AA R; Normalize H, R.

    ExpertRank: Compute query-specific communitiesand ranking in real time.

    Started from Teoma and evolved at Ask.com

  • 8/6/2019 ask.com ppt

    17/43

    Steps of ExpertRank at Ask.comSearch the index for a query 2 Clustering for subjectcommunities for matched

    results

    4 Ranking with knowledge andclassification

    1

    local subject-specific mining3

    Local SubjectCommunity

  • 8/6/2019 ask.com ppt

    18/43

  • 8/6/2019 ask.com ppt

    19/43

    Multi-stage Cluster Refinement with IntegratedLink/Topic Analysis

    Link-guided pageclusteringCluster refinementwith content analysisand topic purification

    Text classificationand NLP

    Similarity andoverlappinganalysis

    2

  • 8/6/2019 ask.com ppt

    20/43

    Subject-specific ranking

    Examplebat, flying mammals vs.baseball bat.

    For each topic group,identify experts for pagerecommendation, andremove spamming links.

    Derive local rankingscores

    3

    Hub

    Authority

  • 8/6/2019 ask.com ppt

    21/43

    Integrated Ranking with User Intention Analysis

    Score weighting from multiple topic groups.Authoritativeness and freshnessassessment.

    User intention analysis.Result diversification.

    Local SubjectCommunity

    Hub

    Authority

    4

  • 8/6/2019 ask.com ppt

    22/43

    Scalability Challenges

    Data scalability: From millions of pages to billions of pages.Clean vs. datasets with lots of noise.

    Infrastructure scalability: Tens of thousands of machines.Tens of Millions of usersImpact on response time, throughput, &availability,data center power/space/networking.

    People scalability: From few persons to manyengineers with non-uniform experience.

  • 8/6/2019 ask.com ppt

    23/43

  • 8/6/2019 ask.com ppt

    24/43

    Examples of Scalability Problems

    Mining question answers from web. Large-scale spammer detection. Computing with irregular data. On-chip cache. Large-scale memory management: 32 bits vs. 64 bits. Incremental cluster expansion and topology mgmt. High throughput write/read traffic. Reliability. Fast and reliable data propagation across networks. Architecture optimization for low power consumption. Update large software & data on a live platform. Distributed debugging thousands of machines.

  • 8/6/2019 ask.com ppt

    25/43

    Some of Lessons Learned

    DataData methods can behave differently withdifferent data sizes/noise levels.Data-driven approaches with iterative refinement.

    Architecture & SoftwareDistributed service-oriented architecturesMiddleware support.

    Product:Monitoring is as critical as others.Simplicity

  • 8/6/2019 ask.com ppt

    26/43

    The Neptune Clustering Middleware

    A simple/flexible programming modelAggregating and replicating application moduleswith persistent data.Shielding complexity of service discovery, load

    balancing, consistency, and failover managementProviding inter-service communication.Providing quality-aware request scheduling for service differentiation

    Started at UCSB. Evolved with Teoma, Ask.com.

  • 8/6/2019 ask.com ppt

    27/43

    Programming Model and Cluster-levelParallelism/Redudancy in Neptune

    Request-driven processing model. SPMD model (single program/multiple data) while

    large data sets are partitioned and replicated. Location-transparent service access with consistency

    support.

    Servicemethod

    Request

    Providermodule

    Providermodule

    Service cluster

    Clustering byNeptune

    Data

  • 8/6/2019 ask.com ppt

    28/43

    Neptune architecture for cluster-basedservices

    Symmetric and decentralized :Each node can host multiple services, acting as aservice provider.Each node can also subscribe internal services

    from other nodes, acting as a consumer. Support multi-tier or nested service architecture

    Client requests

    Service consumer/provider

  • 8/6/2019 ask.com ppt

    29/43

    Inside a Neptune Server Node

    N et w

    or k

    t ot h

    er e

    s t

    of t h

    e c l u

    s t er

    ServiceAccessPoint

    ServiceProviders

    ServiceRuntime

    rvice HandlingModule

    ServiceAvailabilit

    yDirectory

    ServiceAvailabilityPublishing

    ServiceAvailabilitySubsystem

    PollingAgent

    LoadIndex

    Server

    ServiceLoad-balancing

    Subsystem

    ServiceConsume

    rs

  • 8/6/2019 ask.com ppt

    30/43

    Impact of Component Failure in Multi-tierservices

    Failure of one replica: 7s - 12s Service unavailable: 10s - 13s

    Front-endService

    Replica1

    Replica2

    Replica3

  • 8/6/2019 ask.com ppt

    31/43

    Problems that affect availability Threads are blocked with slow service dependency. Fault detection speed.

    Service A(From healthy to unresponsive)

    Thread Pool

    Service BReplica #2(Healthy)

    Service BReplica #1

    (Unresponsive)

    RequestsQueue

    Service BReplica #1

    (Unresponsive)

  • 8/6/2019 ask.com ppt

    32/43

    Dependency Isolation

    Per-dependencymanagement with capsules.Isolate their performance impact.maintain dependency-specific feedbackinformation for QoScontrol.

    Programming support with

    automatic recognition of dependency states.

    Request queue

    NetworkServiceCapsule

    NetworkServiceCapsule

    NetworkServiceCapsule

    DiskCapsule

    Other

    Capsule

    MainWorking

    ThreadPool

  • 8/6/2019 ask.com ppt

    33/43

    Fast Fault Detection and Information Propagationfor Large-Scale Cluster-Based Services

    Complex 24x7 networktopology in serviceclusters.

    Frequent events: failures,structure changes, andnew services.

    Yellowpage directory

    discovery of services and their attributesServer aliveness

    Internet

    DataCenter

    California

    DataCenter

    New York

    DataCenter Asia

    3DNS-WAN Load Balancer

    Asian user

    NY user

    V P N

    V P N

    VP N

    CA user

    Level-2 Sw itch Level-2 S witch Level-2 Sw itch

    Level-3 Sw itch

    Level-2 Sw itch

    Level-3 Switch

    .. .

  • 8/6/2019 ask.com ppt

    34/43

    TAMP: T opology- Adaptive Membership Protocol

    Highly Efficient: Optimize bandwidth, # of packets Topology-aware:

    Form a hierarchical tree according to networktopologyLocalize traffic within switches and adaptive tochanges of switch architecture.

    Topology-adaptive:Network changes: switches

    Scalable: scale to tens of thousands of nodes.

    Easy to operate.

  • 8/6/2019 ask.com ppt

    35/43

    Hierarchical Tree Formation Algorithm

    Exploiting TTL count in IP packet for topology-adaptive design. Each multicast group with a fixed TLL valueperforms an election;Group leaders form higher level groups withlarger TTL values;Stop when max. TTL value is reached; otherwise,goto Step 2 .

  • 8/6/2019 ask.com ppt

    36/43

    An Example of Hiearchical Tree Formation

    Group 2a239.255.0.22

    TTL=3

    Group 2b239.255.0.22

    TTL=3

    AB

    C

    A B CGroup 1a

    239.255.0.21TTL=2

    AB C

    Group 1b239.255.0.21

    TTL=2

    Group 1c239.255.0.21

    TTL=2

    Group 0a239.255.0.20

    TTL=1

    A B C

    Group 0b239.255.0.20

    TTL=1

    Group 0c239.255.0.20

    TTL=1

    B

    Group 3a

    239.255.0.23TTL=4

  • 8/6/2019 ask.com ppt

    37/43

    Scalability Analysis

    Basic performance factorsFailure detection time ( T fail_detect )

    View convergence time ( T converge )

    Communication cost in terms of bandwidth ( B)

    Two metricsBDP = B * T fail_detect , lower failure detection time with lowbandwidth is desiredBCP = B * T converge , lower convergence time with lowbandwidth is desired

  • 8/6/2019 ask.com ppt

    38/43

  • 8/6/2019 ask.com ppt

    39/43

    Bandwidth Consumption

    All-to-All & Gossip: quadratic increase TAMP: close to linear

  • 8/6/2019 ask.com ppt

    40/43

    Failure Detection Time

    Gossip: log(N) increase All-to-All & TAMP: constant

  • 8/6/2019 ask.com ppt

    41/43

    View Convergence Time

    Gossip: log(N) increase All-to-All & TAMP: constant

  • 8/6/2019 ask.com ppt

    42/43

  • 8/6/2019 ask.com ppt

    43/43