20
Search at Tumblr Yufei Pan Director of Search, Tumblr 16 January 2013

Search at Tumblr (nyc search meetup)

  • Upload
    otisg

  • View
    11.212

  • Download
    1

Embed Size (px)

DESCRIPTION

Yufei Pan, Director of Search at Tumblr, presenting at NYC Search & Analytics meetup in January 2014. http://www.meetup.com/NYC-Search-and-Discovery/

Citation preview

Page 1: Search at Tumblr (nyc search meetup)

Search at Tumblr

Yufei PanDirector of Search, Tumblr

16 January 2013

Page 2: Search at Tumblr (nyc search meetup)

Tumblr - Follow the World’s Creators

Founded● David Karp● February 2007

Publishing Platform● 163 million blogs● 72 billion posts

Social Network● Follow, Mention● Like, Reblog

Page 3: Search at Tumblr (nyc search meetup)

About search@tumblr

● Most important way to discover great content ○ 50M searches a day

● Limited search for a long time (2007-2012)○ Tagged page

■ mysql lookup of a single tag id■ sorted by reverse chronological order

○ Finding blog■ navigate through curated directories

Page 4: Search at Tumblr (nyc search meetup)

About search@tumblr

● Search Team○ 2012 July, Jak joined as first search engineer!

● Features launched in 2013○ Post search, Blog search, Theme search○ Typeaheads, Recommendation, Trends

Jak Yufei Bennett Beitao Patrick Adam

Page 5: Search at Tumblr (nyc search meetup)

Whole New Search

Post search● full text search● top and recent● post type filtering

Blog search● name & title ● top tags in posts● blog highlights

Related search● term co-occurrence

Page 6: Search at Tumblr (nyc search meetup)

Typeahead AutocompletesSearch Autocomplete Mention Autocomplete Tag Suggest

● Interactive guide of tumblr content● High volume of traffic● Low latency

Page 7: Search at Tumblr (nyc search meetup)

RecommendationsPersonalized Recommendation Weekly Dashboard Digest

Page 8: Search at Tumblr (nyc search meetup)

Trends

Trending Tags Trending Blogs

Page 9: Search at Tumblr (nyc search meetup)

Theme Search

Page 10: Search at Tumblr (nyc search meetup)

Search Architecture

Recent Post Index

Global Tag Index

In-Blog Tag Index

Blog Full Index

Blog Top-K Index

Personalized Blog Index

Trending Blogs

Trending Posts

Trending Tags

Related Tag Index

Typeahead Indices

Blog Top Posts

Blog Top Tags Like Root

Search Offline Framework

Post Notecount

PostModel

Blog Model

UserModel

Activity Streams (Fire Geyser) Scribe logs, Sqoop tables (HDFS)

Follower Counts

Blog Global Rank

Blog FeedbackTwo Degree

Data

Offline

Post Search Typeahead Blog

RecommendRelated Tags

Blog Highlights

Blog Top Tags

Search Online Framework

Blog Search

Trending Tags

Trending Blogs

Trending Posts

Online

Rediscover

Solr

MySQL

TopPost Index

Theme Index

Nginx

Linux

Page 11: Search at Tumblr (nyc search meetup)

Software Stack

● Search Online○ HAProxy, Nginx, PHP○ Memcache○ Icinga, Scribe, OpenTSDB

● Search Data○ Solr, Redis, MySQL

● Search Offline○ Sqoop, Hadoop○ Java, Hive, Pig, Scalding, Python

Page 12: Search at Tumblr (nyc search meetup)

Search Online Framework

SearchBase

QueryIF RetrieverIF SignalFetcherIF RankerIF DocFetcherIF FilterIF

SimpleQuery

PersonalizedQuery

AdvancedPostQuery

NotecountFetcher

FollowercountFetcher

SolrPostRetriever

MysqlPostRetriever

SMPostRetriever

TumblelogRetriever

TagTypeaheadReteriever

TopPostRanker

TumblelogRanker

RelatedPostRanker

PostFetcher

TumblelogFetcher

PostFilter

TumblelogFilter

Search Flow Execution

Multi-level Caching

Search Logging

Async Execution

Search Services

TimeSliceQuery

TrendTagQuery

TumblelogGlobalRankFetcher

RecommendationSignalFetcher

BlogTopTagFetcher

TumblelogMixingRanker

TagFetcher TagFilter

Search Editorial

Page 13: Search at Tumblr (nyc search meetup)

Search Task Base

Search Batch Processing

Hive Jobs Streaming Jobs

Scalding Jobs

Pig Jobs

Scribe Logs, Sqoop Tables (HDFS)

Search Data (Redis)

Search Workflow Engine

Workflow Composition

Dependency Resolution

AutomaticVersioning

DataVerification

FailureDetection/Alert

ExecutionLogging

TermGenerators

Top-K Indexer

Lucille2 Classes

DeltaPropagator

Page 14: Search at Tumblr (nyc search meetup)

Indexing

● 3-Tier indices○ Index all posts

■ 600+ machines○ Recent (6W) + Popular (4Y) + Existing tag table

■ Down to 40 machines■ Minor loss in coverage■ Serve up to 4K qps (non-cached)

● Lean index ○ Separate signals from index

■ Eliminate high volume re-indexing ■ Independent signal engineering from indexing

○ Separate document text from index■ Dropping the memory footprint

Page 15: Search at Tumblr (nyc search meetup)

Ranking

● Quickly evolving!● Major ranking signals in production

○ Global popularity■ likes, reblogs, follows

○ Local popularity■ popularity projected on <user, query>

● blog search: aggregated likes on query term● blog recommendation: follow counts among friends

○ Textual relevancy■ how: exact match, query proximity■ where: name, title, tag, mention, body, etc

○ Recency

Page 16: Search at Tumblr (nyc search meetup)

Duplicate Elimination (DE)

● Index-time DE○ post signature

■ number of tags > N1■ md5 hash of normalized tag list

● Search-time DE○ Media DE

■ posts with same media hashes.○ Near DE

■ posts with tags > N2■ mark as near duplicate if diff <= N3 tags■ older posts selected as original

Page 17: Search at Tumblr (nyc search meetup)

Search Platform

● A curvy road○ Started with ElasticSearch○ Switched to SolrCloud due to reliability○ Ended up with Solr + Customized Clustering

● Our takes○ ElasticSearch and SolrCloud have great functionality

■ distributed indexing and search■ easy cluster management

○ Solr seems still much more reliable with high indexing load and search traffic.

Page 18: Search at Tumblr (nyc search meetup)

Offline Precomputation

● Benefits○ Minimize the search online latency○ More sophisticated/expensive computation

● Limitation○ Loss of freshness○ Expensive for longtail query and results

● Precomputed○ Typeaheads○ Related search○ Blog recommendation○ Top posts of Blog / User

Page 19: Search at Tumblr (nyc search meetup)

What’s Next

● Inblog search○ full text search on all posts in a blog○ original posts, reblogs, likes

● Ranking○ more effective and spam-resilient signals○ learning to rank

● Topical interest modeling○ supervised and unsupervised○ blog content and user activities○ interest based blog recommendation

● Content discovery○ trending content in various categories

Page 20: Search at Tumblr (nyc search meetup)

Question: Are you hiring?

Answer: Yeah! Check it out at http://www.tumblr.com/jobs

More questions please, :-)

Q & A