Guide: Presented By: Dr. Sunnie S. Chung Kalpesh Sharma ...cis.csuohio.edu/~sschung/CIS601/FastDataEraofBigData_Sharma.pdf · CIS 601 Graduate Seminar Guide: Presented By: Dr. Sunnie

CIS 601 Graduate SeminarGuide: Presented By:

Dr. Sunnie S. Chung Kalpesh Sharma (2660576)

Dhruv Patel (2652790)

Abstract

Introduction

Real-Time Query Suggestion

First Solution

Second Solution (Deployed Solution)

Future work

Conclusion

Purpose of this paper is to present the architecture of

Twitter’s real –time related query suggestion and spelling

correction service.

Provide Relevant Related Query Suggestions within 10

Minutes of Major Events.

Two solutions were deployed to achieve this target.

Suggest future work to reduce the gap between big data and

fast data

It is important to deal with not only volume aspect of data but also with velocity of data.

Architecture behind Twitter's real-time related query suggestion and spelling correction service is collectively called "search assistance“.

After significant breaking news events, twitter aims to provide relevant search results, but the challenge is to provide those results in real time. (i.e. within few minutes)

Good related query suggestions provide:

Topicality - Capture relevant topics

Temporality- Capture temporal connection

Studies at twitter concluded, it is important to provide relevant results in a window of 5-10 minutes:

Any longer- Reaction is too slow.

Any shorter- No enough evidences.

First version of search assistance was built using Hadoop platform.

Twitter has incorporated components on top of Hadoop :-

Pig, Hive, ZooKeeper and Vertica

The first version of search assistance was written in Pig, with custom Java UDFs.

A pig script aggregates user search sessions, computes term and co-occurrence statistics and ranks related queries and spellings suggestions would run on Hadoop stack.

Production Pig analytics jobs are coordinated by a work-flow manager called Oink.

Oink is a job scheduler.

This solution worked great but had latency problems.

This latency was primarily attributed to two bottlenecks.

Bottleneck One: Log Import

Bottleneck two: Hadoop (Map Reduce Jobs)

Bottleneck One: Log import

Delay in importing “client event logs” on the order of terabytes from various twitter clients onto HDFS.

Twitter uses scribe for aggregating large volumes of streaming log data.

A scribe daemon runs on every production hosts and is responsible to aggregate and send the data to the cluster.

Bottleneck two: Hadoop

In hadoop analytics there were latencies associated with Map Reduce jobs.

MapReduce jobs took 15-20 minutes in to process just an hour of log data.

Hadoop was simply not designed for latency sensitive jobs.

Deployed solution is an in memory processing engine developed to meet the latency requirements of the search assistance application.

Uses only two sources for assistance:

Search Sessions

Tweets

Does not need click through and other user sessions for search assistance.

Blender has a record of user search sessions.

Blender also makes all the user queries available to the queryhose.

No need for client event Scribe logs and clickthrough data.

EarlyBird is Twitter’s inverted indexing engine. A fleet of these servers ingests tweets from the “firehose”.

Firehose is a streaming API providing access to all tweets as they are published- to update in-memory indexes.

Twitter search assistance is provided by a custom, in memory processing engine that consumes two sources:

The twitter fire hose- Feeds all the tweets to backend engine.

The blender query hose- Feeds all the user queries and search sessions to backend engine.

The front-end is scalable.

The back-end is fault tolerant.

Every 5 minutes, computed results are persisted to HDFS.

Each backend instance is a multi-threaded application that consists of three major components:

The Stats Collector – which reads the firehoseand query hose.

In-memory Stores – which hold the most up-to-date statistics.

Rankers – which periodically execute one or more ranking algorithm by consulting the in-memory stores for the raw features.

Most important future direction in data management is bridging the gap between platforms for “Big Data ” and “Fast Data”.

Hedwig and kafka present nice solutions. Kafka used by linkedin can handle 10 billion

messages each day. A system can be built which can automatically

perform pruning when memory is needed. Facebook uses a combination of ptail and Puma

on top of scribe infrastructure. Google’s percolator tries to address the problem of

handling “big data” and “fast data”.

The paper proposed two solutions for real time related query suggestion

The first solution was using Hadoop

The second solution was using in memory approach.

A more generic data processing platform in needed to handle both “Big Data” and “Fast data”.

The latest search assistance engine used by twitter finds old things but still favor the new ones.

Documents

Guide: Presented By: Dr. Sunnie S. Chung Kalpesh Sharma ...cis.csuohio.edu/~sschung/CIS601/FastDataEraofBigData_Sharma.pdf · CIS 601 Graduate Seminar Guide: Presented By: Dr. Sunnie