[@IndeedEng] Large scale interactive analytics with Imhotep

Preview:

DESCRIPTION

Link to video: https://www.youtube.com/watch?v=IZ-kC6ut1Lg In a previous talk, we explained how we developed Imhotep, a distributed system for building decision trees for machine learning. We went on to describe how we build large scale interactive analytics tools using the same platform. This has kept our engineering and product organizations focused on key metrics by analyzing test results. It also gives our marketing organization timely and accurate insight into our data - allowing us to identify opportunities, spot trends, and learn about our job seekers. In this talk, Zak Cocos, who leads our Marketing Sciences team, and Product Manager Tom Bergman will discuss and provide examples of the valuable insights that can be gained by using Imhotep with almost any data set.

Citation preview

Large Scale Interactive Analytics

with Imhotep

Tom BergmanProduct Manager

Zak CocosManager

Marketing Science

We help people get jobs.

What is Imhotep?

Imhotep is a highly scalable analytics architecture for querying faceted datasets

Open sourcing Imhotep

Imhotep will be an OPEN SOURCE highly scalable analytics architecture for querying faceted datasets

People

Tools

System

Data

People

Tools

Data

System

People

Data

Tools

System

People

Data

Tools

System

A Brief History of Analytics

@Indeed

What's best for thejob seeker?

Test & Measure EVERYTHING

Query

Query Location

Query Location

Impression

Title: Front End Software EngineerPosition: 1Clicked: 0Country: USQuery: indeed software engineerLocation: austinTimestamp:2014-04-30T20:00:00

Organic Impression Log Entry

Analytics on Raw Logs

Ramses

● Search logs● Extract metrics from matches● Graph aggregated metrics

Ramses

● Search logs● Extract metrics from matches● Graph aggregated metrics

Input -> Query and MetricOutput -> Aggregated metrics by bucket

Ramses

How many organic clicks did we have in Australia?

QUERY

country:au

METRIC

organic_clicks

How many organic clicks did we have in Australia?

How many organic clicks did we have in Australia?

Does test group A or B have more revenue?

QUERY

testgroup:A, testgroup:B

METRIC

revenue

Does test group A or B have more revenue?

Does test group A or B have more revenue?

How has traffic from Yahoo! changed over time in Great Britain, Germany, and Japan?

QUERY

from:yahoo AND country:(gb, de, jp)

METRIC

visits

How has traffic from Yahoo! changed over time in Great Britain, Germany, and Japan?

How has traffic from Yahoo! changed over time in Great Britain, Germany, and Japan?

● How many unique queries in the US?

● What are the top 50 queries in the US?

● How many clicks did each of those queries receive?

Questions Ramses can’t answer

Imhotep

Began as a distributed iteration and group-by

engine for building click prediction models.

Imhotep Origins

We use an iterative algorithm to build decision

trees level-by-level.

Decision Tree Builder

Began as a distributed iteration and group-by

engine for building click prediction models.

Leveraged ability to do massive group-bys and

aggregates to make real-time analytics engine.

Imhotep Origins

How many Android App users with accounts

older than 30 days saved at least 1 job in the

past week?

What titles have the highest click-through rate

for the query “Architecture” in the US?

What about the lowest click-through rate?

For job seekers who click on Google jobs in

Ireland, what other company’s jobs do they

click on?

Zak CocosManager

Marketing Science

I also help people get jobs.

Marketing Sciences

Research, analysis, and automation team supporting marketing initiatives

Imhotep

Imhotep is a highly scalable, [soon to be] open source, analytics architecture for querying faceted datasets

Imhotep@Indeed

Ad hoc exploration

Imhotep@Indeed

Ad hoc exploration

Specific analysis

Imhotep@Indeed

Ad hoc exploration

Specific analysis

Extensible infrastructure

Ad hoc exploration

Public Crunchbase Dataset

Source: CrunchBaseCrunchBase 2013 Snapshot © 2013

Ad hoc exploration

Public Crunchbase Dataset

Document

Source: CrunchBaseCrunchBase 2013 Snapshot © 2013

Ad hoc exploration

Public Crunchbase Dataset

Fields

Source: CrunchBaseCrunchBase 2013 Snapshot © 2013

Ad hoc exploration

Public Crunchbase Dataset

Metric

Source: CrunchBaseCrunchBase 2013 Snapshot © 2013

Interactive tool for exploring Imhotep data

Imhotep Data Explorer

Interactive tool for exploring Imhotep data

Also: a badass hyperlinked pivot table

Imhotep Data Explorer

Imhotep is Large Scale

Total size of all indexes: 125TB

Jobsearch index (largest): 30TB

● Over 48 billion documents

Query

Query Location

Query Location

Organic Impression

Organic Impression

A job that was displayed as the result of a search

Title

Company Information

Description

Job Age

abredistimeacmetimeaddltimeadscadsdelayadsibadscbadsiboostojcboostojibsjcbsjcwiabsjibsjindappliesbsjindappviewsbsjrevbsjwiackcntckszcountsctkagectkagedaysdayofweekdcpingtimedomTotalTimeds-mpo

dsmissdstimefeatempfjfreekwacfreekwarevfreesjcfreesjrevfrmtimegalatdelayiplatiplongjslatdelayjsvdelaykwackwacdelaykwaikwarevkwcntlacinsizelacsgsizelmstimempotimemprtimenavTotTimendxtime

ojcojclongojcshortojcwiaojiojindappliesojindappviewsojwiaoocscpageprcvdlatencyprimfollowcntprvwojiprvwojlatprvwojopentimeprvwojreqradscradsirecidlookupbudgetrectimeredirCountredirTimerelfollowcntrespTimereturnvisitrojc

rojirqcntrqlcntrqqcntrrsjcrrsjirrsjrevrsavailrsjcrsjirsusedrsviableserpsizesjcsjcdelaysjclongsjcntsjcshortsjcwiasjisjindappliessjindappviewssjrevsjwiasllatsllong

sqcsqisugtimesvjsvjnostarsvjstartadsctadsitimetimeofdaytotcnttotfollowcnttotrevtottimetsjctsjcwiatsjitsjindappliestsjindappviewstsjrevtsjwiaunqcntvpwacinsizewacsgsize

Organic Impression Document

Title: Front End Software EngineerPosition: 1Clicked: 0Country: USQuery: indeed software engineerLocation: austinTimestamp:2014-04-30T20:00:00

Organic Impression Index

Title: Front End Software EngineerPosition: 1Clicked: 0Country: USQuery: indeed software engineerLocation: austinTimestamp:2014-04-30T20:00:00

Imhotep Data Explorer can’t...

Combine results from multiple datasets

Combine results from multiple datasets

Be easily automated

Imhotep Data Explorer can’t...

Imhotep Query Language (IQL)

IQL - Imhotep Query Language

Can combine results from multiple datasets

Allows for automation of data tools

IQL queries - requirements

Index Date rangeMetrics

IQL queries - optional

Index Date rangeMetrics

FiltersGroup by

IQL - Metrics

select count()

from organic

‘2013-12-05’

‘2013-12-10’

where country=ie

and clicked=1

group by companyid

Metrics

select count()

from organic

‘2013-12-05’

‘2013-12-10’

where country=ie

and clicked=1

group by companyid

IQL - Indexes

Index

select count()

from organic

‘2013-12-05’

‘2013-12-10’

where country=ie

and clicked=1

group by companyid

IQL - Date Range

Date Range

select count()

from organic

‘2013-12-05’

‘2013-12-10’

where country=ie

and clicked=1

group by companyid

IQL - Filters

Filters

select count()

from organic

‘2013-12-05’

‘2013-12-10’

where country=ie

and clicked=1

group by companyid

IQL - Filters

Groups

IQL Question

Do companies that have raised more than $10 million in the Austin get more clicks on average than those raised less than $10 million?

Methodology

1) organic index: select companies in the US which received organic clicks

Methodology

1) organic index: select companies in the US which received organic clicks

2) crunchbase index: select companies, and the amount of funding for companies receiving investments in Austin

Methodology

1) organic index: select companies in the US which received organic clicks

2) crunchbase index: select companies, and the amount of funding for companies receiving investments in Austin

3) Join, segment, and do the math!

Tom BergmanProduct Manager

I still help people get jobs.

Large Scale Interactive Analytics Platform

● 123 Unique Indexes● Largest Index 30TB● Total size ~125TB

Large Scale Interactive Analytics Platform

IQL -> Largely Programmatic access● approx 76k queries/day● Avg time to execute 0.67 seconds

Ramses -> Largely Human● approx 3,400 queries/day● Avg time to execute 4.4 seconds

Large Scale Interactive Analytics Platform

Users● 198 unique users in past month● 25,622 unique queries in past month● Avg 53 queries/user per day

Large Scale Interactive Analytics Platform

40+ internal clients● 6 Analytics Webapps● 5 dashboards● 10 programming/scripting shells● 6 monitoring apps● … and more

Large Scale Interactive Analytics Platform

One Tool-set for all data● Website usage● Operational Monitoring● Financial Reporting● Google Analytics● Internal Webapp Usage● External Reports

Solving a real problem

Providing the Best Results

Show the jobs that users are most interesting to our users

Providing the Best Results

Clicks are a very good indicator of interest

Providing the Best Results

Clicks are a very good indicator of interest

More clicks -> More RelevantLess clicks -> Less Relevant

Architecture

Very hard query to serve correctly

Architecture

Very hard query to serve correctly

Architecture terminology has been co-opted by technology

Terminology Common to both Software and Architecture

BlueprintDesignFrameworkInfrastructureEngineerProject manager

DevelopmentTechnical architectSoftwareModelingComputationCode reviews

Architecture vs Software Titles

ArchitectCAD DesignerProject Manager

vs

Software ArchitectUI DesignerProject Manager

Query Management

Indeed uses Imhotep to improve matching

Query Management

Indeed uses Imhotep to improve matching

Automatically detect results that should be added or removed from queries

Query Management

Indeed uses Imhotep to improve matching

Automatically detect results that should be added or removed from queries

26,790 rules across all countries

Imhotep Open Source

Imhotep Open Source ETA:August 1, 2014

Imhotep Open Source

Follow along at our blogengineering.indeed.com

Sign up for mailing list to get latest updatesgo.indeed.com/imhotep-announce

Q & A

Next @IndeedEng TalkLaunching Indeed Around the World

Davide Novelli, International DirectorDavid Tulig, Tech Lead

May 28, 2014

http://engineering.indeed.com/talks

More Questions?Jason David James Jeff

Recommended