Upload
indeedeng
View
1.086
Download
4
Embed Size (px)
DESCRIPTION
Link to video: https://www.youtube.com/watch?v=IZ-kC6ut1Lg In a previous talk, we explained how we developed Imhotep, a distributed system for building decision trees for machine learning. We went on to describe how we build large scale interactive analytics tools using the same platform. This has kept our engineering and product organizations focused on key metrics by analyzing test results. It also gives our marketing organization timely and accurate insight into our data - allowing us to identify opportunities, spot trends, and learn about our job seekers. In this talk, Zak Cocos, who leads our Marketing Sciences team, and Product Manager Tom Bergman will discuss and provide examples of the valuable insights that can be gained by using Imhotep with almost any data set.
Citation preview
go.indeed.com/IndeedEngTalks
Large Scale Interactive Analytics
with Imhotep
Tom BergmanProduct Manager
Zak CocosManager
Marketing Science
We help people get jobs.
What is Imhotep?
Imhotep is a highly scalable analytics architecture for querying faceted datasets
Open sourcing Imhotep
Imhotep will be an OPEN SOURCE highly scalable analytics architecture for querying faceted datasets
People
Tools
System
Data
People
Tools
Data
System
People
Data
Tools
System
People
Data
Tools
System
A Brief History of Analytics
@Indeed
What's best for thejob seeker?
Test & Measure EVERYTHING
Query
Query Location
Query Location
Impression
Title: Front End Software EngineerPosition: 1Clicked: 0Country: USQuery: indeed software engineerLocation: austinTimestamp:2014-04-30T20:00:00
Organic Impression Log Entry
Analytics on Raw Logs
Ramses
● Search logs● Extract metrics from matches● Graph aggregated metrics
Ramses
● Search logs● Extract metrics from matches● Graph aggregated metrics
Input -> Query and MetricOutput -> Aggregated metrics by bucket
Ramses
How many organic clicks did we have in Australia?
QUERY
country:au
METRIC
organic_clicks
How many organic clicks did we have in Australia?
How many organic clicks did we have in Australia?
Does test group A or B have more revenue?
QUERY
testgroup:A, testgroup:B
METRIC
revenue
Does test group A or B have more revenue?
Does test group A or B have more revenue?
How has traffic from Yahoo! changed over time in Great Britain, Germany, and Japan?
QUERY
from:yahoo AND country:(gb, de, jp)
METRIC
visits
How has traffic from Yahoo! changed over time in Great Britain, Germany, and Japan?
How has traffic from Yahoo! changed over time in Great Britain, Germany, and Japan?
● How many unique queries in the US?
● What are the top 50 queries in the US?
● How many clicks did each of those queries receive?
Questions Ramses can’t answer
Imhotep
Began as a distributed iteration and group-by
engine for building click prediction models.
Imhotep Origins
We use an iterative algorithm to build decision
trees level-by-level.
Decision Tree Builder
Began as a distributed iteration and group-by
engine for building click prediction models.
Leveraged ability to do massive group-bys and
aggregates to make real-time analytics engine.
Imhotep Origins
How many Android App users with accounts
older than 30 days saved at least 1 job in the
past week?
What titles have the highest click-through rate
for the query “Architecture” in the US?
What about the lowest click-through rate?
For job seekers who click on Google jobs in
Ireland, what other company’s jobs do they
click on?
Zak CocosManager
Marketing Science
I also help people get jobs.
Marketing Sciences
Research, analysis, and automation team supporting marketing initiatives
Imhotep
Imhotep is a highly scalable, [soon to be] open source, analytics architecture for querying faceted datasets
Imhotep@Indeed
Ad hoc exploration
Imhotep@Indeed
Ad hoc exploration
Specific analysis
Imhotep@Indeed
Ad hoc exploration
Specific analysis
Extensible infrastructure
Ad hoc exploration
Public Crunchbase Dataset
Source: CrunchBaseCrunchBase 2013 Snapshot © 2013
Ad hoc exploration
Public Crunchbase Dataset
Document
Source: CrunchBaseCrunchBase 2013 Snapshot © 2013
Ad hoc exploration
Public Crunchbase Dataset
Fields
Source: CrunchBaseCrunchBase 2013 Snapshot © 2013
Ad hoc exploration
Public Crunchbase Dataset
Metric
Source: CrunchBaseCrunchBase 2013 Snapshot © 2013
Interactive tool for exploring Imhotep data
Imhotep Data Explorer
Interactive tool for exploring Imhotep data
Also: a badass hyperlinked pivot table
Imhotep Data Explorer
Imhotep is Large Scale
Total size of all indexes: 125TB
Jobsearch index (largest): 30TB
● Over 48 billion documents
Query
Query Location
Query Location
Organic Impression
Organic Impression
A job that was displayed as the result of a search
Title
Company Information
Description
Job Age
abredistimeacmetimeaddltimeadscadsdelayadsibadscbadsiboostojcboostojibsjcbsjcwiabsjibsjindappliesbsjindappviewsbsjrevbsjwiackcntckszcountsctkagectkagedaysdayofweekdcpingtimedomTotalTimeds-mpo
dsmissdstimefeatempfjfreekwacfreekwarevfreesjcfreesjrevfrmtimegalatdelayiplatiplongjslatdelayjsvdelaykwackwacdelaykwaikwarevkwcntlacinsizelacsgsizelmstimempotimemprtimenavTotTimendxtime
ojcojclongojcshortojcwiaojiojindappliesojindappviewsojwiaoocscpageprcvdlatencyprimfollowcntprvwojiprvwojlatprvwojopentimeprvwojreqradscradsirecidlookupbudgetrectimeredirCountredirTimerelfollowcntrespTimereturnvisitrojc
rojirqcntrqlcntrqqcntrrsjcrrsjirrsjrevrsavailrsjcrsjirsusedrsviableserpsizesjcsjcdelaysjclongsjcntsjcshortsjcwiasjisjindappliessjindappviewssjrevsjwiasllatsllong
sqcsqisugtimesvjsvjnostarsvjstartadsctadsitimetimeofdaytotcnttotfollowcnttotrevtottimetsjctsjcwiatsjitsjindappliestsjindappviewstsjrevtsjwiaunqcntvpwacinsizewacsgsize
Organic Impression Document
Title: Front End Software EngineerPosition: 1Clicked: 0Country: USQuery: indeed software engineerLocation: austinTimestamp:2014-04-30T20:00:00
Organic Impression Index
Title: Front End Software EngineerPosition: 1Clicked: 0Country: USQuery: indeed software engineerLocation: austinTimestamp:2014-04-30T20:00:00
Imhotep Data Explorer can’t...
Combine results from multiple datasets
Combine results from multiple datasets
Be easily automated
Imhotep Data Explorer can’t...
Imhotep Query Language (IQL)
IQL - Imhotep Query Language
Can combine results from multiple datasets
Allows for automation of data tools
IQL queries - requirements
Index Date rangeMetrics
IQL queries - optional
Index Date rangeMetrics
FiltersGroup by
IQL - Metrics
select count()
from organic
‘2013-12-05’
‘2013-12-10’
where country=ie
and clicked=1
group by companyid
Metrics
select count()
from organic
‘2013-12-05’
‘2013-12-10’
where country=ie
and clicked=1
group by companyid
IQL - Indexes
Index
select count()
from organic
‘2013-12-05’
‘2013-12-10’
where country=ie
and clicked=1
group by companyid
IQL - Date Range
Date Range
select count()
from organic
‘2013-12-05’
‘2013-12-10’
where country=ie
and clicked=1
group by companyid
IQL - Filters
Filters
select count()
from organic
‘2013-12-05’
‘2013-12-10’
where country=ie
and clicked=1
group by companyid
IQL - Filters
Groups
IQL Question
Do companies that have raised more than $10 million in the Austin get more clicks on average than those raised less than $10 million?
Methodology
1) organic index: select companies in the US which received organic clicks
Methodology
1) organic index: select companies in the US which received organic clicks
2) crunchbase index: select companies, and the amount of funding for companies receiving investments in Austin
Methodology
1) organic index: select companies in the US which received organic clicks
2) crunchbase index: select companies, and the amount of funding for companies receiving investments in Austin
3) Join, segment, and do the math!
Tom BergmanProduct Manager
I still help people get jobs.
Large Scale Interactive Analytics Platform
● 123 Unique Indexes● Largest Index 30TB● Total size ~125TB
Large Scale Interactive Analytics Platform
IQL -> Largely Programmatic access● approx 76k queries/day● Avg time to execute 0.67 seconds
Ramses -> Largely Human● approx 3,400 queries/day● Avg time to execute 4.4 seconds
Large Scale Interactive Analytics Platform
Users● 198 unique users in past month● 25,622 unique queries in past month● Avg 53 queries/user per day
Large Scale Interactive Analytics Platform
40+ internal clients● 6 Analytics Webapps● 5 dashboards● 10 programming/scripting shells● 6 monitoring apps● … and more
Large Scale Interactive Analytics Platform
One Tool-set for all data● Website usage● Operational Monitoring● Financial Reporting● Google Analytics● Internal Webapp Usage● External Reports
Solving a real problem
Providing the Best Results
Show the jobs that users are most interesting to our users
Providing the Best Results
Clicks are a very good indicator of interest
Providing the Best Results
Clicks are a very good indicator of interest
More clicks -> More RelevantLess clicks -> Less Relevant
Architecture
Very hard query to serve correctly
Architecture
Very hard query to serve correctly
Architecture terminology has been co-opted by technology
Terminology Common to both Software and Architecture
BlueprintDesignFrameworkInfrastructureEngineerProject manager
DevelopmentTechnical architectSoftwareModelingComputationCode reviews
Architecture vs Software Titles
ArchitectCAD DesignerProject Manager
vs
Software ArchitectUI DesignerProject Manager
Query Management
Indeed uses Imhotep to improve matching
Query Management
Indeed uses Imhotep to improve matching
Automatically detect results that should be added or removed from queries
Query Management
Indeed uses Imhotep to improve matching
Automatically detect results that should be added or removed from queries
26,790 rules across all countries
Imhotep Open Source
Imhotep Open Source ETA:August 1, 2014
Imhotep Open Source
Follow along at our blogengineering.indeed.com
Sign up for mailing list to get latest updatesgo.indeed.com/imhotep-announce
Q & A
Next @IndeedEng TalkLaunching Indeed Around the World
Davide Novelli, International DirectorDavid Tulig, Tech Lead
May 28, 2014
http://engineering.indeed.com/talks
More Questions?Jason David James Jeff