Upload
janessa-lantz
View
7.228
Download
0
Embed Size (px)
Citation preview
#datapointlive
The Human Algorithm: Automating Startup Data Collection at Mattermark
Sarah Catanzaro, Head of Data at Mattermark @sarahcat21
#DPL15 | @sarahcat21
Mattermark is a deal intelligence platform and private company database used by
● investors● business and corporate development● sales
Mattermark
#DPL15 | @sarahcat21
Scale
Over 125 million private companies in the world (only about 45.5 thousand public).
#DPL15 | @sarahcat21
Stealth
● Private companies do not have strong incentives (e.g. legal obligations) to share data. Many may have competitive incentives to obfuscate information.
● Investors may request non-disclosure.
#DPL15 | @sarahcat21
Software-oriented approach
● A must, due to the scale of our dataset○ 1.3 million companies○ 16.5k investors○ 110k funding events
● Leverage a lean data team
#DPL15 | @sarahcat21
Data collection strategy
● Web scraping● Machine learning● Direct submission● Manual data entry
#DPL15 | @sarahcat21
Investors ask questions like
What start-ups might raise capital in the next 6 months? What startups is
Stephanie Palmeri investing in?
#DPL15 | @sarahcat21
Our data analysts seek to understand:
● Why does this question matter?● What data is required to answer this question?● Where can this data be accessed?
#DPL15 | @sarahcat21
Next, data analysts:
1. Define repeatable processes for data collection. 2. Determine whether processes can be replicated
through web scraping and/or machine learning algorithms to collect data at scale.
3. Write functional specifications, reviewed by sales and engineering team members.
#DPL15 | @sarahcat21
Next, web and/or machine learning engineers
1. Write dev designs, reviewed by data analysts.2. Upon implementation and marketing release,
this data becomes available to customers.3. New questions arise and the cycle starts again.
#DPL15 | @sarahcat21
Investors ask questions like
How much funding has a company already raised?
Who were the investors at each of those rounds?
#DPL15 | @sarahcat21
Problems with existing sources
Rely on wiki-style data collection (cannot confirm the credibility of sources)
News reports are better; but ● facts are harder to extricate● different sources report different figures
#DPL15 | @sarahcat21
Solution: funding automation
A new framework for collecting and synthesizing funding data.
1. News article fact extraction (machine learning)2. Funding override system (web engineering)3. Funding confirmation email campaign
(marketing)
#DPL15 | @sarahcat21
2. News article fact extractionCrawl RSS feeds, extract data from stories (title, texts, links, etc.)
● 750+ sources● 5,000 - 10,000 articles
#DPL15 | @sarahcat21
2. News article fact extraction
Classify stories about funding
● 250 articles/day
#DPL15 | @sarahcat21
2. News article fact extraction
● Identify sentences containing information about investors, amount, and/or series
#DPL15 | @sarahcat21
2. News article fact extraction
● Extract facts● Match companies and
investors to entities in our database○ 30% of extracted articles
are entered automatically
#DPL15 | @sarahcat21
1. Funding override system● Identify reports about the same funding event● Combine information from multiple reports using wongi rules engine
#DPL15 | @sarahcat21
3. Funding confirmation email campaign
Use CRM and Hubspot to automatically send emails to founders after equity financing.
#DPL15 | @sarahcat21
Where we struggled
Our initial implementation of a funding override system was inefficient. Why?
Because our data analysts and developers were not aligned on functional requirements.
#DPL15 | @sarahcat21
Solution
● Analysts must work closely with developers○ Pre-spec check-ins○ Analysts review dev designs to ensure that
the system design addresses the use case.● Analysts must avoid being prescriptive● Analysts must understand data mining and
machine learning concepts
#DPL15 | @sarahcat21
Where we succeeded
Implementation of news article fact extraction was successful. Why?
Because data analysts and developers worked as service providers to each other.
#DPL15 | @sarahcat21
1. Tighter Analyst + Dev Communication
Tiger teams: 1 ML developer, 1 web/infrastructure developer, 1 data analyst, 1 project lead
Define milestones & hold daily stand-ups.
#DPL15 | @sarahcat21
3. Track II interaction reinforce symbiotic relationship
● Devs lead Python learning group● Data analysts hold seminars on topics like admin
tooling and alternative assets