Upload
firstmark
View
224
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Lenddo's CEO and CTO, Jeff Stewart and Naveen Agnihotri, presented at May's edition of Data Driven NYC, which focused on p2p lending.
Citation preview
Empowering the Emerging Market Middle Class
Big Data is not Big Database
Jeff Stewart - CEO Naveen Agnihotri, PhD - CTO
“If you look 5 years out, every industry is going to be rethought in a social way”.
-Mark Zuckerberg, 2010
● Founded in January 2011● Over 500K members around the world● Integrated with Facebook, LinkedIn, Google,
Yahoo, Twitter● Services oriented architecture (LAMP)
○ Front end (clients) in PHP○ Services in PHP and Python
● Technical team based in NY and PH● Data science team based in NY
LENDDO TECH FACTS
Finance in the Age of Social Networks
Lenddo maintains the worlds largest Opt-in, TrustGraph, for trustworthiness and risk management
Lenddo is….
Social
Social sourcing & screeningPeer enforcement
New data sets
Algorithms
Unprecedented processing powerReal-time / ongoing risk management Targeting, underwriting & collections
Cloud
Rich risk analytic data setUnprecedented processing power
Global
Mobile
New datasets24/7 engagement
new cost structures
Why Finance Works Better with Lenddo
Traditional
• Negative selection bias• Costly
• Fact verification time consuming • Scores incomplete or unavailable
• No peer enforcement• Labor intensive• Hard to maintain contact
DEMANDGENERATION
UNDERWRITING
COLLECTIONS
• Digital, fast and potentially viral• Less Expensive• Social nature cause positive selection bias
• Reduced Fraud and default • Big data and powerful algorithms• Larger addressable market • Easily automatable
• Potential for peer enforcement• Lower cost• More points of contact
With Lenddo
Source: http://www.kpcb.com/insights/2013-internet-trends
ID Verification is easier online
● 100% infrastructure on AWS ● Store social data from all online social
networks● Opt-in Social data storage grows about 10
times faster than member data● Social data currently about 3.5 TB● Largest table (dataset) is > 2 billion records
LENDDO SOCIAL DATA
GOOD AND BAD BORROWERS
26
n=1347
CLUSTERS
27
LOAN SCORE IMPROVEMENT
24
No NLP or network
LOAN SCORE IMPROVEMENT
24
No NLP or network With NLP and network
WORD CLUSTERS
17
Words associate closely together, and can be commonly associated with good or bad loans.
WORDS AND LOAN QUALITY
18
% Association with BAD loans
% Association with GOOD loans
● Started with MongoDB for social data storage● As use cases grew, we added indexes
SOCIAL DATA STORAGE HISTORY
SOCIAL DATA STORAGE
User data
Social data
SOCIAL DATA STORAGE
Social data User data
● We moved to larger and larger servers○ At last iteration, used cr1.8xlarge server○ 32 CPUs, 244 GB RAM○ Still couldn’t keep up with index size
● Data acquisition speeds increased○ provisioned IOPS to the rescue!
● Total cost of social data storage: > $10,000 per month● And we want to grow faster!
SOCIAL DATA STORAGE HISTORY
● Simple queries (by index)● Complex queries (by multiple indexes)● Pull out all data for a member● Aggregate all data for a member● Calculate score for a member● Aggregate all data for all members● Calculate score for all members
SOCIAL DATA STORAGE HISTORY
?
REVELATION: 2013
It’s“BIG DATA”
not“BIG DATABASE”
REVELATION: 2013
● Moved all data to Amazon S3● Data model remains largely unchanged● Hadoop compatible storage format
○ Avro format○ Snappy compressed, chunked
● Created a small ‘cache’ type MongoDB○ stores recent data temporarily
● Using DynamoDB for longer-lived data that needs to be queried all the time
SOCIAL DATA REVAMP - 2013
● Use the cache for data when it first arrives○ Data is available for quick computations and
● Move data from cache to S3 at the end of the day● Use EMR over S3 data for all aggregations● Created a EMR based map-reduce framework for data
science team● Standard EMR jobs for common queries:
○ All social data for a member○ Score one member○ Score all members
NEW SOCIAL DATA USAGE
● Peace of mind○ No more database maintenance○ No more periodic server upgrades
● Scalability○ Storage and access remains identical for the next
10x growth● $$$
○ New cost: < $3000 per month: 70% less!○ Includes EMR clusters running routine jobs
WHAT DID WE GAIN?
Thanks!