22
Using Scalding for Data-Driven Product Development Sasha Ovsankin LinkedIn Presented to Scala By The Bay Aug 9, 2014

Using Scalding for Data Driven Product Development at LinkedIn

Embed Size (px)

DESCRIPTION

My talk on ScalaByTheBay conference http://www.scalabythebay.org/schedule.html

Citation preview

Page 1: Using Scalding for Data Driven Product Development at LinkedIn

Using Scalding for Data-Driven Product Development

Sasha OvsankinLinkedIn

Presented to Scala By The BayAug 9, 2014

Page 2: Using Scalding for Data Driven Product Development at LinkedIn

/summary

Data-Driven Product

Development

Page 3: Using Scalding for Data Driven Product Development at LinkedIn

/summary

Data-Driven Product

Development

Scalding = Hadoop + Scala

Page 4: Using Scalding for Data Driven Product Development at LinkedIn

/summary

Data-Driven Product

Development

Scalding = Hadoop + Scala

Page 5: Using Scalding for Data Driven Product Development at LinkedIn

/data-driven

YourService

Page 6: Using Scalding for Data Driven Product Development at LinkedIn

/data-driven

YourService

Value

Page 7: Using Scalding for Data Driven Product Development at LinkedIn

/data-driven

YourService

Value Data

Page 8: Using Scalding for Data Driven Product Development at LinkedIn

/data-driven

YourService

Value Data

Page 9: Using Scalding for Data Driven Product Development at LinkedIn

/data-driven

YourService

Value Data

Page 10: Using Scalding for Data Driven Product Development at LinkedIn

/data-driven

YourAmazing

Service

Value Data

Page 11: Using Scalding for Data Driven Product Development at LinkedIn

“Online” World

/data-driven/linkedin

Web Applications

NoSQL Data Stores

ETL

“Offline” World (Hadoop)

HDFS

Hadoop Jobs

Tracking/logging

Analytics

Data Products

Messaging

Message delivery

Databases

Page 12: Using Scalding for Data Driven Product Development at LinkedIn

/linkedin/big-data/links

• “LinkedIn Big Data Ecosystem”– http://lnkd.in/big-data-ecosystem

• Grid Operations– http://lnkd.in/gridops2013

Page 13: Using Scalding for Data Driven Product Development at LinkedIn

/scalding

http://github.com/twitter/scalding• Scala-based DSL for Map/Reduce jobs• Built on Cascading, stable and mature Hadoop framework• Uses API similar to Scala collections:

class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => line.split("""\s+""") } .groupBy('word) { _.size } .write( Tsv( args("output") ) )}

• Succinct and powerful• High level of abstraction

Page 14: Using Scalding for Data Driven Product Development at LinkedIn

/data-driven/problem/scaling

• Problem: Scaling• Solution– Distributed processing– High-level description of algorithms– Functional programming

Page 15: Using Scalding for Data Driven Product Development at LinkedIn

…/solution/scalding

Page 16: Using Scalding for Data Driven Product Development at LinkedIn

../problem/complexity

• Problem: Complexity• Solution– Consistent way of organizing data• Self-describing data formats (Avro)• File organization

– Type safety– Modularization

Page 17: Using Scalding for Data Driven Product Development at LinkedIn

…/solution/scalding

Page 18: Using Scalding for Data Driven Product Development at LinkedIn

/linkedin/hadoop/practices

• All online data end up in HDFS– Avro encoding is standard

• Production Process– CI/Automatic Build

• More info forthcoming

– Production Review– Operations and Monitoring

• More info at http://lnkd.in/gridops2013

• Result: Thousands of jobs running in production• More info at http://lnkd.in/big-data-ecosystem

Page 19: Using Scalding for Data Driven Product Development at LinkedIn

../solution/scala/killer-argument

• Map & reduce -- primitivesscala> (1 to 1000) map { pow(_,2) } reduce { _ + _ }res20: Int = 333833500

Page 20: Using Scalding for Data Driven Product Development at LinkedIn

/linkedin/scalding/status

• Started >1 year ago• Thousands of production LOC written in Scalding by our

team– Pretty happy with readability, maintainability and tooling

support• Dozens of flows are currently in production, and counting• Created Scalding user group• Growing interest• Learning:

– Scala[Scalding] < Scala[ _ ]

Page 21: Using Scalding for Data Driven Product Development at LinkedIn

/summary

Data-Driven Product

Development

Scalding = Hadoop + Scala

Page 22: Using Scalding for Data Driven Product Development at LinkedIn

/linkedin/join-us

• Work on unique and interesting problems• Be part of great engineering community• Use latest tools and technologies• Help connect the world’s professionals to help them become

more productive and successful• We are looking for amazing people interested in Software

Engineering and Data Science– http://linkedin.com/careers

Questions?