Scalding @ Coursera

  • View
    14.624

  • Download
    2

  • Category

    Software

Preview:

DESCRIPTION

A lightning talk I gave about how Coursera decided to use Scalding.

Citation preview

@ Coursera

Daniel Chia @DanielJHChia

Software Engineer, Infrastructure

Overview

• Context

• Growing Needs

• Hive / Pig / Scalding

Technical (Online Stack)

• 100% hosted on AWS

• Service-oriented architecture

• Mix of MySQL and Cassandra for persistence

• Scala

Existing Warehouse

Streaming

Future Warehouse Flow

S3

Event Data

Need 1: Expressive

• Joins

• Aggregations

• Secondary sort

• Multiple map-reduce

Need 2: Semi-structured Data

• Increased usage of Cassandra

• Events data

{

“timestamp”:1411359695744,

“membershipState":"LearnerEnrolled"

}

{ "typeName": "multipart", "definition": { "assignmentParts": { "id1": { "typeName": "plainText", "order": 0, "definition": { "prompt": "Write a sentence describing what you think about cereal." } }, "id2": { "typeName": "richText", "order": 1, "definition": { "prompt": "Write a long essay with lots of fancy formatting describing what you think about cereal." } }, "id3": { "typeName": "url", "order": 2, "definition": { "prompt": "Post a link to your favorite cereal." } }, "id4": { "typeName": "plainText", "order": 3, "definition": {

Choices

• Hive

• Pig

• Scalding

Hive

• SQL-like language

• Great for simple rollups and aggregations

• Procedural transforms difficult to express

Pig

• Mature

• Procedural

• Pig Latin + Lots of UDFs

Scalding – Pros

• Succinct

• Expressive

• All code in one language

• Re-use online data models

Scaling – Pros

• Easy to test

Scalding – Cons

• Have to learn Scala

• More heavy weight for simple experimental things.

• Many layers abstracted from MapReduce

Scalding – Example

• User event data

• Want to join with course and topic data

Scalding – Exampleval events = TypedTsv … /* load data */ .toTypedPipe

val courses = TypedTsv … .toTypedPipe

val topics = TypedTsv … .toTypedPipe

Scalding – Exampleevents.groupBy(_.courseId) .leftJoin(courses.groupBy(_.courseId)) .groupBy(_._2.topicId) .leftJoin(topics.groupBy(_.topicId)) /* more analysis */

Scalding – Exampleevents.groupBy(_.courseId) .leftJoin(courses.groupBy(_.courseId)) .groupBy(_._2.topicId) .leftJoin(topics.groupBy(_.topicId)) /* more analysis */

Scalding – Exampleevents.groupBy(_.courseId) .leftJoin(courses.groupBy(_.courseId)) .groupBy(_._2.topicId) .sketch(reducer = 100) .leftJoin(topics.groupBy(_.topicId))

Scalding – Wish-list

• More documentation

• Scala 2.11 soon, please?

Questions?

We’re hiring! coursera.org/jobs