©2014 LinkedIn Corporation. All Rights Reserved.
Gobblin’ Big Data with Ease
Lin QiaoData Analytics Infra @ LinkedIn
©2014 LinkedIn Corporation. All Rights Reserved.
Overview
• Challenges• What does Gobblin provide?• How does Gobblin work?• Retrospective and lookahead
©2014 LinkedIn Corporation. All Rights Reserved.
Overview
• Challenges • What does Gobblin provide?• How does Gobblin work?• Retrospective and lookahead
©2014 LinkedIn Corporation. All Rights Reserved.
Challenges @ LinkedIn
• Large variety of data sources• Multi-paradigm: streaming data, batch data• Different types of data: facts, dimensions, logs,
snapshots, increments, changelog
• Operational complexity of multiple pipelines• Data quality• Data availability and predictability• Engineering cost
©2014 LinkedIn Corporation. All Rights Reserved.
Open source solutions
sqoopp
flumep morphlinep
RDBMS vendor-specific connectorsp
aegisthus
logstashCamus
©2014 LinkedIn Corporation. All Rights Reserved.
Goals
• Unified and Structured Data Ingestion Flow– RDBMS -> Hadoop– Event Streams -> Hadoop
• Higher level abstractions– Facts, Dimensions– Snapshots, increments, changelog
• ELT oriented– Minimize transformation in the ingest pipeline
©2014 LinkedIn Corporation. All Rights Reserved.
Overview
• Challenges • What does Gobblin provide?• How does Gobblin work?• Retrospective and lookahead
©2014 LinkedIn Corporation. All Rights Reserved.
Gobblin Usage @ LinkedIn
• Business Analytics– Source data for, sales analysis, product sentiment
analysis, etc. • Engineering
– Source data for issue tracking, monitoring, product release, security compliance, A/B testing
• Consumer product– Source data for acquisition integration– Performance analysis for email campaign, ads
campaign, etc.
©2014 LinkedIn Corporation. All Rights Reserved.
Key Features
Horizontally scalable and robust framework Unified computation paradigm Turn-key solution Customize your own Ingestion
©2014 LinkedIn Corporation. All Rights Reserved.
Scalable and Robust Framework
13
Scalable
CentralizedState Management
State is carried over between jobs automatically, so metadata can be used to track offsets, checkpoints, watermarks, etc.
Jobs are partitioned into tasks that run concurrently
Fault Tolerant Framework gracefully deals with machine and job failures
Query Assurance Baked in quality checking throughout the flow
©2014 LinkedIn Corporation. All Rights Reserved.
Unified computation paradigm
Common execution flow
Common execution flow between batch ingestion and streaming ingestion pipelines
Shared infra components
Shared job state management, job metrics store, metadata management.
©2014 LinkedIn Corporation. All Rights Reserved.
Turn Key Solution
Built-in Exchange Protocols
Existing adapters can easily be re-used for sources with common protocols (e.g. JDBC, REST, SFTP, SOAP, etc.)
Built-in Source Integration
Fully integrated with commonly used sources including MySQL, SQLServer, Oracle, SalesForce, HDFS, filer, internal dropbox)
Built-in Data Ingestion Semantics
Covers full dump and incremental ingestion for fact and dimension datasets.
Policy driven flow execution & tuning
Flow owners just need to specify pre-defined policy for handling job failure, degree of parallelism, what data to publish, etc.
©2014 LinkedIn Corporation. All Rights Reserved.
Customize Your Own Ingestion Pipeline
Extendable Operators
Configurable Operator Flow
Operators for doing extraction, conversion, quality checking, data persistence, etc., can be implemented or extended against common API.
Configuration allows for multiple plugin points to add in customized logic and code
©2014 LinkedIn Corporation. All Rights Reserved.
Overview
• Challenges • What does Gobblin provide?• How does Gobblin work?• Lookahead
©2014 LinkedIn Corporation. All Rights Reserved.
Computation Model
• Gobblin standalone – single process, multi-threading– Testing, small data, sampling
• Gobblin on Map/Reduce– Large datasets, horizontally scalable
• Gobblin on Yarn– Better resource utilization– More scheduling flexibilities
©2014 LinkedIn Corporation. All Rights Reserved.
Scalable Ingestion Flow
20
Source
WorkUnit
WorkUnit
WorkUnit
Data Publisher
Extractor ConverterQuality Checker Writer
Extractor ConverterQuality Checker Writer
Extractor ConverterQuality Checker Writer
Task
Task
Task
©2014 LinkedIn Corporation. All Rights Reserved.
Sources
• Determines how to partition work- Partitioning algorithm can leverage source sharding- Group partitions intelligently for performance
• Creates work-units to be scheduled
SourceWorkUnit PublisherExtractor Converter Quality
Checker Writer
©2014 LinkedIn Corporation. All Rights Reserved.
Job Management
• Job execution states– Watermark– Task state, job state, quality checker output, error code
• Job synchronization• Job failure handling: policy driven
22
State Store
Job run 1 Job run 3Job run 2
©2014 LinkedIn Corporation. All Rights Reserved.
Gobblin Operator Flow
Extract Schema
Extract Record
Convert Record
Check Record Data
Quality
Write Record
Convert Schema
Check Task Data Quality
Commit Task Data
23
©2014 LinkedIn Corporation. All Rights Reserved.
Extractors SourceWorkUnit PublisherExtractor Converter Quality
Checker Writer
• Specifies how to get the schema and pull data from the source
• Return ResultSet iterator• Track high watermark• Track extraction metrics
©2014 LinkedIn Corporation. All Rights Reserved.
Converters
• Allow for schema and data transformation– Filtering – projection– type conversion– Structural change
• Composable: can specify a list of converters to be applied in the given order
SourceWorkUnit PublisherExtractor Converter Quality
Checker Writer
©2014 LinkedIn Corporation. All Rights Reserved.
Quality Checkers
• Ensure quality of any data produced by Gobblin• Can be run on a per record, per task, or per job basis• Can specify a list of quality checkers to be applied
– Schema compatibility– Audit check– Sensitive fields– Unique key
• Policy driven– FAIL – if the check fails then so does the job– OPTIONAL – if the checks fails the job continues– ERR_FILE – the offending row is written to an error file
26
SourceWorkUnit PublisherExtractor Converter Quality
Checker Writer
©2014 LinkedIn Corporation. All Rights Reserved.
Writers
• Writing data in Avro format onto HDFS– One writer per task
• Flexibility– Configurable compression codec (Deflate, Snappy)– Configurable buffer size
• Plan to support other data format (Parquet, ORC)
SourceWorkUnit PublisherExtractor Converter Quality
Checker Writer
©2014 LinkedIn Corporation. All Rights Reserved.
Publishers
• Determines job success based on Policy.- COMMIT_ON_FULL_SUCCESS- COMMIT_ON_PARTIAL_SUCCESS
• Commits data to final directories based on job success.
Task 1
Task 2
Task 3
File 1
File 2
File 3
Tmp DirFile 1File 2File 3
Final DirFile 1File 2File 3
SourceWorkUnit PublisherExtractor Converter Quality
Checker Writer
©2014 LinkedIn Corporation. All Rights Reserved.
Gobblin Compaction
• Dimensions:– Initial full dump followed by incremental extracts in
Gobblin– Maintain a consistent snapshot by doing regularly
scheduled compaction
• Facts:– Merge small files
29
Ingestion HDFS Compaction
©2014 LinkedIn Corporation. All Rights Reserved.
Overview
• Challenges • What does Gobblin provide?• How does Gobblin work?• Retrospective and lookahead
©2014 LinkedIn Corporation. All Rights Reserved.
Gobblin in Production
• > 350 datasets • ~ 60 TB per day
• Salesforce• Responsys• RightNow• Timeforce• Slideshare• Newsle• A/B testing• LinkedIn JIRA• Data retention
31
Production Instances Data Volume
©2014 LinkedIn Corporation. All Rights Reserved.
Lesson Learned
• Data quality has a lot more work to do• Small data problem is not small• Performance optimization opportunities• Operational traits
©2014 LinkedIn Corporation. All Rights Reserved.
Gobblin Roadmap
• Gobblin on Yarn• Streaming Sources• Gobblin Workbench with ingestion DSL• Data Profiling for richer quality checking• Open source in Q4’14
33