The Big Data Ecosystem at LinkedIn

The Big Data Ecosystem at LinkedIn

The Big Data Ecosystem at LinkedInJay KrepsMeBackground in data not infrastructureLinkedIns SNA teamOriginal co-author of some LinkedIn open source projects (Voldemort, Azkaban, Kafka)This TalkWe are in a renaissance of data infrastructure.How do all these pieces fit together?

Why the current obsession with Big Data?The goal of modern data infrastructure is to make many small computers act like one big one.

The Old Picture

The New Picture

Polyglot persistence?

Infrastructure Icebergs90k lines of tooling and monitoring, 30k lines of logicDedicated engineers, operationsTrainingFirst three nines come from operations

This is (still) a very immature space. Which systems should we have?Good news for users, bad news for distributed systems nerdsFilesystems take a decade to mature. Dont expect this will be easier.

10Infrastructure is sculpted by applications and constraintsProjects are defined by trade-offsConstraintsHardwareJeff Dean: Numbers everyone should knowDavid Patterson: Latency lags bandwidth$$$OtherPath dependenceComplexityResources

ApplicationsCommon categories of non-CRUDRecommendations & MatchingGraphsSearchData NormalizationNews feedAnalysis & MonitoringSocial Graph

Search

Recommendations: People

Recommendations: Jobs

Recommendations: Newsfeed

Data Normalization

Analytics

InfrastructureSearchLuceneBobo (facets), Zoie (real-time indexing), Sensei (distribution)Social GraphStorageOracleVoldemortEspressoStreamsDatabusKafkaOfflineHadoop & friends (Pig, Hive, Azkaban, etc)Three Major ParadigmsRequest/ResponseSearchSocial GraphStorageStreamsKafkaBatchHadoopMost features are multi-paradigm

Request/ResponseSearchSocial GraphStorageVoldemortEspressoRequest/Response PatternsBroker, scatter-gatherStorage systems: only Partitioning strategyLatency oriented

Batch: HadoopUsesAd hocProduction batchEcosystemHive, PigAzkaban (workflow)Avro dataData in: KafkaData out: Voldemort, Kafka

Why do batch if you have real-time?Batch advantagesSafetyEasyThroughputSimplicityEconomicsTricky bit: engineering the data cycle

Why do streaming?You have to glue all these systems togetherThroughput as good as batchLatency much betterMetaphor more natural for low latency than Hadoop

What makes successful infrastructure systems?Operability and OperationsMonitoringSimplicityDocumentationBroad adoptionLazy usersOpen sourceOpen SourceData > InfrastructureOpen source creates better codeeven with few outside contributorsCommercial infrastructure not interesting

Open Source ProjectsWe madeVoldemort: Key/Value storageSensei, Bobo, Zoie: Elastic, faceted, real-time search with LuceneKafka: Persistent, distributed data streamsNorbert: Cluster aware RPC, load balancing, and group membershipAnd othersWe stoleHadoop, Pig, HiveLuceneNetty, JettyZookeeperAvroApache Traffic Server

The [email protected]://www.linkedin.com/in/jaykrepshttp://twitter.com/jaykrepshttp://sna-projects.com

Documents

The Big Data Ecosystem at LinkedIn