If you can't read please download the document
Upload
quasar
View
29
Download
2
Tags:
Embed Size (px)
DESCRIPTION
The Big Data Ecosystem at LinkedIn. Jay Kreps. Me. Background in data not infrastructure LinkedIn’s SNA team Original co-author of some LinkedIn open source projects (Voldemort, Azkaban, Kafka). This Talk. We are in a renaissance of data infrastructure. - PowerPoint PPT Presentation
Citation preview
The Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedInJay KrepsMeBackground in data not infrastructureLinkedIns SNA teamOriginal co-author of some LinkedIn open source projects (Voldemort, Azkaban, Kafka)This TalkWe are in a renaissance of data infrastructure.How do all these pieces fit together?
Why the current obsession with Big Data?The goal of modern data infrastructure is to make many small computers act like one big one.
The Old Picture
The New Picture
Polyglot persistence?
Infrastructure Icebergs90k lines of tooling and monitoring, 30k lines of logicDedicated engineers, operationsTrainingFirst three nines come from operations
This is (still) a very immature space. Which systems should we have?Good news for users, bad news for distributed systems nerdsFilesystems take a decade to mature. Dont expect this will be easier.
10Infrastructure is sculpted by applications and constraintsProjects are defined by trade-offsConstraintsHardwareJeff Dean: Numbers everyone should knowDavid Patterson: Latency lags bandwidth$$$OtherPath dependenceComplexityResources
ApplicationsCommon categories of non-CRUDRecommendations & MatchingGraphsSearchData NormalizationNews feedAnalysis & MonitoringSocial Graph
Search
Recommendations: People
Recommendations: Jobs
Recommendations: Newsfeed
Data Normalization
Analytics
InfrastructureSearchLuceneBobo (facets), Zoie (real-time indexing), Sensei (distribution)Social GraphStorageOracleVoldemortEspressoStreamsDatabusKafkaOfflineHadoop & friends (Pig, Hive, Azkaban, etc)Three Major ParadigmsRequest/ResponseSearchSocial GraphStorageStreamsKafkaBatchHadoopMost features are multi-paradigm
Request/ResponseSearchSocial GraphStorageVoldemortEspressoRequest/Response PatternsBroker, scatter-gatherStorage systems: only Partitioning strategyLatency oriented
Batch: HadoopUsesAd hocProduction batchEcosystemHive, PigAzkaban (workflow)Avro dataData in: KafkaData out: Voldemort, Kafka
Why do batch if you have real-time?Batch advantagesSafetyEasyThroughputSimplicityEconomicsTricky bit: engineering the data cycle
Why do streaming?You have to glue all these systems togetherThroughput as good as batchLatency much betterMetaphor more natural for low latency than Hadoop
What makes successful infrastructure systems?Operability and OperationsMonitoringSimplicityDocumentationBroad adoptionLazy usersOpen sourceOpen SourceData > InfrastructureOpen source creates better codeeven with few outside contributorsCommercial infrastructure not interesting
Open Source ProjectsWe madeVoldemort: Key/Value storageSensei, Bobo, Zoie: Elastic, faceted, real-time search with LuceneKafka: Persistent, distributed data streamsNorbert: Cluster aware RPC, load balancing, and group membershipAnd othersWe stoleHadoop, Pig, HiveLuceneNetty, JettyZookeeperAvroApache Traffic Server
The [email protected]://www.linkedin.com/in/jaykrepshttp://twitter.com/jaykrepshttp://sna-projects.com