Upload
john-nestor
View
244
Download
2
Embed Size (px)
Citation preview
Introduction to Data Engineering(with Scala)
John Nestor 47 Degrees
www.47deg.com
June 27, 2016Galvanize
147deg.com
47deg.com © Copyright 2015 47 Degrees
Outline
• Introduction
• Data Engineering Requirements
• Data Engineering Design Patterns
• Recommended Data Engineering Tools and Systems
• Final Thoughts
2
Introduction
3
47deg.com © Copyright 2015 47 Degrees
Typical Data Engineering Systems
• Low latency response to HTTP or REST requests
• Database reads and writes
• Run ML models
• Produce event streams for later processing
• Near real time event processing
• Simple analytics and alerts
• Analysis of server information
• Logs and metrics
• Produce data for later analysis by data scientists
4
47deg.com © Copyright 2015 47 Degrees
Big Data
• (Much) Too big to fit on a single machine
• Must have both
• distributed computation
• distributed data (bases)
• Distributed systems means no single main memory
• Must pass data across servers
• Large number of distributed components means failure is common
• Dealing with failure must be part of the fundamental architecture
5
47deg.com © Copyright 2015 47 Degrees
• https://blogs.oracle.com/jag/resource/Fallacies.html Peter Deutsch
• The network is reliable
• Latency is zero
• Bandwidth is infinite
• The network is secure
• Topology doesn’t change
• There is one administrator
• Transport cost is zero
• The network is homogeneous
6
Fallacies of Distributed Computing
47deg.com © Copyright 2015 47 Degrees
Reactive Manifesto
• http://www.reactivemanifesto.org/
• Responsive - predictable latency
• Resilient - fault tolerant
• Elastic - (auto) scalability
• Message driven - basis of a distributed implementation
7
Data Engineering Requirements
8
47deg.com © Copyright 2015 47 Degrees
Scalability
• New systems are getting bigger all the time
• Hardware is getting cheaper
• Business requirements to stay competitive are increasing
• Cloud computing permits easy expansion based on instantaneous need
• No single server is ever big enough
• Scalability goal: performance increases (close to) linearly with the number of servers
9
47deg.com © Copyright 2015 47 Degrees
Availability
• Systems are increasingly expected to be available 24/7 with no downtime
• Any server can fail, others must be able to take over
• No downtime for maintenance. Software upgrades occur without shutting system down.
• Must avoid availability killing features such a 2 phase commit
• SLA’s # of nine’s
• The best most achieve is 3 nines (8.8 hours per year)
• Most strive for 6 nines (30 minutes per year)
• AWS S3 claims 9 nines (32 msec per year)
10
47deg.com © Copyright 2015 47 Degrees
Durability
• Loosing data is never acceptable
• Since any single point can fail, we must replicate data
• Replication to
• main memory
• different server
• server in different zone
• across geo-distributed data centers
• AWS S3 will loose at most one object out of 32K objects every 10 million years
11
47deg.com © Copyright 2015 47 Degrees
Latency and Bandwidth
• Latency - msec to process a single request
• More hops can increase latency
• Very fast network hardware can reduce latency
• Speed of light is still the upper bound
• Bandwidth - number of requests processed per sec
• More servers can increase bandwidth
• Latency Numbers Every Programmer Should Know
• main memory (0.0001 msec)
• different server (0.5 msec)
• across geo-distributed data centers (150 msec)
12
Data Engineering Design Patterns
13
47deg.com © Copyright 2015 47 Degrees
Immutable Data
• Concurrent access to mutable data requires synchronization. Immutable data does not.
• Data passed between servers will be immutable
• Immutable data plus functional programming results in code that is easier to understand and test
14
47deg.com © Copyright 2015 47 Degrees
Messaging (1 of 2)
• Message sent from A to B
• A gets ack from B
• A gets no ack from B
• message never got to B
• ack from B never got to A
• What kind?
• at most once (never resend)
• at least once (resend if no ack)
• exactly once (resend idempotently if no ack)
15
47deg.com © Copyright 2015 47 Degrees
Messaging (2 of 2)
• Idempotence
• Multiple sends have same effect
• set X to 3, NOT add 2 to X
• Attach GUID, destination must handle
• In order delivery
• Waiting for an ack before sending next increases latency
• Attach sequence number, destination must handle
• Batching multiple messages together can help
• Design so order does not matter
16
47deg.com © Copyright 2015 47 Degrees
Persistent Data (1 of 3)
• CAP theorem (pick 2)
• Consistency (ACID)
• Availability
• Partition tolerance (closely tied to fault tolerance)
• Distributed consistency solutions: 2-phase commit is “the anti-availability protocol” (Helland)
• For very large highly available systems, AP is only possible choice
17
47deg.com © Copyright 2015 47 Degrees
Persistent Data (2 of 3)
• Detecting conflicts with Vector clocks
• Each server has own time
• Vector has one element for each server
• Forms a partial order
• Resolving conflicts (for example: 2 different phone numbers)
• Select the latest
• Ask someone
• Keep both
• CRDTs (generalization of keep both)
• conflict free replicated data sets
• merge must be commutative, associative, idempotent
18
47deg.com © Copyright 2015 47 Degrees
Persistent Data (3 of 3)
• Log based stores
• Sequence of transformational steps
• Each step is immutable
• Log is append only (fast sequential write to disk)
• Database is a cache of some point in the log
• Log is primary
• Database can be deleted and recreated from log
19
47deg.com © Copyright 2015 47 Degrees
Concurrency and Distribution
• Individual servers are getting ever more cores.
• Utilization is key
• Large data applications require multiple servers
• Connections between servers are frequent points of failure
• Parallel data operations help: parallel collections, Spark
• Traditional synchronization (locks, monitors) are error prone and very hard to get right.
• Message bases systems (Hoare’s CSP, Hewitt’s actors) are a better solution and work well across servers.
20
47deg.com © Copyright 2015 47 Degrees
Logging and Monitoring
• As systems involve more and more servers
• Detecting and locating failure is getting harder
• Understanding system performance and performance tuning is getting harder
• We now produce massive amounts of logs and monitoring data
• Making sense of this huge volume of data is hard
• For failures we need near real-time analysis
• Increasing need for data science solutions
21
47deg.com © Copyright 2015 47 Degrees
Continuous Deployment (1 of 2)
• High availability means we can no longer shut down for upgrades to
• Application code
• Operating system upgrades and patches
• Hardware maintenance
• Automatic server failover
• Rolling upgrades
• Backward compatibility
• Messages
• Database schemas
22
47deg.com © Copyright 2015 47 Degrees
Continuous Deployment (2 of 2)
• Deployment of lots of small changes reduces the chance of errors in any single deployment
• Requires comprehensive automation for testing and deployment
• But errors still do occur
• Although we have good methods for testing individual components, integration testing is still hard and error prone.
• Some approaches
• Roll back
• A-B testing
• Database checkpoints
23
Recommended Data Engineering Tools and
Systems
24
47deg.com © Copyright 2015 47 Degrees
Choices
• Open source preferred
• Personal favorites
• Widely used (best practices in leading companies)
25
47deg.com © Copyright 2015 47 Degrees
Prefer Open Source
• “Free”
• Full source is available
• Community participation
• Can move very fast
• More responsive
• Plus if there is a commercial company providing support
26
47deg.com © Copyright 2015 47 Degrees
Programming Language (1 of 3)
• Compiled versus interpreted
• Compiled: C, C++, Go
• Semi-compiled: Java, C#, Scala
• Interpreted: Python, Ruby, R
• Static versus dynamic type checking
• Static catches more errors at compile-time
• Static are easier to understand and maintain
• Static requires more work writing
• Garbage collection. Safety versus performance
27
47deg.com © Copyright 2015 47 Degrees
Programming Languages (2 of 3)
• Choice of language does not matter
• I can write any algorithm in any language
• Lets avoid pointless “language religion” wars
• Choice of language matters a lot
• Language can have a big impact on performance, productivity and reliability
• Programming languages shape the way we think
28
47deg.com © Copyright 2015 47 Degrees
Programming Languages (3 of 3)
• Scala
• Semi-compiled. Compiled with JIT compiler.
• Statically typed but concise syntax of untyped
• Garbage collected
• Runs on JVM. Full ecosystem of libraries and tools available.
• Key features
• Functional plus immutable data (major advance in program quality)
• Scala Futures and Akka Actors (major advance in easy to understand, easy to get correct, and fault-tolerant distributed computation)
• Main language for Spark
• Suitable for both data engineers and data scientists (better cooperation)
29
47deg.com © Copyright 2015 47 Degrees
Messaging
• Kafka (written in Scala)
• Reliable buffer between produced and consumer
• Can replay
• Multiple produces and consumers
• Multiple topics
• Linearly scalable
• Kafka stream
• Other
• Reactive streams
• Spark streaming
30
47deg.com © Copyright 2015 47 Degrees
Databases
• Relational: Postgres (scaling can be a problem)
• Embedded: LevelDB, MapDB
• NoSQL: Cassandra, Couchbase
• Graph: Neo4j, Titan, DataStax Enterprise Graph
31
47deg.com © Copyright 2015 47 Degrees
Analytics
• Hadoop (let it die!)
• Spark (Written in Scala, Scala API is best)
• Trend toward SQL
• Improved performance via query optimizer
• Widely understood (but poor?) programming model
• Somewhat abandoned functional programming (RDDs)
• dataset transforms: experiment to combine functional programming with support for query optimization
32
47deg.com © Copyright 2015 47 Degrees
Data Center Infrastructure and Continuous Deployment
• GitHub, SBT, Artifactory, Jenkins
• Docker/Rkt, Etcd, CoreOS
• Mesos, Kubernetes
• Cloud: AWS, Google, Microsoft
33
Final Thoughts
34
47deg.com © Copyright 2015 47 Degrees
Final Thoughts
• Scala is the best choice for both data engineers and data scientists
• Spark is the best choice for data analysis
• Data will continue to grow in size and importance
• The number of servers we use will continue to grow requiring better fault tolerance and better automation
• When data engineers and data scientists work closely together both benefit and better results are achieved
• We need to break down traditional silos
• We need shared tools and technologies that work well for both groups
35
Questions
36