Introduction to Data Engineering (with Scala)

Introduction to Data Engineering(with Scala)

John Nestor 47 Degrees

www.47deg.com

June 27, 2016Galvanize

147deg.com

http://www.47deg.com

http://47deg.com

47deg.com © Copyright 2015 47 Degrees

Outline

• Introduction

• Data Engineering Requirements

• Data Engineering Design Patterns

• Recommended Data Engineering Tools and Systems

• Final Thoughts

2

http://47deg.com

Introduction

3


Typical Data Engineering Systems

• Low latency response to HTTP or REST requests

• Database reads and writes

• Run ML models

• Produce event streams for later processing

• Near real time event processing

• Simple analytics and alerts

• Analysis of server information

• Logs and metrics

• Produce data for later analysis by data scientists

4

http://47deg.com


Big Data

• (Much) Too big to fit on a single machine

• Must have both

• distributed computation

• distributed data (bases)

• Distributed systems means no single main memory

• Must pass data across servers

• Large number of distributed components means failure is common

• Dealing with failure must be part of the fundamental architecture

5

http://47deg.com


• https://blogs.oracle.com/jag/resource/Fallacies.html Peter Deutsch

• The network is reliable

• Latency is zero

• Bandwidth is infinite

• The network is secure

• Topology doesn’t change

• There is one administrator

• Transport cost is zero

• The network is homogeneous

6

Fallacies of Distributed Computing

http://47deg.com

https://blogs.oracle.com/jag/resource/Fallacies.html


Reactive Manifesto

• http://www.reactivemanifesto.org/

• Responsive - predictable latency

• Resilient - fault tolerant

• Elastic - (auto) scalability

• Message driven - basis of a distributed implementation

7

http://47deg.com

http://www.reactivemanifesto.org/

Data Engineering Requirements

8


Scalability

• New systems are getting bigger all the time

• Hardware is getting cheaper

• Business requirements to stay competitive are increasing

• Cloud computing permits easy expansion based on instantaneous need

• No single server is ever big enough

• Scalability goal: performance increases (close to) linearly with the number of servers

9

http://47deg.com


Availability

• Systems are increasingly expected to be available 24/7 with no downtime

• Any server can fail, others must be able to take over

• No downtime for maintenance. Software upgrades occur without shutting system down.

• Must avoid availability killing features such a 2 phase commit

• SLA’s # of nine’s

• The best most achieve is 3 nines (8.8 hours per year)

• Most strive for 6 nines (30 minutes per year)

• AWS S3 claims 9 nines (32 msec per year)

10

http://47deg.com


Durability

• Loosing data is never acceptable

• Since any single point can fail, we must replicate data

• Replication to

• main memory

• different server

• server in different zone

• across geo-distributed data centers

• AWS S3 will loose at most one object out of 32K objects every 10 million years

11

http://47deg.com


Latency and Bandwidth

• Latency - msec to process a single request

• More hops can increase latency

• Very fast network hardware can reduce latency

• Speed of light is still the upper bound

• Bandwidth - number of requests processed per sec

• More servers can increase bandwidth

• Latency Numbers Every Programmer Should Know

• main memory (0.0001 msec)

• different server (0.5 msec)

• across geo-distributed data centers (150 msec)

12

http://47deg.com

https://gist.github.com/jboner/2841832

Data Engineering Design Patterns

13


Immutable Data

• Concurrent access to mutable data requires synchronization. Immutable data does not.

• Data passed between servers will be immutable

• Immutable data plus functional programming results in code that is easier to understand and test

14

http://47deg.com


Messaging (1 of 2)

• Message sent from A to B

• A gets ack from B

• A gets no ack from B

• message never got to B

• ack from B never got to A

• What kind?

• at most once (never resend)

• at least once (resend if no ack)

• exactly once (resend idempotently if no ack)

15

http://47deg.com


Messaging (2 of 2)

• Idempotence

• Multiple sends have same effect

• set X to 3, NOT add 2 to X

• Attach GUID, destination must handle

• In order delivery

• Waiting for an ack before sending next increases latency

• Attach sequence number, destination must handle

• Batching multiple messages together can help

• Design so order does not matter

16

http://47deg.com


Persistent Data (1 of 3)

• CAP theorem (pick 2)

• Consistency (ACID)

• Availability

• Partition tolerance (closely tied to fault tolerance)

• Distributed consistency solutions: 2-phase commit is “the anti-availability protocol” (Helland)

• For very large highly available systems, AP is only possible choice

17

http://47deg.com



• Detecting conflicts with Vector clocks

• Each server has own time

• Vector has one element for each server

• Forms a partial order

• Resolving conflicts (for example: 2 different phone numbers)

• Select the latest

• Ask someone

• Keep both

• CRDTs (generalization of keep both)

• conflict free replicated data sets

• merge must be commutative, associative, idempotent

18

http://47deg.com



• Log based stores

• Sequence of transformational steps

• Each step is immutable

• Log is append only (fast sequential write to disk)

• Database is a cache of some point in the log

• Log is primary

• Database can be deleted and recreated from log

19

http://47deg.com


Concurrency and Distribution

• Individual servers are getting ever more cores.

• Utilization is key

• Large data applications require multiple servers

• Connections between servers are frequent points of failure

• Parallel data operations help: parallel collections, Spark

• Traditional synchronization (locks, monitors) are error prone and very hard to get right.

• Message bases systems (Hoare’s CSP, Hewitt’s actors) are a better solution and work well across servers.

20

http://47deg.com


Logging and Monitoring

• As systems involve more and more servers

• Detecting and locating failure is getting harder

• Understanding system performance and performance tuning is getting harder

• We now produce massive amounts of logs and monitoring data

• Making sense of this huge volume of data is hard

• For failures we need near real-time analysis

• Increasing need for data science solutions

21

http://47deg.com


Continuous Deployment (1 of 2)

• High availability means we can no longer shut down for upgrades to

• Application code

• Operating system upgrades and patches

• Hardware maintenance

• Automatic server failover

• Rolling upgrades

• Backward compatibility

• Messages

• Database schemas

22

http://47deg.com


Continuous Deployment (2 of 2)

• Deployment of lots of small changes reduces the chance of errors in any single deployment

• Requires comprehensive automation for testing and deployment

• But errors still do occur

• Although we have good methods for testing individual components, integration testing is still hard and error prone.

• Some approaches

• Roll back

• A-B testing

• Database checkpoints

23

http://47deg.com

Recommended Data Engineering Tools and

Systems

24


Choices

• Open source preferred

• Personal favorites

• Widely used (best practices in leading companies)

25

http://47deg.com


Prefer Open Source

• “Free”

• Full source is available

• Community participation

• Can move very fast

• More responsive

• Plus if there is a commercial company providing support

26

http://47deg.com


Programming Language (1 of 3)

• Compiled versus interpreted

• Compiled: C, C++, Go

• Semi-compiled: Java, C#, Scala

• Interpreted: Python, Ruby, R

• Static versus dynamic type checking

• Static catches more errors at compile-time

• Static are easier to understand and maintain

• Static requires more work writing

• Garbage collection. Safety versus performance

27

http://47deg.com


Programming Languages (2 of 3)

• Choice of language does not matter

• I can write any algorithm in any language

• Lets avoid pointless “language religion” wars

• Choice of language matters a lot

• Language can have a big impact on performance, productivity and reliability

• Programming languages shape the way we think

28

http://47deg.com


Programming Languages (3 of 3)

• Scala

• Semi-compiled. Compiled with JIT compiler.

• Statically typed but concise syntax of untyped

• Garbage collected

• Runs on JVM. Full ecosystem of libraries and tools available.

• Key features

• Functional plus immutable data (major advance in program quality)

• Scala Futures and Akka Actors (major advance in easy to understand, easy to get correct, and fault-tolerant distributed computation)

• Main language for Spark

• Suitable for both data engineers and data scientists (better cooperation)

29

http://47deg.com


Messaging

• Kafka (written in Scala)

• Reliable buffer between produced and consumer

• Can replay

• Multiple produces and consumers

• Multiple topics

• Linearly scalable

• Kafka stream

• Other

• Reactive streams

• Spark streaming

30

http://47deg.com


Databases

• Relational: Postgres (scaling can be a problem)

• Embedded: LevelDB, MapDB

• NoSQL: Cassandra, Couchbase

• Graph: Neo4j, Titan, DataStax Enterprise Graph

31

http://47deg.com


Analytics

• Hadoop (let it die!)

• Spark (Written in Scala, Scala API is best)

• Trend toward SQL

• Improved performance via query optimizer

• Widely understood (but poor?) programming model

• Somewhat abandoned functional programming (RDDs)

• dataset transforms: experiment to combine functional programming with support for query optimization

32

http://47deg.com

https://github.com/nestorpersist/dataset-transform


Data Center Infrastructure and Continuous Deployment

• GitHub, SBT, Artifactory, Jenkins

• Docker/Rkt, Etcd, CoreOS

• Mesos, Kubernetes

• Cloud: AWS, Google, Microsoft

33

http://47deg.com

Final Thoughts

34


Final Thoughts

• Scala is the best choice for both data engineers and data scientists

• Spark is the best choice for data analysis

• Data will continue to grow in size and importance

• The number of servers we use will continue to grow requiring better fault tolerance and better automation

• When data engineers and data scientists work closely together both benefit and better results are achieved

• We need to break down traditional silos

• We need shared tools and technologies that work well for both groups

35

http://47deg.com

Questions

36