53
CS 245: Principles of Data-Intensive Systems Instructor: Matei Zaharia cs245.stanford.edu

CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

CS 245: Principles ofData-Intensive Systems

Instructor: Matei Zahariacs245.stanford.edu

Page 2: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Course originally by Hector Garcia-Molina (1954-2019)

CS 245 2

Page 3: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

My Background

PhD in 2013

CS 245 3

Open source distributed data processing framework

Data & ML platform startup

Research in systems for ML

Page 4: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Outline

Why study data-intensive systems?

Course logistics

Key issues and themes

A bit of history

4CS 245

Page 5: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Why StudyData-Intensive Systems?

Most important computer applications must manage, update and query datasets» Bank, store, fleet controller, search app, …

Data quality, quantity & timeliness becoming even more important with AI» Machine learning = algorithms that generalize

from data

5CS 245

Page 6: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

What AreData-Intensive Systems?Relational databases: most popular type of data-intensive system (MySQL, Oracle, etc)

Many systems facing similar concerns:message queues, key-value stores, streaming systems, ML frameworks, your custom app?

CS 245 6

Goal: learn the main issues and principles that span all data-intensive systems

Page 7: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Typical System Challenges

Reliability in the face of hardware crashes, bugs, bad user input, etc

Concurrency: access by multiple users

Performance: throughput, latency, etc

Access interface from many, changing apps

Security and data privacy

CS 245 7

Page 8: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Practical Benefits of Studying These SystemsLearn how to select & tune data systems

Learn how to build them

Learn how to build apps that have to tackle some of these same challenges» E.g. cross-geographic-region billing app,

custom search engine, etc

CS 245 8

Page 9: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Scientific Interest

Interesting algorithmic and design ideas

In many ways, data systems are the highest-level successful programming abstractions

CS 245 9

Page 10: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Programming: The Dream

CS 245 10

∀𝑖 #$∈&'∪)'

𝜆𝑥. 𝑥-(… )

Working application

High-level spec

Page 11: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Programming: The Dream

CS 245 11

∀𝑖 #$∈&'∪)'

𝜆𝑥. 𝑥-(… )

Working application

High-level spec

Page 12: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Programming: The Reality

CS 245 12

Page 13: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Programming with Databases

CS 245 13

Relational algebra

Actually manages:• Durability• Concurrency• Query optimization• Security• …

High-level spec

Page 14: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Outline

Why study data-intensive systems?

Course logistics

Key issues and themes

A bit of history

14CS 245

Page 15: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Teaching Assistants

CS 245 15

Pratiksha Thaker Ben HannelPeter Kraft Wantong Jiang

Office hours will be posted on website

Page 16: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Course Format

Lectures in class

Optional textbook

Assigned paper readings (Q&A in class)

3 programming assignments

Midterm and final

CS 245 16

Page 17: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Optional Textbook

Database Systems:The Complete Book

Chapters 13-20

By the original Stanford InfoLab group (Hector Garcia-Molina, Jeff Ullman, Jennifer Widom)

CS 245 17

Page 18: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Paper Readings

A few classic or recent research papers

Read the paper before the class: we want to discuss it together!

We’ll post discussion questions on the class website 2-3 weeks before lecture

CS 245 18

Page 19: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

How Should You Read a Paper?

Read: “How to Read a Paper”

TLDR: don’t justgo through endto end; focus onkey ideas/sections

CS 245 19

Page 20: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Our First Paper

We’ll be reading part of “A History and Evaluation of System R” for next class!

Find instructions and questions on website

CS 245 20

Page 21: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Programming Assignments

Three assignments implemented in Java or Scala, and submitted online

1. Storage and access methods2. Query optimization3. Transactions and recovery

Done individually; A1 posted next week

CS 245 21

Page 22: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Midterm and Final

Written tests based on material covered in lectures, assignments and readings

Final will cover the entire course but focus on the second half

CS 245 22

Page 23: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Grading

45% Assignments (15% each)

25% Midterm

30% Final

CS 245 23

Page 24: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Keeping in Touch

Sign up for Piazza on the course website to receive announcements!

cs245.stanford.edu

CS 245 24

Page 25: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Outline

Why study data-intensive systems?

Course logistics

Key issues and themes

A bit of history

25CS 245

Page 26: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Recall: Examples ofData-Intensive SystemsRelational databases: most popular type of data-intensive system (MySQL, Oracle, etc)

Many systems facing similar concerns:message queues, key-value stores, streaming systems, ML frameworks, your custom app?

CS 245 26

Page 27: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Basic Components

CS 245 27

Logical dataset(e.g. table, graph)

Data mgmt. system Physical storage

(data structures)

Administrator

Clients / users

Queries

Page 28: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

ExamplesSystem Logical

Data ModelPhysical Storage API Other Features

Relational databases

Relations(i.e. tables)

B-trees, column stores, indexes, …

SQL, ODBC Durability, transactions, query planning, migrations, …

CS 245 28

Page 29: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

ExamplesSystem Logical

Data ModelPhysical Storage API Other Features

Relational databases

Relations(i.e. tables)

B-trees, column stores, indexes, …

SQL, ODBC Durability, transactions, query planning, migrations, …

TensorFlow

CS 245 29

Page 30: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

ExamplesSystem Logical

Data ModelPhysical Storage API Other Features

Relational databases

Relations(i.e. tables)

B-trees, column stores, indexes, …

SQL, ODBC Durability, transactions, query planning, migrations, …

TensorFlow Tensors

CS 245 30

Page 31: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

ExamplesSystem Logical

Data ModelPhysical Storage API Other Features

Relational databases

Relations(i.e. tables)

B-trees, column stores, indexes, …

SQL, ODBC Durability, transactions, query planning, migrations, …

TensorFlow Tensors NCHW, NHWC, sparse arrays, …

CS 245 31

Page 32: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

ExamplesSystem Logical

Data ModelPhysical Storage API Other Features

Relational databases

Relations(i.e. tables)

B-trees, column stores, indexes, …

SQL, ODBC Durability, transactions, query planning, migrations, …

TensorFlow Tensors NCHW, NHWC, sparse arrays, …

Python DAG construction

CS 245 32

Page 33: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

ExamplesSystem Logical

Data ModelPhysical Storage API Other Features

Relational databases

Relations(i.e. tables)

B-trees, column stores, indexes, …

SQL, ODBC Durability, transactions, query planning, migrations, …

TensorFlow Tensors NCHW, NHWC, sparse arrays, …

Python DAG construction

query planning, distribution, specialized HW

CS 245 33

Page 34: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

ExamplesSystem Logical

Data ModelPhysical Storage API Other Features

Relational databases

Relations(i.e. tables)

B-trees, column stores, indexes, …

SQL, ODBC Durability, transactions, query planning, migrations, …

TensorFlow Tensors NCHW, NHWC, sparse arrays, …

Python DAG construction

query planning, distribution, specialized HW

Apache Kafka

CS 245 34

Page 35: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

ExamplesSystem Logical

Data ModelPhysical Storage API Other Features

Relational databases

Relations(i.e. tables)

B-trees, column stores, indexes, …

SQL, ODBC Durability, transactions, query planning, migrations, …

TensorFlow Tensors NCHW, NHWC, sparse arrays, …

Python DAG construction

query planning, distribution, specialized HW

Apache Kafka

Streams of opaque records

Partitions, compaction

Publish, subscribe

Durability, rescaling

CS 245 35

Page 36: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

ExamplesSystem Logical

Data ModelPhysical Storage API Other Features

Relational databases

Relations(i.e. tables)

B-trees, column stores, indexes, …

SQL, ODBC Durability, transactions, query planning, migrations, …

TensorFlow Tensors NCHW, NHWC, sparse arrays, …

Python DAG construction

query planning, distribution, specialized HW

Apache Kafka

Streams of opaque records

Partitions, compaction

Publish, subscribe

Durability, rescaling

Apache Spark RDDs

CS 245 36

Page 37: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

ExamplesSystem Logical

Data ModelPhysical Storage API Other Features

Relational databases

Relations(i.e. tables)

B-trees, column stores, indexes, …

SQL, ODBC Durability, transactions, query planning, migrations, …

TensorFlow Tensors NCHW, NHWC, sparse arrays, …

Python DAG construction

query planning, distribution, specialized HW

Apache Kafka

Streams of opaque records

Partitions, compaction

Publish, subscribe

Durability, rescaling

Apache Spark RDDs

Collections of Java objects

Read external systems, cache

Functional API, SQL

Distribution,query planning, transactions*

CS 245 37

Page 38: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Some Typical Concerns

Access interface from many, changing apps

Performance: throughput, latency, etc

Reliability in the face of hardware crashes, bugs, bad user input, etc

Concurrency: access by multiple users

Security and data privacy

CS 245 38

Page 39: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Example

Message queue system

CS 245 39

Producers Consumers

What should happen if two consumers read() at the same time?

Page 40: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Example

Message queue system

CS 245 40

Producers Consumers

What should happen if a consumer reads a message but then immediately crashes?

Page 41: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Example

Message queue system

CS 245 41

Producers Consumers

Can a producer put in 2 messages atomically?

Page 42: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Two Big Ideas

Declarative interfaces» Apps specify what they want, not how to do it» Example: “store a table with 2 integer columns”,

but not how to encode it on disk» Example: “count records where column1 = 5”

Transactions» Encapsulate multiple app actions into one atomic request (fails or succeeds as a whole)

» Concurrency models for multiple users» Clear interactions with failure recovery

CS 245 42

Page 43: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Declarative Interface ExamplesSQL» Abstract “table” data model, many physical

implementations» Specify queries in a restricted language that the

database can optimize

TensorFlow» Operator graph gets mapped & optimized to

different hardware devices

Functional programming (e.g. MapReduce)» Says what to run but not how to do scheduling

CS 245 43

Page 44: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Transaction Examples

SQL databases» Commands to start, abort or end transactions

based on multiple SQL statements

Apache Spark, MapReduce» Make the multi-part output of a job appear

atomically when all partitions are done

Stream processing systems» Count each input record exactly once despite

crashes, network failures, etcCS 245 44

Page 45: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Outline

Why study data-intensive systems?

Course logistics

Key issues and themes

A bit of history

45CS 245

Page 46: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Early Data Management

At first, each application did its own data management directly against storage

CS 245 46

Ye OldeBank

I’d like a computerized

account system

I have just the thing

write_block()

read_block()Stores 5 MB!

Page 47: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Problems with App Storage Management

How should we lay out and navigate data?

How do we keep the application reliable?

What if we want to share data across apps?

Every app is solving the same problems!

CS 245 47

Page 48: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Navigational Databases (1964)

CODASYL, IDS

Data is graph of records

Procedural API basedon navigating links:get department with name='Sales’get first employee in set department-employeesuntil end-of-set do {get next employee in set department-employeesprocess employee

}

CS 245 48“Data independence”: app code not tied to storage details

Page 49: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

CS 245 49

Charles W. Bachman, “The Programmer as Navigator”

Page 50: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Edgar F. (Ted) Codd

Proposed the relational DB model, with declarative queries & storage (1970)

Relation = table with unique key identifying each row

CS 245 50

Data independence++: apps don’t even specify how to execute query

Page 51: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Key Ideas in Relational DBMS

CS 245 51

Logical data model:tables with references

across them (foreign keys)

Data mgmt. system Physical storage:

raw files, B-trees,hash indexes, etc

Administrator

Clients / users

Relationalalgebra

(e.g. SQL)

Query planning, access methods, transactions, etc

Page 52: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Early Relational DBMS

IBM System R (1974): research system» Led to IBM SQL/DS in 1981

Ingres (1974): Mike Stonebraker at Berkeley» Led to PostgreSQL

Oracle database (released 1979)

CS 245 52

Next class, we’ll cover database architecture by looking at System R

Page 53: CS 245: Principles of Data-Intensive Systemsweb.stanford.edu/class/cs245/slides/01-Introduction.pdf · Kafka Streams of opaque records Partitions, compaction Publish, subscribe Durability,

Rest of the Course

We’ll explore both “big ideas” we saw, focusing on relational DBs but showing examples in other areas

• Declarative interfaces• Data independence and data storage formats• Query languages and optimization

• Transactions, concurrency & recovery• Concurrency models• Failure recovery• Distributed storage and consistency

CS 245 53Don’t forget to sign up for Piazza!