19
CS 744: SNOWFLAKE Shivaram Venkataraman Fall 2021 Goodmorning !

Shivaram Venkataraman Fall 2021 - pages.cs.wisc.edu

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Shivaram Venkataraman Fall 2021 - pages.cs.wisc.edu

CS 744: SNOWFLAKE

Shivaram VenkataramanFall 2021

Goodmorning!

Page 2: Shivaram Venkataraman Fall 2021 - pages.cs.wisc.edu

ADMINISTRIVIA

- Assignment 1 grades out!- Assignment 2 by mid-week- Midterm on Thursday! Seating layout?

→ Yien on Piazza /email

- Randomize sealing

Page 3: Shivaram Venkataraman Fall 2021 - pages.cs.wisc.edu

Machine Learning SQL

Applications

SparkSQL/Scope: Given a query how do you run it efficiently?

Snowflake: How do you build an elastic data warehouse?

hooking5

☒→confirmseats

online transaction--

collection of logs or

purchasesover long

time period ,analyze

._

f-Ty OLAP queriesAble to increase A collection of data

or decrease resource and youdo analytics

use as workload changes queries on it

Page 4: Shivaram Venkataraman Fall 2021 - pages.cs.wisc.edu

CLOUD COMPUTINGSTACK

Scalable Storage Systems

Computational Engines

Machine Learning SQL

\Amazon ECZ

pyforih .-

. I

☐ 1-Sol as a

software service

MapReduce / Spark \ Elastic MapReduce

1Amazon EC2

- VMs

GFS I

1Amazon 53

Page 5: Shivaram Venkataraman Fall 2021 - pages.cs.wisc.edu

SNOWFLAKE: GOALS

Software-as-a-Service

Elastic

Highly Available

Semi-Structured Data

→ Users don't need to install / manage this

→ Use resources as required

→ Fault tolerance

→ Traditional warehouses only support

JSON structured data

↳ queries on

specific fields

Page 6: Shivaram Venkataraman Fall 2021 - pages.cs.wisc.edu

SNOWFLAKE DESIGN"¥É:#andµ

in →

☐genius ②☐

V We are

isolated →☐from

each

other-

Datastorage

shared<

g--

acrossNws

Page 7: Shivaram Venkataraman Fall 2021 - pages.cs.wisc.edu

STORAGE VS COMPUTE

Shared Nothing Multi Cluster, Shared Data

Separate out storage & computeties together

-scale storage independently

CPU , diskresources

- Handle irregulararrival

⇒ some queriesneed

of querieslot of

disk ,

letsDownside

cpv,can lead to under utilization = No locality of access

CPU Ecz→

instances£7 T¥local Amazon

disk 53

Page 8: Shivaram Venkataraman Fall 2021 - pages.cs.wisc.edu

STORAGE: HYBRID COLUMNARAlice 32

Bob 22

Eve 24

Victor 27

Alice,32,Bob,22

Eve,24,Victor,27

Alice, Bob, 32,22

Eve, Victor,24,27

Row-oriented Hybrid Columnar

Entire partition or File

table

" "" 2 rowsin

each partition

m rangeevery

→ ☐ ☐ ☐ start offset& length

ID column

\ compressionsaves space& also network

traffic

Page 9: Shivaram Venkataraman Fall 2021 - pages.cs.wisc.edu

VIRTUAL WAREHOUSES

Elasticity, Isolation

Local caching, Stragglers

↳ For each user spin up some Ecz instances = VW

→ Isolation as instances separate for eachuser

→ can turn off these machines once youare

dofei.fi#.yy--toViY!erf- S3 (Partition) / P2P communication

worm"

ystragglers happen

de

data skew

IÉone machine has lot more

LRV caching policy | in local SSD work to do

Page 10: Shivaram Venkataraman Fall 2021 - pages.cs.wisc.edu

CLOUD SERVICES

Concurrency Control Pruning

Data / Partitions are versioned"

☐ ☐v2

☒,

- DBversion

numbers⇐÷."☐ ☐ f. les

☐ D TaboÉ→ MVCC

in S3

tA query

will see all the | → the Teg various

files at a specificversion columns in each block

"min : 22

,

max : 34

"

dnery → v1 is associated with it

all the reads come from aconn'stntremTµ→ hint or exclude fileswhich are

not goingto be

"

Snapshot Isolation"

used in to query

Page 11: Shivaram Venkataraman Fall 2021 - pages.cs.wisc.edu

FAULT TOLERANCE

theseare again

stateless

replicatedacross

AZS

ones .

"

Retry queries"ÉÉ÷÷:r

data centers

"Availability zone"

Page 12: Shivaram Venkataraman Fall 2021 - pages.cs.wisc.edu

SEMI STRUCTURED DATA

{first_name: “john”,last_name: “doe”,order_id: “1234”,

}{

first_name: “bucky”,last_name: “badger”,order_id: “52342”,order_date: “3/3/2020”,

}

Extraction operation

Flattening

Infer types, Pruning

→ Key reason /•

challenge for enterprises

→doesnt

'

strictlyfollow a ↳ access individual fieldsschema. with JSON as part of

- -

query

↳ flatten nested Json

osjuk↳

minimax

- - pruning on

Insidecolumn

to

↳ not present in each

all records

Page 13: Shivaram Venkataraman Fall 2021 - pages.cs.wisc.edu

TIME TRAVEL?

Multiple versions of table (MVCC)

Undo accidental deletes

Cheap to clone / snapshot a table

→ Run querieswhich say

processes date till yesterday

Page 14: Shivaram Venkataraman Fall 2021 - pages.cs.wisc.edu

SECURITY

Hierarchical key management

Key rotation, re-keying

Page 15: Shivaram Venkataraman Fall 2021 - pages.cs.wisc.edu

SUMMARY, TAKEAWAYS

Snowflake- Cloud computing à Elastic data warehouse- Key idea: Separation of compute and storage!

- Hybrid columnar storage format- Elastic compute with virtual warehouses- Pruning, semi-structured optimizations, fault tolerant

03,02, 01 ,

" "

vw per user

☐ cache

reused across

queries peruser

Page 16: Shivaram Venkataraman Fall 2021 - pages.cs.wisc.edu

DISCUSSIONhttps://forms.gle/buUDM9nRs6Gg9tURA

Page 17: Shivaram Venkataraman Fall 2021 - pages.cs.wisc.edu

We see how Snowflake leads to the design of an elastic data warehouse. If we were to similarly design an Elastic PyTorch for training how would the design look? What are some design trade-offs compared to existing PyTorch?

Data Parallel → row - based partitioning

hybrid columnar ?

1£ gad ) updatedevery

iteration⇒ overhead ?

model,

traindote

s test data

Parameterscorers

Page 18: Shivaram Venkataraman Fall 2021 - pages.cs.wisc.edu
Page 19: Shivaram Venkataraman Fall 2021 - pages.cs.wisc.edu

NEXT STEPS

Next class: Midterm!