Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
CS 744: SNOWFLAKE
Shivaram VenkataramanFall 2021
Goodmorning!
ADMINISTRIVIA
- Assignment 1 grades out!- Assignment 2 by mid-week- Midterm on Thursday! Seating layout?
→ Yien on Piazza /email
- Randomize sealing
Machine Learning SQL
Applications
SparkSQL/Scope: Given a query how do you run it efficiently?
Snowflake: How do you build an elastic data warehouse?
hooking5
☒→confirmseats
online transaction--
collection of logs or
purchasesover long
time period ,analyze
._
f-Ty OLAP queriesAble to increase A collection of data
or decrease resource and youdo analytics
use as workload changes queries on it
CLOUD COMPUTINGSTACK
Scalable Storage Systems
Computational Engines
Machine Learning SQL
\Amazon ECZ
pyforih .-
. I
☐ 1-Sol as a
software service
MapReduce / Spark \ Elastic MapReduce
1Amazon EC2
- VMs
GFS I
1Amazon 53
SNOWFLAKE: GOALS
Software-as-a-Service
Elastic
Highly Available
Semi-Structured Data
→ Users don't need to install / manage this
→ Use resources as required
→ Fault tolerance
→ Traditional warehouses only support
JSON structured data
↳ queries on
specific fields
SNOWFLAKE DESIGN"¥É:#andµ
in →
☐genius ②☐
V We are
isolated →☐from
each
other-
Datastorage
shared<
g--
acrossNws
STORAGE VS COMPUTE
Shared Nothing Multi Cluster, Shared Data
Separate out storage & computeties together
-scale storage independently
CPU , diskresources
- Handle irregulararrival
⇒ some queriesneed
of querieslot of
disk ,
letsDownside
cpv,can lead to under utilization = No locality of access
CPU Ecz→
instances£7 T¥local Amazon
disk 53
STORAGE: HYBRID COLUMNARAlice 32
Bob 22
Eve 24
Victor 27
Alice,32,Bob,22
Eve,24,Victor,27
Alice, Bob, 32,22
Eve, Victor,24,27
Row-oriented Hybrid Columnar
Entire partition or File
table
" "" 2 rowsin
each partition
m rangeevery
→ ☐ ☐ ☐ start offset& length
ID column
\ compressionsaves space& also network
traffic
VIRTUAL WAREHOUSES
Elasticity, Isolation
Local caching, Stragglers
↳ For each user spin up some Ecz instances = VW
→ Isolation as instances separate for eachuser
→ can turn off these machines once youare
dofei.fi#.yy--toViY!erf- S3 (Partition) / P2P communication
worm"
ystragglers happen
de
data skew
IÉone machine has lot more
LRV caching policy | in local SSD work to do
CLOUD SERVICES
Concurrency Control Pruning
Data / Partitions are versioned"
☐ ☐v2
☒,
- DBversion
numbers⇐÷."☐ ☐ f. les
☐ D TaboÉ→ MVCC
in S3
tA query
will see all the | → the Teg various
files at a specificversion columns in each block
"min : 22
,
max : 34
"
dnery → v1 is associated with it
all the reads come from aconn'stntremTµ→ hint or exclude fileswhich are
not goingto be
"
Snapshot Isolation"
used in to query
FAULT TOLERANCE
theseare again
stateless
replicatedacross
AZS
ones .
"
Retry queries"ÉÉ÷÷:r
data centers
"Availability zone"
SEMI STRUCTURED DATA
{first_name: “john”,last_name: “doe”,order_id: “1234”,
}{
first_name: “bucky”,last_name: “badger”,order_id: “52342”,order_date: “3/3/2020”,
}
Extraction operation
Flattening
Infer types, Pruning
→ Key reason /•
challenge for enterprises
→doesnt
'
strictlyfollow a ↳ access individual fieldsschema. with JSON as part of
- -
query
↳ flatten nested Json
osjuk↳
minimax
- - pruning on
Insidecolumn
•
to
↳ not present in each
all records
TIME TRAVEL?
Multiple versions of table (MVCC)
Undo accidental deletes
Cheap to clone / snapshot a table
→ Run querieswhich say
processes date till yesterday
SECURITY
Hierarchical key management
Key rotation, re-keying
SUMMARY, TAKEAWAYS
Snowflake- Cloud computing à Elastic data warehouse- Key idea: Separation of compute and storage!
- Hybrid columnar storage format- Elastic compute with virtual warehouses- Pruning, semi-structured optimizations, fault tolerant
03,02, 01 ,
" "
vw per user
☐ cache
reused across
queries peruser
DISCUSSIONhttps://forms.gle/buUDM9nRs6Gg9tURA
We see how Snowflake leads to the design of an elastic data warehouse. If we were to similarly design an Elastic PyTorch for training how would the design look? What are some design trade-offs compared to existing PyTorch?
Data Parallel → row - based partitioning
hybrid columnar ?
1£ gad ) updatedevery
iteration⇒ overhead ?
model,
traindote
s test data
Parameterscorers
NEXT STEPS
Next class: Midterm!