Cloud Analytics Data Warehousing - GitHub Pages...Cloud Analytics Data Warehousing Marco Serafini COMPSCI 590S Lecture 19. 22 ... •Data Storage •Based on S3: high throughput, high

Cloud AnalyticsData Warehousing

Marco Serafini

COMPSCI 590SLecture 19

22

Trivia• How does Amazon make money?

• Selling books?• Entertainment?

33

Cloud Computing• Shared resources

• Multiple tenants sharing resources (with isolation)• Economy of scale

• Elastic provisioning• Can easily add and remove resources on the fly

• Pay as you go only when used• Different flavors

• IaaS, PaaS, SaaS• Public, private cloud

44

Cloud Offerings• Computing nodes

• Example: AWS EC2• Full nodes with local storage and pre-installed OS• Very large number of instance types: compute optimized, memory optimized, storage optimized, with GPUs, burstable…

• Storage services• Example: AWS S3• Key-value stores (put/get), file systems

• Higher-level services• Example: DBMS

55

Other Variants• Spot instances

• Allocated in real-time based on live bidding• Can be revoked any time (with notice)

• Serverless computing• Example: AWS Lambda

• Each of these services comes with own pricing

77

Storage Disaggregation• Use remote storage instead of local storage

• Network is fast• Remote and local storage can have same throughput

• Advantages: can use cloud storage services like S3• No configuration or provisioning needed• Cheaper

• Cost of disaggregated storage• Storage nodes can have weak CPUs and limited memory• Storage is cheap

88

Remote vs. Local Storage

9 9

Goals• Easily parallelize single-threaded code• Eliminate cluster management overhead

• Deployment of nodes• Installation• Configuration

• Even cloud offerings have their complexities• Many instance types• Many services

• Solution: Serverless functions

1010

Serverless Functions• Single threaded code• Invoked through HTTP requests• Cloud platform takes care of

• Deployment• Load balancing• Performance isolation

• No need to• Deploy servers• Configure clusters

1111

State and Fault Tolerance• State is lost after execution• Inputs and outputs need to be persisted• Fault tolerance

• Re-execute function• Require atomic writes to check what has succeeded

1212

Registering Functions• Registering a new Lambda function is slow• Solution

• Register a single generic Lambda function• Serialize the code that needs the be executed• Store the code (and the input data) on S3• Generic Lambda function loads code and executes it

1313

Remote Storage Scalability

1414

Semantics• Map is easy

• Execute one function per element of the list• Map + single Reducer

• E.g. parallel featurization + single-server ML• MapReduce

• Many Lambdas needed, many small intermediate files• Use Redis, an in-memory key-value store

• Parameter server• Use Redis

1515

The Cost of Scaling Up• Using more nodes does not always imply higher cost• Lower latency à lower cost per node

16

17 17

Shared-Nothing and the Cloud• Shared-nothing architecture

• Each node has its own disk and memory• All nodes are “symmetric”

• Challenges• Heterogeneous workloads

• No one-size-fits-all hardware configuration• Membership changes

• Large data shuffles when a node fails/is removed• Online upgrade

• It is similar to changing all the nodes in the system

1818

Architecture• Data Storage

• Based on S3: high throughput, high latency• Used also for intermediate data

• Virtual Warehouses• Responsible for query execution• Stateless (restarted in their entirety)• Shared cache (low latency on hot data, most data cold)

• Cloud Services• Query parsing, access control, optimization• Snapshot isolation with multi-versioning• Metadata on external key-value store

1919

Advantages• Storage on S3 is cheaper• Use expensive local disk only for hot data• All services (except storage) are stateless

• Simpler fault tolerance and membership change• Example: online upgrade

21 21

SparkSQL: Spark + DBMS• Extend Spark with

• Simple, high-level SQL-like operators• Query optimization

• No need to transfer data across systems• ETL, query processing, complex analytics in one system

2222

DataFrames• Collection of rows with homogeneous schema

• Like a table in a DBMS• Can be manipulated like an RDD

• DataFrame operations• Similar to Python Pandas or R data frames• Evaluated lazily (query planning is postponed)• Can optimize across multiple queries

2323

Advantages• Relational structure enables query optimization• In-memory caching using columnar representation

• Better compression• Mix SQL-like operators and arbitrary code

• More flexible than UDFs in DBMSs• Can optimize across multiple SQL operations

2424

Catalyst• Query optimizer of SparkSQL• Rule-based optimization

• Rule: find pattern and transform• Used for both logical and physical plans• Can customize rules

• Code generation• Directly outputs bytecode (as opposed to interpreting a plan)• Much more CPU efficient

• Flexible data sources• Can change the physical representation of DataFrames• Still use the optimizer

Documents

Cloud Analytics Data Warehousing - GitHub Pages...Cloud Analytics Data Warehousing Marco Serafini COMPSCI 590S Lecture 19. 22 ... •Data Storage •Based on S3: high throughput, high