Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Cloud AnalyticsData Warehousing
Marco Serafini
COMPSCI 590SLecture 19
22
Trivia• How does Amazon make money?
• Selling books?• Entertainment?
33
Cloud Computing• Shared resources
• Multiple tenants sharing resources (with isolation)• Economy of scale
• Elastic provisioning• Can easily add and remove resources on the fly
• Pay as you go only when used• Different flavors
• IaaS, PaaS, SaaS• Public, private cloud
44
Cloud Offerings• Computing nodes
• Example: AWS EC2• Full nodes with local storage and pre-installed OS• Very large number of instance types: compute optimized, memory optimized, storage optimized, with GPUs, burstable…
• Storage services• Example: AWS S3• Key-value stores (put/get), file systems
• Higher-level services• Example: DBMS
55
Other Variants• Spot instances
• Allocated in real-time based on live bidding• Can be revoked any time (with notice)
• Serverless computing• Example: AWS Lambda
• Each of these services comes with own pricing
77
Storage Disaggregation• Use remote storage instead of local storage
• Network is fast• Remote and local storage can have same throughput
• Advantages: can use cloud storage services like S3• No configuration or provisioning needed• Cheaper
• Cost of disaggregated storage• Storage nodes can have weak CPUs and limited memory• Storage is cheap
88
Remote vs. Local Storage
9 9
Goals• Easily parallelize single-threaded code• Eliminate cluster management overhead
• Deployment of nodes• Installation• Configuration
• Even cloud offerings have their complexities• Many instance types• Many services
• Solution: Serverless functions
1010
Serverless Functions• Single threaded code• Invoked through HTTP requests• Cloud platform takes care of
• Deployment• Load balancing• Performance isolation
• No need to• Deploy servers• Configure clusters
1111
State and Fault Tolerance• State is lost after execution• Inputs and outputs need to be persisted• Fault tolerance
• Re-execute function• Require atomic writes to check what has succeeded
1212
Registering Functions• Registering a new Lambda function is slow• Solution
• Register a single generic Lambda function• Serialize the code that needs the be executed• Store the code (and the input data) on S3• Generic Lambda function loads code and executes it
1313
Remote Storage Scalability
1414
Semantics• Map is easy
• Execute one function per element of the list• Map + single Reducer
• E.g. parallel featurization + single-server ML• MapReduce
• Many Lambdas needed, many small intermediate files• Use Redis, an in-memory key-value store
• Parameter server• Use Redis
1515
The Cost of Scaling Up• Using more nodes does not always imply higher cost• Lower latency à lower cost per node
16
17 17
Shared-Nothing and the Cloud• Shared-nothing architecture
• Each node has its own disk and memory• All nodes are “symmetric”
• Challenges• Heterogeneous workloads
• No one-size-fits-all hardware configuration• Membership changes
• Large data shuffles when a node fails/is removed• Online upgrade
• It is similar to changing all the nodes in the system
1818
Architecture• Data Storage
• Based on S3: high throughput, high latency• Used also for intermediate data
• Virtual Warehouses• Responsible for query execution• Stateless (restarted in their entirety)• Shared cache (low latency on hot data, most data cold)
• Cloud Services• Query parsing, access control, optimization• Snapshot isolation with multi-versioning• Metadata on external key-value store
1919
Advantages• Storage on S3 is cheaper• Use expensive local disk only for hot data• All services (except storage) are stateless
• Simpler fault tolerance and membership change• Example: online upgrade
21 21
SparkSQL: Spark + DBMS• Extend Spark with
• Simple, high-level SQL-like operators• Query optimization
• No need to transfer data across systems• ETL, query processing, complex analytics in one system
2222
DataFrames• Collection of rows with homogeneous schema
• Like a table in a DBMS• Can be manipulated like an RDD
• DataFrame operations• Similar to Python Pandas or R data frames• Evaluated lazily (query planning is postponed)• Can optimize across multiple queries
2323
Advantages• Relational structure enables query optimization• In-memory caching using columnar representation
• Better compression• Mix SQL-like operators and arbitrary code
• More flexible than UDFs in DBMSs• Can optimize across multiple SQL operations
2424
Catalyst• Query optimizer of SparkSQL• Rule-based optimization
• Rule: find pattern and transform• Used for both logical and physical plans• Can customize rules
• Code generation• Directly outputs bytecode (as opposed to interpreting a plan)• Much more CPU efficient
• Flexible data sources• Can change the physical representation of DataFrames• Still use the optimizer