To Share or Not to Share?

  • Published on

  • View

  • Download

Embed Size (px)


To Share or Not to Share?. Ryan Johnson. Nikos Hardavellas, Ippokratis Pandis, Naju Mancheril, Stavros Harizopoulos**, Kivanc Sabirli, Anastasia Ailamaki, Babak Falsafi PARALLEL DATA LABORATORY Carnegie Mellon University. **HP LABS. output. output. aggregate. aggregate. join. scan. - PowerPoint PPT Presentation


<ul><li><p>To Share or Not to Share?Ryan JohnsonNikos Hardavellas, Ippokratis Pandis, Naju Mancheril, Stavros Harizopoulos**, Kivanc Sabirli, Anastasia Ailamaki, Babak Falsafi</p><p>PARALLEL DATA LABORATORYCarnegie Mellon University</p><p>**HP LABS</p></li><li><p>Motivation For Work SharingQuery: What is the highest undergraduate GPA?Query: What is the average GPA in the ECE dept.?</p></li><li><p>Motivation For Work SharingMany queries in systemSimilar requestsRedundant workWork SharingDetect redundant workCompute results once and shareBig win for I/O, uniprocessors 2x speedup for TPC-H queries [hariz05]</p></li><li><p>Work Sharing on Modern Hardware0. QueriesSpeedup due to WS1 CPUWork sharing can hurt performance!</p></li><li><p>ContributionsObservationWork sharing can hurt performance on parallel hardwareAnalysisDevelop intuitive analytical model of work sharingIdentify trade-off between total work, critical pathApplicationModel-based policy outperforms static ones by up to 6x</p></li><li><p>OutlineIntroductionPart I: Intuition and ModelPart II: Analysis and Experiments</p></li><li><p>Challenges of Exploiting Work SharingIndependent execution only?Load reduction from work sharing can be usefulWork sharing only?Indiscriminate application can hurt performanceTo share or not to share? System and workload dependent Adapt decisions at runtime</p><p>Must understand work sharing to exploit it fully</p></li><li><p>Work Sharing vs. ParallelismIndependent ExecutionP = 4.33</p></li><li><p>Work Sharing vs. ParallelismShared ExecutionPenaltyP = 2.75P = 4.33Critical path now longerTotal work and critical path both important</p></li><li><p>Understanding Work SharingPerformance depends on two factors:</p><p>Work sharing presents a trade-offReduces total workPotentially lengthens critical path</p><p> Balance both factors or performance suffers </p></li><li><p>Basis for a ModelClosed systemConsistent high loadThroughput computingAssumed in most benchmarksFixed number of clientsLittles Law governs throughputHigher response time = lower throughputTotal work not a direct factor! Load reduction secondary to response time</p></li><li><p>Predicting Response TimeCase 1: Compute-bound</p><p>Case 2: Critical path-bound</p><p>Larger bottleneck determines response time Model provides u and pmax</p></li><li><p>An Analytical Model of Work SharingSharing helpful when Xshared &gt; XaloneU = requested utilizationPmax = longest pipe stage</p></li><li><p>OutlineIntroductionPart I: Intuition and ModelPart II: Analysis and Experiments</p></li><li><p>Experimental SetupHardwareSun T2000 Niagara with 16 GB RAM8 cores (32 threads) Solaris processor sets vary effective CPU countCordobaStaged DBMSNaturally exposes work sharingFlexible work sharing policies1GB TPCH datasetFixed Client and CPU counts per run</p></li><li><p>Model Validation: TPCH Q1Avg/max error: 5.7% / 22%</p></li><li><p>Model Validation: TPCH Q4Behavior varies with both system and workload</p></li><li><p>Exploring WS vs. ParallelismWork sharing splits query into three partsIndependent work Per-query, parallelTotal workSerial workPer-query, serialCritical pathShared workComputed onceFree after first queryExample:</p></li><li><p>Exploring WS vs. Parallelism00.511.522.508162432Shared QueriesBenefit from Work Sharing481632Behavior matches previously published resultsCPUs</p></li><li><p>Exploring WS vs. Parallelism00.511.522.508162432Shared QueriesBenefit from Work Sharing48CPUs</p></li><li><p>Exploring WS vs. Parallelism00.511.522.508162432Shared QueriesBenefit from Work Sharing481632CPUs</p></li><li><p>Exploring WS vs. Parallelism00.511.522.508162432Shared QueriesBenefit from Work Sharing481632More processors shift bottleneck to critical pathCPUs</p></li><li><p>Performance Impact of Serial WorkCritical path quickly becomes major bottleneckImpact of Serial Work (32 CPU)00.511.522.5010203040Shared QueriesBenefit from Work Sharing0%</p></li><li><p>Model-guided Work SharingIntegrate predictive model into CordobaPredict benefit of work sharing for each new queryConsider multiple groups of queries at onceShorter critical path, increased parallelismExperimental setupProfile run with 2 clients, 2 CPUsExtract model parameters with profiling tools20 clients submit mix of TPCH Q1 and Q4Compare against always-, never-share policies</p></li><li><p>Comparison of Work Sharing StrategiesModel-based policy balances critical path and load2 CPU050100150200Queries/minalways sharemodel guidednever share</p></li><li><p>Related WorkMany existing work sharing schemesIdentification occurs at different stages in the querys lifetimeAll allow pipelined query executionMaterialized Views[rouss82]Multiple Query Optimization[roy00]Staged DBMS[hariz05]Synchronized Scanning[lang07]Schema designBuffer Pool AccessQuery compilationQuery executionEarlyLateModel describes all types of work sharing</p></li><li><p>ConclusionsWork sharing can hurt performanceHighly parallel, memory resident machinesIntuitive analytical model captures behaviorTrade-off between load reduction and critical pathModel-guided work sharing highly effectiveOutperforms static policies by up to 6x</p></li><li><p>References[hariz05] S. Harizopoulos, V. Shkapenyuk, and A. Ailamaki. QPipe: A Simultaneously Pipelined Relational Query Engine. In Proc. SIGMOD, 2005.[lang07] C. Lang, B. Bhattacharjee, T. Malkemus, S. Padmanabhan, and K. Wong. Increasing Buffer-Locality for Multiple Relational Table Scans through Grouping and Throttling. In Proc. ICDE, 2007.[rouss82] N. Roussopoulos. View Indexing in Relational databases. In ACM TODS, 7(2):258-290, 1982.[roy00] P. Roy, S. Seshadri, S. Sudarshan, and S. Bhobe. Efficient and Extensible Algorithms for Multi Query Optimization. In Proc. SIGMOD, 2000.</p><p>Query processing workloads consist of many queries which the database system processes in parallel. These queries can be quite similar and often request redundant work. Ideally, the system could detect these redundant requests, compute each result only once, and share it among multiple queries. This technique is known as work sharing.Consider the following example: suppose we want to determine the highest GPA in a university database. The database system first decomposes our query into a tree of database operations known as a query plan. In this case the plan calls for a scan of the students table to select out undergraduate students, followed by an aggregation operation to compute the maximum of their GPAs.Similarly, if we want to know the average GPA of the ECE department, the database engine must scan both the student and department tables, joining them together to identify students in the engineering department, before computing the average GPA. Both queries access the Student table; if they arrive in the system at the same time it would be nice to detect the redundant request and only perform it once. Intuitively, work sharing in this manner should improve the performance of a loaded system by freeing up resources such as CPU and disk I/O for other uses. </p><p>Work sharing is highly effective for single processor and I/O-bound systems. Prior research has shown that work sharing can double the performance of uniprocessors running TPC-H queries.To recap, query processing workloads consist of many similar requests that contain redundant work. The system can potentially detect this common work, perform it just once, and share the results among several consumers. This graph shows the results of work sharing TPCH Q6 on a uniprocessor. TPCH Q6 closely resembles the maximum GPA example query we discussed a moment ago, and computes some aggregates over the output of a single table scan. We ran TPCH Q6 on our test system and computed the speedup achieved from applying work sharing at the scan operator. The number of shared queries varies along the x-axis. As expected, work sharing improves performance by nearly 2x. However, computer architecture has seen a recent shift away from uniprocessors -- all chip manufacturers currently offer chips containing multiple cores. For example, Intels Core 2 platform currently features two or four processors, while Suns Niagara features 8. For the foreseeable future, it appears that Moores Law will deliver twice as many cores per chip each generation, with only modest improvements in single-threaded performance. In addition, the enormous main memory capacities of high-end machines allow many database workloads to become memory-resident, thus eliminating the I/O bottleneck.Because these two shifts impact performance in the same way as work sharing, we must re-evaluate work sharing in the new hardware landscape. Running the same TPCH query as before on eight processors paints a very different picture than before. Rather than improving performance, work sharing drops speedup by a factor of seven, achieving only one fourth of the throughput achieved by independent execution. From the raw performance numbers we find that 8 processors achieve only slightly higher throughput than one.Examples like this one indicate that blindly applying work sharing can hurt performance significantly. This talk focuses on understanding the nature of work sharing so that we can avoid performance pitfalls like the one illustrated here.</p><p>Our work makes three main contributions. First, we present the unexpected observation that work sharing can lower performance on parallel hardware. Second, we develop an intuitive analytical model of work sharing which predicts the effect of work sharing given the current system state. The model identifies a trade-off between load reduction and critical path length which must be balanced to achieve maximum performance. Finally, we integrate the model into a work sharing policy and demonstrate how it improves performance by as much as 6x over static policies. The rest of the talk proceeds as outlined. The first section presents a high-level view of our analytical model as well as the intuition behind it. We then use the model to gain insight into the nature of work sharing and show how a database engine can use the model to apply work sharing intelligently.Based on the results we saw earlier, it might be tempting to stick with independent execution in order to avoid the quirks of work sharing, this is not ideal because the load reduction offered by work sharing can be extremely useful. Unfortunately, we have also seen that indiscriminate work sharing can hurt performance.As we will see, the impact of work sharing is dynamic and depends on both the system and the current workload. An effective work sharing policy will need to adapt to changing conditions at runtime in order to achieve the highest performance. This kind of adaptive decision-making will only be successful if we understand the trade-offs that underlie work sharing.To illustrate the nature of work sharing, consider the pipelined query execution of our two example queries.This figure depicts an execution timeline of the two queries executing in parallel. Each colored rectangle represents one unit of work performed by the corresponding database operator. Execution begins on the left with table scans, then progresses to the right as the pipeline fills and other operators begin working. When the scans complete on the far right, the pipeline empties and the query completes execution. Once the pipeline fills, these two queries together utilize about 4.5 processors. Assuming a parallel system with, say 8 processors, the pipeline stages execute in parallel and the queries achieve minimum response times shown by the bars. During parallel execution the redundant table scan of each query lies on the critical path of its pipeline; the operation saturates its processor, forcing other, faster, pipeline stages to wait for it. In a uniprocessor, on the other hand, all operations must time share a single processor, and the total work requested by the two queries determines the overall response times. </p><p>Notes:The aggregate operation requires only half as much work as the scan, and only achieves 50% utilization of its processor. Similary, one of the scans only requires one third of its processor time.</p><p>When the system applies work sharing, the two queries share their common file scan and the redundant work from Query 2s scan gets eliminated, as shown. In a uniprocessor system, this improves response time by 15%However, in addition to the actual processing, each pipeline stage must also copy its results into the appropriate output buffer after applying selections and projections as necessary. As a result, though work sharing eliminates 2/3 of the work in the redundant scan, the remaining 1/3 of the scans cost gets added onto the thread performing Q1s scan. A single thread must now pay the cost of sending results to both consumers, where that cost used to be split across two threads operating in parallel. Because the scan already lay on the critical path in the query, the extra work aggravates the bottleneck and delays both queries by the amounts shown in yellow. Looking at it another way, work sharing eliminated 60% of a processor worth of work, but reduced available parallelism by 1.6. The penalty occurs because we shifted work to the critical path of the query.</p><p>In order to avoid unexpected performance loss from work sharing, we need a way to predict changes to the critical path, in addition to the load reduction work sharing provides.</p><p>Based on the previous example, we hypothesize that performance will be a function of both total utilization and critical path length. Increasing either factor will tend to decrease throughput by some unknown amount. Work sharing then presents a trade-off. While it excels at reducing the total work in the system, it may also increase critical path lengths. If the cost of increased critical path outweighs the benefits of load reduction, performance will suffer.In order to determine the relationship between work, critical path, and performance, it will be helpful to consider some of the characteristics of our workload. We assume a closed system, which features consistent, high load and focuses on throughput as the primary performance metric. Closed systems fix some number of clients which repeatedly submit requests back-to-back. This characterizes the state of the system during peak load, when there is always another request waiting and throughput is critical; throughput is less of a concern during off-peak operation because a system with idle resources can process individual requests quickly. The vast majority of database benchmarks, including TPC-H and TPC-C, assume a closed system. Given a closed system, Littles Law states that throughput will be inversely proportional to the average query response time. In other words, given the average query response time, we can use Littles Law to compute the average throughput of the system as a whole. Interestingly, the amount of work in the system does not appear anywhere in Littles Law. High throughput comes from moving queries through the system as quickly as possible, and any delay in one query holds up those in line behind it. Because of this, if response time increases, throughput decreases, regardless of how much work might have been eliminated. Load reduction becomes a secondary concern for performance it is useful only when it improves response time by freeing up overloaded resources....</p></li></ul>


View more >