Olap scalability

  • View

  • Download

Embed Size (px)


  • 1. Online AnalyticalProcessing of LargeDistributed DatabasesLuc BoudreauLead Engineer, Pentaho Corporation

2. "its all about data movement and operating on that data on the fly" 3. a relational database 4. Relational Databases Static schema Minimized redundancy Referential integrity Transactional 5. Classic RDBMS internals "Shared Everything" paradigmPLANNER / SCHEDULER Private PlannerPROCESSOR PROCESSOR PROCESSOR Multiple privateprocessors Multiple privatedata stores 6. What RDBMS are for Operational data Normalized models Static typed data 7. What RDBMS are NOT for "Full Scan" Aggregated Computations Multi-dimensional queries (think pivot) Unstructured data 8. OK so hows thatdifferent from Big Data platforms? 9. Big Data - More than a buzzword(although sometimes its hard to tell...)Big Data is not a product.It is an architecture. 10. Big Data - More than a buzzword(although sometimes its hard to tell...)A schema-less distributed storage and processing model for data. 11. Big Data Schema less Programmatic queries "Map" of MapReduce High Redundancy Distributed processing "Reduce" of MapReduce 12. Big Data No referential integrity Non transactional High latency 13. Classic Big Data internals "Share nothing"paradigmSCHEDULER Push the processingcloser to the dataPROCESSORPROCESSOR PROCESSOR The query definesthe schema 14. What Big Data is for Unstructured datakeep everything Distributed file systemgreat for archiving Data is fixedonly the process evolves 15. What Big Data is for Ludicrous amounts of datakeep everything, remember? Made on the cheapeach processing unit is commodity hardware 16. What Big Data is NOT for Low latency applicationsarbitrary exploration of the data is close to impossible End-userswriting code is easy. writing good code is hard. Replacing your operational DB 17. Some more limitations No structured query languageexploration is tedious Accuracy & Exactitudethe burden is put on the end user / query designer No query optimizercannot optimize at runtime.does exactly what you tell it to. 18. why is this so similar to NoSQL? 19. First, defining NoSQL... NoSQL: The thing named after what itlacks which has as many definitions asthere are products.(which usually turns out to be some sort of key-value store) 20. Why "NoSQL"? Why all the hate?! Historical reasons Wrong technological choices Blind faith in RDBMS scalability General wishful thinking and voodoo magic 21. Why "NoSQL"? Why all the hate?! "SQL" itself was never the issue NoSQL projects are implementing SQL-like query languages 22. bringing structuredqueries to Big Data 23. Current efforts Straight SQL implementationsGreenplum: Straight SQL on top of Big DataHive JDBC: A hybrid of DSL & SQL The Splunk approachSQL with missing columns Runtime query optimizersOptiq framework: SQL with Big Data federated sources 24. isnt there somethingbetter than SQL for analytics? 25. Online AnalyticalProcessing (OLAP) 26. Widely used. Little known. Your favorite corporate dashboards Google Analytics& other ad-hoc tools 27. Analytics centric language Multidimensional Expressions (MDX)a powerful query language for analytics Forget about rows and columnsas many axis as you need Slice & dicestart from everything - progressively focus only on relevant data 28. Business domain driven Hierarchical view ofa multidimensionaluniverse 29. An exampleWhat are my total sales for the current year, per month, for male customers?with member [Measures].[Accumulated Sales] as Sum(YTD(), [Measures].[Store Sales])select {[Measures].[Accumulated Sales]} on columns, {Descendants([Time].[1997], [Time].[Month])} on rowsfrom [Sales]where ([Customer].[Gender].[M]) 30. how does that work? 31. Analytics data modelization A denormalized model for performancethe data is modelized for read operations - not write High redundancybecause sometimes more is better 32. The Star model 33. The Snowflake model 34. different OLAP servers. Different beasts. 35. Relational OLAP (ROLAP) Backed by a relational databasethink of a MDX to SQL bridge.the aggregated data can be cached in-memory or on-disk. Relies heavily on the RDBMS performancefigures out at runtime the proper optimizations 36. Memory OLAP (MOLAP) Loads everything in RAM Relies on an efficient ETL platform 37. Other OLAP On-disk aggregated data filesThink SAS. Cubes are compiled into data files on disk. Simple BridgesConverts MDX straight to SQL, with limited support of MDX syntax. 38. how do they compare? 39. (there are no straightanswers, sorry) 40. Where the data lives matters Location Speed (ns) L1 Cache Reference0.5 Branch Mispredict 5 L2 Cache Reference7 Mutex lock/unlock 25 Main memory reference100 Compress 1K bytes w/ cheap algorithm3000 Send 2K bytes over 1 Gbps network20 000 Read 1 MB sequentially from memory250 000 Round trip within same datacenter 500 000 Disk seek10 000 000 Read 1 MB sequentially from disk 20 000 000 Send packet CA -> Netherlands -> CA 150 000 000 41. Optimizing for CPU Java NIO blocksuse extremely compact chunks of 64 bits. Primitive typesuse "int" instead of "Integer" BitKeysbecause they are naturally CPU friendly 42. Optimizing for memory Hard limits on the heap spacemust pay attention to the total memory usage. Inherent limitationsthere can only be so many individual pointers on heap. 43. Optimizing for networking Payload optimizationbatching. deltas. Manageabilityturning nodes on & off. 44. Optimizing for disk Concurrent accessmust carefully manage disk IO. Inherently slooooow 45. how to deal with these issues? 46. a scalable indexingstrategy 47. Cache indexing Linear performance is not good enoughas N grows, full scanning takes O(n) The rollup combinatorial problemas the cache grows, reuse becomes tedious 48. The rollup combinatorial problem Gender Country Sales M USA 7 MCANADA 8 F USA 4 FCANADA 2Country Sales USA 11CANADA 10 49. The rollup combinatorial problemGenderCountry Sales GenderCountry Sales CitySalesMUSA 7FUSA 5Montreal 6M CANADA 8 Quebec1 AgeCountry Sales Ottawa8 AgeCountry Cost16 - 25USA 2Vancouver241 - 56USA 526 - 40 CANADA 3 Toronto 526 - 40USA 5 CountrySales? ?? ? 50. PoSet & BitKeys Represent the levels / values as bitkeysbecause bitkeys are fast, remember? The PartiallyOrderedSeta hierarchical hash set where elements mightor might not be related to one another. 51. PoSet & BitKeys An example applicationfinding all primes in a set of integers 52. a scalable threadingmodel 53. Concurrent cache access Usage of phasespeek -> load -> rinse & repeat A scalable threading modelthread safety without locks and blocks 54. A scalable threading model Do things once. Do them right.the actor pattern 55. a scalable cachemanagement strategy 56. Operating by deltas All part of a wholeimplicit relation between the dimensions Why deltas are necessaryreducing IO 57. Cache management A data block is a complex objectSchema:[FoodMart]Checksum:[9cca66327439577753dd5c3144ab59b5]Cube:[Sales]Measure:[Unit Sales]Axes:[ {time_by_day.the_year=(*)} {time_by_day.quarter=(Q1, Q2)} {product_class.product_family=(Bread, Soft Drinks)}]Excluded Regions:[ {time_by_day.quarter=(Q1)} {time_by_day.the_year=(1997)}]Compound Predicates:[]ID:[9c8ba4ec39678526f4100506994c384183cd205d19dd142eae76a9fb1d74cab7] 58. a scalable sharingstrategy 59. Shared Caches OLAP and key-value storesdont like each otherOLAP requires a complex key. a hash is insufficient. Remember the "deltas" strategy?partially invalidating a block of data would break the hash 60. Data grids & OLAP Well suited for OLAP cachessupports "rich" keys Distributed and redundantif a node goes offline, the cache data is not lost In-memory grids are fastmultiplies the available heap space 61. a case study 62. Advertising data analysis Interactive behavioral targeting of advertising in real time 63. Advertising data analysis Low latencythe end users dont want to wait for MapReduce jobs Scalability a huge factorwere talking petabytes of data here 64. Advertising data analysis Queries are not staticwe cant tell upfront what will be computed Deployed in datacenters worldwidethe hashing strategy must allow "smart" data distribution Almost all open source 65. Monitoring &ETL DesignerClient AppManagement olap4j Load Balancer OLAPXML/A Cache olap4jLogsETLAnalytical OLAPLogs DBBig ETLDataStore LogsETLLogs MessageETLQueue 66. Client App A queryolap4j- UI sends MDX to a SOAP service.- load balancer dispatches the query.- OLAP layer uses its data sources and aggregates. Load- query is answeredBalancerOLAP XML/ACache olap4jAnalytical OLAPDB 67. An update - Strategy #1- the ETL process updates the analytical DB.- a cache delta is sent to a message queue.- OLAP processes the message.- OLAP uses its index to spot the regions to invalidate.- aggregated cache is updated incrementally. OLAP Cache Logs ETL Analytical OLAP Logs DB Big ETL Data Store Logs ETL LogsMessage ETL Queue 68. An update - Strategy #2- ETL updates the analytical DB.- ETL acts directly on the OLAP cache.- OLAP processes events from its cache.- OLAP updates its indexOLAPCache Logs ETLAnalyticalOLAP LogsDB Big ETL Data Store Logs ETL Logs ETL 69. a stack built on openstandards (get ready, the next slide will hurt your brains) 70. Java Client Appload balancerClient Appolap4j-xmlaolap4j-xmlaHTTP (XMLA) olap4j server olap4j serverolap4j server olap4jolap4j olap4jjdbcjdbc jdbc JDBCconnectionconnection connection poolpool pooljdbcjdbc jdbcolap4j implolap4j impl olap4j impl MondrianMondrian Mondrian serverserver server manager managermanagerJava MondrianMondrian Mondrian cache cachecache manager managermanager infinispaninfinispan infinispanUDP (Hot Rod) infinispan data grid 71. the UI 72. Yahoo! Cocktails A Node.js implementationruns on ManhattanJS