Upload
srinath-perera
View
157
Download
6
Embed Size (px)
DESCRIPTION
Traditionally, big data is mostly read from disks and processed. However, most big data systems are latency bound, which means often the CPU sits idle waiting for data to arrive. This problem is more prevalent with use cases like graph searches that need to randomly access different parts of datasets. In-memory computing proposes an alternative model where data is loaded or stored in-memory and processed instead of processing them from the disk. Although such designs cost more in terms of memory, sometimes resulting systems can have faster order of magnitudes (e.g. 1000X), which could lead to savings in the long run. With rapidly falling memory prices, this difference is reducing by the day. Furthermore, in-memory computing can enable use cases like ad hoc analysis over a large set of data that was not possible earlier. This talk will provide an overview of in-memory technology and discuss how WSO2 technologies like complex event processing that can be used to build in-memory solutions. It will also provide an overview of upcoming improvements in the WSO2 platform.
Citation preview
In-Memory Computing
Srinath Perera Director, Research
WSO2 Inc.
Performance Numbers (based on Jeff Dean’s numbers )
Mem Ops / Sec
If Memory access is a Second
L1 cache reference 0.05 1/20th sec
Main memory reference 1 1 sec
Send 2K bytes over 1 Gbps network 200 3 min
Read 1 MB sequentially from memory 2500 41 minDisk seek 1*10^5 27 hours
Read 1 MB sequentially from disk 2*10^5 2 days
Send packet CA->Netherlands->CA 1.5*10^6 17 days
OperationSpeed
MB/sec
Hadoop Select 3Terasort Bench mark 18Complex Query Hadoop 0.2
CEP 60
CEP Complex 2.5
SSD 300-500
Disk 50-100
Performance Numbers (based on Jeff Dean’s numbers )
Mem Ops / Sec
If Memory access is a Second
L1 cache reference 0.05 1/20th sec
Main memory reference 1 1 sec
Send 2K bytes over 1 Gbps network 200 3 min
Read 1 MB sequentially from memory 2500 41 minDisk seek 1*10^5 27 hours
Read 1 MB sequentially from disk 2*10^5 2 days
Send packet CA->Netherlands->CA 1.5*10^6 17 days
OperationSpeed
MB/sec
Hadoop Select 3Terasort Bench mark 18Complex Query Hadoop 0.2
CEP 60
CEP Complex 2.5
SSD 300-500
Disk 50-100
Most Big Data Apps are Latency-bound!!
Often, your app waste CPU waiting for data to arrive
Latency Lags Bandwidth
• Observation in prof. Patterson’s Keynote at 2004
• Bandwidth improves, but not latency • Same holds now, and the gap is
widening with new systems
Handling Speed Differences in Memory Hierarchy
1. Caching – E.g. Processor caches, file cache,
disk cache, permission cache
2. Replication – E.g. RAID, Content Distribution
Networks (CDN), Web Cache
3. Prediction – Predict what data will be needed and prefect – Tradeoff bandwidth – E.g. disk caches, Google Earth
Above three does not always work
• Limitations – Caching works only if working set is small – Prefetching only works when access patterns are predictable – Replication is expensive and limited by receiving side machines
• Lets assume you are reading and filtering 10G data (assuming 6b per record that is 17Billion records)– 3 minutes to read the data from disk– 35ms to filter 10M in my laptop => 1 minutes to process all
data – Keeping data in memory can give about 30X more
Data Access Patterns in Big Data Applications
• Read from Disk, process once (Basic Analytics)– Data can be perfected, batch load is only about 100 times faster.– OK if processing time > data read time
• Read from Disk, iteratively Process (Machine Learning Algos, e.g. KMean)– Need to load data from disk once and process (e.g. Spark supports this)
• Interactive (OLAP)– Queries are random, data may be scattered. Once query started, can load data to
memory and process
• Random Access (e.g. Graph Processing)– Very hard to optimize
• Realtime Access – As data comes in
In-Memory Computing
Four Myths
• Myths– Too expensive 1TB RAM cluster for 20-40k (about 1$/GB)– It is not durable – Flash is fast enough – It is about In-Memory DBs
• From Nikita Ivanov’s post– http
://gridgaintech.wordpress.com/2013/09/18/four-myths-of-in-memory-computing/
Let us look at each Big data access pattern and where In-Memory
Computing can make a difference
Access Pattern 1:Read from Disk, Process Once • If Tp = 35ms vs
Td=1.2 sec with 60MB chunks, it will give about 30X to keep all data in Memory
• However, this benefit is less if computation is more complex (e.g. Sort)
Access Pattern 2: Read from Disk, iteratively Process
• Very common pattern for machine learning algorithms (e.g. KMean)
• On this case, advantages are greater – If we cannot hold data in memory fully, we need to offload– Then we need to read again – Then cost is very high to load and process and much faster
with in memory computing
• Spark let you load to memory fully and process
Spark• New Programming Model
built on functional programming concepts
• Can be much faster for recursive usecases
• Have a complete stack of products
file = spark.textFile("hdfs://...”)file.flatMap(
line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)
Access Pattern 3: Interactive Queries
• Need to be responsive, < 10 sec• Harder to predict what data is needed• Queries tend to be simpler • Can be made faster by a RAM Cloud
– SAP Hana– Volt DB
• With smaller queries, disk may still be OK. Apache Drill as an Alternative
VoltDB Story
• VoltDB Team (Michael Stonebraker et al.) observed 92% of work in a DB related to Disk
• By building complete in-memory database cluster they made it 20x faster!
Distributed Cloud (e.g. Hazelcast)
• Store the data portioned and replicated across many machines
• Used as a cache that span multipme machines• Key value access
Access Pattern 4: Random Accesses
• E.g. Graph Traversal • This is the hardest usecase • In easy cases, there is a small working set and can be
solved with a cache ( checking users against a black list), not the case with Graph most graph operations like traversal
• Hard cases, In Memory Computing is only real solution • Can be as fast as 1000x or more
Access Pattern 5: Realtime Processing
• This is already In-Memory technology using tools like Complex Event Processing (e.g. WSO2 CEP) or stream processing (e.g. Apache Storm)
Faster Access to Data
• In-Memory databases (e.g. VoltDB, MemSQL)– Provide Same SQL interface– Can think as fast database– VoltDB has shown to about 20X faster than MySQL
• Distributed Cache – Can Integrated as a Large Cache
Load Data Set to Memory and Analyze
• Used with Interactive and Random access usecases • Can be as 1000x faster for some usecases • Tools
– Spark – Hazelcast– SAP Hana
Realtime Processing
• Realtime analytics tools – CEP (WSO2 CEP)– Stream Processing (e.g. Storm)
• Can generate results within few milliseconds to seconds
• Can process 10ks-millions of events per second
• Not all algorithms can be implemented
In Memory Computing with WSO2 Platform
Thank You