Upload
natalino-busa
View
582
Download
0
Embed Size (px)
DESCRIPTION
Some loosen thoughts about the latest buzzwords, streaming computing, realtime processing, and in memory computing.
Citation preview
Streaming ComputingSome thoughts and technology choices for event-driven processing
Natalino Busa - 29 Aug. 2013
Outline
● Concurrency● Streaming computing
● Technologies○ Gigaspaces○ Storm○ Akka
● Comparison matrix● Opportunities
Algorithms: a tribute
Numbers and Algorithms:
9th century Persian Muslim mathematician Abu Abdullah Muhammad ibn Musa Al-Khwarizmi,
whose work built upon that of the 7th century Indian mathematician Brahmagupta.
We own a lot to these guys !!!
Why do we need parallelism?
It gets bigger,
It doesn’t get much faster
BUT
We get more cores in a chip.
More cores = more parallelismWe are happy now, right?
Moore’s law
Every 18 months, the number of CPU
core’s double
Another interpretation:
Every 18 months, the number of idle
CPU core’s double
More parallelism
We trade:
Time vs ( CPU, Memory, I/O)
Modern applications
Scalability:Vertical: concurrency
(use all the cores, memory and I/O of a given machine)
Horizontal: distribution (use all the machines in the cluster)
High availability: Fault tolerance: all levels (local, distributed)
(the terminator effect: you can stop it but can’t kill it )
Streaming applications
Performance: Efficient use of resources:
CPU and memory, but also OS threads and sockets
Asynchronous:
event driven, reacts on new data
Distributed:
more machines = more performancethe algorithm is partitioned and/or replicated on the cluster
What to increase?
More CPU: It helps when there is
computation involved
More MEMORY: It helps when there is
more state to keep
More I/O: It helps when there are
more messages to transfer
Streaming or batch?
ProcessingData
Natalino Busa - 12 Feb. 2013
Data
source system target systemour system
What differentiate Streaming from Batch?
● Granularity of Data● Granularity of Processing
Granularity impacts:
Throughput, Latency, and the Cost of the system!
The choice is yours
1000 events/sec (1 KB/event)
running on 100 cores all day long
“Wait a day, then process”
860 M events = 86 GB of data
Latency: 24 hoursThroughput: 1 update/day
BATCH: Hadoop
Latency 1ms Throughput: 1000 updates/sec
STREAMING: Akka
“Do not wait”
Process the 1KB of data each msec.
“Both are valid options. It depends on the application domain and the requirements/specs of the target and source systems”
Mapping it to existing applications
Granularity of Data
256 GB 256 GB
Granularity of Processing
1 CPU 100 CPU’s
Traditional DB systems Big Data (Hadoop)
Granularity of Data
1 KB 1 KB
Granularity of Processing
1 CPU 100 CPU’s
Traditional mail server Web application server
Technologies: Gigaspaces
Technologies: StormTopology
SupervisingScaling
Technologies: Akka
Supervising:tree of actors
Topology (statics and dynamic actors)
Scaling and distributed processing
Technology matrix
Gran
ular
ity o
f Dat
aGranularity of Processing
Small Big
Small Akka AkkaGigaspaces
Big ? Storm
System end-to-end throughput
High ~ 10’000 events/sec Medium ~100 events/sec Low ~10 events/sec
Akka Storm/ Gigaspaces Scripting languages
Big Data in motion
Both are:Distributed, fault-tolerant, streaming
- Storm ++ multi-language -- not user/admin friendly -- slow supervising
processing elements are jvm’s ideal when data is coarse grained
- Akka ++ high throughput, fine grained actors ++ dynamic topologies -- low-level, but high performance
processing elements are small and lightweightideal for millions of transactions per second
- Gigaspaces ++ combines memory + application distribution -- framework api is not very flexible
processing elements are jvmsideal for all-in-one solution, with little customization
Opportunity: Lambda Architecture
Logic layerSoftware as a Servicee.g realt-time predictor
Natalino Busa - 12 Feb. 2013from http://www.manning.com/marz/
Opportunity: Batch + Streaming
BatchComputing
Front End Services
In-MemoryDistributed Database
In-memoryDistributed DB’s
BatchStreaming
HTML5 Client / Responsive Applow-latencyHTTP API services FETCH
(refresh)
StreamingComputing
Data Warehouses Messaging Busses
PUSH(SSE, notifications)
Thanks
linkedin:
www.linkedin.com/in/natalinobusa
blog:
www.natalinobusa.com
twitter:
@natalinobusa