Upload
satoshi-tagomori
View
4.186
Download
0
Embed Size (px)
Citation preview
Perfect Norikra 2nd SeasonStream Processing Casual Talks #2 2017/07/27
Satoshi Tagomori (@tagomoris)
Satoshi "Moris" Tagomori (@tagomoris)
Fluentd, MessagePack-Ruby, Norikra, ...
Treasure Data, Inc.
http://norikra.github.io/
Streaming +
SQL
Norikra: Schema-less Stream Processing using SQL
• Server software, written in JRuby, runs on JVM
• Open source software (GPLv2)
• http://norikra.github.io/
• https://github.com/norikra/norikra
SELECT user.age, COUNT(*) as cnt FROM events.win:time_batch(5 mins)
WHERE current=”San Diego” AND attend.$0 AND attend.$1
GROUP BY user.age
{“name”:”tagomoris”, “user:{“age”:35, “corp”:”LINE”, “address”:”Tokyo”}, “current”:”San Diego”, “speaker”:true, “attend”:[true,true,false, ...]}
{“user.age":35,"cnt":5}, {"user.age":36,"cnt":8}, ...
How Norikra is Perfect• Ultra fast bootstrap • Schema on read • Handling complex (nested) events • Dynamic query registration/unregistration • Simple Web UI • Data connector: Fluentd • Extensible: UDF/Listener plugins • Performance: good enough for small/middle site
Schema on Read• Query first, Data next • Query must know what it requires
• field names, types of fields, ... • Platform can ingest any data into processor.
Query can fetch events which matches required schema.
schema-less (mixed) data stream
fields subset
for query A
fields subset for query B
query A
query Bevents from
billing service
events from API endpoint
Architecture
Norikra Server (on JVM)
Esper Instance (Query Engine)
Type DefinitionManager
Output Event Pool
Norikra Engine
RPC Servermizuno (Jetty + Rack)
Rack RPC Handler
NorikraClientmsgpack-
rpc-over-http
For details :)• Norikra: Stream Processing with SQL
http://www.slideshare.net/tagomoris/norikra-stream-processing-with-sql
• Norikra: SQL Stream Processing in Ruby http://www.slideshare.net/tagomoris/norikra-sql-stream-processing-in-ruby
• Norikra in Action http://www.slideshare.net/tagomoris/norikra-in-action-ver-2014-spring
• Landscape of Norikra Features http://www.slideshare.net/tagomoris/norikra-meetup-features
• Norikra Recent Updates http://www.slideshare.net/tagomoris/norikra-recent-updates
Recent Updates
• v1.4.0: Jul 19, 2016 • Add support for "-D" and "-agentlib" of JVM • Update msgpack version
• Previous release v1.3.1: May 7, 2015 • Explained in "Norikra Recent Updates" slide
User Companies
• LINE Corporation
• Kayac Inc.
• Mercari, Inc.
• (and some/many others)
https://www.slideshare.net/tagomoris/how-to-make-norikra-perfect
Perfect Norikra• All features of Norikra
• Including "Ultra fast bootstrap" • Compatible RPC API w/ original Norikra
• Distributed execution on any scheduler • YARN? Mesos? or ...? • Automatic failover & retry for failures (HA) • Automated optimization for load balancing • Dynamic scaling out
from 1 to 100 nodes - without any restarts/retries
MAKE Norikra
PERFECT AGAIN
Features for More Perfection
• Loading operator internal states from Batch query engines
• Sharing operator internal states between queries
Stream Processing
• Monitoring, Reporting, Alerting
• Fast recommendation
• Matching behaviors
• and ...
Handling Long Term Data/History
timeline
Website audience data
Jul 24, 2014 Purchase a car
Jul 28, 2017 ....?
Start batch queryto read 3~4 years history
Offer a nice bonus to possible customer!
Browser session already expired......
Stream Processing on Long Term Data
timeline
Website audience data: processed continuously
Jul 24, 2014 Purchase a car
Jul 28, 2017 Got a nice bonus offer!
Jul 28, 2017 Got a wrong offer...
Rewrite the query & start itwithout past data... more 3 years required for test?
Resume/Restart of Queries
• Queries may be stopped/killed by many reasons • cluster version up / migration • troubles
• Queries should be modified anytime • wrong logic • data schema upgrade • new business requirement
What we want:
timeline
Website audience data: processed continuously
Jul 24, 2014 Purchase a car
Jul 28, 2017 Got a nice bonus offer!
Jul 28, 2017 Got a wrong offer...
Rewrite & start the query with past long history
Load "Running" QueriesLoad "running" stream query from batch engines!
Submit a stream query
Query the history on batch engines & load the result as intermediate state of stream query
Start to process realtime data
Load "Running" QueriesLoad "running" stream query from batch engines!
Submit a stream query
Query the history on batch engines & load the result as intermediate state of stream query
Start to process realtime data
JOINs with Past DataSubmit a stream query w/ JOIN past data
JOIN
Submit a query
Query past data from batch & load it
JOINStart to process realtime data w/ JOIN
JOINs with Past DataSubmit a stream query w/ JOIN past data
JOIN
Submit a query
Query past data from batch & load it
JOINStart to process realtime data w/ JOIN
True Lambda Architecture
• Use just one DSL on both of Stream & Batch • SQL!
• Ingest data stream to both of Stream & Storage
• Handle time window intelligently • Specify time window out of DSL • Write once on batch, Run anywhere :D
Idempotent Operator State
• As a stream operator with realtime data
• As a loaded stream operator with past data
• Serializable operator internal states
Sharing Operators between Queries
Query A
Query B
SHARED Operators
Sharing Operators between Queries
history(stream)
history(batch: 3 - 4 years ago)
JOIN
Query Afilter + projection
Query Bfilter + projection
Sharing Operators during Updating Query
history(stream)
history(batch: 3 - 4 years ago)
JOIN
Query Afilter + projection
Oops, I found mistake on Query A!
SHARED Operators
Sharing Operators during Updating Query
history(stream)
history(batch: 3 - 4 years ago)
JOIN
Query Afilter + projection
Query A'filter + projection
I've just added updated query...
Sharing Operators during Updating Query
history(stream)
history(batch: 3 - 4 years ago)
JOIN
Query A'filter + projection
It works!I can remove older one.
Perfect Stream Processing Engine• Just same SQL on both of Batch and Stream
• Stream processor which can resume queries using batch query engine results • reduces memory usage of JOINs • reduces memory usage about historical data
• Stream Processor which can share operators between queries • reduces total amount of memory usage • makes it possible to restart/update queries anytime,
casually
Perfect Norikra
Named
It has still 0 bytes. Stay tuned!
We are hiring! - Treasure Data