Debunking Common Myths in Stream Processing

  • Published on
    06-Jan-2017

  • View
    125

  • Download
    2

Embed Size (px)

Transcript

<p>PowerPoint Presentation</p> <p>1Kostas Tzoumas@kostas_tzoumas</p> <p>Big Data LdnNovember 4, 2016</p> <p>Stream Processing with Apache Flink</p> <p>2Kostas Tzoumas@kostas_tzoumas</p> <p>Big Data LdnNovember 4, 2016Debunking Some Common Myths in Stream Processing </p> <p>3Original creators of Apache Flink </p> <p>Providers of the dA Platform, a supported Flink distribution</p> <p>OutlineWhat is data streaming</p> <p>Myth 1: The throughput/latency tradeoff</p> <p>Myth 2: Exactly once not possible</p> <p>Myth 3: Streaming is for (near) real-time</p> <p>Myth 4: Streaming is hard4</p> <p>The streaming architecture</p> <p>5</p> <p>6Reconsideration of data architecture</p> <p>Better app isolation</p> <p>More real-time reaction to events</p> <p>Robust continuous applications</p> <p>Process both real-time and historical data</p> <p>7</p> <p>app stateapp stateapp state</p> <p>event log</p> <p>Queryservice</p> <p>What is (distributed) streamingComputations on never-ending streams of data records (events)</p> <p>Stream processor distributes the computation in a cluster8Your code</p> <p>Your code</p> <p>Your code</p> <p>Your code</p> <p>What is stateful streamingComputation and stateE.g., counters, windows of past events, state machines, trained ML models</p> <p>Result depends on history of stream</p> <p>Stateful stream processor gives the tools to manage stateRecover, roll back, version, upgrade, etc9Your code</p> <p>state</p> <p>What is event-time streamingData records associated with timestamps (time series data)</p> <p>Processing depends on timestamps</p> <p>Event-time stream processor gives you the tools to reason about timeE.g., handle streams that are out of orderCore feature is watermarks a clock to measure event time</p> <p>10Your code</p> <p>state</p> <p>t3t1t2</p> <p>t4t1-t2t3-t4</p> <p>What is streamingContinuous processing on data that is continuously generated</p> <p>I.e., pretty much all big data</p> <p>Its all about state and time11</p> <p>Debunking some common stream processing myths</p> <p>12</p> <p>Myth 1: Throughput/latency tradeoffMyth 1: you need to choose between high throughput or low latency</p> <p>Physical limitsIn reality, network determines both the achievable throughput and latencyA well-engineered system achieves these limits</p> <p>13</p> <p>Flink performance10s of millions events per seconds in 10s of nodesscaled to 1000s of nodeswith latency in single-digit milliseconds</p> <p>14</p> <p>Myth 2: Exactly once not possibleExactly once: under failures, system computes result as if there was no failure </p> <p>In contrast to:At most once: no guaranteesAt least once: duplicates possible</p> <p>Exactly once state versus exactly once delivery</p> <p>Myth 2: Exactly once state not possible/too costly15</p> <p>TransactionsExactly once is transactions: either all actions succeed or none succeed </p> <p>Transactions are possible</p> <p>Transactions are useful</p> <p>Lets not start eventual consistency all over again16</p> <p>Flink checkpointsPeriodic asynchronous consistent snapshots of application state</p> <p>Provide exactly-once state guarantees under failures17</p> <p>End-to-end exactly onceCheckpoints double as transaction coordination mechanism</p> <p>Source and sink operators can take part in checkpoints</p> <p>Exactly once internally, "effectively once" end to end: e.g., Flink + Cassandra with idempotent updates</p> <p>18</p> <p>transactional sinks</p> <p>State managementCheckpoints triple as state versioning mechanism (savepoints)</p> <p>Go back and forth in time while maintaining state consistency</p> <p>Ease code upgrades (Flink or app), maintenance, migration, and debugging, what-if simulations, A/B tests19</p> <p>Myth 3: Streaming and real timeMyth 3: streaming and real-time are synonymous</p> <p>Streaming is a new modelEssentially, state and timeLow latency/real time is the icing on the cake20</p> <p>Low latency and high latency streams21</p> <p>2016-3-112:00 am2016-3-11:00 am2016-3-12:00 am2016-3-1111:00pm2016-3-1212:00am2016-3-121:00am2016-3-1110:00pm2016-3-122:00am2016-3-123:00am</p> <p>partitionpartitionStream (low latency)</p> <p>Batch(bounded stream)</p> <p>Stream (high latency)</p> <p>Robust continuous applications22</p> <p>Accurate computationBatch processing is not an accurate computation model for continuous dataMisses the right concepts and primitivesTime handling, state across batch boundaries</p> <p>Stateful stream processing a better modelReal-time/low-latency is the icing on the cake23</p> <p>Myth 4: How hard is streaming?Myth 4: streaming is too hard to learn</p> <p>You are already doing streaming, just in an ad hoc way</p> <p>Most data is unbounded and the code changes slower than the dataThis is a streaming problem24</p> <p>It's about your data and codeWhat's the form of your data?Unbounded (e.g., clicks, sensors, logs), orBounded (e.g., ???*)</p> <p>What changes more often?My code changes faster than my dataMy data changes faster than my code25* Please help me find a great example of naturally bounded data</p> <p>It's about your data and codeIf your data changes faster than your code you have a streaming problemYou may be solving it with hourly batch jobs depending on someone else to create the hourly batchesYou are probably living with inaccurate results without knowing it26</p> <p>It's about your data and codeIf your code changes faster than your data you have an exploration problemUsing notebooks or other tools for quick data exploration is a good ideaOnce your code stabilizes you will have a streaming problem, so you might as well think of it as such from the beginning27</p> <p>Flink in the real world</p> <p>28</p> <p>Flink community&gt; 240 contributors, 95 contributors in Flink 1.1</p> <p>42 meetups around the world with &gt; 15,000 members</p> <p>2x-3x growth in 2015, similar in 2016</p> <p>29</p> <p>Powered by Flink30</p> <p>Zalando, one of the largest ecommerce companies in Europe, uses Flink for real-time business process monitoring. </p> <p>King, the creators of Candy Crush Saga, uses Flink to provide data science teams with real-time analytics.</p> <p>Bouygues Telecom uses Flink for real-time event processing over billions of Kafka messages per day. </p> <p>Alibaba, the world's largest retailer, built a Flink-based system (Blink) to optimize search rankings in real time. See more at flink.apache.org/poweredby.html </p> <p>30 Flink applications in production for more than one year. 10 billion events (2TB) processed dailyComplex jobs of &gt; 30 operators running 24/7, processing 30 billion events daily, maintaining state of 100s of GB with exactly-once guaranteesLargest job has &gt; 20 operators, runs on &gt; 5000 vCores in 1000-node cluster, processes millions of events per second31</p> <p>32</p> <p>Flink Forward 2016</p> <p>Current work in Flink</p> <p>34</p> <p>Ongoing Flink development35ConnectorsSession Windows(Stream) SQLLibraryenhancementsMetricSystemOperationsEcosystemApplicationFeaturesMetrics &amp;VisualizationDynamic ScalingSavepointcompatibilityCheckpointsto savepointsMore connectorsStream SQLWindowsLarge stateMaintenanceFine grainedrecoverySide in-/outputsWindow DSLBroaderAudienceSecurityMesos &amp;othersDynamic ResourceManagementAuthenticationQueryable State</p> <p>A longer-term vision for Flink</p> <p>36</p> <p>Streaming use casesApplication(Near) real-time apps</p> <p>Continuous apps</p> <p> Analytics on historical data</p> <p>Request/response appsTechnologyLow-latency streaming</p> <p>High-latency streaming</p> <p>Batch as special case of streaming</p> <p>Large queryable state37</p> <p>Request/response applicationsQueryable state: query Flink state directly instead of pushing results in a database</p> <p>Large state support and query API coming in Flink38</p> <p>queries</p> <p>In summaryThe need for streaming comes from a rethinking of data infra architectureStream processing then just becomes natural</p> <p>Debunking 4 common mythsMyth 1: The throughput/latency tradeoffMyth 2: Exactly once not possibleMyth 3: Streaming is for (near) real-timeMyth 4: Streaming is hard</p> <p>39</p> <p>40Thank you!</p> <p>@kostas_tzoumas @ApacheFlink @dataArtisans</p> <p>41We are hiring! data-artisans.com/careers</p>