Architecting next generation big data platform

Hadoop Application Architectures:

Architecting a Next Generation Data

PlatformStrata + Hadoop World, Singapore 2016tiny.cloudera.com/app-arch-singaporetiny.cloudera.com/app-arch-questions

Mark Grover | @mark_groverTed Malaska | @ted_malaska

Jonathan Seidman | @jseidman

Questions? tiny.cloudera.com/app-arch-questions

Logistics▪ Break at 10:30 – 11:00 AM▪Questions at the end of each section▪ Slides at tiny.cloudera.com/app-arch-singapore▪Code at https://github.com/hadooparchitecturebook/Taxi360


About the book▪@hadooparchbook▪ hadooparchitecturebook.com▪ github.com/hadooparchitecturebook▪ slideshare.com/hadooparchbook


About the presenters

▪ Technical Group Architect at Blizzard Entertainment▪ Previously Principal Solutions

Architect at Cloudera, lead architect at FINRA▪ Contributor to Apache

Hadoop, HBase, Flume, Avro, Pig, Spark, YARN, Sqoop, Kudu, Kafka

Ted Malaska



▪ Partner Software Engineer at Cloudera▪ Contributor to Apache Sqoop.▪ Previously Technical Lead on

the big data team at Orbitz, co-founder of the Chicago Hadoop User Group and Chicago Big Data

Jonathan Seidman



▪ Software Engineer on Spark at Cloudera▪ Committer on Apache Bigtop,

PMC member on Apache Sentry(incubating)▪ Contributor to Apache Spark,

Hadoop, Hive, Sqoop, Pig, Flume

Mark Grover

Case Study OverviewInternet of Things and Entity 360


Customer 360


Connected Cars


Entity (Taxi) 360 View

Geo-location/ Traffic Data

Customer Data

Maintenance Data

Other Data Sources

Streaming Vehicle Data


What Makes Hadoop a Fit?

Data Sources Extract Transform Load

The early days…


What Makes Hadoop a Fit?

SERVERS MARTS EDWS DOCUMENTS STORAGE SEARCH ARCHIVE

ERP,CRM,RDBMS,MACHINES FILES,IMAGES,VIDEOS,LOGS,CLICKSTREAMS EXTERNALDATASOURCES

Today…


Enabling a Range of New Use Cases…

Fraud Detection Market Transactions

Internet of Things

Network Security


Hadoop Challenges

Kafka StreamsKafka Connect

Kafka


Challenges - Architectural Considerations ▪ Reliable and scalable ingress of multiple data types and sources:

- High volume event data? Batch data?▪ Reliable and scalable storage to support multiple workloads and access patterns

- Historical data? Real-time search? Analytics▪ Processing engines (for background processing):

- Stream processing? Batch processing?▪ Data Modeling

- Modeling data for real-time random access? Analytic access? Batch access?

Case Study Requirements

Overview


Requirements▪ Allow users (technical and non-technical) to analyze and visualize data…


Requirements▪ Provide analysts with query capabilities via a standard interface…


Requirements▪ Provide developers the ability to perform batch processing on historical data…


Requirements▪ To support all this, we need:

- Reliable ingestion of streaming and batch data.- Ability to perform transformations on streaming data in flight.- Ability to perform sophisticated processing of historical data.

High level architectureWalkthrough


High level architectureSource Transport Stream

ProcessingStorage Access

Data Producer Pub-SubProcessing &

Ingestion Engine

NestedTables

IndexedCube

RelationalTables

Entity Time Series Lookup

Batch Processing

SQL

NRT REST

NRT Dashboard

Data BufferingConsiderations


But wait!

What about batch data?

Event Data BufferingOverview




Data Producer Pub-SubProcessing &

Ingestion Engine

NestedTables

IndexedCube

RelationalTables


Batch Processing

SQL

NRT REST

NRT Dashboard


Buffering Data▪ What do we mean by “buffering” and why do we need it?

event,event,event,event,event,event…

This is bad!


Buffering Data – Message Brokers

Publisher

Publisher

Publisher

Message Queue

Subscriber

Subscriber

Subscriber




Custom Producer

or

Processing & Ingestion Engine

NestedTables

IndexedCube

RelationalTables


Batch Processing

SQL

NRT Rest

NRT Dashboard


Buffering Data – Flume vs. Kafka▪ Flume – well integrated with Hadoop.

- Great choice when ingesting data into HDFS.- Can support simple transformations.- Less coding.

▪ But…- Interface between Kafka and the streaming layer is already well defined.- Transformations are done in the stream processing layer.- We need a more general purpose system at this layer.


Kafka Connect

Kafka Connect(Source)

Kafka Connect

(Sink)


What is Kafka?▪ It’s like a message queue, right?

- Actually, it’s a “distributed commit log”.

0 1 2 3 4 5 6 7 8

Data Source

Data Consumer

A

Data Consumer

B


Topics and Partitions▪ Messages are organized into topics, and each topic is split into partitions.

- Each partition is an immutable, time-sequenced log of messages on disk.- Note that time ordering is guaranteed within, but not across, partitions.

0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8

Partition 0

Partition 1

Partition 2

Data SourceTopic


Kafka Background (Physical)

ProducersProducersProducersBrockerBrockerBroker

ConsumersConsumersConsumers

ZooKeeperZooKeeperZooKeeper


Consumers

Kafka In Our Architecture

Taxi Trip Data

Producer

Kafka

taxi-trip-input

Topic

Stream Processing(Analytic)

Stream Processing(Lookup)

Stream Processing

(Search)

Stream Processing

(Long Term)


Input Events

CMT,2009-01-05 08:31:55,2009-01-05 8:37:50,1,0.90000000000000002,-73.977936999999997,40.745919000000001,,,-73.983609000000001,40.755051000000002,Credit,5.2999999999999998,0,,0.79000000000000004,0,6.0899999999999999

vendor_name,Trip_Pickup_DateTime,Trip_Dropoff_DateTime,Passenger_Count,Trip_Distance,Start_Lon,Start_Lat,Rate_Code,store_and_forward,End_Lon,End_Lat,Payment_Type,Fare_Amt,surcharge,mta_tax,Tip_Amt,Tolls_Amt,Total_Amt


Kafka Considerations – Reliability▪ Different reliability levels for topics:

Taxi Trip Data

Kafka

taxi-trip-input

Twitter customer-sentiment

100% – dups are ok

(“At least once”)

<=100%(“At most

once”)


Kafka Considerations – Reliability▪ But remember there are tradeoffs…


Kafka Reliability – ReplicationProducer

Bro

ker

Parti

tion

1

Parti

tion

2

Parti

tion

3

Lead

er



Bro

ker

Parti

tion

1

Parti

tion

2

Parti

tion

3



Bro

ker

Parti

tion

1

Parti

tion

2

Parti

tion

3

Bro

ker

Parti

tion

1

Parti

tion

2

Parti

tion

3

Lead

er


Kafka Reliability– ReplicationProducer

Bro

ker

Parti

tion

1

Parti

tion

2

Parti

tion

3

Bro

ker

Parti

tion

1

Parti

tion

2

Parti

tion

3

Lead

er

Lead

er



Bro

ker

Parti

tion

1

Parti

tion

2

Parti

tion

3

Bro

ker

Parti

tion

1

Parti

tion

2

Parti

tion

3

Bro

ker

Parti

tion

1

Parti

tion

2

Parti

tion

3

Lead

er


Kafka Reliability – Replication▪ So how does this relate to our application?

kafka-topics --zookeeper ZKHOST:ZKPORT –partition 2 --replication-factor 3 \--create --topic taxi-trip-input

kafka-topics --zookeeper ZKHOST:ZKPORT –partition 2 --replication-factor 1 \--create –topic customer-sentiment


Kafka Reliability – Producers

Taxi Trip Data

Kafka

taxi_trip_input

Partition 1

Partition 2

Partition 3

Topic B

Partition 1

Partition 2

Partition 3

Message failure?

Producer

Resend message

acks=all


Kafka Reliability – Producers▪ What about duplicates?

Taxi Trip Data

Kafka

taxi_trip_input

Partition 1

Partition 2

Partition 3

Topic B

Partition 1

Partition 2

Partition 3

Producer

ID Message1000 2009-01-04 03:02:00,1,2.629,...1001 2009-01-04 03:38:00,3,4.549…1001 2009-01-04 03:38:00,3,4.549…


Kafka Scaling – Partitions

Producer

Kafka

taxi-trip-input

Partition 1

Partition 2

Partition 3

Consumer Group

Consumer

Consumer

Consumer



Producer

Kafka

taxi-trip-input

Partition 1

Partition 2

Partition 3

Consumer Group

Consumer

Consumer

Consumer

Partition 4

Partition 5

Consumer

Consumer

Higher throughput

Higher throughput

More resources (memory)

More resources

(file handles)

Producer




Custom Partitioning0 1 2

0 1 2 3 4 5 6 7 8

0 1 2 3 4 5

Producer

0 1 2

0 1 2 3 4 5 6 7 8

0 1 2 3 4 5

Producer

3 4 5 6 7

6 7

Partitioner


Kafka Scaling – Producers

Producer

Kafka

taxi-trip-input

Partition 1

Partition 2

Partition 3

Consumer Group

Consumer

Consumer

Consumer

Partition 4

Partition 5

Consumer

Consumer

Producer


Guarding Against Message Loss▪ Producer – What happens if the producer loses connection to Kafka and the buffer overflows?- Consider a producer side buffer (e.g. Flume).▪ Source – What happens if events are lost before getting sent to producer?

- Once again use some kind of buffer to provide sufficient retention of data.

Stream ProcessingConsiderations




Custom Producer

or


NestedTables

IndexedCube

RelationalTables


Batch Processing

SQL

NRT REST

NRT Dashboard


Streaming agenda▪ What do we mean by streaming?▪ Streaming use-cases▪ Streaming semantics▪ Which streaming engine to choose?▪ Streaming in our use-case

What do we mean by streaming?



Constant low milliseconds & under

Low milliseconds to seconds, delay in case of failures

10s of seconds or more, re-run in case

of failures

Real-time Near real-time Batch






of failures



But, there’s no free lunch




of failures


“Difficult” architectures, lower latency

“Easier” architectures, higher latency

Streaming use-cases


Streaming Use-cases▪ Ingestion (most relevant in our use-case)▪ Simple transformations

- Decision (e.g. anomaly detection)- Enrichment (e.g. add a state based on zipcode)▪ Advanced usage

- Machine Learning- Windowing


#1 - Simple ingestion

BufferEvent e Stream

Processing Long term storage

Event e


#2 - Enrichment


Processing Storage

Event e’

e’ = enriched event e

Context store


#2 - Decision


Processing Storage

Event e’

e’ = e + decision

Rules


#3 – Advanced usage


Processing Storage

Event e’

e’ = aggregation or windowed aggregation

Model


#1 – Simple Ingestion1. Zero transformation

- No transformation, plain ingest- Keep the original format – SequenceFile, Text, etc.- Allows to store data that may have errors in the schema

2. Format transformation- Simply change the format of the field- To a structured format, say, Avro, for example- Can do schema validation

3. Atomic transformation- Mask a credit card number


#2 - Enrichment


Processing Storage

Event e’

e’ = enriched event e

Context storeNeed to store the

context somewhere


Where to store the context?1. Locally Broadcast Cached Dim Data

- Local to Process (On Heap, Off Heap)- Local to Node (Off Process)

2. Partitioned Cache- Shuffle to move new data to partitioned cache

3. External Fetch Data (e.g. HBase, Memcached)


#1a - Locally broadcast cached data

Could be On heap or Off heap


#1b - Off process cached dataData is cached on the

node, outside of process. Potentially in an external system like

Rocks DB


#2 - Partitioned cache data

Data is partitioned based on field(s) and

then cached


#3 - External fetch

Data fetched from external system


Partitioned cache + external

Streaming semantics


Delivery Types▪ At most once

- Not good for many cases- Only where performance/SLA is more important than accuracy

▪ Exactly once- Expensive to achieve but desirable

▪ At least once- Easiest to achieve


Semantics of our architectureSource System 1

DestinationsystemSource System 2

Source System 3

Ingest Extract Streaming engine

Push

Message broker


Classification of storage systems▪ File based

- S3- HDFS▪ NoSQL

- HBase- Cassandra▪ Document based

- Search▪ NoSQL-SQL

- Kudu


Classification of storage systems▪ File based

- S3- HDFS▪ NoSQL

- HBase- Cassandra▪ Document based

- Search▪ NoSQL-SQL

- Kudu

De-duplication at file level

Semantics at key/record level

Which streaming engine to choose?




Custom Producer

or


NestedTables

IndexedCube

RelationalTables


Batch Processing

SQL

NRT REST

NRT Dashboard

Apache Beam

Kafka Streams


Spark Streaming▪Micro batch based architecture▪ Allows stateful transformations ▪ Feature rich

- Windowing- Sessionization- ML- SQL (Structured Streaming)


Spark Streaming

DStream

DStream

DStream

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count Print

Source Receiver RDD

RDD

RDD

Single Pass

Filter Count Print

First Batc

h

Second Batch


DStream

DStream

DStream

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count

Print

Source Receiver RDDpartitions

RDDParition

RDD

Single PassFilter Count

Pre-first Batch

First Batc

h

Second Batch

Stateful RDD 1

Print

Stateful RDD 2

Stateful RDD 1

Spark Streaming


Flink▪ True “streaming” system, but not as feature rich as Spark▪Much better event time handling▪Good built-in backpressure support▪ Allows stateful transformations▪ Lower Latency

- No Micro Batching- Asynchronous Barrier Snapshotting (ABS)


Flink - ABS

Operator

Buffer


Operator

Buffer

Operator

Buffer

Flink - ABS

Barrier 1A Hit

Barrier 1B Still Behind


Operator

Buffer

Flink - ABS

Both Barriers Hit

Operator

Buffer

Barrier 1A Hit

Barrier 1B Still Behind


Operator

Buffer

Flink - ABSBoth Barriers

Hit

Operator

Buffer Barrier is combined and can move on

Buffer can be flushed out


Storm▪Old school▪Didn’t manage state – had to use Trident▪No good support for batch processing


Samza▪Good integration with Kafka▪Doesn’t support batch▪ Forked by Kafka Streams


Flume▪Well integrated with the Hadoop ecosystem▪ Allowed interceptors (for simple transformations)▪ Supports buffering- Memory- File- Kafka▪ But no real fault-tolerance▪No state management


Others▪ Apache Apex▪ Kafka Streams▪ Heron


Apache Beam▪ Abstraction on top of Streaming Engines▪ Best support for Google Dataflow

Streaming in our use-case


Spark Streaming▪We chose Spark Streaming because:- Same execution engine for batch and streaming- Similar code for batch and streaming- Support for security, kafka integration- Thriving community- We don’t have low millisecond requirements




Custom Producer

or

NestedTables

IndexedCube

RelationalTables


Batch Processing

SQL

NRT REST

NRT Dashboard

Storage LayerConsiderations




Custom Producer

or

NestedTables

IndexedCube

RelationalTables


Batch Processing

SQL

NRT REST

NRT Dashboard

Data Modeling


Structured Landing ZonesHive Relational Model

Kudu/HDFS

Hive Nested ModelHDFS

AggregationsKudu

HBase Entity Time Series

Solr

Traditional SQL

Optimized for nested Structures like JSON

Optimized Storing and mutating aggregates

Optimized Entity 360 and time base access

Optimized faceted charts and reverse index look ups


Relational▪ Everyone knows it▪ Simple▪ Very painful to do large Join▪ May lead to customers making bad queries▪ Easier to mutate


Kudu Data Models▪ Entity Summary Tables

- Quick update and access of aggregate of Entity Stats▪ Event Tables

- Number of Partitioning strategies- Partition by Entity- Partition by Hash on time


Kudu: Table Creation ExampleCREATE EXTERNAL TABLE ny_taxi_trip (

vender_id STRING,tpep_pickup_datetime TIMESTAMP,tpep_dropoff_datetime TIMESTAMP,passenger_count INT,trip_distance DOUBLE,pickup_longitude DOUBLE,pickup_latitude DOUBLE,rate_code_id STRING,store_and_fwd_flag STRING,dropoff_longitude DOUBLE,dropoff_latitude DOUBLE,payment_type STRING,fare_amount DOUBLE,extra DOUBLE,mta_tax DOUBLE,improvement_surcharge DOUBLE,tip_amount DOUBLE,tolls_amount DOUBLE,total_amount DOUBLE

)STORED AS PARQUETLOCATION 'usr/root/hive/ny_taxi_trip';


Kudu: Data PopulationSparkStreamingTaxiTripToKudu.scala


Kudu: REST APIKuduServiceLayer.scala


View Strategies

Hive Relational Model

Hive Nested Model

Mod

els

Hive Normal Views

Hive Materialized Table Views

Use in the cases where the view requires a join that is done through a shuffle

Use only for tables that filter records/columns or use for marking fields


Nested▪ Less Space than Denormalization▪ Still have tables but the cost of joins is all but gone▪ Also great for cartesian joins

- N x M vs N + M▪ Not really supported yet with Kudu or HBase with SQL


Nested Writing Example in Spark{"id": "0001","type": "donut","name": "Cake","ppu": 0.55,"batters":{"batter":[{ "id": "1001", "type": "Regular" },{ "id": "1002", "type": "Chocolate" },{ "id": "1003", "type": "Blueberry" },{ "id": "1004", "type": "Devil's Food" }]

},"topping":[{ "id": "5001", "type": "None" },{ "id": "5002", "type": "Glazed" },{ "id": "5005", "type": "Sugar" },{ "id": "5007", "type": "Powdered Sugar" },{ "id": "5006", "type": "Chocolate with Sprinkles" }]


Nested Writing Example in Sparkval jsonDF = hiveContext.read.json(jsonRDD)

jsonDF.write.parquet("./parquet")

hiveContext.createExternalTable("jsonNestedTable", "./parquet")


Nested: Taxi ExampleKuduToNestedHDFS.scala


Entity Centric Time Series▪ Partition by Entity ID▪ Order by Time▪ Allows for free windowing▪ Allows for fetching of single time window of single entity at web scale



Cust-A, 10

Cust-A, 20

Cust-A, 40

Cust-C, 10

Cust-C, 20

Cust-C, 30

Cust-C, 40

Cust-B, 10

Cust-B, 20

Cust-B, 30

Cust-B, 40

Cust-F, 20

Cust-F, 30

Cust-F, 40

Cust-D, 10

Cust-D, 20

Cust-D, 40

Cust-G, 10

Cust-G, 20

Cust-G, 30

Cust-G, 40



Cust-A, 10

Cust-A, 20

Cust-A, 40

Cust-C, 10

Cust-C, 20

Cust-C, 30

Cust-C, 40

Cust-B, 10

Cust-B, 20

Cust-B, 30

Cust-B, 40

Cust-F, 20

Cust-F, 30

Cust-F, 40

Cust-D, 10

Cust-D, 20

Cust-D, 40

Cust-G, 10

Cust-G, 20

Cust-G, 30

Cust-G, 40

Rest Call Short Scan



Cust-A, 10

Cust-A, 20

Cust-A, 40

Cust-C, 10

Cust-C, 20

Cust-C, 30

Cust-C, 40

Cust-B, 10

Cust-B, 20

Cust-B, 30

Cust-B, 40

Cust-F, 20

Cust-F, 30

Cust-F, 40

Cust-D, 10

Cust-D, 20

Cust-D, 40

Cust-G, 10

Cust-G, 20

Cust-G, 30

Cust-G, 40

Mapper Mapper Mapper



Cust-A, 10

Cust-A, 20

Cust-A, 40

Cust-C, 10

Cust-C, 20

Cust-C, 30

Cust-C, 40

Cust-B, 10

Cust-B, 20

Cust-B, 30

Cust-B, 40

Cust-F, 20

Cust-F, 30

Cust-F, 40

Cust-D, 10

Cust-D, 20

Cust-D, 40

Cust-G, 10

Cust-G, 20

Cust-G, 30

Cust-G, 40

Mapper

Mapper Mapper


HBase: Row Key Example def generateRowKey(customerTrans: CustomerTran, numOfSalts:Int): Array[Byte] = {val salt = StringUtils.leftPad(

Math.abs(customerTrans.customerId.hashCode % numOfSalts).toString, 4, "0")

Bytes.toBytes(salt + ":" +customerTrans.customerId + ":" +StringUtils.leftPad(customerTrans.eventTimeStamp.toString, 11, "0") + ":trans:" +customerTrans.transId)

}


HBase: Population ExampleSparkStreamingTaxiTripToHBase.scala


HBase: REST ExampleHBaseServiceLayer.scala


Solr: Data Model▪ Think of it like a cube on a object type

- In our case a taxi trip- Allows for rollups and aggregations from object’s point of view- Think of objects as immutable- Try to find time based events- May design more than one object type


Solr Details

1 Trip:101 1

2 Trip:102 1

3 Trip:103 1

ID Document Live

4 Trip:104 1

5 Trip:105 1

ID Field Value Documents

1 Cash 1,3

2 Credit 2

3 Debit 4,5


Single Value Aggregations▪ Get Array Lengths

ID Field Value Documents

1 Cash 1,3

2 Credit 2

3 Debit 4,5


Multi Value Aggregations▪ Ordered Merge Join

- Think like a zipper- Scans- No Lookups▪ Top N from both sides- Leaving the rest to other▪ Indexes distributed▪ No need to read document data

1 4 5 7 8 9 10 14 16

2 3 6 11 12 13 15 17 18

1 2 3 6 7 8 10 15 18

Cash

Credit

Vender A

4 5 9 11 12 13 14 16 17Vender B


Solr: Population ExampleSparkStreamingTaxiTripToSolR.scala


Storage


ProcessingAccess

Custom Producer

or

Batch ProcessingConsiderations




Custom Producer

or

NestedTables

IndexedCube

RelationalTables


Batch Processing

SQL

NRT REST

NRT Dashboard


Why have batch processing?▪ When you need a larger context

- Say, to train a model▪ Complex periodic job that does something

- Convert data to a nested structure for reduced number of shuffles▪ In our use-case,

- Kudu -> HDFS Nested is batch processing- KMeans calculation is also in bash


Batch processing options▪ Spark (+ MLlib)▪ MapReduce (+ Mahout)▪ Flink (+ Flink ML)


Spark▪ Pretty popular▪ Much faster than MapReduce▪ Thriving community


MapReduce▪ Sloooooow


Flink▪ Pretty popular▪ Batch is a special case of Streaming▪ Developing community


In our use-case▪ We chose Spark

- We were using Spark Streaming anyways- Similar code between Spark and Spark Streaming- Thriving community

Interactive Data Access

Considerations




Custom Producer

or

NestedTables

IndexedCube

RelationalTables


Batch Processing

SQL

NRT REST

NRT Dashboard


Types of data access▪ REST server/APIs for querying entities and aggregates▪ UI for displaying search facets▪ SQL engine

REST serversConsiderations


Why have REST server?▪ Tired of business people telling us how to access data▪ Serves as an interface between the data engineers and business folks▪ Lets business folks decide access patterns▪ Engineers to optimize those patterns▪ Brownie points from your boss▪ And, it’s not that difficult to write!


Don’t believe me?import org.mortbay.jetty.Serverimport org.mortbay.jetty.servlet.{Context, ServletHolder}…

val server = new Server(port)val sh = new ServletHolder(classOf[ServletContainer])sh.setInitParameter("com.sun.jersey.config.property.resourceConfigClass", "com.sun.jersey.api.core.PackagesResourceConfig")sh.setInitParameter("com.sun.jersey.config.property.packages", "com.hadooparchitecturebook.taxi360.server.hbase")sh.setInitParameter("com.sun.jersey.api.json.POJOMappingFeature", "true”)val context = new Context(server, "/", Context.SESSIONS)context.addServlet(sh, "/*”)server.start()server.join()


Then, write a ServiceLayer@GET@Path("vender/{venderId}/timeline")@Produces(Array(MediaType.APPLICATION_JSON))def getTripTimeLine (@PathParam("venderId") venderId:String,

@QueryParam("startTime") startTime:String = Long.MinValue.toString,@QueryParam("endTime") endTime:String = Long.MaxValue.toString):

Array[NyTaxiYellowTrip] = {


Use REST! Say no to business people!▪ Access data like so:http://<serverURL>:8080/vender/{venderId}/timeline

UIConsiderations


UI requirementsSomething that can ▪ Represent search results really well▪ Integrates with Apache Solr on Hadoop


UI options▪ Hue▪ Banana▪ Kibana


We choose Hue▪ Because it’s included▪ Please look at the others

SQL enginesConsiderations


SQL engine criteria▪ Low latency SQL access▪ Allows for high concurrency▪ JDBC/ODBC integration▪ Capable of large scale aggregation▪ Optionally integrates with Kudu for real-time updates to SQL tables


Apache Hive▪ Good JDBC integration▪ Not really low latency, even when using Tez▪ Doesn’t integrate with Kudu▪ Can run at MR, Spark, or Tez


Presto▪ Low latency SQL engine from Facebook▪ Provides JDBC/ODBC access▪ Is only in-memory, large aggregations can lead to OOM errors▪ Doesn’t integrate with Kudu


Apache Impala▪ Low latency SQL access▪ Provides JDBC/ODBC access▪ Excellent concurrency support▪ Integrates with Kudu for real-time SQL


Apache Drill▪ Similar in architecture to Impala▪ Provides JDBC/ODBC access▪ Doesn’t integrate with Kudu


Spark SQL▪ Builds on top of Spark▪ JDBC/ODBC access only via Spark Thrift Server

- Doesn’t scale well with larger number of concurrent users- Doesn’t fully provide secure access.


We choose▪ Spark SQL▪ Impala

Overall ArchitectureReview




Custom Producer


NestedTables

IndexedCube

RelationalTables


Batch Processing

SQL

NRT Rest

NRT Dashboard




Custom Producer

or

NestedTables

IndexedCube

RelationalTables


Batch Processing

SQL

NRT REST

NRT Dashboard


Storage


Processing

Custom Producer

or

Access

Batch Processing

SQL

NRT REST

NRT Dashboard


Access


ProcessingStorage

Custom Producer

or




Custom Producer

or

Demo!


High Level of the Demo Design

Producer

KafkaTopic Foo

Partition 1

Partition 2

Partition 3

Spark Streaming

Kudu

Spark Streaming

HBase

Spark Streaming

Solr

Spark Streaming

HDFS

Kudu

HBase

Solr

HDFS

SQL

REST

REST

Hue

SQL

Where else to find us?


Other Sessions▪ Ask Us Anything session (all) – Wednesday, 2:35 PM▪ Top Five Mistakes When Writing Spark Applications (Mark/Ted) – Wednesday,

11:15 AM▪ Storage designs done right equal faster processing and access (Ted) –

Wednesday, 4:15 PM

Thank you!@hadooparchbook

tiny.cloudera.com/app-arch-singaporeJonathan Seidman | @jseidmanTed Malaska | @ted_malaskaMark Grover | @mark_grover

Engineering

Architecting next generation big data platform