Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 1
Real-Time Risk Analytics at Network Speed and Hadoop ScaleWhen Minutes Means Millions
Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 2
Agenda
• About Argyle
• Use Cases we are Focusing on
• Case Study
• Architecture
• Deep Packet Inspection
• SQL on Accumulo
Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 3
Argyle Data
• Founded 2009
• Venture backed
• 25+ employees
• Headquartered in San Mateo, CA
• Mobile Communications, Finance Services, eCommerce, Federal
• Alliance program vertical market ISV app providers
History Vertical Markets
Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 4
Argyle Data – Our Story
• Every Enterprise App– Will be re-written in a better Data Driven way
• Data Driven Apps– Will be Real-Time, Network Speed and Hadoop Scale
• Proven Stack for Data Driven apps
Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 5
Pattern for Real-Time Risk Applications
Minutes Means Millions
Risk App Same Common Pattern Customer
Real-Time Call-Data Non-Invasive Network Packet Ingestion – Call DataMillions of Mixed Inserts/Reads/SecondReal-Time Analytics – Fast and Fresh
Real-Time SMS-DataNon-Invasive Network Packet Ingestion – SMS DataMillions of Mixed Inserts/Reads/SecondReal-Time Analytics – Fast and Fresh
Real-Time Operational Data
Non-Invasive Packet/Log File Ingestion - TextMillions of Mixed Inserts/Reads/SecondReal-Time Analytics – Fast and Fresh
Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 6
Real-Time Fraud Detection
• Situation– Wangiri Fraud – Missed Call
– Multi-Billion Dollar Fraud
– Next Day Call Data Record Analysis
• Solution– Real-Time Network DPI
– Real-Time Analytics and Detection
• Scale– Ingest All Live Call Data for Whole Country
– Non-Intrusive Tap – 10Gb/s to 100Gb/s
• Benefit– Detect IRSF Callback Fraud in Minutes
– Data Packet Lake for Multiple Apps
Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 7
Stack Shift
• 24 Hour ETL/DB Process
• In-Memory Analytics
• Patchwork Quilt Systems
• App Transaction, Log Files
• Application Data Silos
• Complex Rules
• Complex App Dev
• Real-Time
• Petabyte Scale Analytics
• Single Hadoop Stack
• Network Packet Ingestion
• Network Packet Data Lake
• Machine Learning at Scale
• As Simple as Splunk
“62% Moving to Hadoop Infrastructure” - Gartner
Old world architecture New world architecture
Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 8
ArgyleDBEnabling Data Driven Risk Apps at Network Speed and Hadoop Scale
• Ingestion– Network Packet Ingestion
– Deep Packet Inspection
– Storage Optimization
• Universal Schema
• Query– Distributed SQL Optimization
• Machine Learning
Machine Learning
Query Search GraphIngest
Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 9
Deep Packet Inspection
A Sea of Protocols
Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 10
Presto + Hive
Architecture
Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 11
Presto + Accumulo
From K/V to SQL
Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 12
Parallel Architecture / Data Locality
Collocate Presto-Accumulo Workers and Accumulo Nodes
Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 13
Accumulo KV to Presto data model mapping
Schema-less to Schema-full
• Accumulo is schema-less
• Presto expects a predefined schema for tables
• Table definitions in ZooKeeper
• Each Presto table mapped to an Accumulo table
• Each Presto column mapped to an Accumulo colfam+colqualifier
• Use column definition to detect data type and deserialize from byte[]
Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 14
Secondary Index
Or how to make it columnar
• Presto works well with Columnar storage
• Presto fetches individual columns, not rows
• We considered Accumulo Locality Groups
• But we decided to use a separate index table
Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 15
Secondary Index Table
Presto Worker
Table1_index
Table1
Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 16
Secondary Index Table
Table1
Table1_index
Prefixed with a byte for sharding data (to prevent “burning kindle”)
key Value
<shard_byte2>Joe <shard_byte1>123
<shard_byte3>Smith <shard_byte1>123
Key Column Value
<shard_byte1>123 Firstname Joe
<shard_byte1>123 Lastname Smith
Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 17
REAL-TIME RISK ANALYTICSAT NETWORK SPEEDAND HADOOP SCALE
When Minutes Means Millions