Upload
planet-cassandra
View
9.366
Download
2
Embed Size (px)
DESCRIPTION
Building analytics systems is an increasingly common requirement for BI teams inside companies both big and small, and a feat made even more challenging when analytic results have to be produced in real-time. In this presentation the team from MarkedUp Analytics will show you techniques for leveraging Cassandra, Hadoop, and Hive to build a manageable and scalable analytics system capable of handling a wide range of business cases and needs.
Citation preview
Real Time Analytics with Cassandra, Hive, and Solr
Real Time Analytics with Cassandra, Hive, and Solr Aaron Stannard, Founder & CEO of MarkedUp
Powerful analytics tools for native apps
Understand your audience.
Gain valuable data on your users.
Monitor your app’s health.Log errors and crashes
remotely.
Drive more sales.
Better data = more revenue.
Do we really need real-time analytics?
Real time analytics isn’t inherently superior or necessary.
Building your own real-time analytics service with Cassandra and DataStax Enterprise
Cassandra Setup on EC2
Write Strategy
Read Strategy
Analytics Schema Strategy
• All row keys should be predictable (not always possible)
• U8lize physical sortability of columns
• Use predictably sortable data types for column names (integers, dates)
• Learn to love composite keys
• Batch muta8ons are your friend
• Use distributed counters for real-‐8me metrics
• Use TTL for automa8on data expira8on (if necessary)
Time Series Schema 0: All Knowns
Time Series Schema 1: Bounded Number of Unknowns
Time Series Schema 2: Unbounded Number of Unknowns
Schema Tips
Adding Hive and Hadoop to the Mix
Mo’ data, mo’ problems
When is Hadoop necessary? • Large volumes of data (100GB+)
• Queries require retrospective / historical analysis
• Need consistent results
• Need to perform multi-stage analysis
• Speed isn’t a concern (Hadoop is sloooooooooow)
Hadoop on easy mode: Hive • SQL abstraction on top of Hadoop (more familiar)
• Easier to deploy and test
• Simplifies data warehousing
• Easy to automatically import data from Cassandra
• DSE eliminates need for HDFS
C* to Hive
Hive Syntax
Query: count the number items where “key” is greater than 100 RDBMS> select key, count(1) from kv1 where key > 100 group by key; Hive> select key, count(1) from kv1 where key > 100 group by key;
Hive Tips and Tricks
• Don’t write data from Hive back to a hot Cassandra column family • If writing data from Hive to Cassandra, use dedicated column
families • You can write to multiple places on a single Hive read (table, CSV
file, etc…) • Use sampling to test Hive queries on scaled-down data sets
How do you count millions of distinct items in real-time?
• Solr: Lucene-‐based indexing engine • Part of Apache Founda8on • Full-‐text search • Faceted search • Distributed • Integrates well with Cassandra
Solr Index Setup
Solr Search
Questions or Comments?
[email protected] hMps://markedup.com/