Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

Preview:

DESCRIPTION

Most organizations still rely on batch and offline processing of data streams to gain meaningful analysis and insight into their business. However, in our instant gratification world, real-time computation and analysis of streaming data is crucial in gaining insight into patterns and threats. A trend is emerging for real-time and instant analysis from live data streams, promoting the value of logs and a move toward functional programming. This shift in technology is not about what and how to store the data, but what we can do with it to see emerging patterns and trends across multiple resources, applications, services and environments. Log data represents a wealth of information, yet is often sporadic, unstructured, scattered across the enterprise and difficult to track. These slides provide insights into some of the most helpful Big Data tools used by the largest social media and data-centric organizations for competitive trends, instant analysis and feedback from large volume data streams. We show how how using Big Data tools Storm, ElasticSearch and an elastic UI can turn application logs into real-time analytical views. You will also learn how Big Data: Contains data that is elastic, minimally structured, flexible and scalable Helps process live streams into meaningful data Promotes a move toward functional programming Effects the enterprise data architecture Works with real-time CEP tools like Storm for functional programming

Citation preview

Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

Eric Roch, Principal &Ben Hahn, Senior Technical Architect

Perficient is a leading information technology consulting firm serving clients throughout

North America.

We help clients implement business-driven technology solutions that integrate business

processes, improve worker productivity, increase customer loyalty and create a more agile

enterprise to better respond to new business opportunities.

About Perficient

• Founded in 1997

• Public, NASDAQ: PRFT

• 2013 revenue $373 million

• Major market locations throughout North America• Atlanta, Boston, Charlotte, Chicago, Cincinnati, Columbus,

Dallas, Denver, Detroit, Fairfax, Houston, Indianapolis,Los Angeles, Minneapolis, New Orleans, New York City, Northern California, Philadelphia, Southern California,St. Louis, Toronto and Washington, D.C.

• Global delivery centers in China, Europe and India

• >2,100 colleagues

• Dedicated solution practices

• ~90% repeat business rate

• Alliance partnerships with major technology vendors

• Multiple vendor/industry technology and growth awards

Perficient Profile

BUSINESS SOLUTIONSBusiness IntelligenceBusiness Process ManagementCustomer Experience and CRMEnterprise Performance ManagementEnterprise Resource PlanningExperience Design (XD)Management Consulting

TECHNOLOGY SOLUTIONSBusiness Integration/SOACloud ServicesCommerceContent ManagementCustom Application DevelopmentEducationInformation ManagementMobile PlatformsPlatform IntegrationPortal & Social

Our Solutions Expertise

Eric RochPrincipal

Eric leads Perficient's national connected solutions practice

• Includes focus on SOA/integration, cloud, mobile and Big Data

• Author & industry speaker• 25 years+ of experience in various

aspects of information technology including:

• Executive-level management• Enterprise architecture• Application development

Speakers

Ben HahnSr. Technical Architect

Ben Hahn is a Sr. Technical Architect

• Includes focus on transactions, logging & exceptions processing

• Author & speaker• 20+ years of experience in various

aspects of information technology including:

• Software solutions• Enterprise infrastructure• Product management • Open Source software community

contributor

• Often defined as data that exceeds the capacities of conventional database systems because it’s too large and moves too fast for traditional database systems to handle in an architecturally cohesive way. The three V’s of Big Data are:

• Volume • Most companies have 100 TB of data• Facebook ingests 500 TB in a single day• 40 ZettaBytes (that’s 43 trillion GB) of data by

2020 • Velocity

• NYSE captures 4-5 TB of data in a single day• A Boeing 737 generates 243 TB in a single flight• The Google self-driving car generates 750MB of

data per second!• Variety

• Twitter, Clickstreams, Audio, Video• GPS, Sensor data, Facebook content• Infrastructure and application logs

What is Big Data?

POLL QUESTION:What is your current adoption level for big data?• Evaluation• Prototype• Production

But Not Everyone is Google!

Where’s the Big Data coming from?

POLL QUESTIONHave you used open source software for big data solutions? • Yes• No

Machine Data definitely has the three V’s of Big Data

Machine Data is Big Data

What Can We Gain From Machine Data?

Valuable information can be mined from machine data, including:

• Transaction monitoring• Error detection• Behavior trends• Audit logging• Infrastructure states• Anomaly detection• Geospatial analysis• Network analysis

Log Analysis vs. Business Analytics

• Ingest - Versus ETL • Big Data - Bidirectional integration with Hadoop• Query language - MapReduce function on unstructured

data • Drill anywhere - Investigate on all the data versus a

predefined schema or cube• Information discovery - Discover relationships based on

patterns in the data • Ad-hoc versus dimensional - Log analysis is not based a

predefined structure based a point-in-time set of requirements

• Explicit logging - Versus implicit correlation

Polling Question:Do you mine machine data for business insights?• Yes• No

Innovations From Cloud and OSS

• Hadoop and MapReduce - Derived from Google's MapReduce and Google File System

• Storm – Distributed event processor open sourced by Twitter

• Presto - Facebook has released as open source a SQL query engine built to work with petabyte-sized data warehouses

• Google BigQuery - Run SQL-like queries against terabytes of data in seconds

• Amazon DynamoDB - NoSQL database service to store and retrieve any amount of data, and serve any level of request traffic

• Elasticsearch – Distributed full-text search OSS community

POLLING QUESTIONDo you plan to use cloud based solutions for big data?• Yes• No

• 2004 - Google published a paper called MapReduce: Simplified Data Processing on Large Clusters characterized by:

• Map and shuffle key-values data pairs and then aggregate/reduce these intermediate data pairs

• Origins in map and reduce primitives in functional languages• Massive parallelism and elasticity via commodity hardware• Fault tolerance via master-worker nodes

Big Data Processing: MapReduce

2

• Based on Lambda (λ) calculus • ALL computational functions and data can be expressed as

a series of functions and predicates of functions• Declarative language rather than imperative • First-order functions – Functions can be passed just like

values as arguments and returned as arguments. This also allows currying and partial functions.

• Call by name – Function expressions are not evaluated until they are actually used.

• Recursion – Functions evaluate to itself potentially in an endless loop.

• Immutable state and values – Pure functional programming does not consider variables but rather immutable values as they appear in any moment in time. This has big effects on scalability and concurrency.

• Referential Transparency - Functions can be replaced by their values with no side effects.

• Pattern matching – Data type matching as well as data structure composition and deep object type matching

• Erlang, Haskell, Lisp, Clojure, Scala

What are functional languages?

And MapReduce is Better with Functional Languages

2

Imperative Model: Pascal, C. Basic, etc.

Evolution (or Devolution?) of Databases

2

Object Oriented Programming Model: Java, C++,C#.

Evolution (or Devolution?) of Databases

2

Functional Programming Model: Scala, Clojure, F#

Evolution (or Devolution?) of Databases

2

• Because commodity hardware in the cloud is infinitely elastic, resource needs to query and run transactions can be scaled in response to the data volumes at the store level.

• Data is stored using functional programming concept of immutability by only appending data as point-in-time values.

• MapReduce functions can be balanced and distributed across machines as nodes fail or new nodes are added.

• First-class functions and call by name allows function, lambda expressions to be passed into MapReduce calls as arguments allowing ad-hoc functionality to be added.

• Pattern matching allows very complex pattern matches on complex structures like XML.

• Transactions use functional expressions like compare and swap operations to ensure ACIDity.

• SQL or query expressions can be reduced to MapReduce functions or lambda expressions and/or patterns and distributed in parallel across the nodes.

• Using recursion, complex structures like XML can be mapped and reduced from a single expression.

MapReduce Machine Data: What Do We Need?

• A dynamic process for parsing and mapping unstructured data to structured data in real-time

• Wide range of data formats (text, XML, JSON, CSV, EDI, etc.)

• Need intelligent pattern matching capabilities

• Ability to correlate meaningful transactional data and metrics from disparate data (reducing)

• Machine data is static and immutable. Append-only fast writes with eventual consistency is ideal

• Need fast filter, search, query capabilities to display results

Open Source Big Data Landscape

Source: www.bigdata‐startup.com

Apache Hadoop: The Elephant in the Room

• What about Apache Hadoop?

• Apache Hadoop comprises HDFS and the Hadoop MapReduce both based on Google’s GFS and MapReduce

• Batch oriented MapReduce jobs through Schedulers and JobTrackers

• Require real‐time MapReduce processes

• Need index, query, search on data in real‐time with a well‐defined interface

• We can use for secondary storage of long‐term persistent logs – Lambda Architecture (Batch vsSpeed Layer)

Apache Storm: Use Real-time MapReduce for Machine Data Streams

• Developed by Backtype and acquired by Twitter

• Distributed computational framework that allows real-time MapReduce functionality from any data source streams using concept of Spouts and Bolts

• Read From Any Data Stream using Spouts (Kafka, JMS, HTTP, etc.)

• Transactional and guaranteed message processing

• Parallelism and scalability

• Fault Tolerance (Master-Worker for MapReduce)

• MapReduce Topologies

• Offers Real-time MapReduce jobs (Or Bolts)

• Other tools: Apache Spark

Apache Storm: Use Real-time MapReduce for Machine Data Streams

MapReduce - Declarative and simplicity of functional languages within Storm

Elasticsearch: Distributed Document Search

• Distributed search server engine using Apache Lucene

• It’s a Schema-less document store using JSON as it’s document format. New fields can be added dynamically. All fields are indexed by default

• Uses index shards to distribute queries and searches across clusters. Queries and searches are run in parallel

• Cluster can host multiple indexes and can be queried as a group or singly. Index aliases allows indexes to be added or dropped dynamically

• Append-only model using versioning. Writes very fast depending on wait model (wait for all shards to be written or a quorom or none)

• Well-defined RESTful API interface. Very powerful query features

• Other tools: Apache Solr

Elasticsearch: Distributed Document Search

Elasticsearch: Distributed Query and searches using index shards and replicas

A Really Cool UI to Show This Off

• Kibana – Works seamlessly with Elasticsearch, queries Elasticsearchdirectly from Javascript

• Everything is user driven, very little coding except some configuration settings in yaml

• Very dynamic screen interface• Screen layout, queries, filters, graphs, histograms are saved directly to

Elasticsearch• Great design and user interface

Putting It In Action: Demo

As a reminder, please submit your questions in the chat box

We will get to as many as possible!

4/1/2014

Daily unique content about content management, user experience, portals and other enterprise information technology solutions across a variety of industries.

Perficient.com/SocialMediaFacebook.com/Perficient

Twitter.com/Perficient

Thank you for your participation today.Please fill out the survey at the close of this session.

4/1/2014

Recommended