Creating a Next-Generation Big Data Architecture

Big Data Architectural Series:Creating a Next-Generation Big Data Architecture

facebook.com/perficient twitter.com/Perficientlinkedin.com/company/perficient

2

Perficient is a leading information technology consulting firm serving clients throughout

North America.

We help clients implement business-driven technology solutions that integrate business

processes, improve worker productivity, increase customer loyalty and create a more agile

enterprise to better respond to new business opportunities.

About Perficient

3

• Founded in 1997

• Public, NASDAQ: PRFT

• 2013 revenue $373 million

• Major market locations:

• Allentown, Atlanta, Boston, Charlotte, Chicago, Cincinnati,

Columbus, Dallas, Denver, Detroit, Fairfax, Houston,

Indianapolis, Lafayette, Minneapolis, New York City,

Northern California, Oxford (UK), Philadelphia, Southern

California, St. Louis, Toronto, Washington, D.C.

• Global delivery centers in China and India

• >2,200 colleagues

• Dedicated solution practices

• ~90% repeat business rate

• Alliance partnerships with major technology vendors

• Multiple vendor/industry technology and growth awards

Perficient Profile

BUSINESS SOLUTIONS

Business Intelligence

Business Process Management

Customer Experience and CRM

Enterprise Performance Management

Enterprise Resource Planning

Experience Design (XD)

Management Consulting

TECHNOLOGY SOLUTIONS

Business Integration/SOA

Cloud Services

Commerce

Content Management

Custom Application Development

Education

Information Management

Mobile Platforms

Platform Integration

Portal & Social

Our Solutions Expertise

Our Speaker

Bill Busch

Sr. Solutions Architect, Enterprise Information Solutions, Perficient

• Leads Perficient's enterprise data practice

• Specializes in business-enabling BI solutions that enable the agile enterprise

• Responsible for executive data strategy, roadmap development, and the delivery of high-impact solutions that enable organizations to leverage enterprise data

• Bill has over 15 years of experience in executive leadership, business intelligence, data warehousing, data governance, master data management, information/data architecture and analytics

Perficient’s Big Data Architectural Series

Business

Case

Next

Generation

Architecture

Future Topics

• Data Integration

• Stream

Processing

• NoSQL

• SQL on Hadoop

• Data Quality

• Governance

• Use Cases &

Case Studies

Today’s

Webinar

Today’s Objectives

5 Architectural

Roles For Hadoop

Hadoop

Ecosystem

Potential

vs. Reality

Realizing A

Hadoop

Centric

Architecture


5 Architectural

Roles For Hadoop

Hadoop

Ecosystem

Potential

vs. Reality

Realizing A

Hadoop

Centric

Architecture

“Big Data is high-volume, high-velocity and high-variety information assets that demand cost-effective,

innovative forms of information processing for enhanced insight and decision making.”

Convergence of structured, unstructured,and dark data

Big Data is the evolution of data creating similar data management issues that IT has struggled to address

for the last 20+ years.

Three Views of Big Data

“Big Data is high-volume, high-velocity and high-variety information assets that demand cost-

effective, innovative forms of information processing for enhanced insight and decision

making.”

Convergence of structured, unstructured, and dark data

Big Data is the evolution of data creating similar data management issues that IT has struggled to

address for the last 20+ years.

Three Views of Big Data

Common Big Data Business Use Cases

Improve Strategic

Decision Making

Customer

Experience

Analysis

Operational

Optimization

Risk and Fraud

Reduction

Data Monetization

Security Event

Detection and

Analysis

IT Cost

Management

Expanding Data Ecosystem

• Customer

Intelligence

• Operations

• Risk& Fraud

• Data

Monetization

• Strategic

Development

• Security

Intelligence

• IT Optimization

Structured Data

(5-20% of Total)

Point-of-Sale

Text Messages

Contracts &

Regulatory

Preferences &

Emotions

Security AccessWeather

Machine Data

Automobile

Mobile

Communications

Geospatial

Social

Data

Ecosystem

Enterprise Data ArchitectureNext Generation

The PromiseData Architecture Simplification

Data IntegrationData HubAnalytics

Stream ProcessingData Warehouse Operational Data

Hadoop Cluster

The RealityMaturity Limits the Use Cases

• Realize the potential of Hadoop

• Multi-tenancy is in its infancy

• Hadoop 2.0 and YARN

• Most third-party applications are just

moving to YARN

• Hive (and other SQL on Hadoop

solutions) maturing

• Robust enterprise functionality is

evolving

• Security

• High Availability

Different Types of “Open Source Hadoop”

Apache

Projects

Only

Proprietary

Value Add & Re-

Development

Apache

Projects +

Proprietary

Add-ons

Packaged and

Online Solutions

• IBM Big Insights

• Oracle Big Data

Appliance

• HDInsight

• Many others!

Choosing A Hadoop Distribution

Company Philosophy

Current Relationships

Acceptable Risk

Specialized Functionality

Quick Primer on YARN

What is Yarn?

• Yet Another Resource Manager

• Sometimes referred as

MapReduce 2.0

• Data operating system

• Fault-Tolerance

Why is this important?

• Enables multi-tendency on

Hadoop

• Moves processing to the data*Image Provided by HortonWorks


5 Architectural

Roles For Hadoop

Hadoop

Ecosystem

Potential

vs. Reality

Realizing A

Hadoop

Centric

Architecture

Hadoop

Analytics

Data Warehouse

Stream Processing

Data Factory

Transactional Data Store

Five Common Architectural RolesHadoop Big Data Use Cases

Enterprise Data ArchitectureNext Generation

Hadoop

Analytics

Data Warehouse

Stream Processing

Data Factory



Analytical Processing

Source Wrangle Data Model & Tune Operationalize1 2 3 4

• Data Ingestion

• Metadata

Management

• Data Access

• Data Preparation

Tools

• Data Discovery

&Visualization

• Data Wrangling

Tools

• Business Glossary

& Search

• Data Access

• Data Discovery &

Visualization

• Analytical Tools

• Analytical

Sandbox

• Business Created

Reporting

• Model Execution &

Management

• Knowledge

Management

(Portal)

Analytical

Process

Architectural

Capabilities

Analytical Processing

Source Wrangle Data Model & Tune Operationalize1 2 3 4

• Data Ingestion

• Metadata

Management

• Data Access


Tools

• Data Discovery

&Visualization

• Data Wrangling

Tools

• Business Glossary

& Search

• Data Access

• Data Discovery &

Visualization

• Analytical Tools

• Analytical

Sandbox

• Business Created

Reporting

• Model Execution &

Management

• Knowledge

Management

(Portal)

Analytical

Process

Architectural

Capabilities

Data Access

• There are many methods

to accessing Big Data

• Direct HDFS

• NoSQL / Connector

• Hive/ SQL On Hadoop

• Align tool to access

methods and file types


• Analytics Source

Files/DataTidy Data

Data

Preparation

Tool

Analytics

Tool

Analytical

Result

Read Access

Write Access

Key

Hadoop Cluster

Hadoop

Analytics

Data Warehouse

Stream Processing

Data Factory



Data Warehouse Roles

• Two models for splitting processing

• Hot – Cold• Data Warehouse Layer

• Push high user loads to traditional data warehouses

• Fully investigate DW-Hadoop connector functionality

• Leverage opportunity to use in-memory database solutions

Data Warehouse Layer Approach

Hadoop Cluster Traditional DW/DM

Hot – Cold Data Warehouse

Cold Data

Hadoop Cluster Traditional DW/DM

Hot Data

Data WarehouseOrganize Your Data

• Types of data stored on

cluster

• Analytical sandboxes

• Team

• Individual

• Quotas

• Potential to replace

information lifecycle

management solutions

• No right answer – clearly

define usage

Consolidated

Data

Streaming

Queues

Delta’s

(Incremental)

Common Data (Dimensions, Master Data)

Improved / Modeled Data

Published, Analytical and Aggregates

Sandbox Zone

Raw Data Processed Data

Hadoop Cluster

Archived Data

Hadoop

Analytics

Data Warehouse

Stream Processing

Data Factory



Stream and Event Processing

• Dedicated vs. Shared Model

• Persistence of messages, logs, etc.

• Long-term storage

• Queuing

• Pre-load (HDFS) vs. Post-load

processing

• Micro-Batch vs. One-at-a-Time

• Programing language support

• Processing guarantee

• At most once

• At least once

• Exactly once

Let business requirements drive need for streaming solutions. It is acceptable to use more

than one solution as long as the roles / purposes of each are clearly defined.

Hadoop

Analytics

Data Warehouse

Stream Processing

Data Factory



The Data Integration Challenge

Key Point: Hadoop and Hadoop-related technologies can address these challenges.

However, they must be architected and governed properly

Volume, variety, and

velocity create unique

challenges for data

integration

10,000+ unique entities

(or file groups) may have

to be managed

Batch windows are still

the same or shrinking

The Challenge

Data Factory & Integration

Hadoop Distributed

Tools

Data Integration

Packages

Hybrid (Both Hadoop

and Data Integration

Package)

• Leverages tools included in

the Hadoop Distribution and

programing languages

• Scoop, Flume, Spark, Java,

MapReduce are examples

• Tools can be implemented in

many different modes

• Hand-coded/scripted

• Runtime Configured

• Generated

• Based on use case

leverages both Hadoop and

COTs tools to move and

transform data

• Leverage commercial data

integration packages to

move and transform data

• IBM Infosphere Big Insights,

Informatica are examples

• Key questions, where is

processing taking place and

does the tool use YARN

resource manger?

Approaches to Big Data Integration

Define Pipelines and Stages

Sqoop

Cloud

Sources

RDBMS

File

HubFTP

Packaged

Tool

Object

DBMSETL Tool

Log

DataFTP

Stream/

Message

Bus

Kafta

Sqoop

Storm

ExtractHDFS Load &

Formatting

Scraping&

Normalization

MCF

Storm

Cleansing ,

Aggregation

Transformation

Package

ETL Tool

Storm

Data Distribution Data Access &

Distribution

RDBMS/DW

/IMDB

Hive

Hbase

File

Extracts

NoSQL

Stream

Output

Custom

Sqoop

Custom

Custom

Message

Bus

ETL

Tool ETL Tool

Big Data Integration FrameworkTypical Services

Key Guidance:

• In lieu of using a ETL product, consider building a Big

Data Integration framework

• Apache Falcon provides pipeline management

• Focus is on making all components run-time

configurable with metadata

• Can offer significant cost savings over the long run

Load UtilityMetadata

Collection Metadata

Pipeline

Config

Files

Metadata

Config Files

Pipeline Utilities

Parser

(Delimiter)

Data

Standardization

HIVE

Publishing

MF Coding

Converters

File Joiner &

Transport

Logging

Checksum

Retention

Replication

Late Arriving

Data

Exception

Handling

Pipeline Master (ex. Falcon)

DB Copy

Archival

Audit

Sqoop Flume

HDFS Shell

Hadoop

Analytics

Data Warehouse

Stream Processing

Data Factory



SQL on Hadoop

• SQL on Hadoop is changing

• Historically focused on read

functionality for analytics

• New breed of SQL on Hadoop

• BI and operational

reporting

• Transaction Processing

*Image Provided by Splice Machine

Transactions In Hive


5 Architectural

Roles For Hadoop

Hadoop

Ecosystem

Potential

vs. Reality

Realizing A

Hadoop

Centric

Architecture

Common Big Data Business Use Cases

Improve Strategic

Decision Making

Customer

Experience

Analysis

Operational

Optimization

Risk and Fraud

Reduction

Data Monetization

Security Event

Detection and

Analysis

IT Cost

Management

Architectural Scenarios

Architecture

Role

Business Use Case Analytics

Data

Warehouse

Stream

Processing Data Factory

Transactional

Data Store*

Strategic Decision

Making P s

Customer Experience P s P s

Operational

Optimization P s s s

Risk and Fraud

Reduction P s P

Data Monetization s s P

Security Event

Detection and Analysis P s s s

IT Cost Management P s P P

* Capability is just emerging within the Hadoop

ecosystem. Consider this use case for isolated

business cases and early adopters.P = Primary Use Case s = Secondary Use case

Integrating Hadoop into the Enterprise

Determine

Business Use

Cases

Understand

Current Tools

& Architecture

Align Business

Use Case

Priorities

Build

Roadmap

Specify

Solution

Architecture

Update &

Maintain

Roadmap

Implement

Roadmap

Final Thoughts

Do

• Match the business use case to the big data role

• Clearly define a roadmap

• Establish clear architectural standards to drive

• Consistency

• Re-use of resources

• Homework when defining a solution architecture

Don’t

• Select an initial use case that relies on immature

Hadoop functionality

• Leverage tools that move data off the cluster for

processing then storing the data back on the cluster

• Assume all Hadoop technologies integrate well together

As a reminder, please submit your

questions in the chat box.

We will get to as many as possible.

Daily unique content

about content

management, user

experience, portals

and other enterprise

information technology

solutions across a

variety of industries.

Perficient.com/SocialMedia

Facebook.com/Perficient

Twitter.com/Perficient

Thank you for your participation today.Please fill out the survey at the close of this session.

Technology

Creating a Next-Generation Big Data Architecture