Creating a Next-Generation Big Data Architecture

  • Published on
    05-Jul-2015

  • View
    1.035

  • Download
    0

Embed Size (px)

DESCRIPTION

If youve spent time investigating Big Data, you quickly realize that the issues surrounding Big Data are often complex to analyze and solve. The sheer volume, velocity and variety changes the way we think about data including how enterprises approach data architecture. Significant reduction in costs for processing, managing, and storing data, combined with the need for business agility and analytics, requires CIOs and enterprise architects to rethink their enterprise data architecture and develop a next-generation approach to solve the complexities of Big Data. Creating the data architecture while integrating Big Data into the heart of the enterprise data architecture is a challenge. This webinar covered: -Why Big Data capabilities must be strategically integrated into an enterprises data architecture -How a next-generation architecture can be conceptualized -The key components to a robust next generation architecture -How to incrementally transition to a next generation data architecture

Transcript

  • 1. Big Data Architectural Series: Creating a Next-Generation Big Data Architecture facebook.com/perficient twitter.com/Perficientlinkedin.com/company/perficient

2. 2 Perficient is a leading information technology consulting firm serving clients throughout North America. We help clients implement business-driven technology solutions that integrate business processes, improve worker productivity, increase customer loyalty and create a more agile enterprise to better respond to new business opportunities. About Perficient 3. 3 Founded in 1997 Public, NASDAQ: PRFT 2013 revenue $373 million Major market locations: Allentown, Atlanta, Boston, Charlotte, Chicago, Cincinnati, Columbus, Dallas, Denver, Detroit, Fairfax, Houston, Indianapolis, Lafayette, Minneapolis, New York City, Northern California, Oxford (UK), Philadelphia, Southern California, St. Louis, Toronto, Washington, D.C. Global delivery centers in China and India >2,200 colleagues Dedicated solution practices ~90% repeat business rate Alliance partnerships with major technology vendors Multiple vendor/industry technology and growth awards Perficient Profile 4. BUSINESS SOLUTIONS Business Intelligence Business Process Management Customer Experience and CRM Enterprise Performance Management Enterprise Resource Planning Experience Design (XD) Management Consulting TECHNOLOGY SOLUTIONS Business Integration/SOA Cloud Services Commerce Content Management Custom Application Development Education Information Management Mobile Platforms Platform Integration Portal & Social Our Solutions Expertise 5. Our Speaker Bill Busch Sr. Solutions Architect, Enterprise Information Solutions, Perficient Leads Perficient's enterprise data practice Specializes in business-enabling BI solutions that enable the agile enterprise Responsible for executive data strategy, roadmap development, and the delivery of high-impact solutions that enable organizations to leverage enterprise data Bill has over 15 years of experience in executive leadership, business intelligence, data warehousing, data governance, master data management, information/data architecture and analytics 6. Perficients Big Data Architectural Series Business Case Next Generation Architecture Future Topics Data Integration Stream Processing NoSQL SQL on Hadoop Data Quality Governance Use Cases & Case Studies Todays Webinar 7. Todays Objectives 5 Architectural Roles For Hadoop Hadoop Ecosystem Potential vs. Reality Realizing A Hadoop Centric Architecture 8. Todays Objectives 5 Architectural Roles For Hadoop Hadoop Ecosystem Potential vs. Reality Realizing A Hadoop Centric Architecture 9. Big Data is high-volume, high-velocity and high- variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. Convergence of structured, unstructured, and dark data Big Data is the evolution of data creating similar data management issues that IT has struggled to address for the last 20+ years. Three Views of Big Data 10. Big Data is high-volume, high-velocity and high- variety information assets that demand cost- effective, innovative forms of information processing for enhanced insight and decision making. Convergence of structured, unstructured, and dark data Big Data is the evolution of data creating similar data management issues that IT has struggled to address for the last 20+ years. Three Views of Big Data 11. Common Big Data Business Use Cases Improve Strategic Decision Making Customer Experience Analysis Operational Optimization Risk and Fraud Reduction Data Monetization Security Event Detection and Analysis IT Cost Management 12. Expanding Data Ecosystem Customer Intelligence Operations Risk& Fraud Data Monetization Strategic Development Security Intelligence IT Optimization Structured Data (5-20% of Total) Point-of-Sale Text Messages Contracts & Regulatory Preferences & Emotions Security AccessWeather Machine Data Automobile Mobile Communications Geospatial Social Data Ecosystem 13. Enterprise Data Architecture Next Generation 14. The Promise Data Architecture Simplification Data Integration Data Hub Analytics Stream Processing Data Warehouse Operational Data Hadoop Cluster 15. The Reality Maturity Limits the Use Cases Realize the potential of Hadoop Multi-tenancy is in its infancy Hadoop 2.0 and YARN Most third-party applications are just moving to YARN Hive (and other SQL on Hadoop solutions) maturing Robust enterprise functionality is evolving Security High Availability 16. Different Types of Open Source Hadoop Apache Projects Only Proprietary Value Add & Re- Development Apache Projects + Proprietary Add-ons Packaged and Online Solutions IBM Big Insights Oracle Big Data Appliance HDInsight Many others! Choosing A Hadoop Distribution Company Philosophy Current Relationships Acceptable Risk Specialized Functionality 17. Quick Primer on YARN What is Yarn? Yet Another Resource Manager Sometimes referred as MapReduce 2.0 Data operating system Fault-Tolerance Why is this important? Enables multi-tendency on Hadoop Moves processing to the data *Image Provided by HortonWorks 18. Todays Objectives 5 Architectural Roles For Hadoop Hadoop Ecosystem Potential vs. Reality Realizing A Hadoop Centric Architecture 19. Hadoop Analytics Data Warehouse Stream Processing Data Factory Transactional Data Store Five Common Architectural Roles Hadoop Big Data Use Cases 20. Enterprise Data Architecture Next Generation 21. Hadoop Analytics Data Warehouse Stream Processing Data Factory Transactional Data Store Five Common Architectural Roles Hadoop Big Data Use Cases 22. Analytical Processing Source Wrangle Data Model & Tune Operationalize1 2 3 4 Data Ingestion Metadata Management Data Access Data Preparation Tools Data Discovery &Visualization Data Wrangling Tools Business Glossary & Search Data Access Data Discovery & Visualization Analytical Tools Analytical Sandbox Business Created Reporting Model Execution & Management Knowledge Management (Portal) Analytical Process Architectural Capabilities 23. Analytical Processing Source Wrangle Data Model & Tune Operationalize1 2 3 4 Data Ingestion Metadata Management Data Access Data Preparation Tools Data Discovery &Visualization Data Wrangling Tools Business Glossary & Search Data Access Data Discovery & Visualization Analytical Tools Analytical Sandbox Business Created Reporting Model Execution & Management Knowledge Management (Portal) Analytical Process Architectural Capabilities 24. Data Access There are many methods to accessing Big Data Direct HDFS NoSQL / Connector Hive/ SQL On Hadoop Align tool to access methods and file types Data Preparation Analytics Source Files/Data Tidy Data Data Preparation Tool Analytics Tool Analytical Result Read Access Write Access Key Hadoop Cluster 25. Hadoop Analytics Data Warehouse Stream Processing Data Factory Transactional Data Store Five Common Architectural Roles Hadoop Big Data Use Cases 26. Data Warehouse Roles Two models for splitting processing Hot Cold Data Warehouse Layer Push high user loads to traditional data warehouses Fully investigate DW- Hadoop connector functionality Leverage opportunity to use in-memory database solutions Data Warehouse Layer Approach Hadoop Cluster Traditional DW/DM Hot Cold Data Warehouse Cold Data Hadoop Cluster Traditional DW/DM Hot Data 27. Data Warehouse Organize Your Data Types of data stored on cluster Analytical sandboxes Team Individual Quotas Potential to replace information lifecycle management solutions No right answer clearly define usage Consolidated Data Streaming Queues Deltas (Incremental) Common Data (Dimensions, Master Data) Improved / Modeled Data Published, Analytical and Aggregates Sandbox Zone Raw Data Processed Data Hadoop Cluster Archived Data 28. Hadoop Analytics Data Warehouse Stream Processing Data Factory Transactional Data Store Five Common Architectural Roles Hadoop Big Data Use Cases 29. Stream and Event Processing Dedicated vs. Shared Model Persistence of messages, logs, etc. Long-term storage Queuing Pre-load (HDFS) vs. Post-load processing Micro-Batch vs. One-at-a-Time Programing language support Processing guarantee At most once At least once Exactly once Let business requirements drive need for streaming solutions. It is acceptable to use more than one solution as long as the roles / purposes of each are clearly defined. 30. Hadoop Analytics Data Warehouse Stream Processing Data Factory Transactional Data Store Five Common Architectural Roles Hadoop Big Data Use Cases 31. The Data Integration Challenge Key Point: Hadoop and Hadoop-related technologies can address these challenges. However, they must be architected and governed properly Volume, variety, and velocity create unique challenges for data integration 10,000+ unique entities (or file groups) may have to be managed Batch windows are still the same or shrinking The Challenge 32. Data Factory & Integration Hadoop Distributed Tools Data Integration Packages Hybrid (Both Hadoop and Data Integration Package) Leverages tools included in the Hadoop Distribution and programing languages Scoop, Flume, Spark, Java, MapReduce are examples Tools can be implemented in many different modes Hand-coded/scripted Runtime Configured Generated Based on use case leverages both Hadoop and COTs tools to move and transform data Leverage commercial data integration packages to move and transform data IBM Infosphere Big Insights, Informatica are examples Key questions, where is processing taking place and does the tool use YARN resource manger? Approaches to Big Data Integration 33. Define Pipelines and Stages Sqoop Cloud Sources RDBMS File Hub FTP Packaged Tool Object DBMS ETL Tool Log Data FTP Stream/ Message Bus Kafta Sqoop Storm Extract HDFS Load & Formatting Scraping& Normalization MCF Storm Cleansing , Aggregation Transformation Package ETL Tool Storm Data Distribution Data Access & Distribution RDBMS/DW /IMDB Hive Hbase File Extracts NoSQL Stream Output Custom Sqoop Custom Custom Message Bus ETL Tool ETL Tool 34. Big Data Integration Framework Typical Services Key Guidance: In lieu of using a ETL product, consider building a Big Data Integration framework Apache Falcon provides pipeline management Focus is on making all components run-time configurable with metadata Can offer significant cost savings over the long run Load UtilityMetadata Collection Metadata Pipeline Config Files Metadata Config Files Pipeline Utilities Parser (Delimiter) Data Standardization HIVE Publishing MF Coding Converters File Joiner & Transport Logging Checksum Retention Replication Late Arriving Data Exception Handling Pipeline Master (ex. Falcon) DB Copy Archival Audit Sqoop Flume HDFS Shell 35. Hadoop Analytics Data Warehouse Stream Processing Data Factory Transactional Data Store Five Common Architectural Roles Hadoop Big Data Use Cases 36. SQL on Hadoop SQL on Hadoop is changing Historically focused on read functionality for analytics New breed of SQL on Hadoop BI and operational reporting Transaction Processing *Image Provided by Splice Machine 37. Transactions In Hive 38. Todays Objectives 5 Architectural Roles For Hadoop Hadoop Ecosystem Potential vs. Reality Realizing A Hadoop Centric Architecture 39. Common Big Data Business Use Cases Improve Strategic Decision Making Customer Experience Analysis Operational Optimization Risk and Fraud Reduction Data Monetization Security Event Detection and Analysis IT Cost Management 40. Architectural Scenarios Architecture Role Business Use Case Analytics Data Warehouse Stream Processing Data Factory Transactional Data Store* Strategic Decision Making P s Customer Experience P s P s Operational Optimization P s s s Risk and Fraud Reduction P s P Data Monetization s s P Security Event Detection and Analysis P s s s IT Cost Management P s P P * Capability is just emerging within the Hadoop ecosystem. Consider this use case for isolated business cases and early adopters. P = Primary Use Case s = Secondary Use case 41. Integrating Hadoop into the Enterprise Determine Business Use Cases Understand Current Tools & Architecture Align Business Use Case Priorities Build Roadmap Specify Solution Architecture Update & Maintain Roadmap Implement Roadmap 42. Final Thoughts Do Match the business use case to the big data role Clearly define a roadmap Establish clear architectural standards to drive Consistency Re-use of resources Homework when defining a solution architecture Dont Select an initial use case that relies on immature Hadoop functionality Leverage tools that move data off the cluster for processing then storing the data back on the cluster Assume all Hadoop technologies integrate well together 43. As a reminder, please submit your questions in the chat box. We will get to as many as possible. 44. Daily unique content about content management, user experience, portals and other enterprise information technology solutions across a variety of industries. Perficient.com/SocialMedia Facebook.com/Perficient Twitter.com/Perficient 45. Thank you for your participation today. Please fill out the survey at the close of this session.