12
VIEW POINT By Ezra Uzosike, Senior Manager, Big Data & Analytics Platforms Digital Accelerator, at Stanley Black & Decker, Inc. T.Madasamy, Associate Vice President - Senior Principal Technology Architect at Data & Analytics CTO, Infosys Limited Arpit Garg, Senior Technologist, Cloud & Big Data Analytics at Strategic Technology Group, Infosys Limited Jignesh Desai, Partner Solutions Architect at AWS APN PREMIER PARTNER INFOSYS USES AWS TO PROVIDE A META DATA DRIVEN BOUNDARYLESS DATA LAKE SOLUTION TO ITS CUSTOMER - STANLEY BLACK & DECKER, INC.

APN Premier Partner Infosys Uses AWS to Provide a Meta ... · Introduction . With the vast volume of data, coming at ... Elasticsearch service, business intelligence insights analysis

Embed Size (px)

Citation preview

Page 1: APN Premier Partner Infosys Uses AWS to Provide a Meta ... · Introduction . With the vast volume of data, coming at ... Elasticsearch service, business intelligence insights analysis

VIEW POINT

By Ezra Uzosike, Senior Manager, Big Data & Analytics Platforms Digital Accelerator, at Stanley Black & Decker, Inc.

T.Madasamy, Associate Vice President - Senior Principal TechnologyArchitect at Data & Analytics CTO, Infosys Limited

Arpit Garg, Senior Technologist, Cloud & Big Data Analytics at Strategic Technology Group, Infosys Limited

Jignesh Desai, Partner Solutions Architect at AWS

APN PREMIER PARTNER INFOSYS USES AWS TO PROVIDE A META DATA DRIVEN BOUNDARYLESS DATA LAKE SOLUTION TO ITS CUSTOMER - STANLEY BLACK & DECKER, INC.

Page 2: APN Premier Partner Infosys Uses AWS to Provide a Meta ... · Introduction . With the vast volume of data, coming at ... Elasticsearch service, business intelligence insights analysis

External Document © 2018 Infosys Limited

Introduction With the vast volume of data, coming at great velocity of information and growing exponentially each day, organizations need to rethink ways to create and drive business insights without boundaries. With this objective in mind, we have designed and built a simplified metadata driven boundaryless Data Lake on Amazon Web Services at Stanley Black & Decker Inc.

Stanley Black & Decker, Inc., an S&P 500 Company (SWK), is a worldwide manufacturer and marketer of Hand and Power Tools, hardware, security solutions and specialty hardware products for home improvement, consumer, industrial and professional use. With its 162+ year history, Stanley Black & Decker is one of the world’s most trusted names, synonymous with high quality, value and innovation.

This pov outlines the benefits of a simplified Meta Data Driven Boundaryless Data Lake Solution on Amazon Web Services and defines an architecture for data ingestion, dynamic and static meta data store on AWS DynamoDB non-relational database service, data storage on AWS Simple Storage Service in multiple layers (AWS S3 buckets i.e. landing or data intake layer, raw data layer, curated/aggregated layer), data validation & transformation, data aggregation & curation using AWS EMR managed Hadoop framework with HIVE/Spark/Scala etc., data governance, meta data discovery, indexing & search using Kibana on AWS Elasticsearch service, business intelligence insights analysis using QlikView & data lake operations & health dashboard using AWS QuickSight service.

Quick Insight about Data Lake: A Data Lake is a storage repository which is similar to real lake and rivers with multiple tributaries (i.e. data sources) coming in. A data lake can have multiple data sources with structured, semi-structured & unstructured data.

• Structured Data - Data highly normalized with proper schema & data dictionary. These data are easily accessible using data extraction tools.

• Semi-structured Data - Data without predefined schema, often stored in NoSQL databases, such as JSON and XML. These data are easily accessible but require some preparation in order to make them ready for data science.

• Unstructured Data - Data without data model, data as text documents (such as email), pictures, audio and video.

Data Lake Infrastructure: A data lake infrastructure can provide a solid foundation for storing massive amounts of data in a central repository so it is readily available to be categorized, processed, enriched, and consumed by diverse groups within an organization such as business users, data scientists etc.

For this solution, we have built AWS Cloud Formation templates & custom accelerators written using JAVA, AWS SDK JAVA APIs which can run anywhere - local machine/cloud machine/AWS EMR to standup infrastructure on AWS. This process is fully automated, lightweight, with a loosely couple and fully customizable code.

The following AWS services and tools were used in this solution:

• Amazon Simple Storage Services (S3) as the primary storage platform/repository as landing zone to raw, curated, consolidation/aggregated zones with proper folder structure in S3 buckets according to data sets related to each data source

• Amazon DynamoDB as Meta Data Store

• AWS EMR for ‘Compute on Demand’, and data access/provisioning capability. Spark/Scala/Hive/Sqoop/Presto/Hue/Zepplin/Jupyter etc. are used to

perform data validation/transformation/processing, analytics & processing denormalized data sets for dashboards on the cluster and exporting to RDS database to avoid any high end BI reporting in Qlikview applications

• AWS EC2(s) with R Shiny (a Data scientist ‘Workstation’ in the Cloud), grants data scientists access to all the data layers (AWS S3) data and to perform analytics

• AWS Elasticsearch with Kibana, is used for Data discovery through elastic search to enable users to find the data in all layers

• Airflow on AWS EC2(s) - A job scheduler engine to submit data validation/transformation/processing/aggregation jobs to AWS EMR clusters without the help of IT

Meta Data Boundaryless Data Lake Infrastructure is created by using AWS CloudFomration Templates & custom AWS SDK Java based APIs in different environments i.e. Test/Dev/PROD. Templates are customizable in nature.

Key theme of the post: Infosys Helps a Global Manufacturing leader To Achieve ‘Zero Distance to Analytics by using Meta Data Driven Boundaryless Data Lake Solution on Amazon Web Services

Data Lake Essential Components & Critical Capabilities MVP (Minimum Viable Product):

Page 3: APN Premier Partner Infosys Uses AWS to Provide a Meta ... · Introduction . With the vast volume of data, coming at ... Elasticsearch service, business intelligence insights analysis

External Document © 2018 Infosys LimitedExternal Document © 2018 Infosys Limited

Figure 1 – Meta Data Driven Boundaryless Data lake Solution Infrastructure Automation on AWS Cloud

Meta Data Store (Meta Data & Data Catalog in AWS DynamoDB): Metadata, or information about data, allows the understanding of data lineage, quality, and lifecycle, and provides crucial visibility into different data sources and data sets.

Metadata falls into two categories: static and dynamic.

i. Static metadata captures the format and

total number of data sets from different sources, as well as the schema structure for each data set. The data dictionaries/schema for each data set were collated, and enriched with additional flags and inserted into AWS DynamoDB tables by using AWS SDK Java APIs in an automated fashion.

ii. Dynamic metadata captures and catalogs some data processing metrics such as the total number of records

extracted from data source for each data set, start date, end date of the data & data integrity using MD5 checksum. This dynamic Meta data are generated from source system batch metadata pushed with the data files in .csv files to the designated AWS S3 landing zone. An S3 event trigger fires off an AWS lambda function to validate and process this dynamic metadata into AWS DynamoDB table for each dataset and data source.

AWSCloudFormation

template template

AWS SDK Java

Custom AWS Platform Tool/Accelerator

Users

AWS Management Console Meta Data Driven

Boundaryless Data Lake Infrastructure

To CreateIAM Users/Groups/RolesAWS S3 Buckets & Folder StructureAWS DynamoDB configuration Tables in bulk by Reading configuration templateAWS_S3_IAM_DynamoDB_Template.xlsx from AWS S3 dladmin bucket

It takes few minutes to standup the entire Infrastructure on AWS Cloud

Domain TableName Description Extract_name Server_ID Start_date End_date Gen_date Owner MD5Checksum Row_count Encoding

Material RZPMATER Material MasterAEC_RZPMATER_D_20170

523_001.TXTBGIAS400 20170507 20170520 20170523 Jean-Marc Breuil

053c67287cebaae956b206176d035b12

8781 UTF-8

Plant_WH RZPPLANTWarehouse or Plant

InformationAEC_RZPPLANT_D_20170

523_001.TXTBGIAS400 20170507 20170520 20170523 Jean-Marc Breuil

053c67287cebaae956b206176d035b12

1 UTF-8

BOM RZPBOM Bill of MaterialsAEC_RZPBOM_F_2017052

3_001.TXTBGIAS400 20170507 20170520 20170523 Jean-Marc Breuil

053c67287cebaae956b206176d035b12

509482 UTF-8

Vendor RZPVENDO Vendor MasterAEC_RZPVENDO_D_20170

523_001.TXTBGIAS400 20170507 20170520 20170523 Jean-Marc Breuil

053c67287cebaae956b206176d035b12

19 UTF-8

Inventory RZPINVEN Inventory DetailsAEC_RZPINVEN_D_201705

23_001.TXTBGIAS400 20170507 20170520 20170523 Jean-Marc Breuil

053c67287cebaae956b206176d035b12

60187 UTF-8

PurchaseOrders RZPPO Purchase Orders DetailsAEC_RZPPO_D_20170523

_001.TXTBGIAS400 20170507 20170520 20170523 Jean-Marc Breuil

053c67287cebaae956b206176d035b12

3069 UTF-8

Customer RZPCUSTO Customer MasterAEC_RZPCUSTO_D_20170

523_001.TXTBGIAS400 20170507 20170520 20170523 Jean-Marc Breuil

053c67287cebaae956b206176d035b12

5 UTF-8

Sales RZPSALES Sales Order Header and Line Details

AEC_RZPSALES_D_20170523_001.TXT

BGIAS400 20170507 20170520 20170523 Jean-Marc Breuil 053c67287cebaae956b206176d035b12

1889 UTF-8

ExchangeRate RZPEXCHA Currency ExchangeAEC_RZPEXCHA_D_20170

523_001.TXTBGIAS400 20170507 20170520 20170523 Jean-Marc Breuil

053c67287cebaae956b206176d035b12

1 UTF-8

DynomoDB Table Name Table Primary Key OR Par��on Key OR Hash A�ribute Data type - Table Primary Key OR Par��on Key OR Hash A�ribute Table Sorted Key OR Range A�ribute(Op�onal Column) Data type - Table Sorted Key OR Range A�ribute(Op�onal Column) Read Throughput(Op�onal - Default Value=5) Write Throughput(Op�onal - Default Value=5)gnirtSemantcartxeatadatemelif 55gnirtSyekgifnoc-labolg 11

rebmuNredronmulocgnirtSedocsd-edocssatadatemdleif 0251gnirtSedoc-tesatad-ecruosgnirtSedoc-metsys-ecruosgifnoctesatad 0251

irtSedoc-tesatad-ecruosgnirtSedoc-metsys-ecruoseugolatactesatad ng 11

AWS IAM Group Name AWS IAM User Name AWS IAM Group Level Policy AWS IAM Group Resource Access/Group Level Access AWS IAM User Level Policy AWS IAM User Resource Access /User Level Access

datalake-external-aec-usrs3:PutObject

s3:GetBucketLocationdatalakelandingdatazone/aec/new/*datalakelandingdatazone/aec/error/*

datalake-external-acg-usr s3:PutObject datalakelandingdatazone/acg/new/*datalake-external-lawsonmac-usr s3:PutObject datalakelandingdatazone/lawsonmac/new/*datalake-external-baancn-usr s3:PutObject datalakelandingdatazone/baancn/new/*datalake-external-baantw-usr s3:PutObject datalakelandingdatazone/baantw/new/*datalake-external-bpcsxn-usr s3:PutObject datalakelandingdatazone/bpcsxn/new/*datalake-external-bpcsmi-usr s3:PutObject datalakelandingdatazone/bpcsmi/new/*datalake-external-bpcsmt-usr s3:PutObject datalakelandingdatazone/bpcsmt/new/*datalake-external-jda-usr s3:PutObject datalakelandingdatazone/jda/new/*datalake-external-movex-usr s3:PutObject datalakelandingdatazone/movex/new/*datalake-external-navision-usr s3:PutObject datalakelandingdatazone/navision/new/*datalake-external-qadar-usr s3:PutObject datalakelandingdatazone/qadar/new/*datalake-external-qadbr-usr s3:PutObject datalakelandingdatazone/qadbr/new/*datalake-external-qadcc-usr s3:PutObject datalakelandingdatazone/qadcc/new/*datalake-external-qadch-usr s3:PutObject datalakelandingdatazone/qadch/new/*datalake-external-qadpe-usr s3:PutObject datalakelandingdatazone/qadpe/new/*datalake-external-sapbyd-usr s3:PutObject datalakelandingdatazone/sapbyd/new/*datalake-external-cmc11-usr s3:PutObject datalakelandingdatazone/cmc11/new/*datalake-external-cme03-usr s3:PutObject datalakelandingdatazone/cme03/new/*datalake-external-cmp10-usr s3:PutObject datalakelandingdatazone/cmp10/new/*datalake-external-u�da-usr s3:PutObject datalakelandingdatazone/u�da/new/*

datalake-external-source-system-grps3:Listbucket

s3:GetBucketLocationdatalakelandingdatazone/*

dlrawzone/*

S3 Bucket Name S3 Bucket Region S3 Bucket Folder Structure S3 Bucket Description

datalakelandingdatazone us-east-1

ACG/new/ LAWSONMAC/new/

BAANCN/new/ The data received from every source system will be placed in this zone and will be organised as per the source.

dlrawzone us-east-1

In Raw data zone all the files received from every source system are placed as per the source system and partition as BASE(Historical load), CURRENT/STAGING(Recurring load _F files) & as per delta required e.g. tablename1, partition will be monthly(yyyymm) like 201701, 201702.

dlcuratedzone us-east-1The data will be picked from raw zone and processed as per the domains, source system with proper date & timestamp by applying different business rules and saved in the curated data zone.

dlaggregatedzone us-east-1 Data that is aggregated based on business rules will be made available in this zone. temp-sbd-prod-bucket us-east-1 Datalake temp bucket which is used in data lake internal code processing.

datalakecustomlogs us-east-1 datalake custom logs i.e. for entire datalake custom java logs etc.

datalakeadmin us-east-1

jars/scripts/templates/ datalake admin bucket in which we will place jars/scripts/template used in entire data lake process.

Page 4: APN Premier Partner Infosys Uses AWS to Provide a Meta ... · Introduction . With the vast volume of data, coming at ... Elasticsearch service, business intelligence insights analysis

External Document © 2018 Infosys Limited

Data Governance: Customer trust is essential and data governance is the core component of the data lake which includes overall management of the data availability, usability, integrity, quality & security throughout the complete life cycle of the data in the data lake. AWS Identity and Access Management Service, Encrypting Amazon Elastic Block Store (EBS) & AWS Config Rules for data governance automation were used in this solution.

Data Ingestion: The ingestion process provides connectors to extract data from various data sources. It can involve Historical Data Load i.e. One-time load, Recurring Data Load or Delta Load into data lake i.e. Batch Data Load (extracts data from various data sources at periodic intervals and moves them to the Data Lake landing layer or landing zone) and Streaming Data Load or Real Time Data Load (ingests data that are generated continuously from multiple sources such as log files, telemetry, mobile applications, IoT sensors and social networks etc.)

Data Storage: Data storage in AWS simple storage service is one of the key components of a Data Lake foundation. This data is categorized into different layers/zones based on the degree of processing and deviation from the source systems - landing zone, raw data zone, and curated/aggregated zone. Data Storage is highly scalable, available, and low-cost and should support compression and encryption techniques, and AWS S3’s intrinsic strength in these areas were fully utilized. Additional stored data are accessible for data exploration & analytics

tools via a hive Meta store on top of the AWS S3 buckets.

Data Validation: Data validation is the key for the data quality based on various configurable data quality checks by using available meta data in meta data store and based on various business checks e.g. identifying bad records, file naming conventions, checking meta data is available or not, checking mismatch in the data type in actual data vs data type in the meta data available in the Meta Data Store, for each dataset in different data sources. AWS SDK Java APIs code were used in reading data from AWS Simple Storage Service as a stream, adding AWS S3 Object tags as Validated/Invalidated, and using AWS Simple Email Service for notifications for Invalidated file, along with the reason of the failure.

Data Transformation/Processing/Curation and Aggregation: Data transformation is one the core capabilities of a data lake. These are based on various configurable data transformation flags through static Meta data available Meta data store e.g. removing bad records from the data, removing bad characters from the data etc. This solution transforms data for further data processing & analysis using Spark/Scala or PySpark code. Once data transformation is complete, data is processed into external hive tables & AWS Athena tables, with those Hive tables written back out to Amazon Simple Storage Service storage in Parquet or ORC format depending on the configuration available in AWS DynamoDB Meta Data Store by using Amazon EMR cluster. After

that data curation and aggregation spark jobs are submitted on Amazon EMR cluster by Airflow in the automated way i.e. Data Processing Pipeline.

Data Discovery: At the time of data processing into external hive tables and then writing those Hive tables back out to Amazon Simple Storage Service storage in Parquet or ORC format, complete Meta data and various processing flags are pushed into Kibana which is running on Amazon Web Services (AWS) Elasticsearch Service for indexing, and natural language query access.

Data Exploration and Visualization: For data exploration and statistical modelling, the data scientists take advantage of Jupyter Notebooks or Zeppelin in AWS EMR and the R shiny / RStudio on AWS EC2(s). The solution also provisions denormalized data sets to AWS RDS Aurora databases for further visualization in QlikView applications. The solution also incorporates an AWS QuickSight for Data Lake Health dashboard for the Data Operations team.

Prominent Services for Meta Data Driven Boundaryless Data Lake on Amazon Web Services: AWS Identity and Access Management (IAM), AWS Lambda, AWS Simple Storage Service, AWS RDS, AWS DynamoDB, AWS Simple Email Service, AWS Simple Queue Service, AWS EMR Service, AWS Elasticsearch Service with Kibana, AWS Elastic Compute Service, AWS Cloud formation Service, AWS Quick Sight, AWS SDK Java APIs, AWS VPC & Subnet, AWS Athena, Encrypting Amazon Elastic Block Store (EBS) & AWS Config Rules.

External Document © 2018 Infosys Limited

Page 5: APN Premier Partner Infosys Uses AWS to Provide a Meta ... · Introduction . With the vast volume of data, coming at ... Elasticsearch service, business intelligence insights analysis

External Document © 2018 Infosys Limited

Lambda function

ERP

Presentation Layer

Material Master

Bills of Master

Service Part Inventory

Global Sales Value

Demand History Value

StructuredData that are highly normalized with common schema and stored in relational databases, powering transactional line-of-business applications i.e. SALESBOM etc.

AmazonS3

AS400BAANBPCSSAPQAD

MOVEX…….

Landing Zone

AWS S3 Event ObjectCreate(All)

………………..

Process File Meta Data Files(.csv) & Insert Into DynamoDB Table

Keep Track Of Ingestion Data Set File(s) & Insert Ingestion Stats Amazon

SES

Fetch Email Configuration

Failure

Amazon EMR

EC2 Instances

Airflow Scheduler DAGs - Data Validation

TransformationProcessing

Data ValidationTransformationProcessing Jobs

Data Curation & Aggregations Jobs

AmazonS3

Raw Zone

AS400BAANBPCSSAPQAD

MOVEX…….

AmazonS3

ACCOUNTINGBILLING

BOMCOUNTRY

INVENTORYSALES…….

Curated Zone

AmazonSQS

Data Flow Data Flow

Flat Files.text/.csv etc. format

.parquet or .orc format

Meta Data StoreAmazon

DynamoDB

Amazon ES

HiveSparkScala

HDFS

S3 Object Tagging External Hive Tables Data External Hive TablesData

AmazonAthena

Amazon QuickSight

Data Lake Health Dashboard

SKU Rationalization Report

Dashboard

Aggregated Zone

BOMSALES…….

Failure

AmazonSES

Hive part files

Data Flow

Sources

Data Store & Analytics Collection & Transformation

Data Ingestion

BatchExtraction

Historical + Delta

.parquet or .orc format

Hue

Zeppelin

SqoopPresto

dl_dataset_configdl_file_metadata

dl_conslidated_statusdl_global_config

……………….

Figure 2 – Meta Data Driven Boundaryless Datalake Solution Architecture on AWS Cloud

Figure 3 – Meta Data Driven Boundaryless Datalake -Consolidated Data Lake Status Dashboard Design

Meta Data Driven Boundaryless Datalake Solution Architecture on AWS Cloud

data lake Initial meta data setup(static md)

DL Configdatasetconfigtable

hive table(consolidated data lake status)

lambda function(s) (stream based)

1. compare records based on ss & ds code Whenever onboarding new dataset

2. update last receipt date3. update last successful validation date4. update last successful process date5. update start/end transaction date6. update comment(s)

AmazonDynamoDB

Start datalake Ingestion

1. physical data file2. physical meta data file(file meta data

file(.csv) file/dynamic md)

AmazonS3

landing zone bucket

s3://datalakelandingdatazone/ss/new

generic validator(insert validation data into dynamodb table i.e. dl_datasetvalidation)(airflow

schedule)

dl_datasetvalidation table

Processed dataset in datalake(last successful validation date)

last successful process date

start/end transaction date

custom AWS SDK Java based utility to fetch start & end transaction date from hive external tables

Amazon QuickSight

(datasetcatalogue table)

Start

Initial Setup

End

dl_datasetingestiontable

(last receipt date)

lambda function(s) (event based)

Page 6: APN Premier Partner Infosys Uses AWS to Provide a Meta ... · Introduction . With the vast volume of data, coming at ... Elasticsearch service, business intelligence insights analysis

External Document © 2018 Infosys Limited

DATALAKE PLATFORM

INFRASTRUCTURE AUTOMATION

• AWS Cloud Formation Service for Infra & Custom Java Utility using AWS SDK Java to set up Meta Data Store into AWS DynamoDB & AWS S3 bucket(s) & folder creation & DynamoDB Tables

• Data Lake Infrastructure process is fully automated & customizable

DATA VALIDATION TRANSFORMATION

PROCESSING AUTOMATION

• AWS EMR/Spark/Scala/AWS Lambda Python/AWS SQS/AWS SES/AWS S3/RDS Custom Java Code AWS SDK

• Airflow Pipeline for Batch Load/Validation Transformation/Processing

• This process is fully automated & customizable

DATA LAKE DATA DISCOVERY & ANALYTICS CONSUMPTION AUTOMATION

• AWS Elascticsearch with Kibana & connectors to consume useful data into analytics & visualization tool i.e. QlickView & AWS Quicksight

• This process in fully automated & can be customizable

External Document © 2018 Infosys Limited

Meta Data Driven Boundaryless Data Lake AutomationWe have divided complete solution automation into three categories i.e. Data Lake Platform/Infrastructure/Environment Automation, Data Lake Data Validation/Transformation/Processing Automation & Data Discovery & Analytics Automation.

Figure 4 – Meta Data Driven Boundaryless Datalake Solution Automation Categories

Page 7: APN Premier Partner Infosys Uses AWS to Provide a Meta ... · Introduction . With the vast volume of data, coming at ... Elasticsearch service, business intelligence insights analysis

External Document © 2018 Infosys LimitedExternal Document © 2018 Infosys Limited

Figure 5 – Meta Data Store i.e. AWS DynamoDB for Boundaryless Datalake Solution datasetconfig table to store Static Meta data

(showing one sample record for understanding)

datasetcatalogue AWS DynamoDB table to store dataset catalogue{showing one sample record for understanding}

Page 8: APN Premier Partner Infosys Uses AWS to Provide a Meta ... · Introduction . With the vast volume of data, coming at ... Elasticsearch service, business intelligence insights analysis

External Document © 2018 Infosys LimitedExternal Document © 2018 Infosys Limited

Page 9: APN Premier Partner Infosys Uses AWS to Provide a Meta ... · Introduction . With the vast volume of data, coming at ... Elasticsearch service, business intelligence insights analysis

External Document © 2018 Infosys Limited

dl_datasetconsolidated_status AWS DynamoDB table to keep track of each dataset processing status i.e. dataset life cycle in the lake {showing one sample record for understanding}

Other Components Perspective:

• Onboarding of completely new data

source or data sets, as this solution is

based on Meta Data so you just need

to provide Meta data in the predefined

templates and solution will take care of

rest of the things like data validation,

data transformation/processing etc.

• Machine learning

• Deep learning on the data sets

• Neural Networks

Focus Perspective Outcome: One should focus on MVP Meta Data

Driven Boundaryless data lake solution

to amplifying and to get new business

insights for high end data analytics &

the beauty of this solution is that it is

developed using open source technologies

& AWS SDK Java/Python native APIs. Meta

data driven boundaryless solution is fully

customizable as per customer need.

Bells and Whistles which can be added later in the data lake MVP

• Ingests data that are generated

continuously from sources, such as log

files, telemetry, mobile applications, IoT

sensors and social, networks.

• Real time Analysis including Machine

Learning algorithms such as anomaly

detection

• AWS Kinesis , AWS IoT

• User Interface for Data Lake Ops part

Meta Data Driven Boundaryless Data Lake Benefits/Impact:

• Time to Value – move quickly to

analytics and use cases

• AWS-native services and Open-source

Solution

• A centralized and accessible data

repository that different teams can rely

on for their individual analytics.

• Significant reduction in the cost of

Business Analytics due to the use of

cloud and the transient nature of the

compute infrastructure.

• Data Scientist/Analyst community can

now perform DIY analytics on real ‘Big

data’ and data they understand, than

just sample data. Better insights, better

decisions, better business products.

• A complete Analytical ecosystem that

can now be easily expanded to support

multi-tenancy.

• Enables the analysis of all SKU in all

the source systems which will help in

reducing the number of redundant and

not used SKU’

Operating the Data Lake & Health Dashboard: For Data Lake

operations & health monitoring, we have

used AWS lambda service for event based

trigger to update health monitoring tables

in AWS DynamoDB to keep the track for

each data sets validation/transformation

& processing matrix & finally showing the

matrix to AWS Quicksight dashboards.

External Document © 2018 Infosys Limited

Page 10: APN Premier Partner Infosys Uses AWS to Provide a Meta ... · Introduction . With the vast volume of data, coming at ... Elasticsearch service, business intelligence insights analysis

External Document © 2018 Infosys LimitedExternal Document © 2018 Infosys Limited

Page 11: APN Premier Partner Infosys Uses AWS to Provide a Meta ... · Introduction . With the vast volume of data, coming at ... Elasticsearch service, business intelligence insights analysis

External Document © 2018 Infosys Limited

Figure 6 – Meta Data Driven Boundaryless Datalake Health & Operations

Page 12: APN Premier Partner Infosys Uses AWS to Provide a Meta ... · Introduction . With the vast volume of data, coming at ... Elasticsearch service, business intelligence insights analysis

© 2018 Infosys Limited, Bengaluru, India. All Rights Reserved. Infosys believes the information in this document is accurate as of its publication date; such information is subject to change without notice. Infosys acknowledges the proprietary rights of other companies to the trademarks, product names and such other intellectual property rights mentioned in this document. Except as expressly permitted, neither this documentation nor any part of it may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, printing, photocopying, recording or otherwise, without the prior permission of Infosys Limited and/ or any named intellectual property rights holders under this document.

For more information, contact [email protected]

Infosys.com | NYSE: INFY Stay Connected

Madasamy Thandayuthapani (Author) -Sr. Principal Architect - Data & Analytics CTO and Associate Vice President at Infosys Limited

Mada is Senior Technology Transformation Leader and Strategist with 20 years of hands-on expertise in strategy, solution architecture blueprinting & roadmap and delivering transformation programs with specialization in Cloud, Analytics, DevOps, Service Assurance & App Performance Management domains.

Enthusiastic in accepting challenging assignments and energetic in delivering to the responsibility.

Jignesh Desai (Reviewer AWS Partner for Infosys) - Solutions Architect at Amazon Web Services

Jignesh is Solutions Architect at Amazon Web Services. In this role, he delivers exceptional rather than expected results through strategic thinking, strong problem solving skills and unique ability to understand market needs and fill them with distinguishable solutions

He is experienced solution and enterprise architect with multi-million dollar technology transformation.

Ezra Uzosike (Author) - Lead AI Cloud Solution Architect - Big Data Analytics at Stanley Black & Decker, Inc.

Ezra is Senior Manager, Big Data Analytics Data Science & driving the effort for building out an AWS Data Lake for all Stanley Black & Decker heterogenous data sources, including traditional ERP and streaming IoT data. Also leading the integration of Big Data Hadoop and Amazon AWS tools and services e.g. Python Anaconda clusters, R clusters, SPARK, AWS Lambda, Elastic and Dynamo DB into the data science mainstream, to unlock the power of the data catalogued in Data Lake. Demonstration of how capabilities like Machine Learning, along with sound statistical methods are leveraged in mining Insights with the distributed processing power Hadoop.

Arpit Garg (Author) - Senior Technologist /Sr. Technology Architect at STG (Strategic Technology Group), Infosys Limited

Arpit is Senior Technologist/Sr. Technology Architect at STG Infosys. In this role, he delivers high end data analytics & data on cloud solutions, data lake architecture on cloud, server-less architecture & framework. He has hands-on experience in diverse technology stack like Full Stack/Java/Python/AWS/Azure Cloud Services/Big Data Tool Set etc. 

He has outstanding problem solving and decision making skills & ability to work with a fast pace.

About the authors