4
Visualizing Clouds on Different Stages of DWH – An Introduction to Data Warehouse as a Service Harjeet Kaur, Prateek Agrawal, Amita Dhiman Department of Computer Science and Engineering Lovely Professional University India {harjeetkaurminhas, prateek061186, amitadhiman3001} @gmail.com Abstract— Various Advantages that cloud computing has established, namely scalability, agility, automation and resource sharing are being utilized by today’s forward- thinking business leaders within their enterprise data centre. Large-scale data analytics as and when needed, at the price user can afford can be established on demand. This paper presents architecture to affect cloud computing for Data Warehouse applications and thrashes out the pros and cons of visualizing cloud on traditional data warehousing. The emphasis in this work is on providing different stages of Data Warehouse as a service (DWH-aaS), which have recently gained a great deal of attention. Conclusion is supported with a brief discussion on level of architecture adopted by some popular DWH cloud service providers. Keywords-Cloud, Datawarehouse(DWH), Datawarehouse as a service, DWH Cloud Services, DWH Cloud Architecture I. INTRODUCTION In the past couple of years software providers have been moving more and more applications to the "cloud", cloud computing is the wave of the future in terms of delivering software as a service. According to Berkeley scientists [3], "Cloud computing refers to both the applications delivered as services over the Internet and the hardware and systems software in the datacenters that provide those services." All major software vendors and many start-ups have jumped on the bandwagon and claim that they are either cloud-enabled or cloud-enabling with promise to a reduced time-to-market by removing or simplifying the time consuming hardware provisioning, purchasing, and deployment processes. Cost reduction is assured in several ways. First, by adopting “pay- as-you-go” business model, capital costs can be turned into operational costs. Second a better (close to 100%) utilization of the hardware resources. Cloud computing is a strong foundation and a critical technology for green computing [4]. Furthermore, cloud computing reduces operational cost and pain by automating IT oriented tasks. In terms of performance, cloud computing promises (virtually) infinite scalability so that IT administrators need not worry about peak workloads. Finally, cloud computing guarantees to provide improved flexibility in the utilization and management of both software and hardware. Figure. 1 is self explanatory to describe the benefits of Cloud Computing. Figure. 1 Cloud Computing A. Cloud Options: two different classifications The There are two ways to classify cloud. One, according to the location of computing resources to be shared and other by type of the services provided. Three different models available as far as location of resources is concerned are Public, Private and Hybrid Cloud. When a service provider makes resources, such as applications and storage, available to the general public over the Internet it is known as Pubic cloud. Computing infrastructure is placed at the vendor’s premises and is hosted by the cloud vendor. The customer has no idea and where the computing infrastructure is hosted and hence has no control over it. Upgrades are made available automatically to the entire user population who are sharing, and basic support is provided to them as part of a company's subscription. Examples of public clouds include IBM's Blue Cloud, Sun Cloud, Google AppEngine, Amazon Elastic Compute Cloud (EC2), and Windows Azure services platform. Midsize firms are more interested in the public cloud. Because midsize firms have on an average just 5–10 full-time IT staff in-house, they are more dependent on outside resources. For companies requiring more complete control of their data and resources, a private cloud might be more suitable. This model, dedicates the computing infrastructure to a particular organization and do not share the same with other organizations. Some experts believe that private clouds are not real examples of cloud computing. Private clouds are more secure but more expensive in comparison to public clouds. Which vision of cloud (Figure. 2) is right, is still a matter of controversy. Cloud Computing Reduced Cost Increased Storage Up to date software Environment Friendly More Flexibility and Mobility 2012 International Conference on Computing Sciences 978-0-7695-4817-3/12 $26.00 © 2012 IEEE DOI 10.1109/ICCS.2012.78 356 2012 International Conference on Computing Sciences 978-0-7695-4817-3/12 $26.00 © 2012 IEEE DOI 10.1109/ICCS.2012.78 356

[IEEE 2012 International Conference on Computing Sciences (ICCS) - Phagwara, India (2012.09.14-2012.09.15)] 2012 International Conference on Computing Sciences - Visualizing Clouds

  • Upload
    amita

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE 2012 International Conference on Computing Sciences (ICCS) - Phagwara, India (2012.09.14-2012.09.15)] 2012 International Conference on Computing Sciences - Visualizing Clouds

Visualizing Clouds on Different Stages of DWH – An Introduction to Data Warehouse as a Service

Harjeet Kaur, Prateek Agrawal, Amita Dhiman Department of

Computer Science and Engineering Lovely Professional University

India {harjeetkaurminhas, prateek061186, amitadhiman3001} @gmail.com

Abstract— Various Advantages that cloud computing has established, namely scalability, agility, automation and resource sharing are being utilized by today’s forward-thinking business leaders within their enterprise data centre. Large-scale data analytics as and when needed, at the price user can afford can be established on demand. This paper presents architecture to affect cloud computing for Data Warehouse applications and thrashes out the pros and cons of visualizing cloud on traditional data warehousing. The emphasis in this work is on providing different stages of Data Warehouse as a service (DWH-aaS), which have recently gained a great deal of attention. Conclusion is supported with a brief discussion on level of architecture adopted by some popular DWH cloud service providers.

Keywords-Cloud, Datawarehouse(DWH), Datawarehouse as a service, DWH Cloud Services, DWH Cloud Architecture

I. INTRODUCTION In the past couple of years software providers have been

moving more and more applications to the "cloud", cloud computing is the wave of the future in terms of delivering software as a service. According to Berkeley scientists [3], "Cloud computing refers to both the applications delivered as services over the Internet and the hardware and systems software in the datacenters that provide those services." All major software vendors and many start-ups have jumped on the bandwagon and claim that they are either cloud-enabled or cloud-enabling with promise to a reduced time-to-market by removing or simplifying the time consuming hardware provisioning, purchasing, and deployment processes. Cost reduction is assured in several ways. First, by adopting “pay-as-you-go” business model, capital costs can be turned into operational costs. Second a better (close to 100%) utilization of the hardware resources. Cloud computing is a strong foundation and a critical technology for green computing [4]. Furthermore, cloud computing reduces operational cost and pain by automating IT oriented tasks. In terms of performance, cloud computing promises (virtually) infinite scalability so that IT administrators need not worry about peak workloads. Finally, cloud computing guarantees to provide improved flexibility in the utilization and management of both software and hardware. Figure. 1 is self explanatory to describe the benefits of Cloud Computing.

Figure. 1 Cloud Computing A. Cloud Options: two different classifications

The There are two ways to classify cloud. One, according to the location of computing resources to be shared and other by type of the services provided. Three different models available as far as location of resources is concerned are Public, Private and Hybrid Cloud. When a service provider makes resources, such as applications and storage, available to the general public over the Internet it is known as Pubic cloud. Computing infrastructure is placed at the vendor’s premises and is hosted by the cloud vendor. The customer has no idea and where the computing infrastructure is hosted and hence has no control over it. Upgrades are made available automatically to the entire user population who are sharing, and basic support is provided to them as part of a company's subscription. Examples of public clouds include IBM's Blue Cloud, Sun Cloud, Google AppEngine, Amazon Elastic Compute Cloud (EC2), and Windows Azure services platform. Midsize firms are more interested in the public cloud. Because midsize firms have on an average just 5–10 full-time IT staff in-house, they are more dependent on outside resources. For companies requiring more complete control of their data and resources, a private cloud might be more suitable. This model, dedicates the computing infrastructure to a particular organization and do not share the same with other organizations. Some experts believe that private clouds are not real examples of cloud computing. Private clouds are more secure but more expensive in comparison to public clouds. Which vision of cloud (Figure. 2) is right, is still a matter of controversy.

Cloud Computing

Reduced Cost

Increased Storage

Up to date software

Environment Friendly

More Flexibility

and Mobility

2012 International Conference on Computing Sciences

978-0-7695-4817-3/12 $26.00 © 2012 IEEE

DOI 10.1109/ICCS.2012.78

356

2012 International Conference on Computing Sciences

978-0-7695-4817-3/12 $26.00 © 2012 IEEE

DOI 10.1109/ICCS.2012.78

356

Page 2: [IEEE 2012 International Conference on Computing Sciences (ICCS) - Phagwara, India (2012.09.14-2012.09.15)] 2012 International Conference on Computing Sciences - Visualizing Clouds

Figure. 2 Types of Cloud Computing

The election commission, India launched a website for

providing real time results of the mega-poll. It confirmed arrangements which showed it was well prepared to handle 80.64 billion hits in 8 hours. On the election result day media reported: 300,000 hits/second make Election Commission website crash [2]. Increasing number of servers appears to be a good idea but the Indian election commission website will attract visitors only when there’s an election. i.e., ideally once in 5 yrs so solution is using hybrid cloud computing where organization use their own computing infrastructure for normal usage, but access the cloud for high/peak load requirements. This makes sure that an abrupt increase in computing requirement is handled gracefully.

Another approach towards the classification of cloud is according to kind of service offered. IAAS, PAAS and SAAS are the most common type of services offered by different vendors. Infrastructure as a service (IaaS) deals with offering hardware related services using the principles of cloud computing. These include some kind of storage services (database or disk storage) or virtual servers. Leading vendors that offer Infrastructure as a service are Amazon S3, Rackspace Cloud Servers, Amazon EC2, and Flexiscale. Platform as a Service (PaaS) put forwards a development platform on the cloud. Platforms provided by different vendors are many times not compatible. Typical players in PaaS are Microsofts Azure, Google’s Application Engine, Salesforce.com’s force.com . Software as a service (SaaS) includes software availability on the cloud. Software application hosted by the cloud vendor can be accessed by users on pay-per-use basis. Examples are Google docs and Microsoft’s online version of office called BPOS (Business Productivity Online Standard Suite), online email providers like Google’s Gmail and Microsoft’s hotmail.

To move complete DWH to cloud model any of the above mentioned model approach or combination of different approaches can be used. The purpose of this paper

is to introduce and elaborate data warehouse as a service (DWH-AAS). The next section exhibits data warehouse architectures as they are used in cloud-computing today.

II. ENVISAGING CLOUD IN DATA WAREHOUSE ARCHITECTURE

A. OLAP, Data Warehouses, and Data Mart According to Gartner, Information Technology Research

and Advisory Company providing technology related insight, “There are two reasons for the increased activity in running data warehouses as a managed service. First, business units are dissatisfied with the IT department buying a managed service for their BI needs. Second, small companies cannot afford their own data warehouse infrastructure but have large volumes of data to process for analytics. It is a growing market”. [1].

OLAP was engineered to run queries against data warehouses and data marts, environments configured for analysis and reporting. This way, operational transaction processing systems already coping with day-to-day operations are not overburdened by requests for data with this arrangement. Data warehouses and marts contain data extracted from transaction systems, and are specifically made to support queries. The refresh rate on the warehouse or mart can be set according to how up-to-date the data needs to be. This section exhibits data warehouse architectures as they are used in cloud-computing today. As a starting point, the classic multi-tier architecture is explained. Then, a variation of this architecture is presented to envisage cloud in it. It would be more interesting to find how these concepts have been packaged and adopted by commercial cloud services. Foundation of DWH architecture is primarily the business processes of a business enterprise taking into consideration the data merged across the business enterprise with enough security, data modeling and organization, extent of query requirements, meta data management and application. It also includes warehouse staging area planning for optimum bandwidth utilization and full technology implementation.

B. Visualizing Clouds on Classic DW Architecture The Figure. 3 well depicts the spread of cloud over

various conventional data warehouse stages. It identifies different layers in classic warehouse architecture that may be hosted on Cloud. The data staging or inbound layer houses the extraction programmes, and can be hosted on cloud by initiating appropriate data governance model at the data source layer itself and utilize the extraction logic to cleanse and summarize the data. Complete data warehouse can be designed in cloud with the use of DaaS and simply pay on a “pay-per-use” basis. On demand Servers for storage provide scalability also. The next layer which can be accommodated on a cloud is reporting and analysis. Clouds can be created using internal resources to host reports from various business streams. Major components of complete cloud model of DWH include:

ETL Cloud: Extract, transform and load (ETL) is a process in database usage and especially in data warehousing that involves-Extracting data from outside sources,

Public/External Cloud – Off premises

Hybrid Cloud

Private/ Internal Cloud – On premises

357357

Page 3: [IEEE 2012 International Conference on Computing Sciences (ICCS) - Phagwara, India (2012.09.14-2012.09.15)] 2012 International Conference on Computing Sciences - Visualizing Clouds

Transforming it to fit operational needs and Loading it into the end target (database or data warehouse). The benefits of creating ETL services are reducing the latency of creating analytical data and linking the warehouse more tightly into enterprise architectures. ETL services are provided by Teradata Agile Analytic Cloud, Queplix’s CloudETL, Informatica Cloud etc.

BI Cloud: Business reporting , management reporting, business process management (BPM) budgeting and forecasting, financial reporting similar areas, are major applications of OLAP. OLAP (online analytical processing) is computer processing that allows users to selectively extract and view data from different perspectives easily. An OLAP tool is used when you want to analyse aggregate data over the time, instead-of a reporting tool which provide an access to your detailed data. Reporting and data mining again are subset of BICloud as OLAP is. Cloud services available in this context are with Panorama Software, Netezza and AppNexus, GreenPlum, Tera data etc.

The complete architecture can be transformed into full cloud model. Different vendors provide different level of services. Some are offering complete data warehousing solution as services and some are in BI services only. In fact, data staging area and storage can be merged to make one single stage in cloud. Although separate vendors can be contacted for migration, integration and creation of data marts. The next section investigates popular service providers as far as data warehousing is concerned.

III. DW CLOUD SERVICES This section describes some popular cloud service

providers for both data staging and data reporting layer.

A. Teradata Teradata provides a truly unique solution based on the

simplicity of a single central data warehouse and offer Data Warehouse Services[8]. Teradata migration solutions produce a single view of business by assimilating data into a central location. Teradata’s Project Managemet Solutions Methodology is a patented collection of integrated processes, customized tools, and quantifiable metrics from initial strategic planning to technical implementation, user training, and customer support which is utilized by Teradata project managers. Other services take in Data Integration - including completeness and fulfilment of business rule, Data Quality - processes to cleanse, transform, integrate information, Data Security and Privacy - across integrated subjects, Master Data Management – to maintain the integrity and reliability of master data enterprise-wide, Data Governance and Stewardship – to align data warehouse initiatives with business objectives. Teradata Managed Services include -Teradata Infrastructure Services - the foundation upon which the data warehouse environment operates including best practices, database, system and security administration., Data Integration Services - active management and maintenance of the data integration, summary build, and warehouse extract processes, Analytical Services - management and maintenance of the BI environment, Teradata Agile Analytic Cloud Agile Analytic Cloud

permits business users and developers to assign a database or technically speaking elastic data mart inside the Teradata System[8]. There is an on demand self service, Elastic Marts Builder port let, where users can create a database, upload comma separated values, and commence analysis – all within 5-10 minutes (as claimed by Teradata) . Elastic mart users can upload their data and merge it to the EDW data, for fabricating reports.

B. Greenplum’s Enterprise Data Cloud™ Data Cloud. Greenplum, provides solutions for large-

scale data warehousing and analytics for companies managing huge data such as terabytes to petabytes. Some of the data-driven businesses around the world, including NASDAQ OMX, NYSE Euronext, Reliance Communications, Skype and Fox Interactive Media/MySpace, have taken on the Greenplum Database to hold up their mission-critical business functions[9]. Different cloud solutions provided by Greenplum are - Data Mart Consolidation, Data warehouse and Mart Migration.

As large enterprises are faced with the reality of needing to support 100s to 1000s of individual data marts, Greenplum provides a powerful platform to consolidate those individual marts onto a high-performance, lower cost infrastructure. With the ever-increasing growth of data, companies are supporting existing data marts and warehouses that need to scale well beyond their original design considerations. Many growing companies are facing problems in handling the complexities involved in systems administration coupled with expensive costs. These are forcing organizations to migrate their data marts and warehouses to entirely new platforms. Greenplum offers help to such companies to successfully migrate their Oracle, Teradata, and other existing database systems over to Greenplum.

C. Netezza and AppNexus AppNexus offers an enterprise-class IAAS to build

infrastructure in minutes, not months (as claimed by AppNexus). Association of AppNexus with Netezza, the provider in data warehouse appliances, offer its customers a data warehouse service within the cloud. Database server-storage configuration in a system intended to execute complex queries against large volumes of stored data, is provided by Netezza data warehouse equipments. It exploits massively parallel processing and an architecture that puts processing inside storage to provide a brute force solution that can deal with complex analytics against large data volumes. Netezza appliances comply with ANSI standard SQL and are certified with all industry-leading business intelligence, statistical analysis, and ETL platforms, including Business Objects, Cognos, Informatica, MicroStrategy, and SAS [6]. The AppNexus “Netezza in the Cloud” service expands the ease of Netezza implementations by contributing rapid provisioning, scalability on demand, and pay-as-you-go pricing based on utilization.

D. Informatica Cloud Informatica Cloud offers data integration cloud

applications to let business users assimilate data across

358358

Page 4: [IEEE 2012 International Conference on Computing Sciences (ICCS) - Phagwara, India (2012.09.14-2012.09.15)] 2012 International Conference on Computing Sciences - Visualizing Clouds

cloud-based applications and on-premise systems and databases. Informatica Cloud caters to specific business processes (customer/product master synchronization, opportunity to order, etc.) and point-to-point data integration requirements (e.g. Salesforce.com to on premise or cloud-to-cloud end-points) [7]. Informatica Cloud Services make integrating cloud-based applications, such as Salesforce CRM and on-premise databases and applications data, quick and easy. Informatica Cloud Services can be used to integrate SaaS applications with a variety of common on-premise systems and databases. Informatica Cloud Services can be used to synchronize and replicate data between local databases and files. The Data Loader is an integration service for Informatica customers that automates the loading and extraction of data between CRM, flat files, and relational databases and provides limited scheduling capabilities. Informatica Cloud Data Loader Service Features include - An intuitive Web-based integration wizard streamlines set up, Convenient Web-based administration provides access from anywhere, Direct access to relational databases and popular file formats and Drag-and-drop mapping of source and target fields

IV. CONCLUSION This paper presented the spread of cloud on traditional

data warehouse architecture with description of different services provided in this context by various vendors. Almost each layer in the architecture presented above is hosted on a cloud. There is a separate cloud for each DWH stage, even DWH can also be created on cloud of servers. The different vendors, as discussed above make use of different level of cloud. Some are restricted by providing ETL cloud only while others are dealing with providing storage infrastructure also. Many service providers are just providing BI solutions. So user wants to go with which vendor, depends upon his personal needs and level of services provided by the vendor. Market is still immature, the alternative services varied greatly both in cost and performance. Table 1 summarises

the kind of DWH services offered by major vendors discussed in previous section.

Table 1. Services provided by DWH Cloud Service Providers

Service Provider Services Offered

Tera Data DWH Migration, Enterprise Data Mgt, Analytic cloud to allocate data marts,

Business Analysis GreenPlum Data Mart Consolidation, DWH and

Data Mart Migration, Enterprise Data Mgt

Netezza and AppNexus ETL, BI Informatica Data Loading, Data Synchronization,

Data replication

V. REFERENCES [1] Kossmann D, Kraska T, Loesing S, “An evaluation of alternative

architectures for transaction processing in the cloud”, ACM Proceedings of international conference on Management of data, 2010, USA, pp 579-590.

[2] http://articles.timesofindia.indiatimes.com/2009-05-18/india/28190755_1_hits-new-website-nic

[3] Armbrust M, Fox A, Griffith R, Joseph A.D, Katz R, Konwinski A, Lee G, Patterson D, Rabkin A, Stoica I, and Zaharia M, “Above the Clouds: A Berkeley View of Cloud Computing”, UC Berkeley Reliable Adaptive Distributed Systems Laboratory, Technical Report No. UCB/EECS-2009-28, February 2009.

[4] Liu L, Wang H, Liu X, Jin X, Wen B H, Wang Q B, Chen Y, “GreenCloud: a new architecture for green data center”, ICAC-INDST '09, Proceedings of the 6th international conference industry session on Autonomic computing and communications industry session, USA, pp 29-38.

[5] Walker D M, “Overview Architecture for Enterprise Data Warehouses”, Data Management & Warehousing, Ver 1.0, UK, 2006

[6] Netezza, http://www.netezza.com/data-warehouse-appliance-products/cloud.aspx, access on April 2011

[7] Informatica, http://www.informaticacloud.com, access on April 2011 [8] Teradata, http://www.teradata.com/t/, access on April 2011 [9] Greenplum, http://www.greenplum.com/, access on April 2011

Figure. 3 Proposed DWH Cloud Architecture

ETL on Cloud

Data Marts

ETL on Cloud

ETL on Cloud

Data Marts

Data Marts

OLAP Cloud

BI Reporting

Cloud

Data Mining Cloud

Transaction Processing on

Cloud

359359