Use Case: Integration and Evaluation - Orbit Project€¦ · ORBIT ORBIT_WP6_D6.3.1 Business Continuity as a Service 27.10.2014 D6.3.1 - Use case: Scenario Definitions 10/26 and tests

Use Case: Integration and Evaluation

D6.3.1

October 2014

Business Continuity as a Service ICT FP7-609828

ORBIT ORBIT_WP6_D6.3.1 Business Continuity as a Service 27.10.2014 D6.3.1 - Use case: Scenario Definitions

www.orbitproject.eu 2/26

Document Information Scheduled delivery 01.10.2014 Actual delivery 27.10.2014 Version 1.0 Responsible Partner Deutsche Welle (DW)

Dissemination Level PU Public

Revision History Date Editor Statu

s Version Changes

15.09.2014 Mirko Lorenz Draft 0.1 Document initiation 22.09.2014 Mirko Lorenz Draft 0.2 Adding content from IBM 23.09.2014 Mirko Lorenz Draft 0.3 Added evaluation test case and

metrics for media use case 25.09.2014 Luis Tomas Draft 0.4 Added details about testbed 66.10.2014 Doron Fediuck Draft 0.5 Adding content from Red Hat

20.10.2014 Mirko Lorenz Draft 0.6 Review for consistency, added final

inputs 25.10.2014 Dimosthenis

Kyriazis Final 1.0 Final version

Contributors Mirko Lorenz (DW), Tilman Wagner (DW), Petter Svärd (UMU), David Alan Gilbert (Red Hat), Joel Nider (IBM), Doron Fediuck (Red Hat) Internal Reviewers Emmanouel Varvarigos (CTI), Theodora Varvarigou (ICCS) Copyright This report is © by Deutsche Welle (DW) and other members of the ORBIT Consortium 2013-2016. Its duplication is allowed only in the integral form for anyone's personal use and for the purposes of research or education.

Acknowledgements The research leading to these results has received funding from the EC Seventh Framework Program FP7/2007-2013 under grant agreement n° 609828.



Glossary of Acronyms Acronym Definition BC Business Continuity D Deliverable DoW Description of Work EC European Commission EE Execution Environment ERG European Regulatory Group EU European Union FT Fault Tolerant GPS Global Positioning System GUI Graphical User Interface HA High Availability HDTV High Definition Television PM Project Manager PO Project Officer SaaS Software as a Service SDTV Standard Definition Television SIP Session Initiation Protocol SLA Service Level Agreement SME Small and Medium Enterprise SOA Service Oriented Architecture UC Use Case UMTS Universal Mobile Telecommunications System URI Uniform Resource Identifier WP Work Package



Table of Contents 1. Executive Summary...............................................................................................6

1.1. Brief introduction to ORBIT .............................................................................6 1.2. The Concept of Business Continuity ...............................................................6 1.3. Scope of this deliverable .................................................................................9 1.4. Scenarios, Implementation, Evaluation...........................................................9 1.5. ORBIT Outcomes - Use Case Mapping ........................................................10

2. Evaluation Methodology ......................................................................................13 2.1. Performance Metrics .....................................................................................13

2.1.1. Business Use Case Metrics ......................................................................14 2.1.2. Infrastructure Use Case Metrics ...............................................................14 2.1.3. Media Use Case Metrics...........................................................................14

3. Use Case Evaluation Plans.................................................................................16 3.1. Business Use Case Scenario........................................................................16

3.1.1. Highly Available Bugzilla service evaluation.............................................16 3.1.2. Macro-benchmarks ...................................................................................16 3.1.3. Methodology .............................................................................................16

3.2. Infrastructure Use Case Scenario .................................................................18 3.2.1. Infrastructure Use Case Evaluation ..........................................................18 3.2.2. Micro-benchmarks ....................................................................................19 3.2.3. Macro-benchmarks ...................................................................................20 3.2.4. Methodology .............................................................................................20

3.3. Media Use Case Scenario.............................................................................21 3.3.1. Media Use Case Evaluation .....................................................................21 3.3.2. Methodology .............................................................................................22

3.4. Testbed requirements ...................................................................................25 4. Conclusions.........................................................................................................26



List of Figures Figure 1: Hierarchy of concepts...................................................................................7 Figure 2: ORBIT Use Cases - Scenario Development.................................................8

List of Tables Table 1: Mapping of project outcomes to scenarios ..................................................12 Table 2: Analysis plan of ORBIT research outcomes ................................................12 Table 3: Details on the hierarchy of concepts............................................................24



1. Executive Summary

1.1. Brief introduction to ORBIT The main objective of the ORBIT project is to innovate in the space of fault tolerant server resources to minimize the effects of outages. So far, reducing the effects of downtime requires significant investment with meticulous planning to appropriately address each type of common downtime cause.

ORBIT’s novel architecture is specifically well suited for cloud-wide deployments, thus complimenting existing tools available for Small and Medium Enterprises (SMEs) and service providers. ORBIT aims to provide an application agnostic fault-tolerance solution for cloud infrastructures that makes it possible for the first time to migrate critical enterprise workloads to the cloud without compromising on the availability and performance of the system. ORBIT eliminates the complexity of deploying and managing fault-tolerance solutions at the application level and completely eliminates the effort cloud customers previously had to invest to deal with unreliable cloud platforms.

There are already concepts available to reduce the effects of outages such as the replication of servers. Still a gap exists in currently available solutions as there is only either the way of expensive hardware-level or application-specific solutions. The ORBIT project attempts to address this gap by introducing a new paradigm of virtualized resource consolidation. In this solution memory and I/O resources used by a guest Virtual Machine (VM) are provided by multiple external hosts instead of being limited to one single physical server.

This high-level description already defines certain characteristics of this particular project. Different to other research projects the key outcome of the project can be defined from an early stage: To develop demonstrators for such an application for “High Availability” with a very specific technology focus, based on the most current needs among potential users.

1.2. The Concept of Business Continuity Although almost every organization of any size is by now highly dependent on IT availability, planning for Business Continuity, High Availability as well as investments into fault-tolerant IT resources are only common in a fraction of all companies.

There are a number of reasons for the relatively low level of scientific and organizational discourse: Firstly, most companies assume that with todays Service-Level-Agreements which usually guarantee uptimes of external services up to 99,9%. The numerically low risk of 0,1 % downtime seems to create a false sense of security and reliability. The point of a dedicated Business Continuity plan is to prepare for such events, which can often be costly, even if the downtime is only for a few hours.



Another reason for a relatively low preparedness is that implementing solutions is deemed as costly and difficult – this is one very specific area where ORBIT aims to provide new options.

Finally, there is a gap of understanding how to plan against suffering from outages. The gap can be described by the two core questions to ask about this potential threat: How and why. There are two main groups of stakeholders in this area of IT – on the one hand specialized teams from the technical department who can at least tell why a system was unavailable. This is the main interest: Why did it happen? And only then this group would ask how another outage could be avoided. On the other hand, we anticipate that upper management which is usually not too interested in why an outage has happened, but

To show the relevance of Business Continuity below we are repeating the conceptual view of the project, which has been presented as well in earlier deliverables of WP6.

Figure 1 provides a view on how the three concepts of Business Continuity, High Availability of IT resources and finally Fault Tolerant resources relate. In short, Business Continuity can be understood as the broader, company-wide plan to avoid outages. High Availability is part of this plan, defining which technical resources (storage, transactional servers and databases) are defined to be always available. Fault tolerance then is a function of such resources where any outage can cause disruption of services.

Figure 1: Hierarchy of concepts

The goal of the scenarios identified by the partners involved with development and use cases is both to challenge the ORBIT outcomes and innovations and to align the development work from all partners, with a focus on usability by end users. In that regard, we need to define the needs of users, the functionality the system will provide and - as a connector and enabler - the technology that the ORBIT project will deliver.



The goal of this specific deliverable is to address objectives of two different tasks identified in the Description of Work. These to tasks are as follows:

• Definition of three application scenarios • Definition of requirements and dependencies between the WPs

The scenarios are used in ORBIT to connect the needs of users in terms of business continuity with the functionality the final integrated prototype will need and align those with technical development as can be seen in Figure 2.

Figure 2: ORBIT Use Cases - Scenario Development

Starting with the different scenarios a use case can be created for each of them, allowing the definition of requirements for the overall project. With this information at hand the project can then go on to create demonstrators for each of these scenarios, validating the functionality and eradicating technical errors using a suiting testbed. In order to be able to run user tests, it also needs a usable interface, which will be created in the media use case (see Figure 3).



Figure 3: Overview and dependencies of scenarios, use cases and requirements gathering for ORBIT

1.3. Scope of this deliverable This deliverable describes how the project will evaluate the outcomes of the work and the use cases. The document builds on the earlier deliverables from WP6:

• D6.1.1: Use Case: Scenario definitions

• D6.2.1: Use case: Integration and Experimentation

• D6.3.1: Use Case: Evaluation and Conclusions (Current Deliverable)

Because the project is still in the first year at the time of writing, D6.3.1 merely describes how the evaluation will be carried out for each use case. Conclusions on whether the software is a good fit for the intended uses will then be drawn based on evaluation results in years two and three.

1.4. Scenarios, Implementation, Evaluation To make this deliverable easier to understand a brief recap is needed to provide the context. Each use case is going through a step-by-step development of requirements



and tests. The ORBIT project has (1) defined scenarios and (2) based on the scenarios a plan for implementation. The results will be (3) evaluated, to ensure that the requirements have been met. The project has so far organized the work towards evaluation in three layers, where each step builds on the previous.

• In D6.1.1 (Use case: Scenario definitions) we defined three main use cases, carried out by Red Hat, IBM and DW respectively. The goal was to start from a real world scenario, where companies lacking the ORBIT technology might experience and finding an approach how the zero-downtime fault-tolerant new resources could help to ensure Business Continuity in a new way.

• In D6.2.1 (Use Case: Integration and Experimentation) the project described dependencies and requirements for all use cases. Furthermore, in a Face-to-Face meeting in Bonn the consortium discussed how to align the use cases for meaningful experimental set-ups.

• In D6.3.1 (Use case: Evaluation and Conclusions) we now describe the approach to evaluate the results of the work. This includes the original evaluation criteria from the DoW as well as criteria for evaluation for the use cases. As noted above conclusions will be part of the next iterations of this particular deliverable.

1.5. ORBIT Outcomes - Use Case Mapping The following table was created at the consortium meeting in Haifa in February 2014, in cooperation between all partners. It summarizes the main outcomes of the project with respect to their demonstration through the corresponding use cases. The use case mapping has already been part of an earlier deliverable, but is included here to ensure consistency of the process from use case definitions to use case implementation and then on to evaluation.

Note that since some use cases may demonstrate the same project result, they have been mapped by taking into consideration the optimum use case to demonstrate them in order to maximize impact and use the showcase for exploitation purposes as well. Furthermore, the interfaces that will be implemented through the Media Use Case will be demonstrated in all use cases for different purposes (e.g. show to the administrators that failure is foreseen on specific services / VMs).



Media Use Case Business Use Case Infrastructure Use Case ORBIT Outcome: I/O offloading Scenario: In one VM the I/O Hypervisor will be executed. A new “disk” will be added and a user that accesses a VM (running on a different host – Host 1) will be able to access the new disk through the I/O Hypervisor

M12 ORBIT Outcome: Mapping and deeper understanding of modularized User Interfaces – to better implement Business Continuity plans and to show the benefit of fault tolerant resources. Scenario: Show better approaches for configurable interfaces, e.g. by showing the current interface of Open Stack (and other cloud platforms) and how these could be enhanced based on new technical set-ups or needs by IT administrators.

ORBIT Outcome: Memory scale-out (post-copy live migration) Scenario: Show through a console the live migration process. Develop a UI (probably a “meter”) that shows the progress of migration. Remains to conclude whether this “meter” will be integrated and demonstrated through OpenStack

ORBIT Outcome: High-availability resource consolidation Scenario: In the previous use case, we will add a second VM in a different host – Host 2. The user will access the “disk” added in the previous use case, while the first VM (running on Host 1) will be disconnected. ORBIT Outcome: I/O Scheduling

M24 ORBIT Outcome: VM Recovery over MAN Scenario: Show data wrapper (through a running web server) and demonstrate that in MAN-scale it can be recovered with small downtime

ORBIT Outcome: Resource externalization and consolidation Scenario: Show a VM running on a host (Host 1) and accessing an external resource (i.e. disk). We will show that it is external (e.g. through console or IOStat). Then we will show that VM is migrated (post-copy) to a different host (Host 2) and still accesses the external resource.



Media Use Case Business Use Case Infrastructure Use Case ORBIT Outcome: Traffic redirection and load balancing Scenario: Show through the media use case that a failure occurs, it is detected and traffic is redirected to other VMs.

ORBIT Outcome: Failover detection and handling Scenario: Through an end-user that uses Bugzilla in one VM but then due to a failure he switches to another VM

ORBIT Outcome: Memory scale-out (post-copy live migration) Scenario: Show through a console the live migration process. Develop a UI (probably a “meter”) to show progress

M30

ORBIT Outcome: Disaster recovery for heavy industrial workload Scenario: Bugzilla and Data Wrapper running in parallel in the infrastructure while disaster cases occur and are being handled by ORBIT

Table 1: Mapping of project outcomes to scenarios

Furthermore, we will present the outcomes of our research with relation to scalability boundaries and performance limitations. These will not be demonstrated through Use Cases since they should be application-agnostic. Thus a series of benchmarks will be used and their results will be analyzed, based on the following plan:

Highly Available Consolidation of Virtualized Resources

Application Transparent Virtual Machine Fault Tolerance

Metro-Area Zero Downtime Disaster Recovery

M12 Effect of varying number of VMs, networking parameters (e.g. bandwidth), VM types (e.g. tiny, small)

Effect of varying number of VMs, networking parameters (e.g. bandwidth), VM types (e.g. tiny, small)

M24 Effect of varying number of VMs, networking parameters (e.g. bandwidth), VM types (e.g. tiny, small)

Table 2: Analysis plan of ORBIT research outcomes

The main purpose of this early mapping of potential outcomes is to make connections between the use cases more visible. This model and mapping will have to be revised with each new iteration of this deliverable for integration, validation and experimentation.



2. Evaluation Methodology To validate and demonstrate the accomplishment of the R&D effort undertaken by the project, ORBIT will implement several end-to-end challenging experiments where the ORBIT technology will be exercised against Future Internet scenarios from the business, media and infrastructure domains.

Validation Methodology

The use cases are the main focus of WP6. Overall, this set of use cases thoroughly exercise and demonstrate the entire set of innovations as described in the DoW. The following table shows which ORBIT outcomes are exercised by each use case:

This chapter maps out more specifically how the use cases will evaluate the outcome of the prototypes created in the use cases. Based on the status of the project (M12) the input below describes merely how the project aims to test the outcome – it is too early to draw specific conclusions. First results of tests will be available in during year two of the project.

2.1. Performance Metrics Over the course of the project we need to develop metrics to show how ORBIT contributes to better disaster recovery. The metrics needed for this can be technical, time-based or measuring the amount of resources needed for recovery (e.g. man-hours). The principal metrics to be used have been defined in Deliverable 6.2.1.



Below is a brief summary how those metrics will be used for the use cases.

2.1.1. Business Use Case Metrics As the Business Use Case is focusing on live migration of data and hand-overs between system elements, the metrics are focusing on these particular processes, how they progress over time and how well they work. The main metrics that have been currently identified are as follows:

• Amount of data transferred • Transferring speed • Quality of transfer process in failed and succeeded transfers • Number of successful hand-overs between VMs.

2.1.2. Infrastructure Use Case Metrics Our benchmark (apache bench) is meant to evaluate the performance of the system with the inclusion of the fault tolerance layer in the I/O hypervisor, and compare it to the performance of a virtual machine without these additions. As we will evaluate the transfer of several page sizes, we expect the number of transactions per second to be reduced as the size of the page grows. The main metrics that have been currently identified are as follows:

• The effect of off-loading the I/O virtualization logic to a remote server connected via low-latency and high-bandwidth interconnects.

• Latency and impact of high-availability virtualized resource consolidation by providing continuous availability even when disconnecting any single physical server.

• Scalability boundaries and performance limitations. • Application performance and show successful deployment and operation of

the use case applications and services without any adverse effects.

2.1.3. Media Use Case Metrics The evaluation for the media use case will be split into two test cases. One is a technical approach, where we aim to set-up a server with a typical media process of loading and transcoding videos. This set-up later be connected to the prototype from the technical WPs to check whether outages can be avoided using the ORBIT technologies. Set-up of technical evaluation for media use case: A (virtual) server where videos are downloaded from the Deutsche Welle on-demand video system. The videos are downloaded via FTP or the API endpoint to the server and then transcoded into MP4. As discussed in the F2F meeting in Bonn in July 2014 this is an initial test scenario, where we can simulate unavailability of the process and apply the ORBIT zero-downtime fault-tolerant solution.



The main metrics for this evaluation scenario are: • Server load and CPU usage • Simulations of unavailability or fault-resiliency • Technical workload will be that of approximately 12 new videos per day,

downloaded to the server, with an average length of 7,5 minutes. • The videos are then transcoded

To set up meaningful evaluation criteria DW will consult with UMEA, IBM and Red Hat. Dashboard metrics With the focus on building new and better user interfaces for monitoring systems, the metrics for the Media Use Case will have a more qualitative level than the ones for the Business and the Infrastructure Use Case. The main metrics that have been identified are:

• Find solutions for requirements gathered in expert interviews. • Minimize length of downtime of the test system. • Minimize time for detection of downtime and resolving it.



3. Use Case Evaluation Plans

3.1. Business Use Case Scenario As listed in the DoW (Task 6.2) the Business Use Case (M4-M27) will be led by Red Hat and will focus on modern business applications. This means transactional applications providing On-Line Transactional Processing (OLTP) along with advanced On-Line Analytical Processing (OLAP) services. Such services are critical to the operation of enterprises in general and Small and Medium Enterprises (SMEs) in particular, which are the products end-customers. In terms of functionality, a typical application may support everything from one to a few business processes in a well-defined integration scenario. The focus for this use case is based on Highly Available issue tracking (Bugzilla) service as a way to demonstrate the concept.

3.1.1. Highly Available Bugzilla service evaluation In the business use case, it is highly needed to demonstrate relatively realistic scenarios. For this reason, we have chosen to focus the test on the use of macro-benchmarks. Although a demonstration, these reflect near real scenarios and corresponding applications and workloads that are often used in production systems.

This kind of benchmarks are used to show the behaviour of a system under a certain scenario that is likely to occur in real life. The benchmarks are repeatable and have consistent results (unlike real workloads), which makes them ideal for comparing overall system performance under a specific workload and the overhead of failing over from the active instance to the other.

3.1.2. Macro-benchmarks In order to create a baseline, we will need to measure the data for a single VM running Bugzilla. Once we established a single instance measurement, we can proceed to create a similar load on a fault-tolerant VM, and measure it is performance. This will allow us to calculate the overhead of the fault-tolerance mechanism. With a fault-tolerant VM pair we will also measure the behaviour during failover.

3.1.3. Methodology The evaluation plan will consist of the following steps;

1. Setting up the first VM with Bugzilla installed. A demo DB should be loaded to the Bugzilla.

2. Setup a second VM with the Bugzilla installed as a mirror (is special standby mode).

3. A script will be written to issue multiple random queries using Bugzilla API1. 4. Measure load to make sure both VMs carry a similar load, and any latency if

exists



5. Fail a single instance and measure the time and latency it takes the other Bugzilla instance to stabilize.



3.2. Infrastructure Use Case Scenario The main goal of the Infrastructure Use Case (Task 6.4, M4-M27, led by IBM) is to support the use of machine virtualization for High Availability (HA), by exploring ways to reduce costs of such set-ups.

Machine virtualization is undoubtedly useful for HA, but does not come cheap. The performance cost of virtualization, for I/O intensive workloads in particular, can be heavy. This cost is extremely important if we always enable virtualization at the infrastructure level as it is for the case of ORBIT.

Thus, in this scenario, we will focus on the I/O performance and scalability of ORBIT. We plan to evaluate ORBIT running a set of representative I/O intensive workloads. We aim to deploy ORBIT in an N+1 model to simulate a commodity equivalent of the IBM PureFlex1 environment, where only a single compute node and a single I/O node are used to provide HA capabilities for other N nodes. Within this task we plan on the following development steps:

1. Evaluation of typical I/O workloads, focusing on the monitoring CPU, memory and I/O resource utilization.

2. Architecture, design and implementation of the use case prototype in the testbed integrating with the ORBIT infrastructure.

3. Evaluation of ORBIT infrastructure in relation to the workloads created by the scenarios. Requiring a full end-to-end setup on the testbed.

One of the main use cases of cloud infrastructure is running web services. This can be as simple as a web server serving single static pages, up to multi-tier business applications. In our scenario, we will benchmark the effective throughput of an Apache web server installation, to simulate the real-world scenario.

3.2.1. Infrastructure Use Case Evaluation Evaluation will be performed through the use cases in order to challenge and validate the project innovations (following the available outcomes based on the 3 project development cycles). The report will provide evaluation results regarding the infrastructure, as well as regarding the user, application and technical requirements. [month 12]

The performance evaluation of the system infrastructure is dependent on the implementation of the components listed in Task 3.1. At this point in the project (month 12) we have completed approximately xx% of the overall functionality (some components are more advanced in their development than others) we have not yet reached the stage in which evaluation is prudent. Thus, the remainder of this section describes the evaluation which is planned for the coming months, as the components mature.

1 http://www-03.ibm.com/systems/pureflex/pureflex_overview.html



We have chosen to run both micro-benchmarks and macro-benchmarks to showcase various aspects of the system. Micro-benchmarks are used to show performance changes (improvements or degradation) in a very specific part of the system. We use them to amplify a certain behaviour to emphasize how the behaviour has changed from the baseline measurements. Micro-benchmarks are not realistic in that a real-life workload is unlikely to cause the same effect, but they are nonetheless an effective tool for pinpointing behavioural differences in the system.

To show more realistic scenarios, we have also chosen to test using macro-benchmarks. While they are still artificial, they are much closer to real-life scenarios and use applications and workloads that are often used in production systems. This kind of benchmark is used to show the behaviour of a system under a certain scenario that is likely to occur in real life. The benchmarks are repeatable and have consistent results (unlike real workloads), which makes them ideal for comparing overall system performance under a specific workload.

Our system infrastructure is the backbone of the datacenter, and all I/O traffic is affected by our system's performance. Therefore, it is necessary to evaluate the I/O traffic throughput (the amount of data reaching a target server in a given timeframe) as well as latency (the time taken for the data to be received after being sent). These two aspects give a full characterization of the system, and can be used to compare the improvements to the baseline comprehensively.

3.2.2. Micro-benchmarks To evaluate latency, we have chosen the netperf UDP-RR micro-benchmark. This test sends a network packet to another host using the UDP/IP protocol, which has a small packet size and low overhead. Once the second host receives the packet, it immediately sends a reply (of the same kind) back to the originator. Once the packet is sent, the time is measured until a reply is received, signifying one full round trip. Only one packet is "in flight" at any one time, meaning the originator will not send a second packet until the response from the first one is received. This benchmark has been crafted to measure end-to-end latency in the system.

To evaluate throughput, we have chosen the netperf TCP-stream micro-benchmark. This test sends a stream of TCP/IP packets of a given size to a target host. Unlike the previous UDP-RR test, many packets can be in flight simultaneously, thus reducing the time between packets to a minimum, effectively hiding the latency. The idea is to stress the system to see how fast we can send data from one system to another in a given time period (throughput). In general, larger packet sizes give better results (higher throughput) because the overhead per packet is amortized over a larger amount of data. However, with today's powerful CPU's we can often maximize the bandwidth of the network link (line speed), thus hiding any possible improvements. Nevertheless, we can show improvements by looking instead at the CPU utilization required to drive a certain throughput, or by reducing packet size, thus increasing the overhead to the point where line speed can no longer be attained.



3.2.3. Macro-benchmarks To show how the system would perform in a more "real-life" scenario, we have selected a sampling of more realistic applications that would likely be used in a production system. The first is the Apache webserver.

A webserver uses a networking protocol to provide documents on-demand in response to client requests. Webservers typically handle thousands of such requests per second, depending on the processing required to access the documents. In order to focus on the I/O aspects of the system, rather than on the document processing aspects, we use static documents (files) rather than dynamically created documents (scripts) in our tests. The webserver scenario combines both the throughput and latency aspects, thus testing overall system performance.

The memcached server is generally used to cache data for web applications. It is more latency sensitive than the webserver, and moves less data per request.

3.2.4. Methodology We plan to evaluate the system using the aforementioned benchmarks in two configurations:

1. The baseline system without any changes, representing how datacenters are designed and built today.

2. The second configuration includes the I/O hypervisor and software required to support the FTL (fault tolerance layer).

In addition, we will be able to show how the RCL (resource consolidation layer) can help fully utilize I/O devices by balancing their usage across multiple servers.



3.3. Media Use Case Scenario Task 6.3 (M4-M27) will be led by Deutsche Welle (DW) and aims at broadening the approach and innovations created in ORBIT by focusing on additional support for planning Business Continuity in a specific domain (here: media).

The media use case evaluation will be executed in two steps: To draw technical conclusions DW will set-up a typical media server, which will later be equipped with the solution provided by either Red Hat or IBM.

Furthermore DW will work on an interface to visualize a system set-up without and with the ORBIT solution. The goal is to provide a better reporting to stakeholders – to monitor potential threats, which could cause outages and/or define how a visualization for reporting about a systems availability could look like.

The evaluation plan for the media use case is structured as follows: 1. Set up a typical media workflow on a server, to be implemented in the testbed. 2. Perform simulated outage and recovery procedures, using the ORBIT

technology. 3. Visualize the metrics for the start of the outage event versus the time needed

for recovery to full availability in a novel, specific visualization for integration in monitoring tools or dashboards.

4. Present the before/after results to technical administrators for additional feedback.

5. Use metrics from the IBM and Red Hat use cases to perform the same visualization, based on extended metrics for multiple instances to create a novel type of reporting/monitoring dashboard or single elements, which could be included into dashboards available on the market.

3.3.1. Media Use Case Evaluation ORBIT Innovations as Differentiators: The media use case aims to broaden the approach taken in ORBIT by providing a way to integrate the innovations worked on in the technical WPs into a broader setting of “Business Continuity” planning. Such plans will be needed not only in media organizations, but in other business and public domains as well. This application of a customizable approach can serve as way to explore specific needs in a specific situation that can later be applied and broadened if necessary for other areas of application.

Therefore we aim to use the media use case to describe how ORBIT can enhance the availability of media services, with a focus on visual user interfaces. In combination with the more technical aspects of the project these interfaces should enhance the following:

• Avoidance and early detection of service interruptions due to overload, failure of a system or sudden incidents

• Simpler detection of failure sources, organizing the process of searching for the cause of an error, failure or service downtime in a structured way



• Better communication of potential service disruptions • Simpler introduction of new resources after a service disruption using ORBIT.

Avoidance and precautions of downtime: As part of the work done in ORBIT a visual user interface provides an overview over the system status based on the visual concept of “small multiples” and sophisticated color coding. Users can sort all views based on a number of filters to gain an understanding of the system status quickly.

Detection of possible failures: With ORBIT we introduce a system to monitor certain thresholds ranging from “good”, “some issues” or “critical” to “failure” which can be seen on an admin screen. For example, we can define certain “stress levels” for certain systems, based on sometimes very different metrics. For storage for example, we look at the amount of used/free disk size, the number of writes per month, etc.

Easier and better structured recovery: The communication among the team members in the event of failures as well as partially automated reporting are designed to motivate organizations to plan for continuity and see the benefits. This element is more or less an add-on to the work done in ORBIT, but could be one building block where administrators understand and “see” the benefits, which then could influence the exploitation of the core technical work in ORBIT.

3.3.2. Methodology

Understanding user needs First results from initial discussions and questionnaires with administrators of systems showed that at the core Business Continuity is ripe for innovation. Many of the approaches in use today assume a level of control that is no longer valid. This is partially due to the rising complexity of interconnected systems, partially because the relevance of such systems has moved from a supporting and add-on role to the core of modern organizations.

Hardware getting more reliable, but systems are more complex Another aspect is that reliability and availability of modern databases and servers have become much better from a hardware point of view. Often modern servers and software do not reach a level of overload as they did a few years ago. But this also means that systems have become more complex which makes it more difficult to keep track. This claim, made in talks by administrators interviewed so far, must be better understood and quantified, but is a first valid assumption towards better systems.

Visualizing what a system does and how outages occurred At the very core the problem or challenge of Business Continuity is to master the human-machine interaction. This ranges from managers, untrained in the usage of the technology, trying to read and understand reports showing the number and severity of outages to the visualization of the dependencies of technical set-ups in ways that show which system is most prone to failure, even if the system itself plays just a minor role. But it is also a question of how end-users (in the case of a media



company these would be editors and users of a webpage) can be informed about issues in a compelling way. These are just examples underlining the importance of a good visualization.

Key goal: Provide better overview to avoid outages

At this stage of the project one insight is that the media use case and the development of better user interfaces are essentially non-functional requirements – with no direct connection to the functional requirements of the ORBIT development. At the least, the dependencies are not well understood or mapped at this stage.

As a preparation step the media use case must map all potentially relevant non-functional requirements, based on user needs.

Initial lists of user needs for Business Continuity The table below is based on the visual and goes through the different levels of Business Continuity (BC), High Availability (HA) and finally Fault Tolerant Systems (FTS).

Keyword User Need Complexity Planning and Preparation

Business Continuity Planning Budgeting and planning with all stakeholders, including non-technology management and users.

High

Responsibilities Defining who does what to ensure BC. Medium

Set-up/budgeting Defining which systems would be needed to avoid outages from a technical/hardware perspective.

Low

Setting alerts/thresholds for warnings

Defining, which system should alert administrators at what point. Depending on how much a system is centralized or decentralized, fault tolerant, etc. every set-up might be different.

Medium

SLAs Setting Service Level Agreements is important, but is reported as an area that has improved. Main reason is that there are by now “specialists” like Akamai providing highly available, worldwide systems better than one organization could do that itself.

Low

Business Practice Monitoring How to visualize the current status of a

system and communicate its availability now and sometime into the future.

High



Keyword User Need Complexity Warn Levels A visual system that communicates

better than current systems about issues.

Medium

Outage avoidance A wider approach, where dependencies of systems are mapped and help to avoid faults. Visualization of the current system status and good visual display of potential problems are key. In talks with admins we learned that depending on the view, a system status might look good, while some users do not see any content, etc. Might call for a combination of methods, e.g. monitoring of Twitter whether a service or brand is mentioned in combination with certain keywords.

High

Outage handling Once an outage occurs, who is involved, how is the issue communicated and how well does the process work from a communication and cost perspective.

High

Recovery Depending on the issue. We learned that common causes are often logical problems – one file in a database being corrupted – which are harder to solve than failures of hardware.

Reporting Visualizing and combining causes of outages, visualizing long-term development (are there more or less outages and what causes them). Reported to be very time consuming right now.

Table 3: Details on the hierarchy of concepts

Monitoring interfaces Working on better ways to visualize a system’s behavior and connections is a major development area, based on higher availability of data from such systems and higher importance. Such interfaces can serve a variety of needs, but should be based on either novel or typical issues of organizations when trying to avoid outages or minimize the effects of outages through novel, more fault-tolerant technologies.

As for the evaluation we will develop a novel approach to such interfaces, building on metrics from the ORBIT project where possible. Additionally we aim to combine a variety of services and indicators that can serve as “trigger points” to initiate a warning, an action or a mail/notification sent out to recovery experts.



The monitoring interface will be developed using available services (such as “If this then that” (ITTT)2 or Zapier3) and combine these with visualizations that should provide a view on the status (and health) of a system with and without the ORBIT technologies implemented.

3.4. Testbed requirements The primary testbed (Task 6.1, M6-M30, led by UMU) will be configured to support the use cases testing and evaluation, especially for the resource consolidation layer (RCL, WP3) and fault tolerance layer (FTL, WP4) developments. It will be running the ORBIT software stack and configured based on use cases need. The testbed consist of 6 identical servers, which are connected to each other either by regular Gigabit Ethernet, via 56Gbit FDR InfiniBand, or via 40Gbit Ethernet (over InfiniBand fabric). This configuration enables both the infrastructure and the business use cases. On the one hand, this facilitates a fast fault tolerance recovery and synchronization of active-passive VM pair memory. On the other hand, this provides an efficient environment for the I/O consolidation server, where VMs allocated in one server can quickly access the I/O server and vice versa. Currently, the servers will be configured as following:

- 1 server will be the I/O hypervisor host (2x 40Gbit Ethernet ports). - 2 servers for hosting VMs (1x 56Gbit IB port + 1x 40Gbit Ethernet port each). - 3 servers for load generators (2x 40Gbit Ethernet ports each).

More detailed description about the primary testbed can be found in Document D2.3.1. In addition, to enable the media use case, there will be a secondary testbed at another site to demonstrate multi-data-center clouds and federation of clouds (DRL, WP5) in a more realistic environment. In particular the testbed will provide:

• A multi-site environment running the last release of ORBIT software to evaluate functionalities under real conditions;

• Support for the execution of use cases; • Feedback to fix bugs and improve performance; • An environment for demonstrations, tutorials and training.

2 See: https://ifttt.com

3 See: https://zapier.com



4. Conclusions This initial document for Integration and Evaluation of the project outcomes, mainly defines how conclusions will be reached in the 2nd and 3rd year of the project. The demonstrators, once implemented, will be put to test in order to ensure that both technical as well as organizational advances can be provided.

Documents

Use Case: Integration and Evaluation - Orbit Project€¦ · ORBIT ORBIT_WP6_D6.3.1 Business Continuity as a Service 27.10.2014 D6.3.1 - Use case: Scenario Definitions 10/26 and tests