Fabio Veronese Enel - sasCommunity · Fabio Veronese Enel.it Abstract Countless papers have stressed the importance of Service Levels as they reflect the ability to manage resources

Fabio VeroneseEnel.it

SEUGI 19Florence, May 29-June 1, 2001

Uncovering Service Levels

Service Levels are so important ...Everybody has a SL(A)M System ……and some reports

Introduction

Open Systems:Many platform … more servers

… many more data

… managing complexity and scalability ?

Brief review of LDS Project:• Approach: basic principles• Results: state of the art

Introduction

Back stage information providedWhat happened during implementationNon technical session but very practical pragmatic approach

New concept: coveragecoverage! Make it grow ...

LDS Projector

Progetto Livelli di Servizio

Introduction

in progress (now closing) in Enel.it

Going to talk about OS/390, UNIX, NT, Web, ...

Effective Service Level Management requires:

•Data Warehousing of the PDB•Web reporting of results

LDS basics

Data SideMake the PDB a Warehouse with:

•consistent business related metrics•several analysis dimensions•uniqueness, timely update

LDS basics

Data SideMetrics:

•Quality•Quantity•Availability

LDS basics

Dimensions:

•Client•Product•Geography•Workload•Environment•Shift

Reporting SideReport per user

Top Managers CSFAccount Managers Client, ProductProduction Managers Workload, Geography

Client modelThin client (Web Explorer equipped)Access to a dynamic web site

LDS basics

LDS Architecture

SMF Accumulation: OS/390 Roma

SMF Accumulation: OS/390 Milano

SMF Accumulation: OS/390 Napoli

Unix Servers

FTP

FTP

IT ServiceVision™

OS/390 PDB

NTPDB

UNIX PDB

OtherPDB

OS 390 Milano

... based on SAS IT Service Vision

NT Servers

5. HW Rent6. SW Rent

OracleSQL S.

CS/ATE

9. Call Center

Oracle

CS/CC

PDB Global

Reporting O.L.A.

Service contracts

SLAM Reporting S.L.A

OS/390 issuesWhat is managed:• 4 CECs, 5000 MIPS, 20TB• 8M transactions and 100K job per day• 15GB daily SMF data• 8GB of daily transferred SMF data• 8 hours of daily computing time• several PDBs which make up for 80+GB

OS/390 issuesThis is where all the data come to stay• Consolidated ITSV Environment• Scheduling• Backup• General high reliability

Coverage: from high to 100%Scaling problem: huge amounts of dataRecommendation: let the mainframe do the dirty

work

NT issuesWhat is managed:• 500 NT (most critical) servers out of 2000• Exchange, IIS and departmental servers• User written Patrol KM• NT data• Exchange data• IIS data• one 10K daily transfer per server

NT issuesThis is where the blue screens appear• Not reliable resources (HW, SW, management)• Need for “geographic reboots”• Minimum local data processing

– reduce net load– take just local decisions

Coverage: 25% (80% of critical servers)Scaling problem: heavy daily operationsRecommendation: let NT do as little as possible

Unix issuesWhat is managed:• 50+ CECs with more than 500 CPU, 10TB• Oracle, BEA, CRM, DMS, CTI, SAP servers• User written Patrol KM• Unix data• Oracle data• Sap data• one 1MB daily transfer per server

Unix issuesThis is where the scripts rule• mainframe like distribution (no data bulk problem)• actually 4 managed Unix flavors

– Solaris, AIX, HP, SCO (waiting for Linux and True64)– efforts multiplied by 4

• more reliable than NT• less friendly than NT

Coverage: 90+% of CED ServersScaling problem: those different flavors….Recommendation: reduce OS dependent code

Reporting issuesAbout the Web site:• intranet site with (1M) static HTML pages

(1,1 GB)• included dynamic services

(programmed downtimes, security)• “simulated” navigation through dimension

grouped reports• migration to Java technology (scheduled with

SAS/ADS)

Reporting issue This is where the reports show• portal approach: everything about SL here

(not just metrics)• many information and many pages require sound

reporting schemes and coherent standards

Coverage: all data accessible via web siteScaling problem: more and more pages to refreshRecommendation: dynamic site and specialized

sections

Results

I think I’d better show you something ...

Concluding remarks• You’ve just been through 2 people 2 years work• A word or two are for IT Service Vision

• The more information provided the further data requested• Robust architecture incorporated even unforeseen needs • Complexity comes always at a cost• Same problems required different answers in changing

contexts (experience is crucial)

• Medium term coverage below 50% is dangerous

Concluding remarksEvolutions• Specialized SLA section (directly for Client)• Add non technical data from Service Contracts• Dynamic site

(Easy) Questions?

1

Uncovering Service Levels

Fabio Veronese

Enel.it

Abstract

Countless papers have stressed the importance of Service Levels as they reflect the ability tomanage resources and the information provided to customers. Every player of the IT world has todeal with them nowadays. In fact, every IT manager now has his or her own Service LevelManagement System, usually built on sound principles, dedicated resources and able to provideresults in the short term. However, this common view on Service Levels is now challenged.In what follows, we introduce a simple pragmatic way to evaluate the performance of a ServiceLevel Project. We measure the “coverage”, i.e. the percent of IT resources on which the project isable to provide satisfactory Service Levels. Under assumption – based on personal experience -that high levels of coverage are unreachable for unsound projects, we argue that the essential stepfrom low coverage (early results) to high coverage (thorough implementation) may prove to be notan easy task even for projects which are already under way. The paper focuses on what proves to benecessary to enhance the coverage of a well defined Service Level Project. Scaling problems andplatform related issues are also dealt with in the description of what may be considered the closingpart of the ongoing Service Level Project in Enel.it.

2

IntroductionMuch has been said on the importance ofService Levels as they reflect the ability tomanage resources and the informationprovided to customers.The need to measure Service Levels of an ITcompany originates from several factors suchas Customer Relationship Management andservice quality assessment. However, it isbasically aimed at improving the knowledgeof the Information System to get the best outof it.Every player of the IT world has to deal withService Levels. Actually every IT manager,nowadays, has his or her own Service LevelManagement System, usually built on soundprinciples, dedicated resources and able toprovide some results already in the shortterm.But … is that enough? This comfortable viewon Service Levels is now challenged bycomplexity.In what follows we introduce a simplepragmatic way evaluate the success of aService Level Project, by considering its“coverage”, i.e. the percent of IT resources onwhich the project is able to providesatisfactory Service Levels. Personalexperience shows that high levels of coverageare unreachable for unsound projects. Theessential step from low coverage (earlyresults) to high coverage (thoroughimplementation) may be not an easy taskeven for projects which are already underway. The paper focuses on what proves to benecessary to enhance coverage of a welldefined Service Level Project while dealingwith complexity.Scaling problems and platform related issuesare dealt with in what may be considered theclosing part of the ongoing Service LevelProject in Enel.it.Enel.it is the company of the Enel Group,whose mission is to provide IT services toEnel Group companies.In this article, I will first give a brief accounton the principles we followed whiledeveloping our Livelli di Servizio (“ServiceLevels” in the following LDS) Project; later Iwill pass on to speculate on peculiar featuresthat seem to deserve attention.

The LDS Project:

The ApproachIn these days of open systems, where data are“distributed” on OS/390, UNIX, NT serversand network apparatus, and there is a clearneed of comprehensive centralized reporting,effective Service Levels managementrequires a common methodology of collectionand reporting. The way to achieve this goesthrough specialized Data Warehouses andweb reporting tools. Established DataWarehousing principles guarantee effectiveand consistent metrics while companyintranets assure information availability tousers.

The data sideApplying Data Warehousing principles to theperformance metrics of the Performance DataBase (in the following PDB) has relevantimplications such as preferring businessrelated metrics to technical indicators,homogenizing metrics coming from differentplatforms, centralizing metrics in a uniquelogic PDB to assure integrity and payingparticular attention to the choice of metricsand dimensions.Evaluating the performance of an InformationSystem basically means pointing out thespecific resources utilization and verifyingservice quality and availability.This leads to three main metric categories:• quantity metrics: information on resource

utilization• quality metrics: how well service is

delivered• availability metrics: ability to provide a

service at all.

In order to get the most information from ourmetrics, it should be useful to classify metricson the basis of company facts, such ascompany organization and geographicalarticulation, company clients and products,the instant of service providing. Theinformation needed to describe metrics arecalled dimensions and are usually organizedaccording to a hierarchical structure.Providing cross dimensions through themetrics allows to uniform the language and

3

eases multidimensional analysis, althoughapplying predefined levels and codification toall the metrics is one of the toughest tasks inpopulating a PDB, especially in case of datacoming from different platforms.

The reporting sideThe objective of the LDS project is to enableusers to extract information out ofperformance data on their own. As every useris looking for different information from thesame data, the reporting methodologybecomes of the paramount importance, as itmust be able to comply to the need of users.LDS users may be classified on the basis ofthe required level of analysis, of the mostinteresting dimensions and of the estimatedlevel of interactivity. Traditional client-serverreporting systems (based on the fat clientmodel) succeed in satisfying every userrequest, but often they cannot be maintainedby system programmers and are toocomplicated for the users. Direct experiencein the development of a SAS EIS applicationproved this approach not to be effective.In the Internet days web site and thin clientsare very fashionable because virtually everyproduct has HTML output capabilities, thereis no need to provide software distributionand every user feels comfortable withbrowsers.Such technology is undoubtedly strategic,nevertheless a static page web site satisfiesjust the low interactivity users; the others runthe risk of incurring into the well knownproblem of ad hoc reports (available reportsare never felt ad hoc).The ultimate solution to reporting problems isa dynamic web site, which has the benefits ofintranet and satisfies interactivity needs.Such interactivity may be achieved on a stepby step approach, resorting to technologiessuch as JavaScript, ASP, CGI and Java.The first two are currently in use in the LDSproject, while the use of CGI and Javatechnologies is scheduled with the adoptionof the SAS/Intrnet™ product.I should add that it is also possible to simulatea certain degree of interactivity on a staticpage web site. This can be done providing anumber of report organized by dimensions

and grouped with the help of pull-downmenus and combo boxes.

The LDS Project: the factsThe LDS project is based on the functionalityof ITSV®.The theoretical basis on which we started -briefly explained before – led to the followingarchitectural implementation of the project(see Figure 1).It is probably worthwhile spending sometime on the issue raised by theimplementation on the various platforms.

OS/390 issuesAs far as the mainframe is concerned, theOS/390 environment of Enel.it isconcentrated on three nodes with 4 CECs ofG5 technology which make up for 5000MIPS and RAID storage for a total of 19 TB.The three nodes run over 5 milliontransactions and a hundred thousand jobs perday, with over 15 GB of SMF data in 100daily downloads. The PDBs lie on themainframe.SMF data accumulation and transfer to theunique ITSV equipped node is continue; thisprocess transfers 8 GB per day.On the receiving node, temporary data arecollected on transient PDBs and thensummarized on the global PDB.Transient PDBs account for 30 GB, while theglobal PDB is 35 GB.Actually, forcing the maximum parallelism,daily phases require 7 hours starting frommidnight to elaborate 10 million records,including housekeeping tasks (backups,statistics, clean-ups).Eventually, ITSV® proved to be able tomanage very large amount of data and to beflexible enough to incorporate user data.Providing daily consolidated data ready forreporting is a very ambitious goal andrequires a constant and close follow up ofpopulating procedures (actually a manualintervention is required once per week,mostly due to space abends or ftp problems).I’d like to stress the fact that, in order toaccomplish effective management of such ahuge PDB, it is very helpful to be familiarwith the SAS language (which was a

4

remarkable asset of the people working on theproject).The mainframe consolidated environment andthe grown experience represent the startingpoint to face the challenges of distributedenvironments.While adopting tools and strategies typical ofthe distributed world (resorting to specializedsoftware, remote data collecting andpreprocessing), by sending NT and Unix datato the mainframe we kept the followingadvantages: the presence of a consolidated

ITSV® production environment, theavailability of advanced management tools(for backups and scheduling) and, globally,the high reliability of the mainframe platform.At the end of the day, mainframe is where thedata must lie and let it do those stupid thingsabout data.Coverage: started high, reached 100%.Scaling problem: huge amount of data.Recommendation: let the mainframe work asmuch as possible.

Figure 1 Architectural scheme of the LDS Project

NT issuesAs far as the NT environment is concerned,Enel.it manages about 2000 NT servers onwhich BMC Patrol is installed.The architecture requires the distribution of auser written Knowledge Module on every NTserver, that collects, processes and transfersNT, Exchange and IIS metrics to the PDB onthe mainframe. Here NT data are processed,

reduced and downloaded to the web site justlike OS/390 data.At present, we manage 500 NT serverincluding Exchange, IIS and departmentalServer (the one which are felt more critical).This view of the NT world might seem ratherreassuring, but the notorious hardware andsoftware reliability of NT server normallyimplies many “reboots“ of logically andgeographically distributed resources.But as networks may prove to be even lessreliable, so it might turn out to be necessary

5

to process the data locally, if that reduces theamount of sent data. Remember to do thingsand take decisions which are logically relatedto what is distributed and let the mainframedo the rest.Coverage: started minimum (a few servers),reached 25% (80% of critical servers).Scaling problem: a lot of daily servermanagement activity given to the operations.Recommendation: let NT do as little aspossible.

Unix issuesAs far as the Unix environment is concerned,Enel.it manages more than 50 (big) Unixservers on which BMC Patrol is installed.The architecture requires the distribution of auser written Knowledge Module on everyUnix server, that collects and transfers Unixand Oracle metrics to the PDB on themainframe. Here Unix data are processed,reduced and downloaded to the web site justlike OS/390 data.At the moment, we manage about 50 Unixwhich include all the most significantapplications on this platform (BEA, Oracle,CRM, CTI, DMS, ..).Actually, in Enel.it such big Unix servers are(much like mainframe) not geographicallydistributed and are generally much morereliable than NT.But we have 4 mainframes versus 50+ Unixservers and we have only (the same level of)OS/390 running on the mainframe, while thewonderful Unix so far came in differentflavors like Solaris, HpUx, AIX, SCO, withthe following results: the same thing must beanalyzed, developed, written and tested fourtimes as we are usually interested in operatingsystem dependent aspects.Furthermore, NT may be not much reliable,although is surely very friendly (you don’thave explorer on Unix …) - and whenrebooting is needed Unix is not like NT.For SAP we resolved to an external NTServer data collector due to simple reasons ofopportunity, given that we lacked of anytheoretical ground.Coverage: started low (few big servers),reached 90+% of CED Servers (100% forOracle and SAP).

Scaling problem: managing the differentflavor of Unix.Recommendation: reduce to the minimum OSdependent code.

Reporting issuesThe production of static pages, which aredownloaded to the web site, is also includedin daily phases. The web site contains severalhundred thousands pages, one tenth of whichis daily updated (by now its size reached1+GB).On the web site are available some dynamicservices, such as network security andidentification support, and programmedservice downtime management.The web site is installed on an NT server. Afurther NT workstation is required to act asdevelopment platform and is equipped withAppDev Studio™.Providing daily consolidated LDS dataproved to be a critical success factor.Enabling users to quickly find answers toquestions such as “how things are going afteryesterday’s upgrade?”, “how much CPUresource can be accounted to the newcompany?” or “what is the availability of thenew application in the last three month?”,without having to plan ad hoc reports hasbeen blasting news.On the architectural ground, adopting webtechnology for reporting offered a solution towell known client problems, whilemaintaining its intrinsically easiness of accessand use.Even though the “fat client” model is by nowovercome as a matter of fact, it is necessaryto take over the web environment its peculiarinteractivity, i.e. the capability to executecustomized analysis and to offer originalapproaches to all the same data.With the continuos developing of intranetsand of the technologies based on them, thetime has come to go beyond showcases orstatic page web sites.ITSV® macros were rather handy, whenworking, to manage the Web site pages.Such web site was developed thinking to aportal on Service Levels where allinformation (not only metrics) on ServiceLevels could be found.

6

Sound reporting schemes, coherent standardand a bit of flexibility are mandatory indelivering information to users who want tonow how “things” are going, when the“things” which are going are many differentones.Coverage: all collected data is accessiblethrough the web site.Scaling problem: the number of pages thatneed daily refreshing is steadily increasing;the bulk of information to give access to.Recommendation: plan to turn to dynamicsites with specialized sections.

Concluding remarksIn general, every time we gave some newdata it meant further data were requested:started with OS/390 and NT data were urgent,gave something about NT and suddenlyUNIX and SAP were up to date. By the way,measuring 10 NT as test case server is not thesame as measuring 100 or 1000 servers.The architecture proved to be robust andscalable enough to incorporate any new needs(at least for the moment), even thoughcomplexity comes always at a cost.Basically, we found ourselves facing the verysame problems (choosing, selecting,collecting and reporting data) in verydifferent contexts (OS/390, NT, UNIX,centralized, distributed).The appropriate answers were never thesame, nor they were always different. It hasbeen a matter of choosing the right thing todo in the right place, where things are theaction on data and places are the various dataside or the reporting side.Much like a motion picture, you finally comeup with assigning the roles to the various kindof servers and learn the weak points of everyplatform. Practical experience proved to becrucial in this process, together with theability to manage every interesting resource,either “physical” such as servers, storage andnetwork subsystems or “logical” such asinstances and applications.If a project stops anywhere below 50% ofcoverage of the significant resources, after thefirst enthusiast reactions, people will start notto find what they want and the nice ServiceLevel Project will go rapidly awry.

The whole project was developed, tested,implemented and daily managed by twoexperienced system and analyst programmerswho have been exclusively working on theproject for two years.The last recommendation is perhaps anobvious one, still it is the one we will followin the near future: if you succeeded inimplementing an effective way to manageand report Service Level data, extend itwherever you can, i.e. for non technical datawhich are in Service Contracts (delivery timeof Work Station for Customer Service, ClientOffer Request response time for Accountmanager) and just keep in mind that ServiceContracts need specialized reportmanagement.On the reporting side, the introduction ofdynamic characteristics in the web reportingwith CGI or Java tools, appears to be themain way to realize the ideal model of LDSdata utilization, even though suchcharacteristics are not included in ITSV® atthe moment.

I‘d like to thank Renzo Serena and FiorenzoCappellesso, who did the coding while Icared about the talking.Not surprisingly they are the two people Ireferred to in the paper.Without them … etc. etc.All remaining errors are mine.

Documents

Fabio Veronese Enel - sasCommunity · Fabio Veronese Enel.it Abstract Countless papers have stressed the importance of Service Levels as they reflect the ability to manage resources