Upload
ricardo-luis-dos-santos
View
181
Download
2
Embed Size (px)
Citation preview
Identifying the Root Cause of Failures in IT Changes: Novel Strategies and Trade-offs Ricardo L. dos Santos, Juliano A. Wickboldt, Bruno L. Dalmazo, Lisandro Z. Granville and Luciano P. Gaspary
Federal University of Rio Grande do Sul, Brazil
Roben C. Lunardi
Federal Institute of Rio Grande do Sul, Brazil
• Introduction
• Proposed Solution
• Diagnosis Process
• Conceptual Architecture
• Root Cause Analyzer
• Strategies for Selecting Questions
• Case Study
• Final Considerations
• Future Work
Outline
Introduction
• Context
• The complexity of IT infrastructures becomes the IT processes a critical mission
• ITIL (Information Technology Infrastructure Library) became the most widely accepted approach to IT processes management all over the world
• IT Change Management
• Defines how the IT infrastructure must evolve in a consistent and safe way
• Defines how changes should be conducted
3/28
Introduction
• IT Problem Management
• Defines the lifecycle of IT problems
• The primary goals are
• To eliminate recurrent incidents
• To prevent the occurrence of IT problems
• To minimize the impact of problems which cannot be prevented
• To achieve these goals, identifying the root cause of failures and reusing the operator’s knowledge is fundamental
• To simplify the procedures
• To minimize financial losses
• To reduce maintenance costs
4/28
Introduction
• Current Scenario
• Changes and failures have been exploited by several researches
• However, these researches have some limitations, such as
• Often, previous data are not considered
• Do not identify root cause of failures
• Specific solutions for detecting software failures
5/28
Introduction
• Our Goals
• Propose strategies that help in the identification process keeping the interactive approach
• The developed strategies must select a question and explore different criteria
• Compare the diagnostics generated by each strategy
6/28
Interactive Diagnosis
Proposed Solution Diagnosis Process – Our Approach
Problem Report Answered
Question
Root Cause Question
Selection
7/28
PR RC
Help Desk Root Cause
Analyzer
Operator
Config. Mgmt.
Database
Change Management System
Change
Planner
Change
Designer
Proposed Solution Conceptual Architecture
Operator
8/28
Deployment
System
RFC
Config. Mgmt.
Database
Diagnosis System
Diagnosis Log
Recorder
RC
Change Management System
Change
Planner
Change
Designer
Proposed Solution Conceptual Architecture
Operator
8/28
Deployment
System
RFC
Root Cause
Analyzer
Config. Mgmt.
Database
Diagnosis System
Diagnosis Log
Recorder
RC
Change Management System
Change
Planner
Change
Designer
Proposed Solution Conceptual Architecture
Operator
8/28
Deployment
System
Root Cause Analyzer
Question
Selector
Question
Verifier RC
Input
Processor
CI CI
RC RC RC
PR
RFC
Root Cause
Analyzer
Log
Proposed Solution Strategies for Selecting Questions
• The developed strategies use same inputs and return a single question as result
• 4 different proposed strategies
• Strategy 1 – Only completed diagnostics
• Strategy 2 – All diagnostics
• Strategy 3 – Age of diagnostics
• Strategy 4 – Questions’ popularity
9/28
Proposed Solution Strategies for Selecting Questions
• Strategy 1 – Only completed diagnostics
• Only completed diagnostics are considered
• The calculated weights suffer no penalty
• The element weight is computed by sum of completed diagnostics in which RC was correctly identified
Root Causes Questions Answers Completed Diagnostics
RC1 Q1, Q2 A1, A3 20
RC2 Q1, Q3 A2, A5 30
10/28
Proposed Solution Strategies for Selecting Questions
• Strategy 1 – Only completed diagnostics
• Only completed diagnostics are considered
• The calculated weights suffer no penalty
• The element weight is computed by sum of completed diagnostics in which RC was correctly identified
Root Causes Questions Answers Completed Diagnostics
RC1 Q1, Q2 A1, A3 20
RC2 Q1, Q3 A2, A5 30
10/28
20 + 30 = 50 30 20
Proposed Solution Strategies for Selecting Questions
• Strategy 2 – All diagnostics
• Completed and frustrated diagnostics are considered
• The element weight is calculated by the sum of the completed diagnostics subtracting the sum of frustrated diagnostics
• A diagnostic is frustrated when the system uses at least one question associated with a RC, but at the end of the process another RC is identified
11/28
Proposed Solution Strategies for Selecting Questions
• Strategy 2 – All diagnostics
Root Causes Questions Answers Diagnostics
Completed Frustrated
RC1 Q1, Q2 A1, A3 20 10
RC2 Q1, Q3 A2, A5 30 15
12/28
Proposed Solution Strategies for Selecting Questions
• Strategy 2 – All diagnostics
Root Causes Questions Answers Diagnostics
Completed Frustrated
RC1 Q1, Q2 A1, A3 20 10
RC2 Q1, Q3 A2, A5 30 15
12/28
(20 + 30) – (10 + 15) = 25 30 – 15 = 15 20 – 10 = 10
Proposed Solution Strategies for Selecting Questions
• Strategy 3 – Age of diagnostics
• Considers completed and frustrated diagnostics
• The elements weights suffer penalty by the age of diagnostics
Age Diagnostics Time Penalty
1ª To120 days Not applicable
2ª From 121 days to 150 days 10%
3ª From 151 days to 180 days 20%
4ª From 181 days to 210 days 30%
5ª From 211 days to 240 days 40%
6ª From 241 days to 270 days 50%
7ª From 271 days to 300 days 60%
8ª From 301 days to 330 days 70%
9ª From 331 days to 360 days 80%
10ª From 360 days 90%
13/28
Proposed Solution Strategies for Selecting Questions
• Strategy 3 – Age of diagnostics
10
1
)( )(i
iiixghtelementWei
i – age of diagnostics
βi – percentage of weight to be used
αi – the amount of completed diagnostics in an age group
ωi – the amount of frustrated diagnostics in an age group
14/28
Proposed Solution Strategies for Selecting Questions
• Strategy 3 – Age of diagnostics
10
1
)( )(i
iiixghtelementWei
15/28
Root Causes Questions Answers
Completed
Diagnostics
Frustrated
Diagnostics
1st age 10th age 1st age 10th age
RC1 Q1, Q2 A1, A3 1 24 4 8
RC2 Q1, Q3 A2, A5 4 15 1 2
Proposed Solution Strategies for Selecting Questions
• Strategy 3 – Age of diagnostics
10
1
)( )(i
iiixghtelementWei
15/28
Root Causes Questions Answers
Completed
Diagnostics
Frustrated
Diagnostics
1st age 10th age 1st age 10th age
RC1 Q1, Q2 A1, A3 1 24 4 8
RC2 Q1, Q3 A2, A5 4 15 1 2
4.3 + 1.6 = 5.9
100% (1 - 4) + 10% (24 - 8) = 1.6
100% (4 - 1) + 10% (15 - 2) = 4.3
1.6
Proposed Solution Strategies for Selecting Questions
• Strategy 4 – Questions’ popularity
• The RCs and categories’ weight are calculated according the Strategy 2
• The question’s weight consider the weight of associated RCs and question’s popularity
• Question’s popularity is obtained by the ratio between amount of occurrences of the question and amount of diagnostic sets selected
16/28
Proposed Solution Strategies for Selecting Questions
• Strategy 4 – Questions’ popularity
αx – amount of occurrences of the question x in the diagnostic sets
n – amount of diagnostic sets
βRCi – probability of identifying an RC
αRCi, x – amount of occurrences of question x in the diagnostic set
of an RC
2
1
,
)(
n
i
xRCiRCix
xn
ightquestionWe
17/28
Proposed Solution Strategies for Selecting Questions
• Strategy 4 – Questions’ popularity
2
1
,
)(
n
i
xRCiRCix
xn
ightquestionWe
18/28
Root Causes Questions Answers
Completed
Diagnostics
Frustrated
Diagnostics
1st age 10th age 1st age 10th age
RC1 Q1, Q2 A1, A3 1 24 4 8
RC2 Q1, Q3 A2, A5 4 15 1 2
Proposed Solution Strategies for Selecting Questions
• Strategy 4 – Questions’ popularity
18/28
Root Causes Questions Answers
Completed
Diagnostics
Frustrated
Diagnostics
1st age 10th age 1st age 10th age
RC1 Q1, Q2 A1, A3 1 24 4 8
RC2 Q1, Q3 A2, A5 4 15 1 2
(2/2 + ((13/29 * 1) + (16/29 * 1))) /2 = 1
(1/2 + ((13/29 * 1) + (16/29 * 0))) /2 = 0.4741
(1/2 + ((13/29 * 0) + (16/29 * 1))) /2 = 0.5259
• In this case study some constrains were defined
• There is no changes during all executions
• The operator will provide always the same answer
• One company provides some services on the Web
• The infrastructure consists of DB Server and Web Server
• In order to meet growing demand 2 new servers will be installed
• Hosting Server – Will be used to host the clients’ websites
• Mail Server – Will be used to host the email services
19/28
Case Study
• The CP below aims to install 2 new servers and to migrate existing services
20/28
Case Study
• The CP below aims to install 2 new servers and migrate existing services
20/28
Case Study
A failure occurs
• IT infrastructure state in the company
21/28
Case Study
• IT infrastructure state in the company
21/28
Case Study
• IT infrastructure state in the company
21/28
Case Study
22/28
Case Study
Categories Level
Calculated Weights
Strat. 1 Strat. 2 Strat. 3 Strat. 4
Service 1 1083 242 157,30 242
Web Page Server 2 558 82 33,20 82
DataBase 2 519 195 127,60 195
Network 1 1058 345 188,10 345
Services 2 512 189 113,40 189
Devices 2 485 136 66,20 136
System 1 603 167 54,30 167
Computer System 2 545 153 52,90 153
Hosting Server 3 319 175 49,90 175
DB Server 3 192 -22 3,00 -22
Software 1 1115 343 126,60 343
Web Server 2 607 138 86,80 138
DB Server 2 443 169 36,20 169
23/28
Case Study
• Diagnostic workflows generated
23/28
Case Study
• Diagnostic workflows generated
The PHP configuration does not allow the
use of language in user’s websites
24/28
Case Study
• Diagnostic workflows generated
24/28
Case Study
• Diagnostic workflows generated
The PHP configuration does not allow the
use of language in user’s websites
Final Considerations
25/28
• The proposed solution allows to identify the failures’ root cause with the following features
• Reuse the operator’s knowledge
• Interactivity between solution and operator
• Flexibility of the diagnostic generated
• System compatibility with the standards used by companies
• The modular structure of solution allows organizations to adapt the system to their special needs
Final Considerations
26/28
• The proposed strategies generate different diagnostic workflows, considering the same infrastructure and failure
• Analyzing the obtained results, we have the following recommendations for IT operators
• Strategy 1 – histories with a small amount of records
• Strategy 2 – bulky and recent histories
• Strategy 3 – histories that include at least 10 months
• Strategy 4 – data sets with a great amount of popular questions
Future Work
27/28
• Explore new criteria for the selection of questions
• Confidence
• False positive and false negative rates
• Extend the process to identify root causes for other scopes
• Investigate the use of CIM classes (actions e checks) in order to improve the system bootstrapping
• Automate root cause identification of certain kinds of failures
Thank you for your attention!
Questions?
References
• J. P. Sauvé, R. A. Santos, R. R. Almeida et al., “On the Risk Exposure and Priority Determination of Changes in IT Service Management,” in XVIII IFIP/IEEE International Workshop on Distributed Systems: Operations and Management (DSOM 2007), 2007, pp. 147–158
• ITIL, “ITIL - Information Technology Infrastructure Library. Office of Government Commerce (OGC),” 2009, Available: http://www.itilofficialsite.com/. Accessed: aug. 2010
• G. Machado, F. Daitx, W. Cordeiro et al., “Enabling rollback support in IT change management systems,” in Network Operations and Management Symposium, 2008. NOMS 2008. IEEE, April 2008, pp. 347–354
• W. Cordeiro, G. Machado, F. Andreis et al., “ChangeLedge: Change design and planning in networked systems based on reuse of knowledge and automation,” Computer Networks, vol. 53, no. 16, pp. 2782 – 2799, 2009
• ITIL, “ITIL - Information Technology Infrastructure Library: Service Operation Version 3.0. Office of Government Commerce (OGC),” 2007
• DMTF, “Distributed Management Task Force: Common Information Model. Distributed Management Task Force (DMTF),” 2009, Available: http://www.dmtf.org/standards/cim. Accessed: aug. 2010
References
• J. Sauvé, R. Santos, R. Reboucas, A. Moura, and C. Bartolini, “Change priority determination in it service management based on risk exposure,” Network and Service Management, IEEE Transactions on, vol. 5, no. 3, pp. 178 –187, september 2008
• A. Brown and A. Keller, “A best practice approach for automating it management processes,” in Network Operations and Management Symposium, 2006. NOMS 2006. 10th IEEE/IFIP, 3-7 2006, pp. 33 –44
• A. Moura, J. Sauve, and C. Bartolini, “Business-driven it management - upping the ante of it : exploring the linkage between it and business to improve both it and business results,” Communications Magazine, IEEE, vol. 46, no. 10, pp. 148 –153, october 2008
• A. Keller, J. Hellerstein, J. Wolf, K.-L. Wu, and V. Krishnan, “The champs system: change management with planning and scheduling,” in Network Operations and Management Symposium, 2004. NOMS 2004. IEEE/IFIP, vol. 1, 23-23 2004, pp. 395 –408 Vol.1
• M. Jantti and A. Eerola, “A Conceptual Model of IT Service Problem Management,” in Service Systems and Service Management, 2006 International Conference on, vol. 1, Oct. 2006, pp. 798–803
• R. Gupta, K. Prasad, and M. Mohania, “Automating itsm incident management process,” in Autonomic Computing, 2008. ICAC ’08. International Conference on, 2-6 2008, pp. 141 –150
References
• K. Appleby, G. Goldszmidt, and M. Steinder, “Yemanja-a layered event correlation engine for multi-domain server farms,” in Integrated Network Management Proceedings, 2001 IEEE/IFIP International Symposium on, 2001
• M. Steinder and A. S. Sethi, “Probabilistic fault diagnosis in communication systems through incremental hypothesis updating,” Computer Networks, vol. 45, no. 4, pp. 537 – 562, 2004
• W. L. C. Cordeiro, G. Machado, D. F.F. et al., “A template-based solution to support knowledge reuse in IT change design,” in Network Operations and Management Symposium, 2008. NOMS 2008. IEEE, April 2008, pp. 355–362
• J. A. Wickboldt, L. A. Bianchin, R. C. Lunardi et al., “Improving it change management processes with automated risk assessment,” in XII IFIP/IEEE International Workshop on Distributed Systems: Operations and Management (DSOM 2009), 2009
• R. C. Lunardi, F. G. Andreis, W. L. d. C. Cordeiro, J. A. Wickboldt, B. L. Dalmazo, R. L. d. Santos, L. A. Bianchin, L. P. Gaspary, L. Z. Granville, and C. Bartolini, “On strategies for planning the assignment of human resources to it change activities,” in Network Operations and Management Symposium, 2010. NOMS 2010. IEEE, apr. 2010, pp. 248–255
Root Cause Analyzer
Proposed Solution Root Cause Analyzer
Question Verifier
Obvious?
Threshold
80% with the
same answer
Input Processor
RC RC RC Identification
based on
categories
Identification
based on PR
Identification
based on RCs
Question Selector
Selects the
Question has
the greatest
weight/level
Selects the
Category that
has the greatest
weight
Calculates the
weights
according to the
strategy
CI CI Log
Case Study
• Identified CIs and categories associated CI Categories
Hosted Sites Service Web Page Server
DataBase Access Service DataBase
Web Page Access Service Web Page Server
PHP Interpreter Service Web Page Server
CMS Service Service Web Page Server
Logical Connection Network Services
Joomla Software Web Server
PHP Software Web Server
Apache Software Web Server
MySQL Software Web Server
DB Server System Computer System DB Server
Hosting Server System Computer System Hosting Server
Switch Network Devices
Proposed Solution Information Model
dete
rmin
esP
roble
m
possibleAnswers
determinesOthersQuestions
CategoryParentChild
1 1..*
1 0..1
1..*
*
ServiceProblem
SolutionCategory *
1..*
ManagedElement
ExchangeElement
SolutionElement
*
QuestionC
ate
gory
Category
0..1
Question
RootCause
1..* *
1
0..*
ServiceIncident
Problem
Answer
0..1
1..*
1..*
0..1
1..* SolutionCategory
Proposed Solution Information Model
dete
rmin
es
Pro
ble
m
possib
les
Answ
ers
dete
rmin
es
Oth
ers
Questions 1..*
0..1
1
Logical Element
EnabledLogical
Element
MessageLog
RecordLog recordedAnswers
recordedQuestions
1
0..1
Question
RootCause
1..*
1 1
1
1
Problem
Answer
0..1
recordedProblem
1
1
1..*
1 *