Signal vs Noise: Identifying which fixes are important and ... · For some errors, “Won’t fix!” is the only reasonable decision to make. Engineering teams face predicaments

plumbr.io

Here are a few situations that illustrate what an engineer might find, when open-ing their inbox on a typical Tuesday:

Signal vs Noise: Identifying which fixes are important and impactful.

There is a post-mor-tem, covering the last production incident,

reported by the on-call engineer last night. Apparently, some users were facing issues during authentication with the following exception:

Last run failed with:org.springframework.web.client.HttpServerErrorException: 500at <...>at org.springframework.web.client.RestTemplate.getForObject(RestTemplate.java:350)

A restart mitigated the impact, but as of now there is no idea about the root cause or the risk involved.

The monitoring sys- tem has sent out an alert, notifying about

JavaScript errors for some users, after the release last evening. A couple of users who also contacted support, seem to have been using Internet Explorer 9. Still no idea so far if the impact is localised to the specific browser.

Sales manager has finished the first call and is reporting an

issue with monthly reports where inconsistent information is presented in comparison to the operational views within the product. Whether this is a customer-specific issue or are we presenting incorrect informa tion across the board is unclear.

If you’re resopnsible for a web application, how do you decide which error to tackle? And in what order? How do you decide the priority of an issue, especially when there are many other items such as feature requests, and mainte-nance projects that are ongoing? Read to find out….

Every application has errors. The good news though is that not every one of them have to be fixed. At Plumbr, we collect data about user interactions and application performance for our customers. We present the results of several important data analy-sis we conducted based on the monitoring data collected by Plumbr.

Introduction

1 No matter what your application is, it is bound to contain errors.

2 Not all errors that are occurring need to be fixed.

3 Fixing just a few select errors can bring major improvements.

We analyzed over 400 different applications moni-tored by Plumbr. We picked a 6-month period between July 2018 and Jan 2019. Here is a descrip-tion of the data set:

Our methodology

• 400 different applications monitored by our application monitoring products

• 19 billion API calls served

• 100 million failed calls over the period

• 4,500 distinct errors causing these failures.

Accepting that all systems will contain availability issues that will never get fixed, is a step towards improvement. Consider the three example situations we illustrated in the previous section. For some errors, “Won’t fix!” is the only reasonable decision to make.

Engineering teams face predicaments about the right tradeoffs to be made. Since engineering resources are limited, compromises must be made. There is always a long list of new features business/product owners wish to deliver to improve the business.

The conclusions drawn by our team were as follows:

400Applications

4,500Distinct errors

19,000,000,000 API calls

100,000,000Failed calls

6Months

Some statistics from the analysis

• Applications with no errors = 0

• Median number of errors per application = 13

• Minimum errors in an application = 1

• Maximum errors in an application = 1931

• Number of applications with >200 errors = 35

• Average number of errors per application = 72

Determining the errors with the biggest impact

For every call made, the HTTP response code determines the outcome. Calls with a 4xx or a 5xx series response codes are classified as “Failed”. While the rest are classified as a success. For every application, each error type was ranked by the number of failed calls that they spawned. The 'relative impact' of each error was then calculated as the fraction of its impact against the total number of failures, expressed as a percentage. The following distribution was derived based on the relative impact of the top three errors across all the applications (using averages):

This may look like it proves the claim. However, since averages are notoriously misleading when trying to make predictions for a specific situation, we decided to slice and dice the data further to verify this claim. For each application, we divided the data into one-week stretches. For each such bucket, the top-errors were isolated, and their impact calculat-ed. This gave rise to the following observations:

• 50% of the time, the single most impactful error causes at least 25% of all failures

• 50% of the time, the three most impactful errors cause at least 54% of all failures

Here are a couple of additional insights:

• In 1 out of every 4 weeks, the total impact of the top-three errors is above 80%

• In 1 out of every 10 weeks, the total impact of the top-three errors falls below 20%

• Error #1 accounts for 32% of total failures



Error #132%

Error #215%

Error #39%

Rest44%

What fraction of total impact do the top 3 errors represent at any given time?

Under 20%10.5%

20% - 50%36.7%

50% - 80%27.0%

Over 80%25.8%

1 10 100 10000

20

40

60

Num

ber o

f app

licat

ions

Number of errors

Understanding the impact of the errors in your appli-cations is an important step towards knowing what to resolve. This requires a good monitoring solution. When you have meaningful monitoring in place, you can fit the information from it into your current work-flows in many ways.

Circling back to the three situations highlighted in the first section, you will realize that there will always be a short list of errors that affect a small number of people, and a long list that impacts niche user segments..

You can sign up for a trial at plumbr.io. You can also con-tact the team for a demo at [email protected].

Monitoring: The key to closing the feedback loop

Plumbr was built to help engineering teams with all this and more. It is divided into two distinct parts – Real user monitoring and Application monitoring. Each of these parts helps build a cohesive picture about what is breaking in an application, and how users are affected. With this knowledge, many processes – such as sprint planning, incident man-agement, troubleshooting, and reporting – can be improved within engineering teams.

plumbr.iohttps://plumbr.io/

https://plumbr.iomailto:[email protected]

Documents

Signal vs Noise: Identifying which fixes are important and ... · For some errors, “Won’t fix!” is the only reasonable decision to make. Engineering teams face predicaments