6.benchmarking of

Article Title Page

Benchmarking of Marine Bunker Fuel Suppliers: The Good, The Bad, The Ugly Author Details Author 1 Name: Ole Jørgen Anfindsen University/Institution: DNV Research & Innovation Town/City: Høvik Country: Norway Author 2 Name: Grunde Løvoll University/Institution: DNV Research & Innovation Town/City: Høvik Country: Norway Author 3 Name: Thomas Mestl University/Institution: DNV Research & Innovation Town/City: Høvik Country: Norway Corresponding author: Ole Jørgen Anfindsen Corresponding Author’s Email: [email protected] Acknowledgments (if applicable): n/a Biographical Details (if applicable): Ole Anfindsen holds a dr. scient. degree (PhD) in computer science and a bachelors degree in electronics engineering. For more than 25 years he has worked with databases and related technologies. He has been senior research scientist in Telenor R&D, visiting researcher at GTE Laboratories (Massachusetts) and Sun Microsystems Laboratories (California), as well as adjunct associate professor at the Institute of Informatics at the University of Oslo. He currently works as a researcher in the Research & Innovation department of DNV, where his main activity is directed towards data analysis especially in the maritime area. G. Løvoll has a dr. scient. degree (PhD) in physics. Grunde has worked for 6 years as a Post Doc and researcher at the Department of Physics at the University of Oslo doing experimental studies on multiphase flow in porous materials, water diffusion in dry clay and optical tweezers. Dr. Løvoll currently works as a researcher in DNV Research & Innovation, where his main focus is on data analysis in the maritime area. Thomas Mestl has a Dr. Scient. (PhD) in mathematics and a degree in precisions engineering. He has worked in DNV's Research Department for the last 13 years within the field of information technology. A large part of his work has been on identifying emerging technology trends, evaluating new ICT technologies (especially with respect to mobile work and information management), and to identify promising business opportunities offered by new or combination of existing technologies. Currently, his main activity is directed towards data analysis especially in the maritime area.

Structured Abstract: Purpose - This paper has two main focus areas; the construction of a realistic best practice benchmark, and the development of a methodology for comparison of individual suppliers of marine bunker fuel. As is well-known in this trade, unfair business behaviors in the bunker fuel market are not uncommon, resulting in financial losses for the buyers.

Design/methodology/approach - Establishing a best practice will naturally involve some degree of subjectivity as there is not a priori correct answer to this problem. Using the concept of membership functions from fuzzy set theory, a score can be derived from a best practice benchmark histogram. The main advantages of this method are its relative independence both of sample size and of the underlying distribution, as well as being computationally very efficient.

Findings - Our methodology turns out to be more powerful than standard descriptive statistics, as it is less sensitive to outliers and is well suited for small datasets and even single numbers. When applied to data for all suppliers worldwide it turns out that the number of good suppliers is actually much lower than might be expected.

Practical implications - Bunker fuel is a major expense for ship owners, and can easily reach $30 million/year for a single container ship. There is therefore a considerable interest in the market for benchmarking of individual fuel suppliers. Our methodology is also applicable to other quality related fuel parameters.

Originality/value - To the best of our knowledge this is the first attempt to benchmark actors in the marine bunker fuel industry and to quantify their behaviors.

Keywords: benchmarking, membership functions, scoring, fuzzy clustering, supplier quality, best practice

Type footer information here

Type header information here

Article Classification: Technical paper

For internal production use only Running Heads:

p. 1

Benchmarking of Marine Bunker Fuel Suppliers:

The Good, The Bad, The Ugly

Abstract

Purpose This paper has two main focus areas; the construction of a realistic best practice benchmark, and the development of a methodology for comparison of individual suppliers of marine bunker fuel. As is well-known in this trade, unfair business behaviors in the bunker fuel market are not uncommon, resulting in financial losses for the buyers.

Design/methodology/approach Establishing a best practice will naturally involve some degree of subjectivity as there is no a priori correct answer to this problem. Using the concept of membership functions from fuzzy set theory, a score can be derived from a best practice benchmark histogram. The main advantages of this method are it’s relative independence both of sample size and of the underlying distribution, as well as being computationally very efficient.

Findings Our methodology turns out to be more powerful than standard descriptive statistics, as it is less sensitive to outliers and is well suited for small datasets and even single numbers. When applied to data for all suppliers worldwide it turns out that the number of good suppliers is actually much lower than what might be expected.

Practical implications Bunker fuel is a major expense for ship owners, and can easily reach $30 million/year for a single container ship. There is therefore a considerable interest in the market for benchmarking of individual fuel suppliers. Our methodology is also applicable to other quality related fuel parameters.

Originality/value To the best of our knowledge this is the first attempt to benchmark actors in the marine bunker fuel industry and to quantify their behaviors.

Keywords: benchmarking, membership functions, scoring, fuzzy clustering, supplier quality, best practice

Category: Technical Paper

1. Introduction The density of marine bunker fuel can be regarded as one of its most basic parameters. It is used for fuel quantity estimation, and is also the basis for the so-called Calculated Carbon Aromaticity Index (CCAI), an important factor for ignition and for deposits in the engine and used for calculating the specific energy content in fuel. Density is also an important factor when it comes to the process of separating water or solids from bunker fuel.

For the typical ship operator the primary importance of density comes from the fact that bunker fuel is delivered by volume but paid per ton. The conversion is done by means of the fuel density reported by the supplier. A small density difference between stated and actual fuel density can quickly lead to large financial losses for the ship operator. For instance, if a density of 977 kg/m3 is stated when the actual value happens to be 960 kg/m3, this will give rise to a difference of nearly 35 ton when

p. 2

bunkering 2000m3, the value of which, in the current market, is close to US$ 20,000 – just for a single bunkering.

Although this example belongs in the high end of the spectrum, it is not at all hard to find even more extreme examples in real life. And such a way of making a quick buck is exploited by many fuel suppliers as their stated density is usually used to calculate the quantity of the delivered fuel. Over-reporting of density, i.e. claiming that the fuel density is higher than what is actually the case, is called short-lifting, while the opposite could be termed long-lifting. Short-lifting implies that the ship operator loses money, since he pays for more fuel than he receives. Long-lifting implies that the fuel supplier loses money, and that the ship operator gets more than what he pays for.

The global market for marine bunker fuel is more than 300 million tons annually (IEA 2010, p. 618; Eyring et al 2010; IMO 2009; EPA 2008). We estimate that more than 300,000 tons of bunker fuel, i.e. about 1‰ of the global consumption, is short-lifted every year. We further estimate that the amount of long-lifting exceeds 150,000 tons. That is, on the order of half a million tons are long- or short-lifted annually. Thus, bunker fuel worth more than US$200 million appears not to be properly accounted for every year.

Both short- and long-lifting may be indications of fraudulent behavior of individual employees within the ship operator’s or bunker fuel supplier’s organization. Such behavior is however sufficiently

widespread that a systematic and commonly accepted short-lifting praxis in parts of the bunker fuel trade may be suspected. Some fuel suppliers use this tactic to consistently over-state the delivered amount to improve the company’s profit margin. Many ship operators and suppliers would welcome a benchmarking of suppliers, ports, or geo-regions against some best practice.

The rest of the paper is organized as follows: In Section 2 we take a closer look at concrete examples of different density reporting strategies and discuss the difficulties associated with single number characteristics. In Section 3 we use this to characterize good suppliers and derive criteria for defining a best practice. In Section 4, a Best Practice Classifier is constructed that will assign a Best Practice

Score to an individual bunkering or a supplier. We also present a series of benchmarking comparisons between regions together with an overview of how they developed over a 10 year period. This paper ends with a discussion and some promising leads for further work.

2. Investigating density reporting behavior Table 1 gives some statistics for density deviations on a global and local basis (e.g. Canada and the US West coast, South Asia, Middle East, and South America West) and for 4 selected suppliers (S1, S2 , S3,

S4) in 4 different bunker ports. The density difference, dd, is the difference between the density

claimed by the supplier and the actual density measured by a fuel testing agency (e.g. DNVPS). The

average density difference, dd , could in principle be used to characterize the behavior of a fuel

supplier (a port or a region) as good, medium or bad.

Unfortunately, most of such single number quality measures have some sort of shortcoming as they compress a wealth of information into a single number. They often wipe out (quite effectively) much of the information about the interesting behavior of a supplier. In addition, the arithmetic mean or median may be less suited for distributions that are non-normal, skewed or showing heavy tails. Also, the mean and standard deviation is very sensitive to outliers (a few unusually large or small observations) (Bhattacharyya & Johnson 1977). As an example, the mean value of ten bad bunkerings could easily be balanced by one exceptionally good one (or a typing error), while the median is less sensitive to outliers. Another problem with the mean and median is that they reveal nothing about the shape of the underlying distribution. For instance, if we only look at the mean, the geo-region South America West seems to be better than e.g. Canada & US West Coast from a short-lifting perspective, see Table 1. If we take the standard deviation into account it is obvious that there is a higher risk of being short-lifted in South America West than in the other geo-regions, simply because the distribution is wider. The standard deviation only refers to the width of the underlying distribution but not to the actual shape. As can be seen in Figure 2 the distributions are non-normal, i.e. a highly skewed middle spike combined with a very long one-sided tail.

p. 3

Table 1: Standard descriptive measures of density differences for some selected geo-regions and suppliers

(n = number of samples, dd = mean density difference, σdd = standard deviation of dd). Histograms for the

geo-regions and suppliers are shown in Figures 1 and 2 respectively, whereas their scatter plots are shown

in Figures 3 and 4. Data in this table and in the following examples is, unless otherwise stated, based on

DNVPS bunkering samples of RMG380 fuel collected in 2008 (confer DNV 2010).

n dd

in kg/m3

median(dd) in kg/m3

σσσσdd

Global 43343 0.39 0.10 3.92

Canada & US West Coast 1919 0.03 -0.10 2.43

South Asia 6806 1.22 0.90 3.35

Middle east 2990 1.83 0.70 4.76

South America West 565 -0.48 -0.90 6.00

Supplier 1 (S1) 129 -0.12 -0.10 0.95

Supplier 2 (S2) 239 2.31 0.90 4.84

Supplier 3 (S3) 71 2.40 2.60 1.83

Supplier 4 (S4) 145 2.07 1.50 2.81

Histograms

For a more detailed understanding of the properties of the data in Table 1 please refer to the density difference histograms of Figures 1 and 2. For comparison we have plotted a smoothed version of the global histogram (dashed line) and a smoothed version of the actual histogram (solid line). These histograms represent estimates for the underlying probability density distribution and can thus tell us something about the risk and possible amount of the short-lifting. A comparison with a reference

histogram, like the global histogram, would provide the desired benchmark.

From Figure 1 it can be seen that none of the histograms seem to come from a normal distribution (the implications of this observation will not be further discussed in this paper). This can be confirmed by means of a probability plot. The different geo-regions also show significant differences in their density reporting practice. Canada & US West Coast appears better than the global average, the peak of the histogram is centered at 0 and has shorter tails. For South Asia, the width of the histogram is similar to the global one, but its center is shifted towards short-lifting, whereas the Middle East shows a fairly heavy short-lifting tail. The histogram for South America West is especially remarkable as the chance of actually getting the fuel density stated by the supplier appears to be slim. The rule is rather that the buyer is either short- or long-lifted, something which could not be deduced from the standard descriptive statistics.

Figure 1: Probability distribution of density reporting deviations (i.e. the difference between claimed and

measured density) for 4 selected geo-regions. The histograms are (clockwise from top left): Canada & US

West Coast, South Asia, Middle East, and South America West Coast. The solid lines represent the

smoothed histogram while the dashed lines are the smoothed global histogram. The underlying number of

samples, averages, medians, and standard deviations are given in Table 1. The histograms reveal

considerable variation in density reporting.

Histograms for individual suppliers listed in Table 1 are shown in Figure 2 below. A visual comparison indicates that Supplier 1 is much better than the global average with a narrow symmetric distribution centered at 0. The three other suppliers are all heavily short-lifting with varying degrees of right-shifted and/or right-heavy distributions. Based on these histograms the suppliers might be characterized as rather bad, but any fine grained information about their underlying reporting strategy is removed by the histogram. A main disadvantage of using histograms for characterizing suppliers is that they require a considerable amount of data which could be a challenge when considering short time periods or suppliers with few data samples.

p. 4

Figure 2: Probability distribution of density reporting deviations (i.e. the difference between claimed and

measured density) for 4 selected suppliers in 4 different bunker ports (for more details se Table 1). The

histograms reveal different reporting behavior, but histograms become noisy when the number of samples

becomes too low.

Scatter plots

Scatter plots of measured vs. claimed density allows a much more fine grained view on the underlying data. These plots may be used to unravel the various reporting strategies of the suppliers, see Figure 3

and Figure 4. Scatter plots quite effectively visualize the density reporting behavior of suppliers or groups of suppliers. Note that each dot in a scatter-plot represents at least one bunkering sample. The diagonal solid line represents correct density reporting (i.e. stated = measured, in the following called no-cheat line). The horizontal and vertical dashed lines specify the upper density limit given by the ISO8217 standard.

These scatter plots exhibit some interesting observations. Note that the range of densities of the available fuel varies between geo-regions; e.g. the fuel density range is much wider in the Middle East than in North America or South Asia. This phenomenon may be traced back to the proximity to crude oil production in the regions.

Observe also that in many bunkerings the fuel density was above the limit (dots to the right of vertical dashed line) but almost none of them were reported to lie above the limit (above horizontal dashed line). This is true for all suppliers.

From Figure 4 we may deduce that Supplier 1 could be considered as rather good, since most of his samples are on or close to the no-cheat line. This behavior seems to be dominant for most of the suppliers in the Canada & US West geo-region (note: good suppliers are found in all geo-regions). In contrast, Supplier 2 may be regarded as bad, since his stated densities cover the whole range from the no-cheat line and all the way up to maximum-cheating, i.e. the upper density limit given by the standard. This type of behavior is also visible both in the South Asia and the Middle East scatter plots.

It seems that Supplier 3 has a strategy of simply adding an offset to the real density, which is reflected in the mean density different from zero and a relative low standard deviation. A fourth reporting scheme appears in Supplier 4 who has a tendency of always stating a density near the limit – independently of the actual density. This could be termed as the worst behavior since they short-lift as much as possible. This behavior is not uncommon in South Asia and the Middle East. Variations to this scheme, i.e. stating a fixed fuel density but lower than the limit, are seen in Asia, Middle East and South America West. They appear as horizontal lines in the scatter plot.

Figure 3: Scatter plot of measured vs. claimed density for the same geo-regions as in Table 1 and Figure 1.

Each black dot represents (at least) one bunkering. The solid line represents the no-cheat line, i.e.

bunkerings where the supplier states the density correctly (claimed = measured), whereas the dashed lines

indicate the upper density limit in the ISO standard for bunker fuel (ISO8217), viz. 991 kg/m3, implicitly

giving the maximum possible amount of cheating. Many dots along the upper dashed line indicate a high

degree of cheating in many bunkerings. Note that in many bunkerings the fuel density was above the limit

(dots to the right of vertical dashed line) but almost none of them were reported to lie above the limit

(above horizontal dashed line).

Figure 4: Scatter plot of measured versus claimed density for the same suppliers as in Table 1 and Figure

2. Supplier 1 reports quite honestly as his dots are scattered close along the no-cheat line. In contrast,

Supplier 2 and 3 have many reportings away from this no-cheat line but they are not as dishonest as

Supplier 4, who basically reports only one density close to 991 irrespective of the actual fuel density.

p. 5

3. The Good: Best practice benchmark The above discussion has emphasized the need for a good benchmark for measuring the goodness in density reporting, and for distinguishing between various short-lifting and long-lifting strategies.

The scatter plots of Canada & US West Coast and Supplier 1 are examples of good density reporting behaviors that could be used as best practice references. Our interpretation of good or best practice is indicated by the grey diagonal area around the no-cheat line in Figure 5. Fair reporting and good control of the delivered density should result in a small symmetric scatter around the no-cheat line, and thus a narrow density difference (dd) histogram centered at dd = 0 (like the one for Supplier 1 in Figure 2).

The goal is to establish a best practice, and then use it as a predefined reference to which bunkerings may be compared. This best practice benchmark is given by the dd-histogram for a group of selected good suppliers.

Figure 5: Scatter plot of bunkering data from South Asia. Data points around the diagonal line (no-cheat

line) indicates good or best practice behavior, i.e. fair reporting, with little or no cheating. In the area

above the no-cheat line, customers get short-lifted (pay too much) whereas below the line the supplier loses

money. The more dots there are above the fair line, and the further away from it they are, the less

accurate the density reporting. Bunkerings far below the fair area should be considered suspicious and

may indicate a bribing situation. Reportings in the grey horizontal area (reporting densities close to the

upper density limit) indicate that some suppliers consciously choose a strategy of maximum density

cheating. A close up of the scatter plot near the density limit = 991 kg/m3 reveals that hardly any suppliers

are willing to state that their fuel exceeds the limit even when this is clearly the case.

This best practice histogram shall represent good suppliers and should be based on many data points. Any outliers, intentional cheating, or other indications of dishonesty should be eliminated to obtain an unbiased and fair benchmark. The following criteria for deriving the best practice benchmark should therefore be chosen (there will always be a certain element of subjective judgment in this process, but the method for deriving the benchmark should as far as possible be transparent, sound, and unbiased):

1) Select some geo-regions where the scatter plots show that data are predominantly found along the no cheat line.

2) For each selected dataset we:

a. Eliminate extreme outliers, max cheating and near limit lying; only data inside a predefined area around the no-cheat line is selected (see Figure 6 for details).

b. Eliminate any bias by centering the dd data around dd = 0.

3) The adjusted and selected dd data for all the selected sets are then merged into one large dataset.

4) Calculate the dd histogram for the dataset.

Figure 7 shows the best practice reference histogram derived from the geo-regions Biscay, Canada & US East Coast, Canada & US West Coast, US Gulf Coast, and Oceania.

Figure 6: Only bunkering samples between the 2 blue solid lines will be used as basis for deriving the best

practice benchmark histogram. This effectively eliminates max cheating, outliers, and ‘near limit effects’,

i.e. less than complete honesty when selling too heavy fuel. The upper solid line divides the angle between

no-cheat and max-cheat lines. The lower solid line is simply mirrored around the no-cheat line such that

the density deviations are the same above and below, i.e. |+��| = |-��|.

p. 6

Figure 7: Best practice dd histogram based on samples from selected geo-regions (Biscay, Canada

& US East and West Coast, US Gulf Coast and Oceania) where max cheating, outliers and near

limit dishonesty have been eliminated. The dashed line is the histogram function H, i.e. a

smoothed version of the histogram indicating the global best practice.

Classification by membership function

Once the best practice histogram is generated, the challenge is to benchmark a supplier, a port, or a region against it. In principle, this histogram must be compared with the dd histograms for the suppliers in question and the degree of conformance would then give the desired benchmark. Unfortunately this is a non-trivial task and for many of the suppliers only relatively few samples are available, resulting in bad histograms. We therefore propose a more elegant approach that is insensitive to the number of data points and outliers, and that can even be used for a single bunkering.

The concept of a membership function (Turksen 1991; Terano et al 1987, p. 21), which is widely applied in Fuzzy set theory (Lowen 1996, Self 1990), is used to achieve this benchmarking. A single number (score) is computed denoting the goodness of a specific bunkering or supplier.

An example will hopefully make this clear. Consider the task of benchmarking people into fast and slow runners, respectively. One way to do this is to set a threshold T on how fast a person should be able to run 100 m, and then categorize the people who run slower than the threshold as slow (=0) and those who run faster than the threshold as fast (=1). This sorting is achieved by a Boolean membership function B with threshold T for the measured time t on 100 m, i.e. B(T,t). However, it is quite obvious that this benchmarking will result in a crude oversimplification as there is a continuous transition from extremely fast runners to the really slow ones, and a small change in the chosen threshold could seriously alter the number of members in each category. A better approach would be to replace the Boolean function with a continuous function, assigning a continuous membership value between 0 and 1 depending on how fast they run. This is an example of a so-called membership function, and will in the following simply be denoted m.

The situation is analogous to our best practice density benchmark where suppliers (or bunkerings) are not grouped into crisp sets of good and bad but rather get a score indicating how close to or far away from the best practice they are. This, by the way, is also the reason why e.g. discriminant analysis (Hastie et al 2009) is unsuitable for the task at hand.

The challenge is to find a membership function for the good group, faithfully reflecting what we consider to be good. Fuzzy set theory does not provide help in determining the membership function, as all kinds of functions are used, e.g. triangular, trapezoid, Gaussian, etc. The discussion of good behavior above gives us some hints about the properties of the desired membership function. It should not be too wide, as a bad bunkering could then be regarded as good. Likewise, if it is too narrow then a good bunkering would get a too low goodness score. It is important that the membership function represents the best practice set as well as possible. The obvious choice is to derive the membership function directly from the dd histogram itself.

The membership function for good bunkerings, mG, must have a maximum value of 1 at dd = 0, i.e. mG(dd=0) = 1, and is continuously decreasing in both directions, i.e. a rescaling and shift of the H

histogram has to be done. We therefore propose the following definition of the membership function:

)0(

)(

)max(

)()(

H

ddH

H

ddHddmG ==

where the subscript G indicates that this gives a goodness scoring, and H is the smoothed (and adjusted) best practice histogram (i.e. H is the histogram function). Note that mG is a function of the distance of dd to 0, as well as the frequency of dd in the best practice. This membership function can now be applied e.g. to all n supplier samples to obtain the overall goodness benchmark,

p. 7

∑=

⋅=n

i

iGG ddmn

b1

)(1

where the summation is done over all n bunkerings for a specific supplier, port, or geo-region.

An interesting observation is that the scoring from the membership function mG(dd) is not (a priori) a

probabilistic measure, it is a measure (0→1) based on how far away a variable is from some value, i.e. dd=0; see Figure 8. However, this rescaling does preserve an interesting probabilistic feature, viz. the following: the probability of finding a value x in a small interval around dd, relative to that of finding a value y in an equally sized interval close to 0, given that the samples are drawn from the best practice group.

Figure 8: The solid line gives the goodness membership function, mG, which is a scaling of the best practice

histogram. mB = 1-mG gives the membership function for the opposite (dashed line), i.e. bad which in turn

could be divided into a long- and short-lifting part, mLL and mSL respectively (corresponding to negative

and positive dd values). E.g. a bunkering with dd=2.3 would get a good score of mG=0.23 and a bad score of

mB=0.77 (with mLL=0 and mSL=0.77).

The Bad

Note that mG(dd) was derived based on what was chosen to be the best practice. It therefore gives a measure/score for how good a bunkering or supplier is with respect to this best practice. The complementary,

mB(dd) = 1 - mG(dd),

give a badness scoring but it will not tell weather the bad scoring comes from short- or long-lifting. Fortunately, mB can, depending on whether a sample falls into the short- or long-lifting domain, be further divided into mSL and mLL. That is, if the dd value of a sample is positive, its mSL will be greater than zero; if the dd value of a sample is negative, its mLL will be greater than zero.

This enables us to calculate short- and long-lifting scores similar to the goodness score:

∑=

⋅=n

i

ixLxL ddmn

b1

)(1

,

where the subscript xL should be SL or LL, which stands for short- or long-lifting, respectively. These scores indicate the behavior of a supplier and give the risk of being short- or long-lifted. Note, by definition:

bG + bSL + bLL = 1

Remember that the scores correspond to the degree of membership, i.e. how close a bunkering is to the good or bad benchmark, they can therefore be understood as weights corresponding to the proportion of good or bad.

The Ugly

As pointed out above, profit maximization by reporting densities at or close to the upper limit may be considered as fairly ugly behavior. The same methodology can be applied to obtain a near limit score for this behavior by constructing a membership function

mNC(claimed density) = mG(claimed density - 991)

where the subscript NC denotes Near Ceiling.

This membership function assigns a scoring to a bunkering corresponding to the distance from the density limit and frequency of occurrence in the benchmark. To avoid categorizing a bunkering as ugly when the measured density is actually near the limit, we employ a convolution of mNC and mSL. In

p. 8

so doing we exclude all reportings that are near the limit but that are actually honest. We propose the following ugly or near limit benchmark

∑=

⋅⋅=n

i

iNCiSLNC densityclaimedmddmn

b1

)()(1

giving the fraction of short-lifting that could be considered as near limit reporting.

Further characterization of Good and Bad

In order to further characterize bunkering samples within the good-, short-, or long-lifting region in the scatter plot, the average density deviations in each region could be computed by weighting each bunkering sample with the corresponding score from the membership function. For instance, the mean

density difference ( SLdd ) in the short-lifting area is:

( ) ( )

( )∑

∑ ⋅

=

i

iSL

i

iSLi

SL

ddm

ddmdd

dd

in kg/m3, where the index i runs over all samples n.

This means, for a given supplier we can provide information about the risk of being short-lifted, bSL,

and about the expected average amount in density difference, SLdd . The method is easily extended to the other identified behaviors.

4. Application of the benchmarks As discussed above the power of the scatter plot lies in the visualization of the different density reporting schemes. Several patterns, like fixed value density reporting, systematic density reporting deviations, etc., are easily spotted. The benchmarks developed above are constructed to discriminate between some of these different reporting schemes, and to quantify the risk of being short-lifted as well as the amount of short-lifting that should be expected. The benchmarks for our examples from Table 1 are given in Table 2 below.

Table 2: Standard descriptive measures together with our benchmark(s) for the geo-regions and suppliers

from Table 1. The benchmarks for the data that were used to generate the best practice histogram are also

included for comparison. A row, e.g. Global, is read as follows: average density difference is 0.39, std=3.92.

Benchmarking against the best practice gives the following results: 43% of the samples can be regarded as

good (bG), 31% qualify as short-lifting (bSL), and 26% as long-lifting (bLL). For the short-lifting samples the

average density difference is 3.31, but only 7% of them were near the ceiling.

dd (kg/m3)

σdd bG bSL bLL bNC SLdd (kg/m3)

Best Practice 0.05 1.16 0.62 0.19 0.19 0.01 1.50

Global 0.39 3.92 0.43 0.31 0.26 0.07 3.31

Canada & US West Coast 0.03 2.43 0.55 0.22 0.24 0.02 2.09

South Asia 1.22 3.35 0.41 0.52 0.07 0.26 2.44

Middle east 1.83 4.76 0.32 0.49 0.19 0.02 4.61

South America West 0.48 6.00 0.08 0.42 0.50 0.00 3.73

Supplier 1 0.12 0.95 0.71 0.09 0.20 0.02 1.70

Supplier 2 2.31 4.84 0.36 0.53 0.11 0.13 4.65

Supplier 3 2.40 1.83 0.09 0.87 0.03 0.00 2.81

Supplier 4 2.07 2.81 0.27 0.72 0.01 0.46 2.64

p. 9

The samples used to generate the best practice histogram were included in the table for easy comparison. Note that the only way the good score can be 1 is when all samples are at dd=0, this explains why even the good score of the best practice is ‘only’ 0.62. The table shows that for the selected geo-regions the highest risk of being short-lifted is found in South Asia. The near-limit benchmark, bNC, confirms what is apparent from the scatter-plot (Figure 3), that for many suppliers it is a common practice to maximize their profit by just reporting a fuel density at or near the limit.

South America West nicely illustrates the strong ability of the benchmark to identify the underlying behavior. Recall that for this area the mean was near zero, but the high standard deviation suggested large fluctuations in their reporting. Even so, no indications about the underlying reporting schemes, or the risk of being short- or long-lifted, can be deduced. In contrast, our benchmark reveals that the likelihood of actually getting what you paid for is rather slim, viz. around 8%. In the vast majority of the cases either short- or long-lifting takes place.

Observe also that Supplier 1 can indeed be regarded as honest with a good score higher than best practice. Supplier 2 and 3 have comparable average density differences but their good and near limit benchmarks clearly separates them. A comparison of the benchmarks with the corresponding scatter plots will confirm that the benchmarks do indeed give a more accurate description of the honesty of suppliers than standard descriptive statistics.

Figure 9: Comparison of different benchmarking methods: suppliers ranked based on their mean density

difference, dd , (top), and their corresponding good score, bG (bottom). Observe that ranking with respect

to the mean would result in about 1057 good suppliers (| dd | ≤≤≤≤ 0.7). Our scoring with respect to best

practice, (0.62), reveals however that about 150 are definitively bad (left-hatched area), even below global

average (0.43). 539 are rally good (equal to or better than best practice, right-hatched area) whereas the

rest are located between global average and best practice. Observe also that simply relying on the mean to

characterize suppliers would label several of them as bad even though their good score is above global best

practice.

Supplier ranking

In Figure 9 (top) all suppliers of RMG380 fuel worldwide are ranked with respect to their mean

density difference, dd . When using |dd | ≤ 0.7 as a criterion for goodness then the mean would imply

there are about 1057 good suppliers. Applying this mean dd to our benchmarking method results in

the continuous bell-shaped curve (blue). If dd is indeed an unbiased measure for the goodness of suppliers, then their scorings should be closely scattered around this curve – this is, however, not at all the case. This discrepancy stems from the unreliability of the mean (or standard deviation) as a trustworthy measure whenever the underlying distributions are non-normal or outliers have a large effect. The figure visualizes clearly that 150 of the apparently good suppliers are actually quite bad, i.e. even below global average (left hatched area), whereas just about the half (539) can be considered equal to or better than best practice (right hatched area). Observe also that many of the apparently bad

suppliers (those with | dd | > 0.7) are actually better then their reputation as most of them are above the bell shaped curve, some are even above best practice – further emphasizing the need for an unbiased score like bG.

Development over time

Following the development of the score of a supplier, port, or region over time may give valuable indications about what may be expected in the near future. For instance, Figure 10 shows the development of the bG score for two major ports, Singapore and Rotterdam, over the past 25 years.

p. 10

Figure 10: Time series of goodness scores bG for two large ports in different geo-regions. Data from all

available suppliers are included. Dots are quarterly time intervals while the stippled lines are year

averages. Each dot is based on a varying number of ‘raw data points’, i.e. the number of bunkerings

during the corresponding time interval.

Observe that from the beginning of the 1980s and up to the mid 1990s the quality of the density reporting was increasing. It then leveled off until 2008, when a change in behavior occurred – perhaps triggered by the onset of the global recession?

5. Discussion and concluding remarks This paper has two main focus areas: the construction of a realistic benchmark and the development of a methodology that allows comparing one or more samples with the benchmark.

The examples given above demonstrate the capabilities of our approach. It is more powerful than

standard descriptive statistics (e.g. dd and σdd), as it is less sensitive to outliers and is well suited for small datasets and even single numbers. Recall that our benchmarks give better quantifications than

the dd and σdd together. Further, it makes no assumptions about the data distributions. There are actually no restrictions to the probability distribution of the underlying data – any distribution is allowed. Only some weak requirements apply to the membership function (e.g. increasing/decreasing). The methodology is quite generic and could in principle be applied to any kind of comparison task, i.e. benchmarking.

The fact that the benchmark is based on a probability density function, and that a probabilistic interpretation of the scoring is possible, is an aid to the user’s intuition, making it easier to understand and interpret the results.

Once a best practice histogram has been generated, a membership function can be derived, after which benchmarking is easily done. Subjectivity is only involved in the definition of what can be regarded as best practice, as there is no a priori correct answer to this problem. Our approach has been to ask: what should be expected of a good supplier? And by answering this question we have picked suppliers that best match our expectations. Outliers and incorrect claims near the density limit are of course not wanted from a good supplier, hence their removal from the best practice data set.

From a user perspective the main strengths of the presented benchmark are:

• Institutive and easy to understand.

• Applicable for few or even singleton samples.

• Able to pinpoint different density reporting schemes.

In closing let us return to the extent and amount of global short-lifting which is estimated to be around 1.7 ton per bunkering on average. Thanks to our benchmarking methodology we can now provide a more detailed picture of the situation. First, 43% of the bunkerings could be considered to be loss neutral (bG=0.43), since they are within best practice. Second, 26% are instances of long-lifting (bLL=0.26), where the buyer gains on average 1.8 ton. Third, 31% could be regarded as short-lifting (bSL=0.31), with an average buyer loss of 2.5 ton per bunkering. This highlights the importance of choosing the right supplier.

The presented benchmark methodology is easily extendable to other (quality and economical) bunkering parameters like viscosity, sulfur or water content, as well as a series of physical and chemical properties. The methodology will be the basis for a benchmarking web tool, scheduled for release by DNVPS later this year.

Figure 11: Bunker surveyor on board a ship. Photo by DNV Petroleum Services (used with

permission).

p. 11

References

Bhattacharyya, G., Johnson, R. (1977), Statistical Concepts and Methods, Wiley, New York.

DNV (2010). Total fuel management, http://www.dnv.com/industry/maritime/servicessolutions/fueltesting (accessed 13. Oct. 2010).

EPA (2008), Global Trade and Fuels Assessment -Future Trends and Effects of Requiring Clean Fuels

in the Marine Sector. Assessment and Standards Division Office of Transportation and Air Quality, U.S. Environmental Protection Agency. EPA420-R-08-021, November 2008.

Eyring, V., Isaksen, I.S.A., Berntsen, T., Collins, W.J., Corbett, J.J., Endresen, O., Grainger, R.G., Moldanova, J., Schlager, H., Stevenson, D.S. (2010), “Transport impacts on atmosphere and climate: Shipping”, Atmospheric Environment, Volume 44, Issue 37, December 2010, pp. 4735-4771.

Hastie, T., Tibshirani, R., Friedman, J. (2009), The Elements of Statistical Learning: Data Mining,

Inference, and Prediction (second edition). Springer, New York.

IEA (2010). World Energy Outlook 2010. International Energy Agency, OECD Publishing, Paris.

IMO (2009). Prevention of Air Pollution from Ships. International Maritime Organization, Marine Environment Protection Committee. MEPC 59/INF.10, 9 April 2009.

Lowen, R. (1996), Fuzzy Set Theory, Kluwer Academic Publishers, Dordrecht.

Self, K. (1990), “Designing with fuzzy logic”, IEEE Spectrum, Vol 27, No 11, November 1990, pp. 42-44, p. 105.

Terano, T., Asai, K., Sugeno, M. (1987), Fuzzy Systems Theory and its Applications. Academic Press, San Diego.

Turksen, I.B. (1991), “Measurement of membership functions and their acquisition”, Fuzzy Sets and

Systems, Vol. 40, pp. 5-38.

Figure 1:

Figure 2:

Figure 3:

Figure 4:

Figure 10:

Figure 11:

961

971

981

991

961

971

981

991

Me

asu

red

de

nsi

ty

Claimed densitym

ax.

ch

eat

are

a

Su

sp

icio

us

Bad

Go

od

Limit

Lim

it

Fig

ure

5

Lim

it

Lim

it =

max. ch

eat

lin

e +� �� -� ��

=

no

cheatline

Fig

ure

6

Probability

Den

sit

y

de

via

tio

ns

Fig

ure

7

1

de

ns

ity d

iffe

ren

ce

dd

= 2

.3

Good: m

G=

0.2

3

Bad: m

B=

1-0

.23

= 0

,77

mB

=1-m

G

mG

0

Short

-lifting

Long-lifting

Fig

ure

8

0

0,2

5

0,5

0,7

51

0500

1000

1500

2000

2500

Be

st

pra

cti

ce

sc

ore

Glo

ba

l a

ve

rag

e s

co

re

Man

y “

go

od

su

pp

liers

”are

actu

ally q

uit

e b

ad

!

150

539

So

me “

bad

su

pp

liers

”are

actu

ally v

ery

go

od

!

Good scoreclaimed –measured density

-505

10

05

00

10

00

15

00

20

00

25

00

0.7

-0

.7

tota

l num

ber

of supplie

rs

Ca

. 1057

su

pp

lie

rs

So

me “

bad

su

pp

liers

”are

actu

ally s

lig

htl

y b

ett

er

!

Fig

ure

9

Documents

6.benchmarking of