Interaction between financial risk measures and machine learning methods

Comput Manag SciDOI 10.1007/s10287-013-0175-5

ORIGINAL PAPER

Interaction between financial risk measuresand machine learning methods

Jun-ya Gotoh · Akiko Takeda · Rei Yamamoto

Received: 6 October 2012 / Accepted: 31 May 2013© Springer-Verlag Berlin Heidelberg 2013

Abstract The purpose of this article is to review the similarity and difference betweenfinancial risk minimization and a class of machine learning methods known as supportvector machines, which were independently developed. By recognizing their commonfeatures, we can understand them in a unified mathematical framework. On the otherhand, by recognizing their difference, we can develop new methods. In particular,employing the coherent measures of risk, we develop a generalized criterion for two-class classification. It includes existing criteria, such as the margin maximization andν-SVM, as special cases. This extension can also be applied to the other type ofmachine learning methods such as multi-class classification, regression and outlierdetection. Although the new criterion is first formulated as a nonconvex optimization,it results in a convex optimization by employing the nonnegative �1-regularization.Numerical examples demonstrate how the developed methods work for bond rating.

The research of the first author is partly supported by a MEXT Grant-in-Aid for Young Scientists (B)23710176. Also, the authors appreciate the comments by two anonymous referees and Dr. Pando G.Georgiev.

J. Gotoh (B)Department of Industrial and Systems Engineering, Chuo University,2-13-27 Kasuga, Bunkyo-ku, Tokyo 112-8551, Japane-mail: [email protected]

A. TakedaDepartment of Mathematical Informatics, The University of Tokyo,7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan

R. YamamotoMitsubishi UFJ Trust Investment Technology Institute Co., Ltd.,4-2-6 Akasaka, Minato-ku, Tokyo 107-0052, Japan

123

J. Gotoh et al.

Keywords ν-Support vector machine (ν-SVM) ·Conditional value-at-risk (CVaR) ·Mean-absolute semi-deviation (MASD) · Coherent measures of risk · Credit rating

Mathematics Subject Classification (2000) 62H30 · 62P05 · 90C90 · 91B28 ·91B30 · 91G40

1 Introduction

In financial risk management, classification problems play an important role. Forexample, failure discrimination has been a popular subject for classification since thestudy of Altman (1968) which is known as the first application of the Fisher’s lineardiscriminant analysis (LDA) to company’s bankruptcy. Such a failure discriminationis considered to be a so-called two-class (or binary) classification.

On the other hand, credit rating of debtors (e.g., companies or consumers) canbe accomplished by a multi-class classification where each debtor is to be classi-fied into a class (e.g., AA+) among more than two classes (e.g., AAA, AA+, . . .).Estimating or predicting the rating of bonds or their issuers is of increasing impor-tance as well as the credit scoring of consumer loans (see Crook et al. 2007, for con-sumer credit risk assessment). Recently sovereign credit rating has gathered attention(e.g., Bennell et al. 2006). Corresponding to the increase of the importance, a numberof articles have been devoted to developing the classification methods. The most clas-sical is the multi-class extension of LDA and the most popular is the (ordered) logitmodel (see, e.g., Thomas et al. 2002; Crook et al. 2007).

In addition to those traditional statistical methods, various kinds of artificial intel-ligence and mathematical programming approaches have been applied to financialcredit scoring (see, e.g., Bahrammirzaee 2010 for a comprehensive survey of artificialintelligence methods). Especially, since the mid 1990’s, the support vector machines(SVMs), a class of machine learning methods, were developed by Vapnik (1995) andhave been often employed in financial applications (e.g., Erdal and Ekinci 2012; Huanget al. 2004; Shin et al. 2005). Most of the articles, however, have merely employedexisting statistical methods.

On the other hand, way before those credit classification applications, optimizationmodeling had been developed in the context of portfolio selection since the adventof the Markowitz’s mean-variance model, in which variance is employed to repre-sent the risk to be minimized. In addition to variance, there has been a continuouseffort for studying a number of measures of risk to capture various characteristicsof loss distributions. Among such are semi-variance (Markowitz 1959), below-targetreturns (Fishburn 1977), value-at-risk (VaR), conditional value-at-risk (CVaR) (e.g.,Rockafellar and Uryasev 2000) and classes of coherent and convex measures of risk(Artzner et al. 1999; Föllmer and Schied 2002). In addition to the aim of merely mea-suring the risk of a position, such risk measures have been employed as an objectivein optimizing a portfolio of financial assets.

It is not hard to see that there is a connection between the financial risk minimizationand the optimization in machine learning criteria, both of which estimate models thatwould achieve good out-of-sample performance. Indeed, Gotoh and Takeda (2005)

123

Interaction between financial risk measures and machine learning methods

have pointed out the common mathematical structure employed both in the class ofmachine learning methods known as ν-support vector machines (ν-SVMs) and in theCVaR minimization. Considering that both of the methods have gained popularitydue to their nice theoretical properties and computational tractability, the connectionbrings a new perspective to us.

The purpose of this article is to review the similarity and difference between finan-cial risk optimization and SVMs, by revisiting the facts discussed in Gotoh and Takeda(2005). Both of the methodologies have been developed independently, but they havea lot in common. By recognizing what they have in common, we can understand theirmathematics in a unified framework. At the same time, by recognizing the difference,we can develop new methods. In particular, we develop a new criterion for two-classclassification by employing the notion of the coherent measures of risk. The new crite-rion can be viewed as a generalization of both the margin maximization and ν-SVM.Besides, the idea can be extensively applied to the other type of machine learningsituations such as multi-class classification, regression and outlier detection.

The structure of this article is as follows. In the next section, we briefly overview acouple of prominent criteria for SVM of two-class linear classification. On the otherhand, Sect. 3 is devoted to an overview of VaR and CVaR, followed by an expositionon the relation between the two-class ν-SVM and the CVaR minimization on the basisof Gotoh and Takeda (2005). In Sect. 4, we extensively apply the coherent measuresof risk to the two-class classification context, and discuss strategies to cope withthe intractability arising from a nonconvex constraint of the associated optimizationproblems. Section 5 further extends the two-class method developed in Sect. 4 tothe other machine learning situations. In Sect. 6, numerical examples are given so asto demonstrate the performance of a couple of methods developed in the precedingsections. Finally, Sect. 7 closes the article with some concluding remarks.

Notation. We throughout use the following notation. For x ∈ Rn, ‖x‖ denotes any

norm in Rn . In particular, �p-norm is defined by ‖x‖p := (

∑nj=1 |x j |p)1/p for p ∈

[1,∞), and ‖x‖∞ := max j=1,...,n{|x j |} for p = ∞. �2-norm, ‖x‖2, is also known asthe Euclidean norm. en denotes the vector of ones of size n, i.e., en := (1, . . . , 1)� ∈R

n . I�m denotes the unit simplex in Rm and I�m+ is its (relative) interior, i.e., I�m :=

{q ∈ Rm : e�m q = 1, q ≥ 0} whereas I�m+ := {q ∈ R

m : e�m q = 1, q > 0}. For a setS ⊂ R

n , its convex hull is denoted by S.

2 Formulations of support vector classification

Maximum margin criterion and hard margin formulation. To make this article self-contained, let us start with a fundamental formulation of the so-called two-class hardmargin support vector classification.

Let {(x1, y1), . . . , (xm, ym)} denote a given data set where xi ∈ Rn describes the

attributes of sample i and yi ∈ {±1} describes its label, i = 1, . . . , m. Let m+ := |{i :yi = 1}| and m− := |{i : yi = −1}| (= m − m+). In the following, the data set isassumed to contain at least one sample of each class, i.e., min{m+, m−} ≥ 1. We saythat the data set is linearly separable if there exists (w, b) ∈ R

n \ {0} × R such thatfor each i = 1, . . . , m,

123

J. Gotoh et al.

Fig. 1 Two hyperplanesseparating a linearly separabledata set and their margins. Twoseparating hyperplanes,w�x = b and w�x = b, aredrawn in this figure. Accordingto a machine learning theory, thehyperplane w�x = b ispreferable to the other becausethe former has larger margin

{yi = +1⇒ w�xi > byi = −1⇒ w�xi < b.

(1)

Condition (1) can be rewritten by yi (w�xi −b) > 0. As indicated in Fig. 1, a linearly

separable data set can be divided into two classes corresponding to its label, i.e.,yi = +1 or yi = −1, by an infinite number of hyperplanes, w�x = b.

One of the most reasonable criteria for determining a hyperplane separating thetwo classes is to maximize the distance from the hyperplane to nearest points, xi . Itis not hard to see that if the data set is linearly separable, this criterion is formulatedby a fractional optimization problem of the form:

maximizew∈Rn\{0},b∈R

mini=1,...,m

yi (w�xi − b)

‖w‖2 . (2)

The fraction yi (w�xi − b)/‖w‖2 is called the geometric margin of a sample (xi , yi ),

and the smallest geometric margin is called the margin of a hyperplane, as illustratedin Fig. 1. The max-min formulation (2) can be reformulated into another fractionalprogram:

∣∣∣∣∣

maximizew,b,s

s

‖w‖2subject to yi (w

�xi − b) ≥ s, i = 1, . . . , m.(3)

By applying the so-called Charnes-Cooper transformation, it results in a convexquadratic problem formulation called the hard margin support vector classification(HSVC):

123


Fig. 2 Distance under different norms. x1 is the closest to a point x under �∞-norm and its distance isgiven by |w� x − b|/‖w‖1, while x2 is the closest under �2-norm and the distance is |w� x − b|/‖w‖2

∣∣∣∣∣

minimizew,b

12‖w‖22

subject to yi (w�xi − b) ≥ 1, i = 1, . . . , m.

(4)

The equivalence between (4) and (3) depends on the linear separability of the data set.Indeed, if a data set is not linearly separable, (4) is infeasible, whereas (3) still has asolution. It should also be noted that the choice of �2-norm is not necessary. Indeed,if we employ, in place of �2-norm, �∞-norm for gauging the distance between twopoints, the distance between a point xi to a hyperplane w�x = b is represented by

yi (w�xi − b)

‖w‖1 ,

where ‖w‖1 represents �1-norm (see Fig. 2). In general, if any norm, ‖x‖, isemployed (i.e., not necessarily �2- or �∞-norm), the corresponding distance is givenby yi (w

�xi−b)/‖w‖∗ with the dual norm ‖w‖∗ := maxx

w�x/‖x‖ (see Mangasarian

1999). Corresponding to (4), the problem can be represented as a convex optimizationwith the dual norm:

∣∣∣∣∣

minimizew,b

‖w‖∗subject to yi (w

�xi − b) ≥ 1, i = 1, . . . , m.(5)

If �2-norm, ‖x‖2, is employed as ‖x‖, the objective of (5) is ‖w‖2 and (5) is equivalentto (4).

Minimizing a norm in (4) or (5) is often interpreted as the margin maximization onthe basis of the above reasoning.

Soft margin formulation. When the data set is not linearly separable, i.e., for any(w, b) there exists a sample i satisfying yi (w

�xi − b) ≤ 0, there are multiple waysfor modification. The most common approach is known as the soft margin formulationor C-support vector classification (C-SVC), which is formulated by

123

J. Gotoh et al.

∣∣∣∣∣∣

minimizew,b,z

12‖w‖22 + C

m

m∑

i=1zi

subject to yi (w�xi − b)+ zi ≥ 1, zi ≥ 0, i = 1, . . . , m,

(6)

where C > 0 is a user-defined parameter. Formulation (6) is always feasible regardlesswhether the given data set is linearly separable or not.

In contrast to HSVC (4), C-SVC (6) is a simultaneous minimization of twoobjectives:

1

2‖w‖22 + C · 1

m

m∑

i=1

max{

1− yi (w�xi − b), 0

}. (7)

In a machine learning theory, such a two-objective formulation for estimating a modelis called structural risk minimization. The second term of (7) represents the average ofthe degrees of misclassification where only samples xi satisfying yi (w

�xi − b) < 1contribute to the average computation, and is called empirical risk, while the first termis considered to play a role of avoiding overfitting and is called regularization term.Minimizing the regularization term is often interpreted as the margin maximization incomparison with HSVC (4).

In financial practice, the use of the machine learning model is required to makea balance between the accuracy and the interpretability. In such a sense, the inter-pretability of the parameter(s) can be of importance. Compared to HSVC (4), itsclear-cut interpretation of (6), however, is vague in the following sense:

1. First of all, the empirical risk part of (7) gauges the degree of inseparability, whichare defined not only on the samples satisfying yi (w

�xi − b) < 0, but also onthose satisfying 0 ≤ yi (w

�xi − b) < 1. Note that the samples xi satisfying0 < yi (w

�xi − b) ≤ 1 are penalized despite their correct classification. Besides,it is not easy to understand the implication of the value “1”.

2. In addition, the parameter C > 0 has to be selected a priori, but its interpretationis not clear. In practice, its value is determined by data-driven methods such ascross validation.

(See Bennett and Bredensteiner 2000 for its geometric interpretation).Another remedy for the linearly inseparable case is known as ν-SVC (Schölkopf

et al. 2000), which is formulated by

∣∣∣∣∣∣

minimizew,b,z,ρ

12‖w‖22 − ρ + 1

νm

m∑

i=1zi

subject to yi (w�xi − b)+ zi − ρ ≥ 0, zi ≥ 0, i = 1, . . . , m,

(8)

where ν ∈ (0, 1] is a user-defined parameter. It is known that ρ ≥ 0 is satisfied atoptimality. ν-SVC (8) can be viewed as a structural risk minimization of the form:

123


minimizew,b

1

2‖w‖22 +min

ρ

{

−ρ + 1

νm

m∑

i=1

max{−yi (w

�xi − b)+ ρ, 0}}

. (9)

As will be elaborated in the next section, the second term of (9) represents a specialcase of a risk measure known as the conditional value-at-risk in the finance literature.

A solution to (8) corresponds to that of (6) in the following sense.

Proposition 1 (Schölkopf et al. 2000) Let (w∗, b∗, ρ∗) be a solution to (8) and sup-pose ρ∗ > 0. Then (w∗/ρ∗, b∗/ρ∗) is a solution to (6) with the parameter C = 1/ρ∗.

Despite of the correspondence shown in the above proposition, the interpretabilityof ν-SVC (8) is superior to that of C-SVC (6).

Theorem 1 (ν-property Schölkopf et al. 2000) Suppose that the solution (w∗, b∗,z∗, ρ∗) to (8) satisfies ρ∗ > 0. Then

(i) ν is an upper bound on the value defined by 1m |{i : yi (x�i w∗ − b∗) < ρ∗}|.

(ii) ν is a lower bound on the fraction of support vectors.

A support vector (SV) of a solution (w∗, b∗, z∗, ρ∗) is defined as a sample (xi , yi )

for which the KKT complementarity condition ζ ∗i (yi (x�i w∗ − b∗) + z∗i − ρ∗) =0, ζ ∗i ≥ 0, yi (x�i w∗ − b∗)+ z∗i − ρ∗ ≥ 0 holds with ζ ∗i > 0 where ζ ∗i is the optimaldual variable corresponding to the constraint yi (x�i w − b)+ zi − ρ ≥ 0. A machinelearning theory indicates that the smaller number of SVs leads to better out-of-sampleperformance, and accordingly, the controllability of the number of SVs is a preferableproperty to users. In contrast, the parameter C in (6) is not easy to interpret.

On the other hand, we can point out that the ν-property is understandable in thecontext of the CVaR minimization, which will be overviewed in the next section. Thefact that the ν-SVC can be viewed as a special case of a financial risk measure opens thedoor of versatile interaction between financial risk measures and the machine learningmethodologies.

3 Value-at-risk and conditional value-at-risk

Next let us overview the measures of financial risk. In financial risk management,the uncertainty is usually described by random variables (e.g., uncertain loss of aposition) on a certain space � of elementary events and a probability distribution P.In the following, we assume that random variables, often denoted by L, represent aloss, i.e., the smaller the value of a random variable, the better. A risk measure r is afunctional that maps a random loss to a real value. Throughout the paper we assumethat the larger a risk measure, the riskier.

To represent the risk of a loss using a single value, various risk measures have beenproposed and examined since the Markowitz’s introduction of variance or, equiva-lently, standard deviation. Since variance captures the expected deviation from theexpected value of the random variable, it regards both upper and lower deviationsfrom the expected gain as a loss.

123

J. Gotoh et al.

Fig. 3 Illustration of a loss distribution and risk measures associated with CVaR

VaR. On the other hand, there has been a strongly supported view that only the lowerdeviation or lower tail of a return distribution is to be avoided, and many measures,including semi-variance (Markowitz 1959) and below-target return (Fishburn 1977),have been posed for capturing such downside risk.

Value-at-risk (VaR) is another measure of downside risk, defined as the β-quantileof the loss, i.e.,

αβ [L] := min{α : P{L ≤ α} ≥ β},

with β ∈ (0, 1). The parameter β is, in usual practice, fixed at a value close to 1,say, 0.99, so as to measure the virtually-largest loss which may happen with a smallprobability 1− β (see Fig. 3). Although VaR was originally developed in practice inthe mid 1990s so that manager can easily grasp whole risk he/she owns every day, ithas been pointed out that VaR has drawbacks as a risk measure. Indeed, it passes overthe impact of the loss which is larger than the quantile, and it lacks the subadditivity.

CVaR. Conditional value-at-risk (CVaR) has been proposed by many authors and isnow one of the most promising downside risk measures. Given a random loss L, itsCVaR is defined by

φβ [L] := minα

{

α + 1

1− βE[max{L− α, 0}]

}

, (10)

where E[·] is mathematical expectation with respect to P and β ∈ (0, 1) is a user-defined parameter. CVaR is also known as tail value-at-risk (TVaR) or expected short-fall (ES).

Although the definition (10) seems slightly complicated, CVaR is virtually equalto the conditional expectation E[L|L ≥ αβ [L]] or E[L|L > αβ [L]] in the sense thatwe have

E[L|L ≥ αβ [L]] ≤ φβ [L] ≤ E[L|L > αβ [L]].

123


See Proposition 5 of Rockafellar and Uryasev (2002) for the details. Besides, anoptimal α in the definition of φβ [L] is located in the closed interval [αβ [L], α+β [L]]where α+β [L] := inf{α : P{L ≤ α} > β}, and accordingly, it virtually provides VaRof L.

In particular, if L is a random variable on a finite sample space, i.e., |�| = m(<∞),(10) is well-defined at β = 0, and we have φ0[L] = E[L] = ∑m

i=1 pi Li whereLi := L(ωi ) and pi := P{ω = ωi }, i = 1, . . . , m. Besides, at β sufficiently close to1, we have φ

β[L] = maxi {Li }. In this sense, CVaR is a generalization of the expected

loss and the maximum loss (see Fig. 3).CVaR has nice properties both in theory and in computation. In terms of the expected

utility theory, it is known to be consistent with all risk averse investors in the followingsense.

Theorem 2 (Pflug 2000; Ogryczak and Ruszczýnski 2002) Let U be the set of riskaverse utility functions, i.e., nondecreasing and concave functions. Then, for eachβ ∈ (0, 1), we have

E[u(−L1)] ≥ E[u(−L2)] for all u ∈ U ⇒ φβ [L1] ≤ φβ [L2].

This theorem says that if all the risk averse investors agree on that L1 is never inferiorto L2, the preference relation holds equally true of CVaR (for any β). This prop-erty is known as the consistency with the second order stochastic dominance (SSD)(Pflug 2000; Ogryczak and Ruszczýnski 2002). It should be emphasized that theabove inequality holds independently of the distribution P. Note that neither variance(or standard deviation) or mean-variance is consistent with SSD.

CVaR is also known to be a coherent measure (Artzner et al. 1999), whose detailswill be summarized in the next section. Since both the consistency with SSD andthe coherence are distribution-free properties, the use of CVaR is advantageous in asituation where the loss cannot be assumed to follow a specific distribution. Moreover,although the use of utility functions in credit scoring problems has been proposed in theliterature (e.g., Baourakis et al. 2009; Bugera et al. 2002), the use of SSD-consistentcoherent risk measures is more advantageous since specifying an adequate utilityfunction is a difficult task.

CVaR minimization. Once a risk measure, r [·], is introduced, we can define the prob-lem of choosing an optimal random variable, L(π�), from among a set of randomvariables, {L(π) : π ∈ �} where � is the given set of parameters. Here we assumethat P is independent of π .

An important example in financial risk management is portfolio selection. Forexample, let us denote the random rates of return of n investable assets by R :=(R1, . . . ,Rn)�, and let π := (π1, . . . , πn)� be the investment weight vector. The lossL of a portfolio π is usually defined by the negative portfolio return, L(π) = −R�π ,and an optimal portfolio π� is obtained via min{r [L(π)] : π ∈ �} for some riskmeasure r [·] and � ⊂ {π ∈ R

n : e�n π = 1}. More specifically, the problem ofobtaining a parameter π� that minimizes CVaR associated with a loss function L(π)

is formulated by

123

J. Gotoh et al.

minimizeπ

φβ [L(π)] subject to π ∈ �. (11)

Rockafellar and Uryasev (2002) show that the CVaR minimization (11) results in aconvex minimization under a convexity condition that arises often in practice.

Theorem 3 (Rockafellar and Uryasev 2002) If the loss function L(π) is convex in π ,then φβ [L(π)] is convex in π , and the CVaR minimization (11) results in a convexminimization of the form:

minimizeα,π

α + 1

1− βE[max{L(π)− α, 0}] subject to π ∈ �, (12)

as long as � is a convex set. In addition, for any optimal solution (α�,π�) to (12), itholds that α� ∈ [αβ [L(π�)], α+β [L(π�)]].The last statement says that the CVaR minimization (12) provides as a byproduct anapproximate of VaR of the CVaR-minimizing loss distribution.

What is more commonly the case in practice is the minimization of the empiricalversion of the risk measures. Suppose that m realized samples of a random vectordefining the loss (e.g., historical returns Ri , i = 1, . . . , m, in the aforementionedportfolio selection) are given. By regarding the m realizations as all the elementaryevents (e.g., Ri = R(ωi ), i = 1, . . . , m) and defining the associated loss Li (π) :=L(ωi ,π), i = 1, . . . , m, the minimization of the empirical CVaR can then be writtenby

minimizeα,π

α + 1

1− β

m∑

i=1

pi max{Li (π)− α, 0} subject to π ∈ �, (13)

where β ∈ [0, 1), and pi is a reference probability satisfying∑m

i=1 pi = 1, pi >

0, i = 1, . . . , m. A typical choice of p is the uniform case, i.e., pi = 1/m, i =1, . . . , m.

As Rockafellar and Uryasev (2000) demonstrate, if the loss is given by a linearfunction of π (e.g., Li (π) = −R�i π) and the constraints, π ∈ �, are given by asystem of linear inequalities, the minimization (13) results in a linear program (LP).

A relation of ν-SVC to CVaR minimization. It is easy to find a similarity betweenthe CVaR minimization (13) and ν-SVC (8) or (9). Precisely, replacing the variableand parameters as follows:

⎧⎨

⎩

ρ → −α,

ν → 1− β,

1/m → pi , (i = 1, . . . , m),

(14)

the formulation (8) or (9) can be viewed as a special case of the minimization of thefunction:

1

2‖w‖22 +min

α

{

α + 1

1− β

m∑

i=1

pi max{−yi (w

�xi − b)− α, 0}}

.

123


Namely,ν-SVC can be considered as a structural risk minimization where the empiricalrisk is captured by a CVaR associated with the loss−yi (w

�x− b) and its probabilitypi , i = 1, . . . , m.

However, there is still a question remaining: Unless the data set is linearly separable,does minimizing 1

2‖w‖22 yet represent margin maximization? In conclusion, there isa gap between the margin maximization (2) and ν-SVC in that (8) or (9) can resultin a meaningless solution satisfying w = 0 for small ν, whereas (2) has a solutionw �= 0. In order to see the source of the gap, Gotoh and Takeda (2005) examine theCVaR-based formulation:

minimizew,b,α

α + 1

1− β

m∑

i=1

pi max

{

− yi (w�xi − b)

‖w‖2 − α, 0

}

. (15)

Note that this is the minimization of CVaR where the negative geometric margin, i.e.,

− yi (w�xi − b)

‖w‖2 (16)

is employed to define a loss. With the change of variables (w, b, α) ← (w/

‖w‖2, b/‖w‖2, α), (15) is rewritten by

minimizew,b,α

α + 1

1− β

m∑

i=1

pi max{−yi (w

�xi − b)− α, 0}

subject to ‖w‖2 = 1.

(17)

Note that (17) is equivalent to (15) in the following sense:

– If (w∗, b∗, α∗) is an optimal solution to (17), then k(w∗, b∗, α∗) is optimal to (15)for any k > 0;

– If (w∗, b∗, α∗) is an optimal solution to (15), then (w∗/‖w∗‖2, b∗/‖w∗‖2, α∗) isoptimal to (17).

Perez-Cruz et al. (2003) present an extended version of ν-SVC, termed Eν-SVC:

∣∣∣∣∣∣∣∣

minimizew,b,z,ρ

−ρ + 1νm

m∑

i=1zi

subject to yi (w�xi − b)+ zi − ρ ≥ 0, zi ≥ 0, i = 1, . . . , m,

‖w‖22 = 1.

(18)

It is easy to see that Eν-SVC (18) is a special case of (17) under the change of variableand parameters as in (14). In this sense, Eν-SVC (18) can be regarded as a CVaRminimization in which the loss of the form (16) is employed. (18) has an advantageover the ordinary ν-SVC (8) in the following sense:

– For small νs, (8) or (9) can result in a meaningless solution satisfying w = 0. Onthe other hand, (18) has optimal solutions for such small νs;

123

J. Gotoh et al.

Fig. 4 A bridge via the CVaR minimization between ν-SVM and margin maximization. ‘Min.CVaR’indicates the optimal value of (15) or (17) (or, equivalently, Eν-SVC), and is nondecreasing in β (ornonincreasing in ν). For the linearly inseparable data, ‘Min.CVaR’ can become positive at large βs (or,equivalently, small νs). In particular, at β sufficiently close to 1 (or ν sufficiently close to 0), (15) or (17)(or Eν-SVC) provides the same hyperplane as the maximum margin criterion (2) does. On the other hand,ν-SVC results in a solution satisfying w = 0 in the case where ‘Min.CVaR’ is positive, and accordingly,ν-SVC never provides the same solution as the maximum margin criterion does

– For νs under which (8) or (9) has optimal solutions satisfying w �= 0, (18) has thesame optimal solutions.

Besides, Gotoh and Takeda (2005) show that

– If the optimal value of (18) is negative, then the resulting hyperplane is equivalentto that obtained via (8) or (9).

– If the optimal value of (18) is positive, then (8) or (9) results in a meaninglesssolution satisfying w = 0.

Note that the above facts hold also for (15) or (17) by replacing ν and 1/m with 1−β

and pi , respectively.An additional advantage of the formulations (15), (17) and (18) over ν-SVC (8)

is that by adopting a sufficiently large β or small ν, each of them provides the samehyperplane as the maximum margin criterion (2) does.

Proposition 2 If β ∈ (1 − mini pi , 1), (15) and (17) are equivalent to (2) in thesense that both provide the same hyperplane. Equivalently, if ν ∈ (0, 1/m), (18) isequivalent to (2).

Note that this proposition implies that (15), (17) and (18) include the margin maxi-mization as a special case.

Figure 4 illustrates the above-mentioned facts. The CVaR minimization (15) or (17)(or equivalently Eν-SVC) covers both ν-SVC and the maximum margin criterion byvarying the parameter β (or ν, respectively).

A generalization theory. A goal of classification is to predict the labels of unknownsamples as much as possible, rather than to obtain a model that fits to the given samples.More precisely, based on m observed samples {(x1, y1), . . . , (xm, ym)} of a random

123


vector (X ,Y), an estimate (w, b) is sought so that Y(w�X −b) would be probabilis-tically large. A machine learning theory, known as generalization theory, providesa nonparametric bound of the associated probability. For example, by following theproof of Takeda et al. (2010) for ν-SVR, we achieve the following theoretical gen-eralization bounds for ν-SVC and the extended ν-SVC (18) by using the empiricalVaR:

αeβ(w, b) := min

{

α : 1

m|{i : Li (w, b) ≤ α}| ≥ β

}

,

with Li (w, b) := −yi (w�xi − b), or the empirical CVaR:

φeβ(w, b) := min

α

{

α + 1

(1− β)m

m∑

i=1

max{Li (w, b)− α, 0}}

.

Theorem 4 Let θ be a threshold for a loss. Suppose that random vector (X ,Y) has abounded support in the sense that X lies in a ball of radius R centered at the origin,and that m samples, (xi , yi ), are independently drawn from (X ,Y). Then, for anyw satisfying αe

β(w, b) < θ and ‖w‖22 = 1, the probability of the loss L(w, b) :=−Y(w�X − b) being greater than θ, P{L(w, b) > θ}, is bounded above as

P{L(w, b) > θ} ≤ 1− β + G(αe

β(w, b)− θ)

,

with probability at least 1− δ, where

G(γ ) :=√

2m

(4c2

γ 2 (R2 + 2)(R2 + θ2 + 1) log2(2m)− 1+ log( 2

δ

)),

and c is a constant.

Note that we here use P instead of P because we have to distinguish the unknownprobability distribution P on a sample space � ⊃ � from the known probability P

on � which is dealt with everywhere else in this article. More precisely, we here takepi = 1/m, i = 1, . . . , m, as P.

Corollary 1 Suppose the same assumption as in Theorem 4. Then, for any w satisfyingφe

β(w, b) < θ and ‖w‖22 = 1, the probability of the loss L(w, b) being greater than

θ, P{L(w, b) > θ}, is bounded above as

P{L(w, b) > θ} ≤ 1− β + G(φe

β(w, b)− θ)

,

with probability at least 1− δ.

123

J. Gotoh et al.

These propositions indicate that the minimization of the empirical VaR or thatof CVaR coupled with the constraint on a regularization term leads to a better out-of-sample performance of the prediction. This fact has been utilized for portfoliooptimization in Gotoh and Takeda (2011, 2012).

It is interesting that VaR provides a tighter bound than CVaR does. However, we donot further study the VaR-based classification in this article since the empirical VaRminimization is far less computationally tractable than the CVaR minimization.

4 Coherent risk-based classification

It is natural to extend the CVaR-based classification (15) to those based on anotherrisk measures.

4.1 Fundamental properties of coherent risk measures

First of all, we overview the basic properties of the coherent measures of risk. In orderto avoid unnecessary technicalities, we limit the following discussion to distributionswith finite sample space � = {ω1, . . . , ωm}.

Definition 1 (Coherent measure of risk Artzner et al. 1999) A risk measure r is saidto be coherent if it satisfies the following axioms:

1. (monotonicity) L1 ≤ L2 (i.e., L1(ω) ≤ L2(ω) for all ω ∈ �) ⇒ r [L1] ≤ r [L2],2. (translation invariance) r [L+ a] = r [L] + a for all L and a ∈ R,3. (positive homogeneity) r [aL] = ar [L] for all a ≥ 0,4. (subadditivity) r [L1] + r [L2] ≥ r [L1 + L2] for all L1,L2.

CVaR is known to be a coherent measure (Pflug 2000; Rockafellar and Uryasev2002). On the other hand, VaR is not coherent since it does not satisfy the subadditivitywhile it satisfies the axioms 1. through 3. This fact indicates the nonconvexity of theVaR-minimizing classification even when the loss is given as a linear function of(w, b).

The following two are the simplest examples of coherent risk measures:

r [L] = E[L] =m∑

i=1

pi Li : expected loss,

r [L] = max{L1, . . . , Lm}: maximum loss.

We should recall that CVaR can be viewed as a generalization of the expected loss andthe maximum loss, i.e., CVaR includes both of them as two special cases: β = 0 forthe expected loss while β > 1−mini=1,...,m pi for the maximum loss.

Another risk measure of interest is the so-called mean absolute semi-deviation(MASD):

123


r [L] = E[L] + λE[max{L− E[L], 0}]

=m∑

i=1

pi Li + λ

m∑

i=1

pi max

{

Li −m∑

h=1

ph Lh, 0

}

,

where λ ≥ 0. MASD is coherent for λ ∈ [0, 1], but it represents a downside risk mea-sure even for λ > 1. See Fisher (2001) for the coherence of MASD. Obviously, MASDcontains the expected loss at λ = 0. The minimization of MASD is viewed as the MAD(mean-absolute deviation) model (Konno and Yamazaki 1991), which minimizes

E[L] + λE[|L− E[L]|] with λ ∈[

0,1

2

]

: MAD

because absolute semi-deviation E[max{L−E[L], 0}] is equal to the half of absolutedeviation E[|L − E[L]|]. It is noteworthy that MASD with λ ∈ [0, 1] is an SSD-consistent risk measure (Ogryczak and Ruszczýnski 1999) as well as CVaR.

For the other SSD-consistent coherent risk measures, see Krokhmal (2007), inwhich higher moment coherent risk measures and algorithm for solving the associatednonlinear conic programs are studied. Other coherent risk measures are studied byChen and Wang (2008), Delbaen (2002), for example.

Any coherent measure of risk is known to have an explicit representation as follows.

Theorem 5 (Representation theorem for coherent measures of risk e.g., Artzner et al.1999) A risk measure r is a coherent measure of risk if and only if there exists a setQ ⊂ I�m such that

r [L] = supq∈Q

m∑

i=1

qi Li . (19)

Noting sup{∑mi=1 qi Li : q ∈ Q} = max{∑m

i=1 qi Li : q ∈ Q} where Q is the convexhull of Q, Theorem 5 indicates that any coherent measure can be characterized bya compact convex set Q ⊂ I�m . For example, the expected loss is characterized bya single point Q = { p} whereas the maximum loss is characterized by the wholespace of probability distributions Q = I�m . In this sense, the expected loss and themaximum loss are extreme cases. Especially, note that the expected loss with Q = { p}attains the smallest value among any coherent measures whose Q include p and thatthe maximum loss attains the largest value among any coherent measures.

CVaR has the dual representation with Q given by

QCVaR :={

q ∈ Rm : e�m q = 1, 0 ≤ q ≤ p/(1− β)

}for β ∈ [0, 1). (20)

It is not hard to see that QCVaR is monotonically increasing in β, and that QCVaRcontains the point p and is contained by I�m for any β ∈ [0, 1). On the other hand,MASD is characterized by

QMASD ={

q ∈ Rm : q = p+ u − pe�m u, 0 ≤ u ≤ λ p

}for λ ∈ [0, 1].

123

J. Gotoh et al.

QMASD coincides with p when λ = 0, and monotonically increasing in λ.In addition to the risk measures mentioned above, we can define a coherent risk

measure by employing a closed convex set Q in I�m . For example, consider a setdefined by

QEn ={

q ∈ I�m+ :m∑

i=1

qi lnqi

pi≤ C

}

for p ∈ I�m+, C > 0,

i.e., a collection of probability distributions whose entropy relative to a referenceprobability p ∈ I�m+ is less than C . This set also characterizes a coherent risk measuredue to the closed convexity of QEn. In this manner, we can create an infinite numberof coherent measures.

A distributionally robust extension. An important subclass of the dual representation(19) is given by a closed convex set which is defined with a reference probabilityp, and is denoted by Q( p). Obviously, CVaR and MASD belong to this case sinceQCVaR and QMASD contain p for any β ∈ [0, 1) and λ ≥ 0, respectively. The uniformprobability em/m is usually adopted as the reference probability p, but such a choice ofp is under uncertainty. For example, there may be a situation where several candidatesp1, . . . , pK for p are possible. To cope with the uncertainty, the robust optimizationapproach usually seeks the best response to the worst case. Namely, instead of Q( p)

with a single p, it employs as Q the union of Q( p) over p ∈ P where P is a givenset of uncertain reference probabilities, i.e.,

Q(P) := ∪p∈P

Q( p), where P is any set satisfying P ⊂ I�m .

The union Q(P) can become a nonconvex set, and the maximization in the dualrepresentation (19) can result in a nonconvex optimization. However, employing theconvex hull of Q(P) in the dual representation provides a coherent risk measure, i.e.,

maxp∈P

maxq∈Q( p)

m∑

i=1

qi Li = max

{m∑

i=1

qi Li : q ∈ ∪p∈P

Q( p)

}

.

In this sense, the robust version of coherent measure of risk remains to be anothercoherent measure.

This type of robust optimization modeling is called distributionally robust opti-mization. SVMs have been connected to robust modelings [e.g., ch.12 of Ben-Talet al. (2009), ch.5 of Xanthopoulos et al. (2013) and Caramanis et al. (2012)], butmany existing researches deal with the uncertainty in measurement of the given dataxi and relate the modeling to the regularization term ‖w‖. In contrast, the distribu-tionally robust modeling takes care of the uncertainty of the reference probability p.Recently, Wang (2012) presents a distributionally robust version of the CVaR-basedclassification formulation, in which P consists of several candidates for p.

In the context of financial portfolio selection, Zhu and Fukushima (2009) examinea distributionally robust CVaR minimization and show that if the uncertainty set P isgiven by either of the following cases:

123


P ={

p ∈ Rm : p = p+ η, e�mη = 0, ηL ≤ η ≤ ηU

}for p ∈ I�m, ηL, ηU ∈ R

m,

P ={

p ∈ Rm : p= p+ Aη ≥ 0, e�m Aη = 0, η�η ≤ 1

}for p ∈ I�m, A ∈ R

m×m,

the resulting portfolio optimization leads to a tractable convex optimization. We cantrace the same line as theirs in the context of machine learning presented in the nextsubsection.

4.2 Formulations and solution approaches of coherent risk-based classification

We are now in a position to provide a wide class of classification problems. Althoughthe idea can be easily extended to the other type of machine learning methods, we firstshow the formulation for the two-class classification version.

Along the lines of the discussion in Sect. 3, we employ the negative geometricmargin −Y(w�X − b)/‖w‖ as a loss L, and consider the following formulation:

minimizew,b

maxq

{

−m∑

i=1

qi yi (w�xi − b)

‖w‖ : q ∈ Q

}

, (21)

for a certain closed convex set Q ⊂ I�m . Needless to say, the CVaR-based formulation(15) can be represented as (21) with Q = QCVaR defined by (20).

Using the Charnes-Cooper’s technique, the coherent risk based-classification (21)can be rewritten by

∣∣∣∣∣∣

minimizew,b

maxq{−

m∑

i=1qi yi (w

�xi − b) : q ∈ Q}subject to ‖w‖ = 1.

(22)

Let us examine the condition under which (22) can attain the optimality. It is naturalto start with the maximum loss and the expected loss since both of them respectivelyattain the smallest and largest optimal values of (22) in terms of Q, as mentionedearlier.

Theorem 6 (i) If Q = I�m, i.e., the maximum loss is employed as the risk measure,(22) has an optimal solution.

(ii) Suppose that Q = { p}, i.e., the expected loss is employed as the risk measure.Then, (22) has an optimal solution if and only if

∑mi=1 pi yi = 0.

Proof First of all, recall that the expected loss and the maximum loss are special casesof CVaR with β > 1−mini pi and β = 0, respectively. The CVaR-based version of(22), i.e., (17), can be rewritten by

∣∣∣∣∣

minimizew

g(w)

subject to ‖w‖ = 1,(23)

123

J. Gotoh et al.

where

g(w) := minimizeb,α,z

α + 11−β

m∑

i=1pi zi

subject to zi − yi b + α ≥ −yi x�i w, i = 1, . . . , m,

z ≥ 0.

(24)

Observe that LP (24) is feasible for any β ∈ [0, 1) and p ∈ I�m+, and accordingly,either it has an optimal solution or it is unbounded. Using the duality of LP, a dualproblem to (24) can be defined by

∣∣∣∣∣∣∣∣∣∣∣∣

maximizeλ

−m∑

i=1yi x�i wλi

subject tom∑

i=1yiλi = 0,

m∑

i=1λi = 1, 0 ≤ λ ≤ p/(1− β).

(25)

(i) If β > 1−mini pi , (25) always has a feasible solution λ satisfying λi = 1/(2m+)

for yi = 1 and λi = 1/(2m−) for yi = −1. In addition, the feasible region isbounded, and accordingly, (25) has an optimal solution. By the duality theoremof LP, (24) then attains a (finite) optimal value g(w). Since g(w) is continuousand ‖w‖ = 1 constitutes a compact feasible region, (23) has an optimal solution.This implies that the maximum loss minimization has an optimal solution.

(ii) If β = 0, observe that LP (25) is feasible if and only if∑m

i=1 yi pi = 0. Conse-quently, the proof is complete. ��

It is noteworthy that this theorem is consistent with Lemma 2.2 of Gotoh and Takeda(2005), in which the CVaR-based problem (with �2-norm) is shown to have a solutionfor β ∈ [1− 2 min{∑i :yi=+1 pi ,

∑i :yi=−1 pi }, 1).

Besides, we have the following corollary straightforward from the proof of thesecond statement.

Corollary 2 Suppose that Q contains a reference probability p which satisfies∑mi=1 yi pi = 0. Then, (22) has an optimal solution.

Conversely, if∑m

i=1 pi yi = 0 does not hold, the boundedness depends on coherentmeasures used.

On the other hand, the above condition does not ensure the uniqueness of thesolution. In fact, it is easy to see that in case of β = 0, any b can be optimal. In such acase, after an optimal solution, i.e., (w∗, b∗), is obtained by solving the optimizationproblem, we can redetermine b (with w∗ fixed) so that, for example, the largest in-sample classification accuracy would be attained.

Next let us consider the case where (22) has an optimal solution. The minimization(22) has a (positively homogeneous) convex objective function in (w, b), but has a

123


single nonconvex norm constraint. If we cannot deal with the nonconvex constraintdirectly, it is reasonable to solve a relaxed counterpart which is given by

∣∣∣∣∣∣

minimizew,b

maxq

{

−m∑

i=1qi yi (w

�xi − b) : q ∈ Q

}

subject to ‖w‖ ≤ 1.

(26)

Similarly to Gotoh and Takeda (2005) where only CVaR with �2-norm is treated, weobtain a condition under which a solution to (26) is optimal to (22).

Theorem 7 If the optimal value of (22) is negative, any optimal solution to (26)satisfies w �= 0 and is also optimal to (22). On the other hand, if the optimal value of(22) is positive, (26) results in a meaningless solution satisfying w = 0.

Proof For the first statement, it is sufficient to show that an optimal solution (w∗, b∗)to (26) satisfies ‖w∗‖ = 1. Note that the negative optimal value of (22) implies thatthe optimal value of (26) is also negative. On the contrary we assume that ‖w∗‖ < 1.Then we have

maxq∈Q

{

−m∑

i=1

qi yi

(w∗

‖w∗‖�

xi − b∗

‖w∗‖

)}

< maxq∈Q

{

−m∑

i=1

qi yi (w∗�xi − b∗)

}

< 0.

This means that the feasible solution ( w∗‖w∗‖ ,

b∗‖w∗‖ ) attains a smaller optimal value than

(w∗, b∗) does, and therefore, ‖w∗‖ = 1 must hold.Next we consider the case where the optimal value of (22) is positive. Since (w, b)=

(0, 0) is a feasible solution to (26), (26) has a feasible solution whose objective value isnonpositive. Suppose that there exists such a solution (w′, b′) satisfying w′ �=0. How-ever, this contradicts the existence of an optimal solution to (22) attaining a positivevalue, since (w′, b′)/‖w′‖ is feasible to (22) and attains a nonpositive value. ��Two-step framework. Due to the observation in Theorem 7, if we can obtain an optimalsolution w∗ to (26) which satisfies the convex constraint with equality, i.e., ‖w∗‖ = 1,the solution is also optimal to (22). If it is not satisfied with equality, there are twopossible cases. In the case where 0 < ‖w∗‖ < 1 holds, we can obtain an optimalsolution to (22) by scaling as follows: (w′, b′) ← (w∗, b∗)/‖w∗‖. Otherwise (i.e.,‖w∗‖ = 0), we have to cope with the nonconvexity of (22). This two-step frameworkis summarized in Algorithm 1.

Algorithm 1 Prototype of Two-step FrameworkSolve the convex minimization (26), and let (w∗, b∗) be an optimal solution. [First step]if ‖w∗‖ = 1 then

Terminate; (w∗, b∗) is an optimal solution to (22).else if 0 < ‖w∗‖ < 1 then

Terminate; (w∗, b∗)/‖w∗‖ is an optimal solution to (22).else

Solve (22) (in a heuristic manner). [Second step]end if

It is possible to employ one of (rigorous) global optimization algorithms so as toapproach to an optimal solution at the second step. However, those usually spend animpractically large amount of computation time. Therefore, a reasonable choice is to

123

J. Gotoh et al.

apply a heuristic algorithm which is expected to attain a near optimal solution. Onereasonable choice is to employ the normalized linearization algorithm employed inGotoh and Takeda (2005) where the algorithm is referred to as cutting plane algorithm.Gotoh and Takeda (2005) report that the algorithm worked well, and we can trace thesame line in the generalized problem (22). However, we will below introduce a morecomputationally tractable strategy, which might be adequate for credit rating.Incorporation of a priori information of experts and use of the �1-norm regularization.When we apply the classification methods to credit scoring problem, we often imposethe (expected) sign conditions on weights, w. For example, the creditworthiness of aloan applicant should be increasing in his/her income, and that condition is achieved byimposing a constraint that the coefficient of the attribute ‘income’ being nonnegative.Accordingly, suppose that the signs of all attributes are previously arranged so thatthey are nonnegatively related to the creditworthiness. It is then reasonable to imposethe nonnegativity, i.e., w ≥ 0.

In addition, we employ �1-norm for the norm constraint of (22). Note that this �1-norm regularization corresponds to the case where �∞-norm is employed in measuringthe geometric margin, as mentioned in Sect. 2.

When we impose the nonnegativity and the �1-norm constraint simultaneously, thenorm constraint ‖w‖ = 1 of (22) can be rewritten by linear inequalities of the forme�mw = 1,w ≥ 0. For example, the CVaR- and MASD-based classifications togetherwith this technique are, respectively, formulated by LPs:

∣∣∣∣∣∣∣

minimizew,b,α,z

α + 11−β

p�z

subject to zi ≥ −yi (w�xi − b)− α, i = 1, . . . , m,

z ≥ 0, e�mw = 1,w ≥ 0;(27)

∣∣∣∣∣∣∣∣∣∣

minimizew,b,z

−m∑

i=1pi yi (w

�xi − b)+ λm∑

i=1pi zi

subject to zi ≥ −yi (w�xi − b)+

m∑

h=1ph yh(w�xh − b), i = 1, . . . , m,

z ≥ 0, e�mw = 1,w ≥ 0.

(28)

We call this strategy the nonnegative �1-regularization. Advantages of employing thisstrategy are that (1) there is no need for treating the nonconvexity even when the optimalvalue is positive; (2) users’ knowledge on the expected sign can be incorporated.Besides, the use of the �1-norm regularization is often recommendable in the contextof machine learning since it can lead to a sparse solution, i.e., a solution with many zeroelements, and that function can be considered as a variable selection. The nonnegativityis expected to further prompt the sparsity. Accordingly, the above formulations, (27)and (28), have advantages in that estimating a model and variable selection can beachieved at the same time via a single LP.

5 Extensions to other machine learning methods

In the preceding sections, we have seen that financial risk measures can be applied inthe two-class classification problems. The idea can be straightforward extended to theother type of statistical methods which are treated in the framework of SVMs.

123


5.1 Ordered multi-class classification

One useful and straightforward extension of the two-class problem is the orderedmulti-class classification where more than two classes are treated. It is applicable topractical credit rating problems, including rating bonds or their issuers (e.g., companiesand governments) and scoring loan applicants.

In order to extend the modeling used so far, let us first modify the notation asfollows. Let C := {1, . . . , K } be the set of ordered labels of K classes. Without loss ofgenerality, we suppose that class K is the most creditworthy and the creditworthinessdecreases as the class label number decreases. The m data samples are supposed tobelong to one of them, and let ki denote the class label of sample i .

Our task is simply described as follows. Given a data set of m labeled samples(xi , ki ), i = 1, . . . , m, we construct K − 1 parallel hyperplanes that would separatethe K classes as clear as possible.

To simplify the notation employed below, let us define

yi,κ :={+1 (ki ≥ κ + 1)

−1 (ki ≤ κ), i = 1, . . . , m, κ ∈ C \ {K }.

The ordered multi-class version of linear separability can then be introduced asfollows: If there exists (w, b1, . . . , bK−1) satisfying

{ki ≥ κ + 1, i = 1, . . . , m, κ ∈ C \ {K } ⇒ w�xi > bκ ,

ki ≤ κ, i = 1, . . . , m, κ ∈ C \ {K } ⇒ w�xi < bκ ,

we say that the labeled data set {(xi , ki ) : i = 1, . . . , m} is linearly separable. As inthe two-class case, a loss associated with the geometric margin can be defined by

− yi,κ (w�xi − bκ)

‖w‖ ,

with an arbitrary norm ‖ · ‖ (see Fig. 5), and the coherent risk-based minimization isformulated by

minimizew,b1,...,bK−1

max

⎧⎨

⎩−

m∑

i=1

∑

κ∈C\{K }

qi,κ yi,κ (w�xi − bκ)

‖w‖ : q ∈ Q

⎫⎬

⎭. (29)

Similarly to the two-class case, (29) can be rewritten by

∣∣∣∣∣∣∣

minimizew,b1,...,bK−1

max

{

−m∑

i=1

∑

κ∈C\{K }qi,κ yi,κ (w�xi − bκ) : q ∈ Q

}

subject to ‖w‖ = 1.

123

J. Gotoh et al.

Fig. 5 Geometric margin of the ordered multi-class problem. With the setting in this subsection, everycombination of a sample xi and a hyperplane w�x = bκ is taken into account to define the loss. If a sampleis correctly classified in terms of a hyperplane, its geometric margins to K hyperplanes are all positive andaccordingly, its losses are all negative. On the other hand, a sample is misclassified in terms of a hyperplane,its loss is positive. Therefore, if a point is misclassified in terms of two hyperplanes as a sample xi whichis located at the right-hand side of the figure, two positive losses are assigned to the point. In this figure,arrows are drawn only for the misclassified samples, and their lengths indicate the size of the positive losses,while negative losses are omitted

In particular, the CVaR-based formulation is given by

∣∣∣∣∣∣∣∣

minimizew,b,α,z

minimize α + 11−β

m∑

i=1

∑

κ∈C\{K }pi,κ zi,κ

subject to zi,κ ≥ −yi,κ (w�xi − bκ)− α, i = 1, . . . , m, κ ∈ C \ {K },z ≥ 0, ‖w‖ = 1,

(30)

where β ∈ [0, 1), whereas the MASD-based formulation is given by

∣∣∣∣∣∣∣∣∣∣∣∣∣

minimizew,b,z

−m∑

i=1

∑

κ∈C\{K }pi,κ yi,κ (w�xi − bκ)+ λ

m∑

i=1

∑

κ∈C\{K }pi,κ zi,κ

subject to zi,κ ≥ −yi,κ (w�xi − bκ)+m∑

h=1

∑

k∈C\{K }ph,k yh,k(w

�xh − bk),

i = 1, . . . , m, κ ∈ C \ {K },z ≥ 0, ‖w‖ = 1,

(31)

where λ ≥ 0.Like the two-class problems (27) and (28), the smallest β for (30) and λ for

(31), respectively, depend on the setting of p. It is not hard to see that if p satis-fies

∑i,κ pi,κ yi,κ = 0 for all i = 1, . . . , m; κ = 1, . . . , K − 1, then they can attain

an optimal solution at β = 0 and λ = 0, respectively. This condition does not holdin general. However, it is satisfied, for example, in the case where we employ the

123


uniform probability pi,κ = 1m(K−1)

and each class has the same population. Even insuch a case, we can find the minimum β ∈ [0, 1) and λ ≥ 0 under which (30) and(31), respectively, would have optimal solutions.

For solving the above optimization problems, the same regularization strategiescan be taken as in the two-class version. In the next section, we present numericalexamples where the ordered multi-class classification is applied to some credit ratingproblems by exploiting the nonnegative �1-regularization.

5.2 Further extensions

Non-ordered multi-class classification. Even when the order of classes is not defined,the coherent risk-based classification can be applied in a couple of ways. A simpleway for achieving a non-ordered K -class classification is to repeat K applications ofthe two-class classification, each defining a hyperplane w�κ = bκ which divides oneclass, say κ , from the other K − 1 classes (Vapnik 1995). Estimating K hyperplanes(i.e., w�κ x = bκ , κ ∈ C) by assigning the label yi = +1 to the class κ and yi = −1 tothe others, the class-assigning rule is given by

f (x) := arg maxκ

{w�κ x − bκ : κ ∈ C

}.

This framework can be built on the basis of the two-class problem discussed in thepreceding sections, and the coherent risk-based version can be easily achieved. Onthe other hand, this needs to solve K optimization problems. This contrasts with thesingle-optimization method posed in Sect. 5.1.

A single-optimization method can also be formulated for the non-ordered case as inthe ordered case above. Indeed, Bennett and Mangasarian (1993) define the piecewise-linear separability as follows: there exist {(wκ , bκ ) : κ ∈ C} such that

w�kixi − bki > w�κ xi − bκ , ∀i = 1, . . . , m; ∀κ �= ki .

Based on this separability, the geometric margin can be naturally extended by

(wki − wκ)�xi − (bki − bκ)

‖wki − wκ‖ , for i = 1, . . . , m; κ �= ki .

In this case, the coherent risk-based classification is formulated by

minimize(wκ ),(bκ )

max

⎧⎨

⎩−

m∑

i=1

∑

κ �=ki

qi,κ{(wki − wκ)�xi − (bki − bκ)

}

‖wki − wκ‖ : q ∈ Q

⎫⎬

⎭.

Separation by nonlinear functions. Limiting our attention to the similarity in formu-lations of financial optimization and machine learning, we have so far only dealt with

123

J. Gotoh et al.

linear classification where the underlying classification hyperplane is given by a linearfunction of the attributes. Nevertheless, it is not so hard to extend the discussion aboveto nonlinear cases.

It is known that SVM has an advantage in treating a highly nonlinear classificationwithout increasing the complexity of the algorithm by employing the so-called kerneltrick (see, e.g., Vapnik 1995; Schölkopf and Smola 2002). That technique can beapplied also to some coherent risk-based modeling as long as �2-norm is employed.It is noteworthy that even if the other norm is used, we can deal with the kernel-basednonlinearity along the lines of Mangasarian (2000).

The second way is more straightforward. When a practitioner builds a classifica-tion model for credit rating in practice, he/she is likely to adopt a specific nonlin-earity. For example, he/she may think that the quadratics of two certain attributesshould be incorporated, or that the log-transformation should be applied to some ofattributes. In such a case, he/she can explicitly incorporate the nonlinearity by addingthe nonlinearly transformed variables (e.g., xi x j , ln x j ) as attributes and treating alinear model. Another advantage of this strategy is that we can add another condi-tions to the nonlinear functions. Konno and Kobayashi (2000), for example, imposethe positive semidefinite condition on the quadratic functions. Similar techniques canbe found in utility functions approaches (e.g., Baourakis et al. 2009; Bugera et al.2002).

Employing explicit nonlinearity in the above-mentioned way is preferred when theuser has a prior knowledge on the underlying model and wants to incorporate it inthe model. On the other hand, the kernel method is preferred when the user wants todiscover an unknown nonlinear relation between the rating and the attributes.

Regression and outlier detection. So far, we have revisited the existing classificationas a financial risk minimization. The similar interpretation can be obtained in othermachine learning methodologies if any loss is defined.

Regression is another important machine learning methodology. Its main task isto find a model Y = w0 + w�X that explains the relation between Y and X . Theordinary least square method (OLS) is the most famous criterion. In the SVM litera-ture, the regression version is known as the support vector regression (SVR) and itsdevelopment is parallel to SVC. ν-SVR is formulated by

∣∣∣∣∣∣∣∣∣

minimizew,w0,ρ,z

12‖w‖22 + C

(

−ρ + 1νm

m∑

i=1zi

)

subject to zi ≥ |yi − w0 − w�xi | + ρ, i = 1, . . . , m,

zi ≥ 0, i = 1, . . . , m,

where C > 0 and ν ∈ (0, 1] are user-defined parameter (see, e.g., Schölkopf andSmola 2002, Sec.9.3). It is easy to see that this can be regarded as a regularized CVaRminimization with a loss defined by the absolute residuals |yi − w�xi − w0|, i =1, . . . , m. Therefore, it can be interpreted as the minimization of the expectation ofthe largest 100(1− β) percent (absolute) residuals.

123


If we replace the loss by the residual to the sth power, the new version of CVaRminimization can be obtained.

minimizeα,w,w0

α + 1

1− β

m∑

i=1

pi max{|yi − w�xi − w0|s − α, 0

}+ C ′‖w‖tt , (32)

where s ∈ [1,∞) and C ′ ≥ 0. It is easy to see that (32) is a generalization of OLS(β = 0, C ′ = 0, s = 2), the ridge regression (β = 0, C ′ > 0, s = t = 2) and the lasso(β = 0, C ′ > 0, s = 2, t = 1) (see, e.g., Hastie et al. 2001). Note that for s, t ≥ 1,(32) is a convex minimization. Its coherent risk-based version can be given by

minimizew,w0

maxq∈Q

m∑

i=1

qi |yi − w�xi − w0|s + C ′‖w‖tt .

Another interesting class of machine learning methods is outlier detection (or alsoknown as one-class classification). This methodology seeks to find a small number ofoutlying samples, x′1, . . . , x′k , out of all the samples x1, . . . , xm . In order to define theoutliers, we can employ various types of loss functions. One candidate for the loss isLi (w) := −w�xi/‖w‖. In fact, the one-class ν-SVC (see, e.g. Schölkopf and Smola2002 Sec.8.3) is formulated by

∣∣∣∣∣∣∣∣

minimizew,ρ,z

12‖w‖22 − νρ + 1

m

m∑

i=1zi

subject to zi ≥ −w�xi + ρ, i = 1, . . . , m,

zi ≥ 0, i = 1, . . . , m.

By following the same line of the two-class case in Sects. 3 or 4, we can see thatthe one-class ν-SVC is equivalent to a convex counterpart of the following CVaRminimization of the loss Li (w) = −w�xi/‖w‖:

minimizeα,w

α + 1

1− β

m∑

i=1

pi max

{

−w�xi

‖w‖ − α, 0

}

.

Its coherent risk-based version can be given by

minimizew

maxq∈Q

{

−m∑

i=1

qiw�xi

‖w‖

}

, or minimizew

maxq∈Q

{

−m∑

i=1

qiw�xi

}

subject to ‖w‖ = 1.

The support vector domain description (SVDD) (Tax and Duin 1999) is another outlierdetection method formulated by

minimizeρ,c

− ρ + 1

ν

m∑

i=1

pi max{‖xi − c‖22 + ρ, 0

},

123

J. Gotoh et al.

with pi = 1/m. It is easy to see that this is the CVaR minimization of the lossLi (c) := ‖xi − c‖22. In place of the squared Euclidean distance, another loss canbe employed, e.g., Li (c) := ‖xi − c‖ or Li (c) := ‖xi − c‖p

p for p ∈ [1,∞). It isnoteworthy that these CVaR minimizations result in convex minimizations for generalposition of the samples x1, . . . , xm .

Its coherent risk-based version can be given by

minimizec

maxq∈Q

m∑

i=1

qi‖xi − c‖, or minimizec

maxq∈Q

m∑

i=1

qi‖xi − c‖pp,

where p ∈ [1,∞). It is noteworthy that these formulations are both convex minimiza-tion for any norm ‖ · ‖ and Q ∈ I�m .

6 Numerical examples

In this section, we demonstrate how the developed methods work in corporate bondrating. In particular, we here examine the CVaR- and MASD-based linear models for (i)the two-class classification, (27) and (28), and (ii) the ordered six-class classification,(30) and (31), where the nonnegative �1-regularization was employed so as to reduceall these optimization problems to LPs. The used data set consists of financial indicesand ratings of bonds of non-financial companies that were listed on the first sectionof the Tokyo Stock Exchange, and the ratings were given by Rating and InvestmentInformation, Inc. (R&I). We used two data sets, named the 2011 data and the 2012data, each including 405 and 393 companies’ data, respectively. In the experiment,the 2011 data was used for tuning the parameters, β and λ, and for estimating models,whereas the 2012 data was used for testing the estimated models under the tunedparameters.

Table 1 summarizes the numbers of ratings of the companies. Since the number ofcompanies was limited, we grouped them into six classes as shown in the table so as toimplement the ordered six-class classification. In order to conduct the two-class case,we further divided the six classes into the lowest two classes, which are considered tobe speculative, and the other four classes.

Before applying the linear classification models, we preprocessed the attribute data(i.e., financial indices) as follows. First the negative logarithm transformation wasapplied to each attribute, i.e., xi → ln(1 + xi ) if xi ≥ 0; xi → − ln(1 − xi ) ifxi < 0. Next we adopted the centering and normalization to the transformed data, andreplaced outlying samples by adopting the following transformation to the normalizedattributes x j : min{x j , 4} for x j ≥ 0; max{x j ,−4} for x j < 0.

In addition, signs of attributes were arranged so that the value of each attributewould be nondecreasing in the class labels, i.e., the larger the attribute, the higher theclass label.

In order to tune the parameters β for CVaR-based methods, (27) and (30), and λ

for MASD-based method, (28) and (31), we employed the so-called leave-one-outcross validation (LOO) using the 2011 data. For example, for β of CVaR, we prepared

123


Table 1 Rating and class separation

The ratings were given at the ends of March 2011 and 2012

twenty-one candidates, 0.01, 0.05, 0.10, …, 0.90, 0.95, 0.99. At each β, we repeatedthe following procedure. Leaving one sample out of 405 samples for validation, weestimated a classification model using the remaining 404 samples and check if theestimated model could predict the class of the validation sample or not. Repeating thisprocedure 405 times by alternating the validation samples, the average rate of accu-rate prediction was computed. Repeating the above procedure for all the candidatesof βs, we picked up one β that attained the best accuracy. As for MASD, the sameprocedure was adopted where the parameter λ ≥ 0 was transformed as λ = τ/(1− τ)

with τ ∈ [0, 1) and the same twenty-one candidates for τ as those for β of theCVaR-based method were examined. Note that MASD with τ > 0.5 does not corre-spond to the coherent measure, but it is still reasonable to employ in the classificationproblem in the sense that MASD with larger τ puts more emphasis on the downsidedeviation.

We should remember that the existence of the optimal solutions depends also on thereference probability p. It is natural to employ the uniform probability, i.e., pi = 1/m,for the CVaR- and the MASD-based methods, but it may result in an unboundedsolution at low βs and λs. As discussed earlier, if a weighted probability satisfying∑

i yi pi = 0 (for the two-class case) or∑

κ

∑i yiκ piκ = 0 for κ = 1, . . . , 5 (for

the six-class case) is employed, the CVaR- and the MASD-based formulations haveoptimal solutions at any nonnegative β and λ, respectively. To this end, we employedweighted probabilities of the form:

[the two-class case] : [the K -class case] :

pi ={ 1

2m+ for yi = +1,

12m− for yi = −1; piκ =

⎧⎨

⎩

12(K−1)

∑k≥κ+1 mk

for (i, κ) : ki ≥ κ + 1,

12(K−1)

∑k≤κ mk

for (i, κ) : ki ≤ κ.

123

J. Gotoh et al.

Fig

.6A

vera

geac

cura

cyin

valid

atio

nof

LO

O.‘

Exa

ct’

impl

ies

the

rate

ofex

actly

pred

ictin

gth

etr

uecl

ass

labe

l,w

here

as‘±

1’ad

ditio

nally

incl

udes

the

wro

ngbu

tone

-cla

ssdi

ffer

entp

redi

ctio

ns.a

CV

aRw

ithβ

for

two-

clas

sca

se.b

MA

SDw

ithλ=

τ/(1−

τ)

for

two-

clas

sca

se.c

CV

aRw

ithβ

for

six-

clas

sca

se.d

MA

SDw

ithλ=

τ/(1−

τ)

for

six-

clas

sca

se

123


Tabl

e2

Var

iabl

ese

lect

ion

(tw

o-cl

ass

case

)

Cat

egor

yid

x.N

o.Fi

nanc

iali

ndic

esO

rder

edlo

git

CV

aRM

ASD

coef

.

(β=

0.6)

Uni

form

#co

ef.

(β=

0.6)

Wei

ghte

d#

coef

.

(τ=

0.7)

Uni

form

#co

ef.

(τ=

0.55

)W

eigh

ted

#co

ef.

Profi

tabi

lity

1O

pera

ting

Profi

tMar

gin

−0.0

670

00

0

2N

etPr

ofitM

argi

n0.

121

00

00

3A

fter

-Tax

Profi

tMar

gin

0.01

00

00

0

4G

ross

Profi

tMar

gin

−0.0

6440

50.

125

405

0.08

540

50.

093

0

5Sa

les

and

Gen

eral

Adm

inis

trat

ion

Exp

ense

Mar

gin

−0.0

650

00

0

6R

etur

non

Cap

italE

mpl

oyed

0.01

60

00

0

Safe

ty7

Qui

ckR

atio

(Aci

dTe

stR

atio

)0

00

0

8C

urre

ntR

atio

090

00

9C

apita

lAde

quac

yR

atio

037

60

0

10R

etai

ned

Ear

ning

sto

Tota

lAss

ets

0.01

340

50.

245

405

0.19

940

50.

274

405

0.27

0

11D

ebtt

oTo

talA

sset

sR

atio

0.00

70

00

0

12Fi

xed

Rat

io39

440

50.

087

389

0

13C

ash

Equ

ival

ents

Sale

sR

atio

050

00

14L

iabi

lity

Tur

nove

rPe

riod

−0.0

280

00

0

15C

urre

ntL

iabi

litie

sT

urno

ver

Peri

od0.

017

00

00

16Fi

xed

Lia

bilit

ies

Tur

nove

rPe

riod

00

00

17C

urre

ntE

xpen

seto

Cur

rent

Inco

me

Rat

io−0

.066

00

00

18In

tere

stE

xpen

seto

Inte

rest

Bea

ring

Lia

bilit

yR

atio

02

00

19In

tere

stE

xpen

seto

Sale

sR

atio

0.01

40

00

0

20W

orki

ngC

apita

lRat

io−0

.007

00

00

123

J. Gotoh et al.

Tabl

e2

cont

inue

d

Cat

egor

yid

x.N

o.Fi

nanc

iali

ndic

esO

rder

edlo

git

CV

aRM

ASD

coef

.

(β=

0.6)

Uni

form

#co

ef.

(β=

0.6)

Wei

ghte

d#

coef

.

(τ=

0.7)

Uni

form

#co

ef.

(τ=

0.55

)W

eigh

ted

#co

ef.

Solv

ency

21In

tere

stC

over

age

−0.0

100

00

0

22O

pera

ting

Cas

hFl

owto

Inte

rest

Bea

ring

Lia

bilit

yR

atio

00

00

Effi

cien

cy23

Tota

lCap

italT

urno

ver

Rat

io0.

016

00

00

24A

sset

Tur

nove

rPe

riod

−0.0

670

00

0

25In

vent

ory

Tur

nove

rPe

riod

0.00

340

10.

007

405

0.00

711

30

26A

ccou

nts

Rec

eiva

ble

Tur

nove

rPe

riod

−0.0

040

340

00

27B

orro

wed

Inde

bted

ness

Tur

nove

rPe

riod

00

00

28W

orki

ngC

apita

lTur

nove

rPe

riod

00

00

29C

urre

ntA

sset

Tur

nove

rPe

riod

−0.0

280

00

0

30Fi

xed

Ass

etT

urno

ver

Peri

od0

00

0

31N

onop

erat

ing

Inco

me

Rat

io−0

.017

00

00

32N

onop

erat

ing

Exp

ense

Rat

io−0

.026

00

00

33Sa

les

per

Pers

on−0

.007

00

00

34N

etPr

ofitp

erPe

rson

0.01

338

640

30.

009

20

35Ta

ngib

leFi

xed

Ass

ets

per

Pers

on2

405

0.08

840

00

Size

36To

talC

apita

l31

90

00

37To

talA

sset

s−0

.086

405

0.62

440

50.

524

405

0.63

340

50.

730

38Sa

les

0.12

50

00

0

39A

fter

-Tax

Profi

t−0

.005

00

00

40C

ash

Flow

00

00

123


Tabl

e2

cont

inue

d

Cat

egor

yid

x.N

o.Fi

nanc

iali

ndic

esO

rder

edlo

git

CV

aRM

ASD

coef

.

(β=

0.6)

Uni

form

#co

ef.

(β=

0.6)

Wei

ghte

d#

coef

.

(τ=

0.7)

Uni

form

#co

ef.

(τ=

0.55

)W

eigh

ted

#co

ef.

Cas

hFl

ow41

Ope

ratin

gC

ash

Flow

toSa

les

Rat

io0.

054

117

61

0

42Fi

nanc

ing

Cas

hFl

owto

Sale

sR

atio

00

00

43O

pera

ting

Cas

hFl

owto

Tota

lAss

ets

Rat

io−0

.008

00

00

44Fi

nanc

ing

Cas

hFl

owto

Tota

lAss

ets

Rat

io0.

004

00

00

45Fr

eeC

ash

Flow

toSa

les

Rat

io−0

.032

00

00

46Fr

eeC

ash

Flow

toTo

talA

sset

sR

atio

00

00

47O

pera

ting

Cas

hFl

owto

Cur

rent

Lia

bilit

ies

Rat

io0

00

0

48Fr

eeC

ash

Flow

toC

urre

ntL

iabi

litie

sR

atio

00

00

49O

pera

ting

Cas

hFl

owto

Lia

bilit

ies

Rat

io0

00

0

50Fr

eeC

ash

Flow

toL

iabi

litie

sR

atio

00

00

Gro

wth

51Sa

les

Gro

wth

Rat

e0

00

0

The

colu

mn

‘Ord

ered

Log

it’re

port

sth

eco

effic

ient

ses

timat

edby

the

func

tionpolr

ofR

,whi

chis

anim

plem

enta

tion

ofth

eor

dere

dlo

gita

naly

sis.

The

func

tionstep

was

empl

oyed

fort

heva

riab

lese

lect

ion.

The

blan

kpa

rts

inth

eco

lum

n‘c

oef.

’ind

icat

eth

ere

mov

edva

riab

les

via

the

sele

ctio

n.T

heco

effic

ient

sw

ere

norm

aliz

edby

�1-n

orm

.The

nega

tive

coef

ficie

nts

impl

yth

eop

posi

teto

the

expe

cted

sign

cond

ition

.The

colu

mns

nam

ed‘#

’fo

rC

VaR

and

MA

SDre

port

the

num

bers

ofno

n-ze

roco

effic

ient

sdu

ring

the

405

LO

Oso

lutio

nsat

the

sele

cted

para

met

ersβ

and

τ,w

here

asth

eco

lum

nsna

med

‘coe

f.’r

epor

tsth

eop

timal

solu

tion

obta

ined

usin

gth

ew

hole

405

sam

ples

afte

rdet

erm

inin

gth

epa

ram

eter

s

123

J. Gotoh et al.

Tabl

e3

Var

iabl

ese

lect

ion

(six

-cla

ssca

se)

Cat

egor

yid

x.N

o.Fi

nanc

iali

ndic

esO

rder

edlo

git

CV

aRM

ASD

coef

.

(β=

0.65

)U

nifo

rm#

coef

.

(β=

0.55

)W

eigh

ted

#co

ef.

(τ=

0.7)

Uni

form

#co

ef.

(τ=

0.65

)W

eigh

ted

#co

ef.

Profi

tabi

lity

1O

pera

ting

Profi

tMar

gin

−0.0

930

00

0

2N

etPr

ofitM

argi

n0.

078

00

00

3A

fter

-Tax

Profi

tMar

gin

00

00

4G

ross

Profi

tMar

gin

405

0.04

740

50.

042

405

0.03

139

5

5Sa

les

and

Gen

eral

Adm

inis

trat

ion

Exp

ense

Mar

gin

00

00

6R

etur

non

Cap

italE

mpl

oyed

0.01

20

00

0

Safe

ty7

Qui

ckR

atio

(Aci

dTe

stR

atio

)−0

.025

00

00

8C

urre

ntR

atio

00

00

9C

apita

lAde

quac

yR

atio

940

30.

017

01

10R

etai

ned

Ear

ning

sto

Tota

lAss

ets

0.01

740

50.

177

405

0.15

540

50.

181

405

0.17

3

11D

ebtt

oTo

talA

sset

sR

atio

0.01

10

00

0

12Fi

xed

Rat

io0

00

0

13C

ash

Equ

ival

ents

Sale

sR

atio

00

00

14L

iabi

lity

Tur

nove

rPe

riod

−0.0

790

00

0

123


Tabl

e3

cont

inue

d

Cat

egor

yid

x.N

o.Fi

nanc

iali

ndic

esO

rder

edlo

git

CV

aRM

ASD

coef

.

(β=

0.65

)U

nifo

rm#

coef

.

(β=

0.55

)W

eigh

ted

#co

ef.

(τ=

0.7)

Uni

form

#co

ef.

(τ=

0.65

)W

eigh

ted

#co

ef.

15C

urre

ntL

iabi

litie

sT

urno

ver

Peri

od0.

069

00

00

16Fi

xed

Lia

bilit

ies

Tur

nove

rPe

riod

0.04

20

00

0

17C

urre

ntE

xpen

seto

Cur

rent

Inco

me

Rat

io0

00

0

18In

tere

stE

xpen

seto

Inte

rest

Bea

ring

Lia

bilit

yR

atio

00

00

19In

tere

stE

xpen

seto

Sale

sR

atio

00

00

20W

orki

ngC

apita

lRat

io1

00

0

Solv

ency

21In

tere

stC

over

age

−0.0

060

00

0

22O

pera

ting

Cas

hFl

owto

Inte

rest

Bea

ring

Lia

bilit

yR

atio

00

00

Effi

cien

cy23

Tota

lCap

italT

urno

ver

Rat

io0

00

0

24A

sset

Tur

nove

rPe

riod

−0.0

560

00

0

25In

vent

ory

Tur

nove

rPe

riod

0.01

040

50.

041

405

0.04

143

405

0.01

8

26A

ccou

nts

Rec

eiva

ble

Tur

nove

rPe

riod

−0.0

100

00

0

27B

orro

wed

Inde

bted

ness

Tur

nove

rPe

riod

00

00

28W

orki

ngC

apita

lTur

nove

rPe

riod

0.02

00

00

0

29C

urre

ntA

sset

Tur

nove

rPe

riod

−0.0

620

00

0

30Fi

xed

Ass

etT

urno

ver

Peri

od0

00

0

31N

onop

erat

ing

Inco

me

Rat

io−0

.016

00

00

32N

onop

erat

ing

Exp

ense

Rat

io−0

.016

00

00

33Sa

les

per

Pers

on−0

.015

00

00

34N

etPr

ofitp

erPe

rson

0.01

840

50.

058

405

0.04

840

50.

027

401

0.00

6

123

J. Gotoh et al.

Tabl

e3

cont

inue

d

Cat

egor

yid

x.N

o.Fi

nanc

iali

ndic

esO

rder

edlo

git

CV

aRM

ASD

coef

.

(β=

0.65

)U

nifo

rm#

coef

.

(β=

0.55

)W

eigh

ted

#co

ef.

(τ=

0.7)

Uni

form

#co

ef.

(τ=

0.65

)W

eigh

ted

#co

ef.

35Ta

ngib

leFi

xed

Ass

ets

per

Pers

on0

00

0

Size

36To

talC

apita

l−0

.052

00

00

37To

talA

sset

405

0.59

740

50.

606

405

0.67

540

50.

681

38Sa

les

0.13

00

00

0

39A

fter

-Tax

Profi

t0

00

0

40C

ash

Flow

0.00

80

00

0

Cas

hFl

ow41

Ope

ratin

gC

ash

Flow

toSa

les

Rat

io1

10

0

42Fi

nanc

ing

Cas

hFl

owto

Sale

sR

atio

00

00

43O

pera

ting

Cas

hFl

owto

Tota

lAss

ets

Rat

io0.

016

00

00

44Fi

nanc

ing

Cas

hFl

owto

Tota

lAss

ets

Rat

io0

00

0

45Fr

eeC

ash

Flow

toSa

les

Rat

io0

00

0

46Fr

eeC

ash

Flow

toTo

talA

sset

sR

atio

−0.0

330

00

0

47O

pera

ting

Cas

hFl

owto

Cur

rent

Lia

bilit

ies

Rat

io0.

039

405

0.08

140

50.

092

405

0.08

740

50.

122

48Fr

eeC

ash

Flow

toC

urre

ntL

iabi

litie

sR

atio

00

00

49O

pera

ting

Cas

hFl

owto

Lia

bilit

ies

Rat

io−0

.042

00

00

50Fr

eeC

ash

Flow

toL

iabi

litie

sR

atio

0.01

90

00

0

Gro

wth

51Sa

les

Gro

wth

Rat

e−0

.005

00

00

123


Table 4 In-sample versus out-of-sample accuracy (two-class case) (%)

In-sample (2011) Out-of-sample (2012)

Logit CVaR MASD Logit CVaR MASD

Uniform Weighted Uniform Weighted Uniform Weighted Uniform Weighted

All 90.4 84.4 84.7 85.2 83.7 47.3 84.2 79.9 84.7 82.71 83.3 67.5 87.5 70.8 85.0 29.8 67.7 77.4 70.2 83.92 93.3 91.6 83.5 91.2 83.2 55.4 91.8 81.0 91.4 82.2

Table 5 In-sample versus out-of-sample accuracy (six-class case) (%)

In-sample (2011) Out-of-sample (2012)

Logit CVaR MASD Logit CVaR MASD

Uniform Weighted Uniform Weighted Uniform Weighted Uniform Weighted

Exact All 58.3 53.8 50.1 49.1 50.6 23.2 47.6 45.5 48.6 44.8

Class 1 54.9 62.0 90.1 47.9 91.5 34.3 55.7 84.3 41.4 84.3

Class 2 55.1 22.4 12.2 20.4 12.2 14.8 25.9 20.4 29.6 14.8

Class 3 57.7 55.1 23.1 62.8 24.4 11.0 37.8 20.7 56.1 18.3

Class 4 54.0 52.9 35.6 54.0 36.8 9.6 53.0 34.9 57.8 37.3

Class 5 47.2 33.3 13.9 11.1 11.1 13.2 26.3 5.3 15.8 2.6

Class 6 72.6 73.8 94.0 65.5 94.0 56.1 74.2 92.4 69.7 93.9

±1 All 92.6 88.4 82.5 86.9 80.0 48.3 86.0 78.9 86.3 77.1

Class 1 88.7 81.7 95.8 76.1 93.0 44.3 72.9 87.1 71.4 87.1

Class 2 89.8 91.8 91.8 91.8 91.8 51.9 92.6 92.6 92.6 92.6

Class 3 97.4 92.3 61.5 96.2 55.1 40.2 87.8 58.5 92.7 54.9

Class 4 92.0 90.8 71.3 92.0 70.1 37.3 88.0 66.3 90.4 63.9

Class 5 86.1 88.9 88.9 86.1 83.3 63.2 86.8 86.8 84.2 81.6

Class 6 96.4 85.7 94.0 79.8 94.0 65.2 89.4 95.5 84.8 95.5

‘Exact’ implies the rate of exactly predicting the true class label, whereas ‘±1’ additionally includes thewrong but one-class different predictions

Figure 6 shows the accuracy ratio in the LOO method. In the two-class case, thebest βs were achieved at 0.6 both with the uniform probability and the weightedprobability while the best τ s were achieved at τ = 0.77 (uniform probability) andτ = 0.5 (weighted probability). In the six-class case, the best βs were achieved at 0.65(uniform probability) and 0.55 (weighted probability) while the best τ s were achievedat τ = 0.70 (uniform probability) and τ = 0.65 (weighted probability).

We see that for the two-class case, both of the probability settings achieved almostthe same highest accuracy, while for the six-class case, uniform probability achievedhigher accuracy under the best parameters.

Tables 2 and 3 show the estimated coefficients of the attributes under the chosenparameters β or λ. Clearly, both the CVaR- and MASD-based classifications achieved

123

J. Gotoh et al.

very sparse estimates of the weight vector. These sparse solutions are the consequenceof the nonnegative �1-regularization, and can be considered as a result of variableselection. On the other hand, the coefficients estimated via the ordered logit modelresulted in many negative coefficients, implying that the estimated model was incon-sistent with the experts’ economic prior knowledge.

Tables 4 and 5 report the in-sample and out-of-sample accuracy of the two- andsix-class classification results, respectively. We see from the tables that both CVaR-and MASD-based methods achieved higher out-of-sample accuracy compared to thelogit model. Especially, their out-of-sample performance is as well as the in-sampleone, while the logit model showed the highest in-sample accuracy, but resulted ina poor out-of-sample performance. This indicates that the logit model overfits thein-sample data in spite of the variable selection based on AIC. On the other hand,the variable selection via the CVaR- and MASD-based methods with the nonnegative�1-regularization successfully avoided the overfitting.

The use of the uniform probability showed better performance than that of theweighted probability on average (see the row named ‘All’). However, the weightedprobability showed better performance at some classes, which seems to be due to thedifferent weights between classes.

7 Concluding remarks

In this article we demonstrate how the financial risk measures relate to the machinelearning methods, especially to SVMs. We have pointed out in the first part that theCVaR minimization has much to do with ν-SVMs. Indeed, both of them are basedon some empirical risk minimization, and aim good out-of-sample performance. Dueto their high degree of affinity and small difference, we can develop a new class ofmachine learning methods. In particular, we introduce the class of coherent risk-basedmethods. Their formulations can be viewed as a straightforward extension of ν-SVMs.Numerical experiments demonstrate promising results for the extension. In fact, byemploying both the �1-norm regularization and the nonnegative weights, both theCVaR- and the MASD-based methods can be rewritten by LPs and they successfullyprovided sparse estimates of the weight vector, achieving higher accuracy than theordered logit model.

In this paper we have concentrated on the coherent measures, but the similar exten-sion can be achieved on the basis of convex measures (Föllmer and Schied 2002). Theextension along that direction is left for a future research.

References

Altman EI (1968) Financial ratios, discriminant analysis and the prediction of corporate bankruptcy.J Finance 23:589–609

Artzner P, Delbaen F, Eber JM, Heath D (1999) Coherent measures of risk. Math Finance 9:203–228Bahrammirzaee A (2010) A comparative survey of artificial intelligence applications in finance: artificial

neural networks, expert system and hybrid intelligent systems. Neural Comput Appl 19:1165–1195

123


Baourakis G, Conisescu M, van Dijk G, Pardalos PM, Zopounidis C (2009) A multicriteria approach forrating the credit risk of financial institutions. Comput Manag Sci 6:347–356

Bennell JA, Crabbe D, Thomas S, Gwilym O (2006) Modeling sovereign credit ratings: neural networksversus ordered profit. Expert Syst Appl 30:415–425

Bennett KP, Bredensteiner E (2000) Geometry in learning. In: Gorini C (ed) Geometry at work. MathematicalAssociation of America, Washington, DC, pp 132–145

Bennett KP, Mangasarian OL (1993) Multicategory discrimination via linear programming. Optim MethodsSoftw 3:27–39

Ben-Tal A, ElGhaoui L, Nemirovski A (2009) Robust optimization. Princeton Univ Press, PrincetonBredensteiner EJ, Bennett KP (1999) Multicategory classification by support vector machines. Comput

Optim Appl 12:53–79Brodie J, Daubechiesa I, De Mol C, Giannone D, Lorisc I (2009) Sparse and stable Markowitz portfolios.

PNAS 106:12267–12272Bugera V, Konno H, Uryasev S (2002) Credit cards scoring with quadratic utility function. J Multi Criteria

Decis Anal 11:197–211Caramanis C, Mannor S, Xu H (2012) Robust optimization in machine learning. In: Sra S, Nowozin S,

Wright SJ (eds) Optimization for machine learning. The MIT Press, Cambridge, pp 369–402Crook JN, Edelman DB, Thomas LC (2007) Recent developments in consumer credit risk assessment. Eur

J Oper Res 183:1447–1465Chen Z, Wang Y (2008) Two-sided coherent risk measures and their application in realistic portfolio

optimization. J Banking Finance 32:2667–2673Delbaen F (2002) Coherent risk measures on general probability spaces. In: Advances in finance and

stochastics. Essays in honour of dieter sondermann. Springer, Berlin, pp 1–37DeMiguel V, Garlappi L, Nogales FJ, Uppal R (2009) A generalized approach to portfolio optimization:

improving performance by constraining portfolio norms. Manag Sci 55:798–812Erdal HI, Ekinci A (2012) A comparison of various artificial intelligence methods in the prediction of bank

failures. Comput Econ. doi:10.1007/s10614-012-9332-0Fishburn PC (1977) Mean-risk analysis with risk associated with below-target returns. Am Econ Rev

67:116–126Fisher T (2001) Examples of coherent risk measures depending on one-sided moments. Darmstadt Univer-

sity of Technology. Discussion PaperFöllmer H, Schied A (2002) Convex measures of risk and trading constraints. Finance Stoch 6:429–447Gotoh J, Takeda A (2011) On the role of norm constraints in portfolio selection. Comput Manag Sci

8:323–353Gotoh J, Takeda A (2012) Minimizing loss probability bounds for portfolio selection. Eur J Oper Res

217:371–380Gotoh J, Takeda A (2005) A linear classification model based on conditional geometric score. Pac J Optim

1:277–296Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning-data mining, inference, and

prediction. Springer, NYHuang Z, Chen H, Hsu CJ, Chen WH, Wu S (2004)Credit rating analysis with support vector machines and

neural networks: a market comparative study. Decis Supp Syst 37:543–558Konno H, Kobayashi H (2000) Failure discrimination and rating of enterprises by semi-definite program-

ming. Asia Pac Financ Mark 7:261–273Konno H, Yamazaki H (1991) Mean-absolute deviation portfolio optimization model and its applications

to Tokyo stock market. Manag Sci 37:519–531Krokhmal P (2007) Higher moment coherent risk measures. Quant Finance 7:373–387Mangasarian OL (1999) Arbitrary-norm separating plane. Oper Res Lett 24:15–23Mangasarian OL (2000) Advances in large margin classifiers. In: Smola AJ, Bartlett PL, Schölkopf B,

Schuurmans D (eds) Generalized support vector machines. The MIT Press, Cambridge, pp 135–146Markowitz HM (1959) Portfolio selection: efficient diversification of investments. Wiley, New YorkOgryczak W, Ruszczýnski A (1999) From stochastic dominance to mean-risk models: semideviations as

risk measures. Eur J Ope Res 116:33–50Ogryczak W, Ruszczýnski A (2002) Dual stochastic dominance and related mean-risk models. SIAM J

Optim 13:60–78

123

http://dx.doi.org/10.1007/s10614-012-9332-0

J. Gotoh et al.

Perez-Cruz F, Weston J, Hermann DJL, Schölkopf B (2003) Extension of the ν-SVM range for classification.In: Suykens JAK, Horvath G, Basu S, Micchelli C, Vandewalle J (eds) Advances in learning theory:methods, models and applications, vol 190. IOS Press: Amsterdam, pp 179–196

Pflug GC (2000) Some remarks on the value-at-risk and the conditional value-at-risk. In: Uryasev S (ed)Probabilistic constrained optimization: methodology and applications. Springer, Berlin, pp 278–287

Rockafellar TR, Uryasev S (2000) Optimization of conditional value-at-risk. J Risk 2:21–41Rockafellar TR, Uryasev S (2002) Conditional value-at-risk for general loss distributions. J Banking Finance

26:1443–1471Schölkopf B, Smola AJ (2002) Learning with kernels-support vector machines, regularization, optimization,

and beyond. The MIT Press, MassachusettsSchölkopf B, Smola AJ, Williamson RC, Bartlett PL (2000) New support vector algorithms. Neural Comput

12:1207–1245Shin KS, Lee TS, Kim H (2005) An application of support vector machines in Bankruptcy prediction model.

Expert Syst Appl 28:127–135Takeda A (2009) Generalization performance of ν-support vector classifier based on conditional value-at-

risk minimization. Neurocomputing 72:2351–2358Takeda A, Gotoh J, Sugiyama M (2010) Support vector regression as conditional value-at-risk minimization

with application to financial time-series analysis. In: Proceedings of 2010 IEEE international workshopon machine learning for signal processing

Tax DMJ, Duin RPW (1999) Support vector domain description. Pattern Recognit Lett 20:1191–1199Thomas LC, Edelman DB, Crook JN (2002) Credit scoring and its applications. SIAM, PhiladelphiaVapnik VN (1995) The nature of statistical learning theory. Springer, New YorkWang Y (2012) Robust ν-support vector machine based on worst-case conditional value-at-risk minimiza-

tion. Optim Methods Softw 27:1025–1038Xanthopoulos P, Pardalos PM, Trafalis TB (2013) Robust data mining. Springer, BerlinYajima Y (2005) Linear programming approaches for multicategory support vector machines. Eur J Ope

Res 162:514–531Zhu S, Fukushima M (2009) Worst-case conditional value-at-risk with application to robust portfolio man-

agement. Oper Res 57:1155–1168

123

Documents

Interaction between financial risk measures and machine learning methods