Click here to load reader
Upload
rei
View
213
Download
0
Embed Size (px)
Citation preview
Comput Manag SciDOI 10.1007/s10287-013-0175-5
ORIGINAL PAPER
Interaction between financial risk measuresand machine learning methods
Jun-ya Gotoh · Akiko Takeda · Rei Yamamoto
Received: 6 October 2012 / Accepted: 31 May 2013© Springer-Verlag Berlin Heidelberg 2013
Abstract The purpose of this article is to review the similarity and difference betweenfinancial risk minimization and a class of machine learning methods known as supportvector machines, which were independently developed. By recognizing their commonfeatures, we can understand them in a unified mathematical framework. On the otherhand, by recognizing their difference, we can develop new methods. In particular,employing the coherent measures of risk, we develop a generalized criterion for two-class classification. It includes existing criteria, such as the margin maximization andν-SVM, as special cases. This extension can also be applied to the other type ofmachine learning methods such as multi-class classification, regression and outlierdetection. Although the new criterion is first formulated as a nonconvex optimization,it results in a convex optimization by employing the nonnegative �1-regularization.Numerical examples demonstrate how the developed methods work for bond rating.
The research of the first author is partly supported by a MEXT Grant-in-Aid for Young Scientists (B)23710176. Also, the authors appreciate the comments by two anonymous referees and Dr. Pando G.Georgiev.
J. Gotoh (B)Department of Industrial and Systems Engineering, Chuo University,2-13-27 Kasuga, Bunkyo-ku, Tokyo 112-8551, Japane-mail: [email protected]
A. TakedaDepartment of Mathematical Informatics, The University of Tokyo,7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan
R. YamamotoMitsubishi UFJ Trust Investment Technology Institute Co., Ltd.,4-2-6 Akasaka, Minato-ku, Tokyo 107-0052, Japan
123
J. Gotoh et al.
Keywords ν-Support vector machine (ν-SVM) ·Conditional value-at-risk (CVaR) ·Mean-absolute semi-deviation (MASD) · Coherent measures of risk · Credit rating
Mathematics Subject Classification (2000) 62H30 · 62P05 · 90C90 · 91B28 ·91B30 · 91G40
1 Introduction
In financial risk management, classification problems play an important role. Forexample, failure discrimination has been a popular subject for classification since thestudy of Altman (1968) which is known as the first application of the Fisher’s lineardiscriminant analysis (LDA) to company’s bankruptcy. Such a failure discriminationis considered to be a so-called two-class (or binary) classification.
On the other hand, credit rating of debtors (e.g., companies or consumers) canbe accomplished by a multi-class classification where each debtor is to be classi-fied into a class (e.g., AA+) among more than two classes (e.g., AAA, AA+, . . .).Estimating or predicting the rating of bonds or their issuers is of increasing impor-tance as well as the credit scoring of consumer loans (see Crook et al. 2007, for con-sumer credit risk assessment). Recently sovereign credit rating has gathered attention(e.g., Bennell et al. 2006). Corresponding to the increase of the importance, a numberof articles have been devoted to developing the classification methods. The most clas-sical is the multi-class extension of LDA and the most popular is the (ordered) logitmodel (see, e.g., Thomas et al. 2002; Crook et al. 2007).
In addition to those traditional statistical methods, various kinds of artificial intel-ligence and mathematical programming approaches have been applied to financialcredit scoring (see, e.g., Bahrammirzaee 2010 for a comprehensive survey of artificialintelligence methods). Especially, since the mid 1990’s, the support vector machines(SVMs), a class of machine learning methods, were developed by Vapnik (1995) andhave been often employed in financial applications (e.g., Erdal and Ekinci 2012; Huanget al. 2004; Shin et al. 2005). Most of the articles, however, have merely employedexisting statistical methods.
On the other hand, way before those credit classification applications, optimizationmodeling had been developed in the context of portfolio selection since the adventof the Markowitz’s mean-variance model, in which variance is employed to repre-sent the risk to be minimized. In addition to variance, there has been a continuouseffort for studying a number of measures of risk to capture various characteristicsof loss distributions. Among such are semi-variance (Markowitz 1959), below-targetreturns (Fishburn 1977), value-at-risk (VaR), conditional value-at-risk (CVaR) (e.g.,Rockafellar and Uryasev 2000) and classes of coherent and convex measures of risk(Artzner et al. 1999; Föllmer and Schied 2002). In addition to the aim of merely mea-suring the risk of a position, such risk measures have been employed as an objectivein optimizing a portfolio of financial assets.
It is not hard to see that there is a connection between the financial risk minimizationand the optimization in machine learning criteria, both of which estimate models thatwould achieve good out-of-sample performance. Indeed, Gotoh and Takeda (2005)
123
Interaction between financial risk measures and machine learning methods
have pointed out the common mathematical structure employed both in the class ofmachine learning methods known as ν-support vector machines (ν-SVMs) and in theCVaR minimization. Considering that both of the methods have gained popularitydue to their nice theoretical properties and computational tractability, the connectionbrings a new perspective to us.
The purpose of this article is to review the similarity and difference between finan-cial risk optimization and SVMs, by revisiting the facts discussed in Gotoh and Takeda(2005). Both of the methodologies have been developed independently, but they havea lot in common. By recognizing what they have in common, we can understand theirmathematics in a unified framework. At the same time, by recognizing the difference,we can develop new methods. In particular, we develop a new criterion for two-classclassification by employing the notion of the coherent measures of risk. The new crite-rion can be viewed as a generalization of both the margin maximization and ν-SVM.Besides, the idea can be extensively applied to the other type of machine learningsituations such as multi-class classification, regression and outlier detection.
The structure of this article is as follows. In the next section, we briefly overview acouple of prominent criteria for SVM of two-class linear classification. On the otherhand, Sect. 3 is devoted to an overview of VaR and CVaR, followed by an expositionon the relation between the two-class ν-SVM and the CVaR minimization on the basisof Gotoh and Takeda (2005). In Sect. 4, we extensively apply the coherent measuresof risk to the two-class classification context, and discuss strategies to cope withthe intractability arising from a nonconvex constraint of the associated optimizationproblems. Section 5 further extends the two-class method developed in Sect. 4 tothe other machine learning situations. In Sect. 6, numerical examples are given so asto demonstrate the performance of a couple of methods developed in the precedingsections. Finally, Sect. 7 closes the article with some concluding remarks.
Notation. We throughout use the following notation. For x ∈ Rn, ‖x‖ denotes any
norm in Rn . In particular, �p-norm is defined by ‖x‖p := (
∑nj=1 |x j |p)1/p for p ∈
[1,∞), and ‖x‖∞ := max j=1,...,n{|x j |} for p = ∞. �2-norm, ‖x‖2, is also known asthe Euclidean norm. en denotes the vector of ones of size n, i.e., en := (1, . . . , 1)� ∈R
n . I�m denotes the unit simplex in Rm and I�m+ is its (relative) interior, i.e., I�m :=
{q ∈ Rm : e�m q = 1, q ≥ 0} whereas I�m+ := {q ∈ R
m : e�m q = 1, q > 0}. For a setS ⊂ R
n , its convex hull is denoted by S.
2 Formulations of support vector classification
Maximum margin criterion and hard margin formulation. To make this article self-contained, let us start with a fundamental formulation of the so-called two-class hardmargin support vector classification.
Let {(x1, y1), . . . , (xm, ym)} denote a given data set where xi ∈ Rn describes the
attributes of sample i and yi ∈ {±1} describes its label, i = 1, . . . , m. Let m+ := |{i :yi = 1}| and m− := |{i : yi = −1}| (= m − m+). In the following, the data set isassumed to contain at least one sample of each class, i.e., min{m+, m−} ≥ 1. We saythat the data set is linearly separable if there exists (w, b) ∈ R
n \ {0} × R such thatfor each i = 1, . . . , m,
123
J. Gotoh et al.
Fig. 1 Two hyperplanesseparating a linearly separabledata set and their margins. Twoseparating hyperplanes,w�x = b and w�x = b, aredrawn in this figure. Accordingto a machine learning theory, thehyperplane w�x = b ispreferable to the other becausethe former has larger margin
{yi = +1⇒ w�xi > byi = −1⇒ w�xi < b.
(1)
Condition (1) can be rewritten by yi (w�xi −b) > 0. As indicated in Fig. 1, a linearly
separable data set can be divided into two classes corresponding to its label, i.e.,yi = +1 or yi = −1, by an infinite number of hyperplanes, w�x = b.
One of the most reasonable criteria for determining a hyperplane separating thetwo classes is to maximize the distance from the hyperplane to nearest points, xi . Itis not hard to see that if the data set is linearly separable, this criterion is formulatedby a fractional optimization problem of the form:
maximizew∈Rn\{0},b∈R
mini=1,...,m
yi (w�xi − b)
‖w‖2 . (2)
The fraction yi (w�xi − b)/‖w‖2 is called the geometric margin of a sample (xi , yi ),
and the smallest geometric margin is called the margin of a hyperplane, as illustratedin Fig. 1. The max-min formulation (2) can be reformulated into another fractionalprogram:
∣∣∣∣∣
maximizew,b,s
s
‖w‖2subject to yi (w
�xi − b) ≥ s, i = 1, . . . , m.(3)
By applying the so-called Charnes-Cooper transformation, it results in a convexquadratic problem formulation called the hard margin support vector classification(HSVC):
123
Interaction between financial risk measures and machine learning methods
Fig. 2 Distance under different norms. x1 is the closest to a point x under �∞-norm and its distance isgiven by |w� x − b|/‖w‖1, while x2 is the closest under �2-norm and the distance is |w� x − b|/‖w‖2
∣∣∣∣∣
minimizew,b
12‖w‖22
subject to yi (w�xi − b) ≥ 1, i = 1, . . . , m.
(4)
The equivalence between (4) and (3) depends on the linear separability of the data set.Indeed, if a data set is not linearly separable, (4) is infeasible, whereas (3) still has asolution. It should also be noted that the choice of �2-norm is not necessary. Indeed,if we employ, in place of �2-norm, �∞-norm for gauging the distance between twopoints, the distance between a point xi to a hyperplane w�x = b is represented by
yi (w�xi − b)
‖w‖1 ,
where ‖w‖1 represents �1-norm (see Fig. 2). In general, if any norm, ‖x‖, isemployed (i.e., not necessarily �2- or �∞-norm), the corresponding distance is givenby yi (w
�xi−b)/‖w‖∗ with the dual norm ‖w‖∗ := maxx
w�x/‖x‖ (see Mangasarian
1999). Corresponding to (4), the problem can be represented as a convex optimizationwith the dual norm:
∣∣∣∣∣
minimizew,b
‖w‖∗subject to yi (w
�xi − b) ≥ 1, i = 1, . . . , m.(5)
If �2-norm, ‖x‖2, is employed as ‖x‖, the objective of (5) is ‖w‖2 and (5) is equivalentto (4).
Minimizing a norm in (4) or (5) is often interpreted as the margin maximization onthe basis of the above reasoning.
Soft margin formulation. When the data set is not linearly separable, i.e., for any(w, b) there exists a sample i satisfying yi (w
�xi − b) ≤ 0, there are multiple waysfor modification. The most common approach is known as the soft margin formulationor C-support vector classification (C-SVC), which is formulated by
123
J. Gotoh et al.
∣∣∣∣∣∣
minimizew,b,z
12‖w‖22 + C
m
m∑
i=1zi
subject to yi (w�xi − b)+ zi ≥ 1, zi ≥ 0, i = 1, . . . , m,
(6)
where C > 0 is a user-defined parameter. Formulation (6) is always feasible regardlesswhether the given data set is linearly separable or not.
In contrast to HSVC (4), C-SVC (6) is a simultaneous minimization of twoobjectives:
1
2‖w‖22 + C · 1
m
m∑
i=1
max{
1− yi (w�xi − b), 0
}. (7)
In a machine learning theory, such a two-objective formulation for estimating a modelis called structural risk minimization. The second term of (7) represents the average ofthe degrees of misclassification where only samples xi satisfying yi (w
�xi − b) < 1contribute to the average computation, and is called empirical risk, while the first termis considered to play a role of avoiding overfitting and is called regularization term.Minimizing the regularization term is often interpreted as the margin maximization incomparison with HSVC (4).
In financial practice, the use of the machine learning model is required to makea balance between the accuracy and the interpretability. In such a sense, the inter-pretability of the parameter(s) can be of importance. Compared to HSVC (4), itsclear-cut interpretation of (6), however, is vague in the following sense:
1. First of all, the empirical risk part of (7) gauges the degree of inseparability, whichare defined not only on the samples satisfying yi (w
�xi − b) < 0, but also onthose satisfying 0 ≤ yi (w
�xi − b) < 1. Note that the samples xi satisfying0 < yi (w
�xi − b) ≤ 1 are penalized despite their correct classification. Besides,it is not easy to understand the implication of the value “1”.
2. In addition, the parameter C > 0 has to be selected a priori, but its interpretationis not clear. In practice, its value is determined by data-driven methods such ascross validation.
(See Bennett and Bredensteiner 2000 for its geometric interpretation).Another remedy for the linearly inseparable case is known as ν-SVC (Schölkopf
et al. 2000), which is formulated by
∣∣∣∣∣∣
minimizew,b,z,ρ
12‖w‖22 − ρ + 1
νm
m∑
i=1zi
subject to yi (w�xi − b)+ zi − ρ ≥ 0, zi ≥ 0, i = 1, . . . , m,
(8)
where ν ∈ (0, 1] is a user-defined parameter. It is known that ρ ≥ 0 is satisfied atoptimality. ν-SVC (8) can be viewed as a structural risk minimization of the form:
123
Interaction between financial risk measures and machine learning methods
minimizew,b
1
2‖w‖22 +min
ρ
{
−ρ + 1
νm
m∑
i=1
max{−yi (w
�xi − b)+ ρ, 0}}
. (9)
As will be elaborated in the next section, the second term of (9) represents a specialcase of a risk measure known as the conditional value-at-risk in the finance literature.
A solution to (8) corresponds to that of (6) in the following sense.
Proposition 1 (Schölkopf et al. 2000) Let (w∗, b∗, ρ∗) be a solution to (8) and sup-pose ρ∗ > 0. Then (w∗/ρ∗, b∗/ρ∗) is a solution to (6) with the parameter C = 1/ρ∗.
Despite of the correspondence shown in the above proposition, the interpretabilityof ν-SVC (8) is superior to that of C-SVC (6).
Theorem 1 (ν-property Schölkopf et al. 2000) Suppose that the solution (w∗, b∗,z∗, ρ∗) to (8) satisfies ρ∗ > 0. Then
(i) ν is an upper bound on the value defined by 1m |{i : yi (x�i w∗ − b∗) < ρ∗}|.
(ii) ν is a lower bound on the fraction of support vectors.
A support vector (SV) of a solution (w∗, b∗, z∗, ρ∗) is defined as a sample (xi , yi )
for which the KKT complementarity condition ζ ∗i (yi (x�i w∗ − b∗) + z∗i − ρ∗) =0, ζ ∗i ≥ 0, yi (x�i w∗ − b∗)+ z∗i − ρ∗ ≥ 0 holds with ζ ∗i > 0 where ζ ∗i is the optimaldual variable corresponding to the constraint yi (x�i w − b)+ zi − ρ ≥ 0. A machinelearning theory indicates that the smaller number of SVs leads to better out-of-sampleperformance, and accordingly, the controllability of the number of SVs is a preferableproperty to users. In contrast, the parameter C in (6) is not easy to interpret.
On the other hand, we can point out that the ν-property is understandable in thecontext of the CVaR minimization, which will be overviewed in the next section. Thefact that the ν-SVC can be viewed as a special case of a financial risk measure opens thedoor of versatile interaction between financial risk measures and the machine learningmethodologies.
3 Value-at-risk and conditional value-at-risk
Next let us overview the measures of financial risk. In financial risk management,the uncertainty is usually described by random variables (e.g., uncertain loss of aposition) on a certain space � of elementary events and a probability distribution P.In the following, we assume that random variables, often denoted by L, represent aloss, i.e., the smaller the value of a random variable, the better. A risk measure r is afunctional that maps a random loss to a real value. Throughout the paper we assumethat the larger a risk measure, the riskier.
To represent the risk of a loss using a single value, various risk measures have beenproposed and examined since the Markowitz’s introduction of variance or, equiva-lently, standard deviation. Since variance captures the expected deviation from theexpected value of the random variable, it regards both upper and lower deviationsfrom the expected gain as a loss.
123
J. Gotoh et al.
Fig. 3 Illustration of a loss distribution and risk measures associated with CVaR
VaR. On the other hand, there has been a strongly supported view that only the lowerdeviation or lower tail of a return distribution is to be avoided, and many measures,including semi-variance (Markowitz 1959) and below-target return (Fishburn 1977),have been posed for capturing such downside risk.
Value-at-risk (VaR) is another measure of downside risk, defined as the β-quantileof the loss, i.e.,
αβ [L] := min{α : P{L ≤ α} ≥ β},
with β ∈ (0, 1). The parameter β is, in usual practice, fixed at a value close to 1,say, 0.99, so as to measure the virtually-largest loss which may happen with a smallprobability 1− β (see Fig. 3). Although VaR was originally developed in practice inthe mid 1990s so that manager can easily grasp whole risk he/she owns every day, ithas been pointed out that VaR has drawbacks as a risk measure. Indeed, it passes overthe impact of the loss which is larger than the quantile, and it lacks the subadditivity.
CVaR. Conditional value-at-risk (CVaR) has been proposed by many authors and isnow one of the most promising downside risk measures. Given a random loss L, itsCVaR is defined by
φβ [L] := minα
{
α + 1
1− βE[max{L− α, 0}]
}
, (10)
where E[·] is mathematical expectation with respect to P and β ∈ (0, 1) is a user-defined parameter. CVaR is also known as tail value-at-risk (TVaR) or expected short-fall (ES).
Although the definition (10) seems slightly complicated, CVaR is virtually equalto the conditional expectation E[L|L ≥ αβ [L]] or E[L|L > αβ [L]] in the sense thatwe have
E[L|L ≥ αβ [L]] ≤ φβ [L] ≤ E[L|L > αβ [L]].
123
Interaction between financial risk measures and machine learning methods
See Proposition 5 of Rockafellar and Uryasev (2002) for the details. Besides, anoptimal α in the definition of φβ [L] is located in the closed interval [αβ [L], α+β [L]]where α+β [L] := inf{α : P{L ≤ α} > β}, and accordingly, it virtually provides VaRof L.
In particular, if L is a random variable on a finite sample space, i.e., |�| = m(<∞),(10) is well-defined at β = 0, and we have φ0[L] = E[L] = ∑m
i=1 pi Li whereLi := L(ωi ) and pi := P{ω = ωi }, i = 1, . . . , m. Besides, at β sufficiently close to1, we have φ
β[L] = maxi {Li }. In this sense, CVaR is a generalization of the expected
loss and the maximum loss (see Fig. 3).CVaR has nice properties both in theory and in computation. In terms of the expected
utility theory, it is known to be consistent with all risk averse investors in the followingsense.
Theorem 2 (Pflug 2000; Ogryczak and Ruszczýnski 2002) Let U be the set of riskaverse utility functions, i.e., nondecreasing and concave functions. Then, for eachβ ∈ (0, 1), we have
E[u(−L1)] ≥ E[u(−L2)] for all u ∈ U ⇒ φβ [L1] ≤ φβ [L2].
This theorem says that if all the risk averse investors agree on that L1 is never inferiorto L2, the preference relation holds equally true of CVaR (for any β). This prop-erty is known as the consistency with the second order stochastic dominance (SSD)(Pflug 2000; Ogryczak and Ruszczýnski 2002). It should be emphasized that theabove inequality holds independently of the distribution P. Note that neither variance(or standard deviation) or mean-variance is consistent with SSD.
CVaR is also known to be a coherent measure (Artzner et al. 1999), whose detailswill be summarized in the next section. Since both the consistency with SSD andthe coherence are distribution-free properties, the use of CVaR is advantageous in asituation where the loss cannot be assumed to follow a specific distribution. Moreover,although the use of utility functions in credit scoring problems has been proposed in theliterature (e.g., Baourakis et al. 2009; Bugera et al. 2002), the use of SSD-consistentcoherent risk measures is more advantageous since specifying an adequate utilityfunction is a difficult task.
CVaR minimization. Once a risk measure, r [·], is introduced, we can define the prob-lem of choosing an optimal random variable, L(π�), from among a set of randomvariables, {L(π) : π ∈ �} where � is the given set of parameters. Here we assumethat P is independent of π .
An important example in financial risk management is portfolio selection. Forexample, let us denote the random rates of return of n investable assets by R :=(R1, . . . ,Rn)�, and let π := (π1, . . . , πn)� be the investment weight vector. The lossL of a portfolio π is usually defined by the negative portfolio return, L(π) = −R�π ,and an optimal portfolio π� is obtained via min{r [L(π)] : π ∈ �} for some riskmeasure r [·] and � ⊂ {π ∈ R
n : e�n π = 1}. More specifically, the problem ofobtaining a parameter π� that minimizes CVaR associated with a loss function L(π)
is formulated by
123
J. Gotoh et al.
minimizeπ
φβ [L(π)] subject to π ∈ �. (11)
Rockafellar and Uryasev (2002) show that the CVaR minimization (11) results in aconvex minimization under a convexity condition that arises often in practice.
Theorem 3 (Rockafellar and Uryasev 2002) If the loss function L(π) is convex in π ,then φβ [L(π)] is convex in π , and the CVaR minimization (11) results in a convexminimization of the form:
minimizeα,π
α + 1
1− βE[max{L(π)− α, 0}] subject to π ∈ �, (12)
as long as � is a convex set. In addition, for any optimal solution (α�,π�) to (12), itholds that α� ∈ [αβ [L(π�)], α+β [L(π�)]].The last statement says that the CVaR minimization (12) provides as a byproduct anapproximate of VaR of the CVaR-minimizing loss distribution.
What is more commonly the case in practice is the minimization of the empiricalversion of the risk measures. Suppose that m realized samples of a random vectordefining the loss (e.g., historical returns Ri , i = 1, . . . , m, in the aforementionedportfolio selection) are given. By regarding the m realizations as all the elementaryevents (e.g., Ri = R(ωi ), i = 1, . . . , m) and defining the associated loss Li (π) :=L(ωi ,π), i = 1, . . . , m, the minimization of the empirical CVaR can then be writtenby
minimizeα,π
α + 1
1− β
m∑
i=1
pi max{Li (π)− α, 0} subject to π ∈ �, (13)
where β ∈ [0, 1), and pi is a reference probability satisfying∑m
i=1 pi = 1, pi >
0, i = 1, . . . , m. A typical choice of p is the uniform case, i.e., pi = 1/m, i =1, . . . , m.
As Rockafellar and Uryasev (2000) demonstrate, if the loss is given by a linearfunction of π (e.g., Li (π) = −R�i π) and the constraints, π ∈ �, are given by asystem of linear inequalities, the minimization (13) results in a linear program (LP).
A relation of ν-SVC to CVaR minimization. It is easy to find a similarity betweenthe CVaR minimization (13) and ν-SVC (8) or (9). Precisely, replacing the variableand parameters as follows:
⎧⎨
⎩
ρ → −α,
ν → 1− β,
1/m → pi , (i = 1, . . . , m),
(14)
the formulation (8) or (9) can be viewed as a special case of the minimization of thefunction:
1
2‖w‖22 +min
α
{
α + 1
1− β
m∑
i=1
pi max{−yi (w
�xi − b)− α, 0}}
.
123
Interaction between financial risk measures and machine learning methods
Namely,ν-SVC can be considered as a structural risk minimization where the empiricalrisk is captured by a CVaR associated with the loss−yi (w
�x− b) and its probabilitypi , i = 1, . . . , m.
However, there is still a question remaining: Unless the data set is linearly separable,does minimizing 1
2‖w‖22 yet represent margin maximization? In conclusion, there isa gap between the margin maximization (2) and ν-SVC in that (8) or (9) can resultin a meaningless solution satisfying w = 0 for small ν, whereas (2) has a solutionw �= 0. In order to see the source of the gap, Gotoh and Takeda (2005) examine theCVaR-based formulation:
minimizew,b,α
α + 1
1− β
m∑
i=1
pi max
{
− yi (w�xi − b)
‖w‖2 − α, 0
}
. (15)
Note that this is the minimization of CVaR where the negative geometric margin, i.e.,
− yi (w�xi − b)
‖w‖2 (16)
is employed to define a loss. With the change of variables (w, b, α) ← (w/
‖w‖2, b/‖w‖2, α), (15) is rewritten by
minimizew,b,α
α + 1
1− β
m∑
i=1
pi max{−yi (w
�xi − b)− α, 0}
subject to ‖w‖2 = 1.
(17)
Note that (17) is equivalent to (15) in the following sense:
– If (w∗, b∗, α∗) is an optimal solution to (17), then k(w∗, b∗, α∗) is optimal to (15)for any k > 0;
– If (w∗, b∗, α∗) is an optimal solution to (15), then (w∗/‖w∗‖2, b∗/‖w∗‖2, α∗) isoptimal to (17).
Perez-Cruz et al. (2003) present an extended version of ν-SVC, termed Eν-SVC:
∣∣∣∣∣∣∣∣
minimizew,b,z,ρ
−ρ + 1νm
m∑
i=1zi
subject to yi (w�xi − b)+ zi − ρ ≥ 0, zi ≥ 0, i = 1, . . . , m,
‖w‖22 = 1.
(18)
It is easy to see that Eν-SVC (18) is a special case of (17) under the change of variableand parameters as in (14). In this sense, Eν-SVC (18) can be regarded as a CVaRminimization in which the loss of the form (16) is employed. (18) has an advantageover the ordinary ν-SVC (8) in the following sense:
– For small νs, (8) or (9) can result in a meaningless solution satisfying w = 0. Onthe other hand, (18) has optimal solutions for such small νs;
123
J. Gotoh et al.
Fig. 4 A bridge via the CVaR minimization between ν-SVM and margin maximization. ‘Min.CVaR’indicates the optimal value of (15) or (17) (or, equivalently, Eν-SVC), and is nondecreasing in β (ornonincreasing in ν). For the linearly inseparable data, ‘Min.CVaR’ can become positive at large βs (or,equivalently, small νs). In particular, at β sufficiently close to 1 (or ν sufficiently close to 0), (15) or (17)(or Eν-SVC) provides the same hyperplane as the maximum margin criterion (2) does. On the other hand,ν-SVC results in a solution satisfying w = 0 in the case where ‘Min.CVaR’ is positive, and accordingly,ν-SVC never provides the same solution as the maximum margin criterion does
– For νs under which (8) or (9) has optimal solutions satisfying w �= 0, (18) has thesame optimal solutions.
Besides, Gotoh and Takeda (2005) show that
– If the optimal value of (18) is negative, then the resulting hyperplane is equivalentto that obtained via (8) or (9).
– If the optimal value of (18) is positive, then (8) or (9) results in a meaninglesssolution satisfying w = 0.
Note that the above facts hold also for (15) or (17) by replacing ν and 1/m with 1−β
and pi , respectively.An additional advantage of the formulations (15), (17) and (18) over ν-SVC (8)
is that by adopting a sufficiently large β or small ν, each of them provides the samehyperplane as the maximum margin criterion (2) does.
Proposition 2 If β ∈ (1 − mini pi , 1), (15) and (17) are equivalent to (2) in thesense that both provide the same hyperplane. Equivalently, if ν ∈ (0, 1/m), (18) isequivalent to (2).
Note that this proposition implies that (15), (17) and (18) include the margin maxi-mization as a special case.
Figure 4 illustrates the above-mentioned facts. The CVaR minimization (15) or (17)(or equivalently Eν-SVC) covers both ν-SVC and the maximum margin criterion byvarying the parameter β (or ν, respectively).
A generalization theory. A goal of classification is to predict the labels of unknownsamples as much as possible, rather than to obtain a model that fits to the given samples.More precisely, based on m observed samples {(x1, y1), . . . , (xm, ym)} of a random
123
Interaction between financial risk measures and machine learning methods
vector (X ,Y), an estimate (w, b) is sought so that Y(w�X −b) would be probabilis-tically large. A machine learning theory, known as generalization theory, providesa nonparametric bound of the associated probability. For example, by following theproof of Takeda et al. (2010) for ν-SVR, we achieve the following theoretical gen-eralization bounds for ν-SVC and the extended ν-SVC (18) by using the empiricalVaR:
αeβ(w, b) := min
{
α : 1
m|{i : Li (w, b) ≤ α}| ≥ β
}
,
with Li (w, b) := −yi (w�xi − b), or the empirical CVaR:
φeβ(w, b) := min
α
{
α + 1
(1− β)m
m∑
i=1
max{Li (w, b)− α, 0}}
.
Theorem 4 Let θ be a threshold for a loss. Suppose that random vector (X ,Y) has abounded support in the sense that X lies in a ball of radius R centered at the origin,and that m samples, (xi , yi ), are independently drawn from (X ,Y). Then, for anyw satisfying αe
β(w, b) < θ and ‖w‖22 = 1, the probability of the loss L(w, b) :=−Y(w�X − b) being greater than θ, P{L(w, b) > θ}, is bounded above as
P{L(w, b) > θ} ≤ 1− β + G(αe
β(w, b)− θ)
,
with probability at least 1− δ, where
G(γ ) :=√
2m
(4c2
γ 2 (R2 + 2)(R2 + θ2 + 1) log2(2m)− 1+ log( 2
δ
)),
and c is a constant.
Note that we here use P instead of P because we have to distinguish the unknownprobability distribution P on a sample space � ⊃ � from the known probability P
on � which is dealt with everywhere else in this article. More precisely, we here takepi = 1/m, i = 1, . . . , m, as P.
Corollary 1 Suppose the same assumption as in Theorem 4. Then, for any w satisfyingφe
β(w, b) < θ and ‖w‖22 = 1, the probability of the loss L(w, b) being greater than
θ, P{L(w, b) > θ}, is bounded above as
P{L(w, b) > θ} ≤ 1− β + G(φe
β(w, b)− θ)
,
with probability at least 1− δ.
123
J. Gotoh et al.
These propositions indicate that the minimization of the empirical VaR or thatof CVaR coupled with the constraint on a regularization term leads to a better out-of-sample performance of the prediction. This fact has been utilized for portfoliooptimization in Gotoh and Takeda (2011, 2012).
It is interesting that VaR provides a tighter bound than CVaR does. However, we donot further study the VaR-based classification in this article since the empirical VaRminimization is far less computationally tractable than the CVaR minimization.
4 Coherent risk-based classification
It is natural to extend the CVaR-based classification (15) to those based on anotherrisk measures.
4.1 Fundamental properties of coherent risk measures
First of all, we overview the basic properties of the coherent measures of risk. In orderto avoid unnecessary technicalities, we limit the following discussion to distributionswith finite sample space � = {ω1, . . . , ωm}.
Definition 1 (Coherent measure of risk Artzner et al. 1999) A risk measure r is saidto be coherent if it satisfies the following axioms:
1. (monotonicity) L1 ≤ L2 (i.e., L1(ω) ≤ L2(ω) for all ω ∈ �) ⇒ r [L1] ≤ r [L2],2. (translation invariance) r [L+ a] = r [L] + a for all L and a ∈ R,3. (positive homogeneity) r [aL] = ar [L] for all a ≥ 0,4. (subadditivity) r [L1] + r [L2] ≥ r [L1 + L2] for all L1,L2.
CVaR is known to be a coherent measure (Pflug 2000; Rockafellar and Uryasev2002). On the other hand, VaR is not coherent since it does not satisfy the subadditivitywhile it satisfies the axioms 1. through 3. This fact indicates the nonconvexity of theVaR-minimizing classification even when the loss is given as a linear function of(w, b).
The following two are the simplest examples of coherent risk measures:
r [L] = E[L] =m∑
i=1
pi Li : expected loss,
r [L] = max{L1, . . . , Lm}: maximum loss.
We should recall that CVaR can be viewed as a generalization of the expected loss andthe maximum loss, i.e., CVaR includes both of them as two special cases: β = 0 forthe expected loss while β > 1−mini=1,...,m pi for the maximum loss.
Another risk measure of interest is the so-called mean absolute semi-deviation(MASD):
123
Interaction between financial risk measures and machine learning methods
r [L] = E[L] + λE[max{L− E[L], 0}]
=m∑
i=1
pi Li + λ
m∑
i=1
pi max
{
Li −m∑
h=1
ph Lh, 0
}
,
where λ ≥ 0. MASD is coherent for λ ∈ [0, 1], but it represents a downside risk mea-sure even for λ > 1. See Fisher (2001) for the coherence of MASD. Obviously, MASDcontains the expected loss at λ = 0. The minimization of MASD is viewed as the MAD(mean-absolute deviation) model (Konno and Yamazaki 1991), which minimizes
E[L] + λE[|L− E[L]|] with λ ∈[
0,1
2
]
: MAD
because absolute semi-deviation E[max{L−E[L], 0}] is equal to the half of absolutedeviation E[|L − E[L]|]. It is noteworthy that MASD with λ ∈ [0, 1] is an SSD-consistent risk measure (Ogryczak and Ruszczýnski 1999) as well as CVaR.
For the other SSD-consistent coherent risk measures, see Krokhmal (2007), inwhich higher moment coherent risk measures and algorithm for solving the associatednonlinear conic programs are studied. Other coherent risk measures are studied byChen and Wang (2008), Delbaen (2002), for example.
Any coherent measure of risk is known to have an explicit representation as follows.
Theorem 5 (Representation theorem for coherent measures of risk e.g., Artzner et al.1999) A risk measure r is a coherent measure of risk if and only if there exists a setQ ⊂ I�m such that
r [L] = supq∈Q
m∑
i=1
qi Li . (19)
Noting sup{∑mi=1 qi Li : q ∈ Q} = max{∑m
i=1 qi Li : q ∈ Q} where Q is the convexhull of Q, Theorem 5 indicates that any coherent measure can be characterized bya compact convex set Q ⊂ I�m . For example, the expected loss is characterized bya single point Q = { p} whereas the maximum loss is characterized by the wholespace of probability distributions Q = I�m . In this sense, the expected loss and themaximum loss are extreme cases. Especially, note that the expected loss with Q = { p}attains the smallest value among any coherent measures whose Q include p and thatthe maximum loss attains the largest value among any coherent measures.
CVaR has the dual representation with Q given by
QCVaR :={
q ∈ Rm : e�m q = 1, 0 ≤ q ≤ p/(1− β)
}for β ∈ [0, 1). (20)
It is not hard to see that QCVaR is monotonically increasing in β, and that QCVaRcontains the point p and is contained by I�m for any β ∈ [0, 1). On the other hand,MASD is characterized by
QMASD ={
q ∈ Rm : q = p+ u − pe�m u, 0 ≤ u ≤ λ p
}for λ ∈ [0, 1].
123
J. Gotoh et al.
QMASD coincides with p when λ = 0, and monotonically increasing in λ.In addition to the risk measures mentioned above, we can define a coherent risk
measure by employing a closed convex set Q in I�m . For example, consider a setdefined by
QEn ={
q ∈ I�m+ :m∑
i=1
qi lnqi
pi≤ C
}
for p ∈ I�m+, C > 0,
i.e., a collection of probability distributions whose entropy relative to a referenceprobability p ∈ I�m+ is less than C . This set also characterizes a coherent risk measuredue to the closed convexity of QEn. In this manner, we can create an infinite numberof coherent measures.
A distributionally robust extension. An important subclass of the dual representation(19) is given by a closed convex set which is defined with a reference probabilityp, and is denoted by Q( p). Obviously, CVaR and MASD belong to this case sinceQCVaR and QMASD contain p for any β ∈ [0, 1) and λ ≥ 0, respectively. The uniformprobability em/m is usually adopted as the reference probability p, but such a choice ofp is under uncertainty. For example, there may be a situation where several candidatesp1, . . . , pK for p are possible. To cope with the uncertainty, the robust optimizationapproach usually seeks the best response to the worst case. Namely, instead of Q( p)
with a single p, it employs as Q the union of Q( p) over p ∈ P where P is a givenset of uncertain reference probabilities, i.e.,
Q(P) := ∪p∈P
Q( p), where P is any set satisfying P ⊂ I�m .
The union Q(P) can become a nonconvex set, and the maximization in the dualrepresentation (19) can result in a nonconvex optimization. However, employing theconvex hull of Q(P) in the dual representation provides a coherent risk measure, i.e.,
maxp∈P
maxq∈Q( p)
m∑
i=1
qi Li = max
{m∑
i=1
qi Li : q ∈ ∪p∈P
Q( p)
}
.
In this sense, the robust version of coherent measure of risk remains to be anothercoherent measure.
This type of robust optimization modeling is called distributionally robust opti-mization. SVMs have been connected to robust modelings [e.g., ch.12 of Ben-Talet al. (2009), ch.5 of Xanthopoulos et al. (2013) and Caramanis et al. (2012)], butmany existing researches deal with the uncertainty in measurement of the given dataxi and relate the modeling to the regularization term ‖w‖. In contrast, the distribu-tionally robust modeling takes care of the uncertainty of the reference probability p.Recently, Wang (2012) presents a distributionally robust version of the CVaR-basedclassification formulation, in which P consists of several candidates for p.
In the context of financial portfolio selection, Zhu and Fukushima (2009) examinea distributionally robust CVaR minimization and show that if the uncertainty set P isgiven by either of the following cases:
123
Interaction between financial risk measures and machine learning methods
P ={
p ∈ Rm : p = p+ η, e�mη = 0, ηL ≤ η ≤ ηU
}for p ∈ I�m, ηL, ηU ∈ R
m,
P ={
p ∈ Rm : p= p+ Aη ≥ 0, e�m Aη = 0, η�η ≤ 1
}for p ∈ I�m, A ∈ R
m×m,
the resulting portfolio optimization leads to a tractable convex optimization. We cantrace the same line as theirs in the context of machine learning presented in the nextsubsection.
4.2 Formulations and solution approaches of coherent risk-based classification
We are now in a position to provide a wide class of classification problems. Althoughthe idea can be easily extended to the other type of machine learning methods, we firstshow the formulation for the two-class classification version.
Along the lines of the discussion in Sect. 3, we employ the negative geometricmargin −Y(w�X − b)/‖w‖ as a loss L, and consider the following formulation:
minimizew,b
maxq
{
−m∑
i=1
qi yi (w�xi − b)
‖w‖ : q ∈ Q
}
, (21)
for a certain closed convex set Q ⊂ I�m . Needless to say, the CVaR-based formulation(15) can be represented as (21) with Q = QCVaR defined by (20).
Using the Charnes-Cooper’s technique, the coherent risk based-classification (21)can be rewritten by
∣∣∣∣∣∣
minimizew,b
maxq{−
m∑
i=1qi yi (w
�xi − b) : q ∈ Q}subject to ‖w‖ = 1.
(22)
Let us examine the condition under which (22) can attain the optimality. It is naturalto start with the maximum loss and the expected loss since both of them respectivelyattain the smallest and largest optimal values of (22) in terms of Q, as mentionedearlier.
Theorem 6 (i) If Q = I�m, i.e., the maximum loss is employed as the risk measure,(22) has an optimal solution.
(ii) Suppose that Q = { p}, i.e., the expected loss is employed as the risk measure.Then, (22) has an optimal solution if and only if
∑mi=1 pi yi = 0.
Proof First of all, recall that the expected loss and the maximum loss are special casesof CVaR with β > 1−mini pi and β = 0, respectively. The CVaR-based version of(22), i.e., (17), can be rewritten by
∣∣∣∣∣
minimizew
g(w)
subject to ‖w‖ = 1,(23)
123
J. Gotoh et al.
where
g(w) := minimizeb,α,z
α + 11−β
m∑
i=1pi zi
subject to zi − yi b + α ≥ −yi x�i w, i = 1, . . . , m,
z ≥ 0.
(24)
Observe that LP (24) is feasible for any β ∈ [0, 1) and p ∈ I�m+, and accordingly,either it has an optimal solution or it is unbounded. Using the duality of LP, a dualproblem to (24) can be defined by
∣∣∣∣∣∣∣∣∣∣∣∣
maximizeλ
−m∑
i=1yi x�i wλi
subject tom∑
i=1yiλi = 0,
m∑
i=1λi = 1, 0 ≤ λ ≤ p/(1− β).
(25)
(i) If β > 1−mini pi , (25) always has a feasible solution λ satisfying λi = 1/(2m+)
for yi = 1 and λi = 1/(2m−) for yi = −1. In addition, the feasible region isbounded, and accordingly, (25) has an optimal solution. By the duality theoremof LP, (24) then attains a (finite) optimal value g(w). Since g(w) is continuousand ‖w‖ = 1 constitutes a compact feasible region, (23) has an optimal solution.This implies that the maximum loss minimization has an optimal solution.
(ii) If β = 0, observe that LP (25) is feasible if and only if∑m
i=1 yi pi = 0. Conse-quently, the proof is complete. ��
It is noteworthy that this theorem is consistent with Lemma 2.2 of Gotoh and Takeda(2005), in which the CVaR-based problem (with �2-norm) is shown to have a solutionfor β ∈ [1− 2 min{∑i :yi=+1 pi ,
∑i :yi=−1 pi }, 1).
Besides, we have the following corollary straightforward from the proof of thesecond statement.
Corollary 2 Suppose that Q contains a reference probability p which satisfies∑mi=1 yi pi = 0. Then, (22) has an optimal solution.
Conversely, if∑m
i=1 pi yi = 0 does not hold, the boundedness depends on coherentmeasures used.
On the other hand, the above condition does not ensure the uniqueness of thesolution. In fact, it is easy to see that in case of β = 0, any b can be optimal. In such acase, after an optimal solution, i.e., (w∗, b∗), is obtained by solving the optimizationproblem, we can redetermine b (with w∗ fixed) so that, for example, the largest in-sample classification accuracy would be attained.
Next let us consider the case where (22) has an optimal solution. The minimization(22) has a (positively homogeneous) convex objective function in (w, b), but has a
123
Interaction between financial risk measures and machine learning methods
single nonconvex norm constraint. If we cannot deal with the nonconvex constraintdirectly, it is reasonable to solve a relaxed counterpart which is given by
∣∣∣∣∣∣
minimizew,b
maxq
{
−m∑
i=1qi yi (w
�xi − b) : q ∈ Q
}
subject to ‖w‖ ≤ 1.
(26)
Similarly to Gotoh and Takeda (2005) where only CVaR with �2-norm is treated, weobtain a condition under which a solution to (26) is optimal to (22).
Theorem 7 If the optimal value of (22) is negative, any optimal solution to (26)satisfies w �= 0 and is also optimal to (22). On the other hand, if the optimal value of(22) is positive, (26) results in a meaningless solution satisfying w = 0.
Proof For the first statement, it is sufficient to show that an optimal solution (w∗, b∗)to (26) satisfies ‖w∗‖ = 1. Note that the negative optimal value of (22) implies thatthe optimal value of (26) is also negative. On the contrary we assume that ‖w∗‖ < 1.Then we have
maxq∈Q
{
−m∑
i=1
qi yi
(w∗
‖w∗‖�
xi − b∗
‖w∗‖
)}
< maxq∈Q
{
−m∑
i=1
qi yi (w∗�xi − b∗)
}
< 0.
This means that the feasible solution ( w∗‖w∗‖ ,
b∗‖w∗‖ ) attains a smaller optimal value than
(w∗, b∗) does, and therefore, ‖w∗‖ = 1 must hold.Next we consider the case where the optimal value of (22) is positive. Since (w, b)=
(0, 0) is a feasible solution to (26), (26) has a feasible solution whose objective value isnonpositive. Suppose that there exists such a solution (w′, b′) satisfying w′ �=0. How-ever, this contradicts the existence of an optimal solution to (22) attaining a positivevalue, since (w′, b′)/‖w′‖ is feasible to (22) and attains a nonpositive value. ��Two-step framework. Due to the observation in Theorem 7, if we can obtain an optimalsolution w∗ to (26) which satisfies the convex constraint with equality, i.e., ‖w∗‖ = 1,the solution is also optimal to (22). If it is not satisfied with equality, there are twopossible cases. In the case where 0 < ‖w∗‖ < 1 holds, we can obtain an optimalsolution to (22) by scaling as follows: (w′, b′) ← (w∗, b∗)/‖w∗‖. Otherwise (i.e.,‖w∗‖ = 0), we have to cope with the nonconvexity of (22). This two-step frameworkis summarized in Algorithm 1.
Algorithm 1 Prototype of Two-step FrameworkSolve the convex minimization (26), and let (w∗, b∗) be an optimal solution. [First step]if ‖w∗‖ = 1 then
Terminate; (w∗, b∗) is an optimal solution to (22).else if 0 < ‖w∗‖ < 1 then
Terminate; (w∗, b∗)/‖w∗‖ is an optimal solution to (22).else
Solve (22) (in a heuristic manner). [Second step]end if
It is possible to employ one of (rigorous) global optimization algorithms so as toapproach to an optimal solution at the second step. However, those usually spend animpractically large amount of computation time. Therefore, a reasonable choice is to
123
J. Gotoh et al.
apply a heuristic algorithm which is expected to attain a near optimal solution. Onereasonable choice is to employ the normalized linearization algorithm employed inGotoh and Takeda (2005) where the algorithm is referred to as cutting plane algorithm.Gotoh and Takeda (2005) report that the algorithm worked well, and we can trace thesame line in the generalized problem (22). However, we will below introduce a morecomputationally tractable strategy, which might be adequate for credit rating.Incorporation of a priori information of experts and use of the �1-norm regularization.When we apply the classification methods to credit scoring problem, we often imposethe (expected) sign conditions on weights, w. For example, the creditworthiness of aloan applicant should be increasing in his/her income, and that condition is achieved byimposing a constraint that the coefficient of the attribute ‘income’ being nonnegative.Accordingly, suppose that the signs of all attributes are previously arranged so thatthey are nonnegatively related to the creditworthiness. It is then reasonable to imposethe nonnegativity, i.e., w ≥ 0.
In addition, we employ �1-norm for the norm constraint of (22). Note that this �1-norm regularization corresponds to the case where �∞-norm is employed in measuringthe geometric margin, as mentioned in Sect. 2.
When we impose the nonnegativity and the �1-norm constraint simultaneously, thenorm constraint ‖w‖ = 1 of (22) can be rewritten by linear inequalities of the forme�mw = 1,w ≥ 0. For example, the CVaR- and MASD-based classifications togetherwith this technique are, respectively, formulated by LPs:
∣∣∣∣∣∣∣
minimizew,b,α,z
α + 11−β
p�z
subject to zi ≥ −yi (w�xi − b)− α, i = 1, . . . , m,
z ≥ 0, e�mw = 1,w ≥ 0;(27)
∣∣∣∣∣∣∣∣∣∣
minimizew,b,z
−m∑
i=1pi yi (w
�xi − b)+ λm∑
i=1pi zi
subject to zi ≥ −yi (w�xi − b)+
m∑
h=1ph yh(w�xh − b), i = 1, . . . , m,
z ≥ 0, e�mw = 1,w ≥ 0.
(28)
We call this strategy the nonnegative �1-regularization. Advantages of employing thisstrategy are that (1) there is no need for treating the nonconvexity even when the optimalvalue is positive; (2) users’ knowledge on the expected sign can be incorporated.Besides, the use of the �1-norm regularization is often recommendable in the contextof machine learning since it can lead to a sparse solution, i.e., a solution with many zeroelements, and that function can be considered as a variable selection. The nonnegativityis expected to further prompt the sparsity. Accordingly, the above formulations, (27)and (28), have advantages in that estimating a model and variable selection can beachieved at the same time via a single LP.
5 Extensions to other machine learning methods
In the preceding sections, we have seen that financial risk measures can be applied inthe two-class classification problems. The idea can be straightforward extended to theother type of statistical methods which are treated in the framework of SVMs.
123
Interaction between financial risk measures and machine learning methods
5.1 Ordered multi-class classification
One useful and straightforward extension of the two-class problem is the orderedmulti-class classification where more than two classes are treated. It is applicable topractical credit rating problems, including rating bonds or their issuers (e.g., companiesand governments) and scoring loan applicants.
In order to extend the modeling used so far, let us first modify the notation asfollows. Let C := {1, . . . , K } be the set of ordered labels of K classes. Without loss ofgenerality, we suppose that class K is the most creditworthy and the creditworthinessdecreases as the class label number decreases. The m data samples are supposed tobelong to one of them, and let ki denote the class label of sample i .
Our task is simply described as follows. Given a data set of m labeled samples(xi , ki ), i = 1, . . . , m, we construct K − 1 parallel hyperplanes that would separatethe K classes as clear as possible.
To simplify the notation employed below, let us define
yi,κ :={+1 (ki ≥ κ + 1)
−1 (ki ≤ κ), i = 1, . . . , m, κ ∈ C \ {K }.
The ordered multi-class version of linear separability can then be introduced asfollows: If there exists (w, b1, . . . , bK−1) satisfying
{ki ≥ κ + 1, i = 1, . . . , m, κ ∈ C \ {K } ⇒ w�xi > bκ ,
ki ≤ κ, i = 1, . . . , m, κ ∈ C \ {K } ⇒ w�xi < bκ ,
we say that the labeled data set {(xi , ki ) : i = 1, . . . , m} is linearly separable. As inthe two-class case, a loss associated with the geometric margin can be defined by
− yi,κ (w�xi − bκ)
‖w‖ ,
with an arbitrary norm ‖ · ‖ (see Fig. 5), and the coherent risk-based minimization isformulated by
minimizew,b1,...,bK−1
max
⎧⎨
⎩−
m∑
i=1
∑
κ∈C\{K }
qi,κ yi,κ (w�xi − bκ)
‖w‖ : q ∈ Q
⎫⎬
⎭. (29)
Similarly to the two-class case, (29) can be rewritten by
∣∣∣∣∣∣∣
minimizew,b1,...,bK−1
max
{
−m∑
i=1
∑
κ∈C\{K }qi,κ yi,κ (w�xi − bκ) : q ∈ Q
}
subject to ‖w‖ = 1.
123
J. Gotoh et al.
Fig. 5 Geometric margin of the ordered multi-class problem. With the setting in this subsection, everycombination of a sample xi and a hyperplane w�x = bκ is taken into account to define the loss. If a sampleis correctly classified in terms of a hyperplane, its geometric margins to K hyperplanes are all positive andaccordingly, its losses are all negative. On the other hand, a sample is misclassified in terms of a hyperplane,its loss is positive. Therefore, if a point is misclassified in terms of two hyperplanes as a sample xi whichis located at the right-hand side of the figure, two positive losses are assigned to the point. In this figure,arrows are drawn only for the misclassified samples, and their lengths indicate the size of the positive losses,while negative losses are omitted
In particular, the CVaR-based formulation is given by
∣∣∣∣∣∣∣∣
minimizew,b,α,z
minimize α + 11−β
m∑
i=1
∑
κ∈C\{K }pi,κ zi,κ
subject to zi,κ ≥ −yi,κ (w�xi − bκ)− α, i = 1, . . . , m, κ ∈ C \ {K },z ≥ 0, ‖w‖ = 1,
(30)
where β ∈ [0, 1), whereas the MASD-based formulation is given by
∣∣∣∣∣∣∣∣∣∣∣∣∣
minimizew,b,z
−m∑
i=1
∑
κ∈C\{K }pi,κ yi,κ (w�xi − bκ)+ λ
m∑
i=1
∑
κ∈C\{K }pi,κ zi,κ
subject to zi,κ ≥ −yi,κ (w�xi − bκ)+m∑
h=1
∑
k∈C\{K }ph,k yh,k(w
�xh − bk),
i = 1, . . . , m, κ ∈ C \ {K },z ≥ 0, ‖w‖ = 1,
(31)
where λ ≥ 0.Like the two-class problems (27) and (28), the smallest β for (30) and λ for
(31), respectively, depend on the setting of p. It is not hard to see that if p satis-fies
∑i,κ pi,κ yi,κ = 0 for all i = 1, . . . , m; κ = 1, . . . , K − 1, then they can attain
an optimal solution at β = 0 and λ = 0, respectively. This condition does not holdin general. However, it is satisfied, for example, in the case where we employ the
123
Interaction between financial risk measures and machine learning methods
uniform probability pi,κ = 1m(K−1)
and each class has the same population. Even insuch a case, we can find the minimum β ∈ [0, 1) and λ ≥ 0 under which (30) and(31), respectively, would have optimal solutions.
For solving the above optimization problems, the same regularization strategiescan be taken as in the two-class version. In the next section, we present numericalexamples where the ordered multi-class classification is applied to some credit ratingproblems by exploiting the nonnegative �1-regularization.
5.2 Further extensions
Non-ordered multi-class classification. Even when the order of classes is not defined,the coherent risk-based classification can be applied in a couple of ways. A simpleway for achieving a non-ordered K -class classification is to repeat K applications ofthe two-class classification, each defining a hyperplane w�κ = bκ which divides oneclass, say κ , from the other K − 1 classes (Vapnik 1995). Estimating K hyperplanes(i.e., w�κ x = bκ , κ ∈ C) by assigning the label yi = +1 to the class κ and yi = −1 tothe others, the class-assigning rule is given by
f (x) := arg maxκ
{w�κ x − bκ : κ ∈ C
}.
This framework can be built on the basis of the two-class problem discussed in thepreceding sections, and the coherent risk-based version can be easily achieved. Onthe other hand, this needs to solve K optimization problems. This contrasts with thesingle-optimization method posed in Sect. 5.1.
A single-optimization method can also be formulated for the non-ordered case as inthe ordered case above. Indeed, Bennett and Mangasarian (1993) define the piecewise-linear separability as follows: there exist {(wκ , bκ ) : κ ∈ C} such that
w�kixi − bki > w�κ xi − bκ , ∀i = 1, . . . , m; ∀κ �= ki .
Based on this separability, the geometric margin can be naturally extended by
(wki − wκ)�xi − (bki − bκ)
‖wki − wκ‖ , for i = 1, . . . , m; κ �= ki .
In this case, the coherent risk-based classification is formulated by
minimize(wκ ),(bκ )
max
⎧⎨
⎩−
m∑
i=1
∑
κ �=ki
qi,κ{(wki − wκ)�xi − (bki − bκ)
}
‖wki − wκ‖ : q ∈ Q
⎫⎬
⎭.
Separation by nonlinear functions. Limiting our attention to the similarity in formu-lations of financial optimization and machine learning, we have so far only dealt with
123
J. Gotoh et al.
linear classification where the underlying classification hyperplane is given by a linearfunction of the attributes. Nevertheless, it is not so hard to extend the discussion aboveto nonlinear cases.
It is known that SVM has an advantage in treating a highly nonlinear classificationwithout increasing the complexity of the algorithm by employing the so-called kerneltrick (see, e.g., Vapnik 1995; Schölkopf and Smola 2002). That technique can beapplied also to some coherent risk-based modeling as long as �2-norm is employed.It is noteworthy that even if the other norm is used, we can deal with the kernel-basednonlinearity along the lines of Mangasarian (2000).
The second way is more straightforward. When a practitioner builds a classifica-tion model for credit rating in practice, he/she is likely to adopt a specific nonlin-earity. For example, he/she may think that the quadratics of two certain attributesshould be incorporated, or that the log-transformation should be applied to some ofattributes. In such a case, he/she can explicitly incorporate the nonlinearity by addingthe nonlinearly transformed variables (e.g., xi x j , ln x j ) as attributes and treating alinear model. Another advantage of this strategy is that we can add another condi-tions to the nonlinear functions. Konno and Kobayashi (2000), for example, imposethe positive semidefinite condition on the quadratic functions. Similar techniques canbe found in utility functions approaches (e.g., Baourakis et al. 2009; Bugera et al.2002).
Employing explicit nonlinearity in the above-mentioned way is preferred when theuser has a prior knowledge on the underlying model and wants to incorporate it inthe model. On the other hand, the kernel method is preferred when the user wants todiscover an unknown nonlinear relation between the rating and the attributes.
Regression and outlier detection. So far, we have revisited the existing classificationas a financial risk minimization. The similar interpretation can be obtained in othermachine learning methodologies if any loss is defined.
Regression is another important machine learning methodology. Its main task isto find a model Y = w0 + w�X that explains the relation between Y and X . Theordinary least square method (OLS) is the most famous criterion. In the SVM litera-ture, the regression version is known as the support vector regression (SVR) and itsdevelopment is parallel to SVC. ν-SVR is formulated by
∣∣∣∣∣∣∣∣∣
minimizew,w0,ρ,z
12‖w‖22 + C
(
−ρ + 1νm
m∑
i=1zi
)
subject to zi ≥ |yi − w0 − w�xi | + ρ, i = 1, . . . , m,
zi ≥ 0, i = 1, . . . , m,
where C > 0 and ν ∈ (0, 1] are user-defined parameter (see, e.g., Schölkopf andSmola 2002, Sec.9.3). It is easy to see that this can be regarded as a regularized CVaRminimization with a loss defined by the absolute residuals |yi − w�xi − w0|, i =1, . . . , m. Therefore, it can be interpreted as the minimization of the expectation ofthe largest 100(1− β) percent (absolute) residuals.
123
Interaction between financial risk measures and machine learning methods
If we replace the loss by the residual to the sth power, the new version of CVaRminimization can be obtained.
minimizeα,w,w0
α + 1
1− β
m∑
i=1
pi max{|yi − w�xi − w0|s − α, 0
}+ C ′‖w‖tt , (32)
where s ∈ [1,∞) and C ′ ≥ 0. It is easy to see that (32) is a generalization of OLS(β = 0, C ′ = 0, s = 2), the ridge regression (β = 0, C ′ > 0, s = t = 2) and the lasso(β = 0, C ′ > 0, s = 2, t = 1) (see, e.g., Hastie et al. 2001). Note that for s, t ≥ 1,(32) is a convex minimization. Its coherent risk-based version can be given by
minimizew,w0
maxq∈Q
m∑
i=1
qi |yi − w�xi − w0|s + C ′‖w‖tt .
Another interesting class of machine learning methods is outlier detection (or alsoknown as one-class classification). This methodology seeks to find a small number ofoutlying samples, x′1, . . . , x′k , out of all the samples x1, . . . , xm . In order to define theoutliers, we can employ various types of loss functions. One candidate for the loss isLi (w) := −w�xi/‖w‖. In fact, the one-class ν-SVC (see, e.g. Schölkopf and Smola2002 Sec.8.3) is formulated by
∣∣∣∣∣∣∣∣
minimizew,ρ,z
12‖w‖22 − νρ + 1
m
m∑
i=1zi
subject to zi ≥ −w�xi + ρ, i = 1, . . . , m,
zi ≥ 0, i = 1, . . . , m.
By following the same line of the two-class case in Sects. 3 or 4, we can see thatthe one-class ν-SVC is equivalent to a convex counterpart of the following CVaRminimization of the loss Li (w) = −w�xi/‖w‖:
minimizeα,w
α + 1
1− β
m∑
i=1
pi max
{
−w�xi
‖w‖ − α, 0
}
.
Its coherent risk-based version can be given by
minimizew
maxq∈Q
{
−m∑
i=1
qiw�xi
‖w‖
}
, or minimizew
maxq∈Q
{
−m∑
i=1
qiw�xi
}
subject to ‖w‖ = 1.
The support vector domain description (SVDD) (Tax and Duin 1999) is another outlierdetection method formulated by
minimizeρ,c
− ρ + 1
ν
m∑
i=1
pi max{‖xi − c‖22 + ρ, 0
},
123
J. Gotoh et al.
with pi = 1/m. It is easy to see that this is the CVaR minimization of the lossLi (c) := ‖xi − c‖22. In place of the squared Euclidean distance, another loss canbe employed, e.g., Li (c) := ‖xi − c‖ or Li (c) := ‖xi − c‖p
p for p ∈ [1,∞). It isnoteworthy that these CVaR minimizations result in convex minimizations for generalposition of the samples x1, . . . , xm .
Its coherent risk-based version can be given by
minimizec
maxq∈Q
m∑
i=1
qi‖xi − c‖, or minimizec
maxq∈Q
m∑
i=1
qi‖xi − c‖pp,
where p ∈ [1,∞). It is noteworthy that these formulations are both convex minimiza-tion for any norm ‖ · ‖ and Q ∈ I�m .
6 Numerical examples
In this section, we demonstrate how the developed methods work in corporate bondrating. In particular, we here examine the CVaR- and MASD-based linear models for (i)the two-class classification, (27) and (28), and (ii) the ordered six-class classification,(30) and (31), where the nonnegative �1-regularization was employed so as to reduceall these optimization problems to LPs. The used data set consists of financial indicesand ratings of bonds of non-financial companies that were listed on the first sectionof the Tokyo Stock Exchange, and the ratings were given by Rating and InvestmentInformation, Inc. (R&I). We used two data sets, named the 2011 data and the 2012data, each including 405 and 393 companies’ data, respectively. In the experiment,the 2011 data was used for tuning the parameters, β and λ, and for estimating models,whereas the 2012 data was used for testing the estimated models under the tunedparameters.
Table 1 summarizes the numbers of ratings of the companies. Since the number ofcompanies was limited, we grouped them into six classes as shown in the table so as toimplement the ordered six-class classification. In order to conduct the two-class case,we further divided the six classes into the lowest two classes, which are considered tobe speculative, and the other four classes.
Before applying the linear classification models, we preprocessed the attribute data(i.e., financial indices) as follows. First the negative logarithm transformation wasapplied to each attribute, i.e., xi → ln(1 + xi ) if xi ≥ 0; xi → − ln(1 − xi ) ifxi < 0. Next we adopted the centering and normalization to the transformed data, andreplaced outlying samples by adopting the following transformation to the normalizedattributes x j : min{x j , 4} for x j ≥ 0; max{x j ,−4} for x j < 0.
In addition, signs of attributes were arranged so that the value of each attributewould be nondecreasing in the class labels, i.e., the larger the attribute, the higher theclass label.
In order to tune the parameters β for CVaR-based methods, (27) and (30), and λ
for MASD-based method, (28) and (31), we employed the so-called leave-one-outcross validation (LOO) using the 2011 data. For example, for β of CVaR, we prepared
123
Interaction between financial risk measures and machine learning methods
Table 1 Rating and class separation
The ratings were given at the ends of March 2011 and 2012
twenty-one candidates, 0.01, 0.05, 0.10, …, 0.90, 0.95, 0.99. At each β, we repeatedthe following procedure. Leaving one sample out of 405 samples for validation, weestimated a classification model using the remaining 404 samples and check if theestimated model could predict the class of the validation sample or not. Repeating thisprocedure 405 times by alternating the validation samples, the average rate of accu-rate prediction was computed. Repeating the above procedure for all the candidatesof βs, we picked up one β that attained the best accuracy. As for MASD, the sameprocedure was adopted where the parameter λ ≥ 0 was transformed as λ = τ/(1− τ)
with τ ∈ [0, 1) and the same twenty-one candidates for τ as those for β of theCVaR-based method were examined. Note that MASD with τ > 0.5 does not corre-spond to the coherent measure, but it is still reasonable to employ in the classificationproblem in the sense that MASD with larger τ puts more emphasis on the downsidedeviation.
We should remember that the existence of the optimal solutions depends also on thereference probability p. It is natural to employ the uniform probability, i.e., pi = 1/m,for the CVaR- and the MASD-based methods, but it may result in an unboundedsolution at low βs and λs. As discussed earlier, if a weighted probability satisfying∑
i yi pi = 0 (for the two-class case) or∑
κ
∑i yiκ piκ = 0 for κ = 1, . . . , 5 (for
the six-class case) is employed, the CVaR- and the MASD-based formulations haveoptimal solutions at any nonnegative β and λ, respectively. To this end, we employedweighted probabilities of the form:
[the two-class case] : [the K -class case] :
pi ={ 1
2m+ for yi = +1,
12m− for yi = −1; piκ =
⎧⎨
⎩
12(K−1)
∑k≥κ+1 mk
for (i, κ) : ki ≥ κ + 1,
12(K−1)
∑k≤κ mk
for (i, κ) : ki ≤ κ.
123
J. Gotoh et al.
Fig
.6A
vera
geac
cura
cyin
valid
atio
nof
LO
O.‘
Exa
ct’
impl
ies
the
rate
ofex
actly
pred
ictin
gth
etr
uecl
ass
labe
l,w
here
as‘±
1’ad
ditio
nally
incl
udes
the
wro
ngbu
tone
-cla
ssdi
ffer
entp
redi
ctio
ns.a
CV
aRw
ithβ
for
two-
clas
sca
se.b
MA
SDw
ithλ=
τ/(1−
τ)
for
two-
clas
sca
se.c
CV
aRw
ithβ
for
six-
clas
sca
se.d
MA
SDw
ithλ=
τ/(1−
τ)
for
six-
clas
sca
se
123
Interaction between financial risk measures and machine learning methods
Tabl
e2
Var
iabl
ese
lect
ion
(tw
o-cl
ass
case
)
Cat
egor
yid
x.N
o.Fi
nanc
iali
ndic
esO
rder
edlo
git
CV
aRM
ASD
coef
.
(β=
0.6)
Uni
form
#co
ef.
(β=
0.6)
Wei
ghte
d#
coef
.
(τ=
0.7)
Uni
form
#co
ef.
(τ=
0.55
)W
eigh
ted
#co
ef.
Profi
tabi
lity
1O
pera
ting
Profi
tMar
gin
−0.0
670
00
0
2N
etPr
ofitM
argi
n0.
121
00
00
3A
fter
-Tax
Profi
tMar
gin
0.01
00
00
0
4G
ross
Profi
tMar
gin
−0.0
6440
50.
125
405
0.08
540
50.
093
0
5Sa
les
and
Gen
eral
Adm
inis
trat
ion
Exp
ense
Mar
gin
−0.0
650
00
0
6R
etur
non
Cap
italE
mpl
oyed
0.01
60
00
0
Safe
ty7
Qui
ckR
atio
(Aci
dTe
stR
atio
)0
00
0
8C
urre
ntR
atio
090
00
9C
apita
lAde
quac
yR
atio
037
60
0
10R
etai
ned
Ear
ning
sto
Tota
lAss
ets
0.01
340
50.
245
405
0.19
940
50.
274
405
0.27
0
11D
ebtt
oTo
talA
sset
sR
atio
0.00
70
00
0
12Fi
xed
Rat
io39
440
50.
087
389
0
13C
ash
Equ
ival
ents
Sale
sR
atio
050
00
14L
iabi
lity
Tur
nove
rPe
riod
−0.0
280
00
0
15C
urre
ntL
iabi
litie
sT
urno
ver
Peri
od0.
017
00
00
16Fi
xed
Lia
bilit
ies
Tur
nove
rPe
riod
00
00
17C
urre
ntE
xpen
seto
Cur
rent
Inco
me
Rat
io−0
.066
00
00
18In
tere
stE
xpen
seto
Inte
rest
Bea
ring
Lia
bilit
yR
atio
02
00
19In
tere
stE
xpen
seto
Sale
sR
atio
0.01
40
00
0
20W
orki
ngC
apita
lRat
io−0
.007
00
00
123
J. Gotoh et al.
Tabl
e2
cont
inue
d
Cat
egor
yid
x.N
o.Fi
nanc
iali
ndic
esO
rder
edlo
git
CV
aRM
ASD
coef
.
(β=
0.6)
Uni
form
#co
ef.
(β=
0.6)
Wei
ghte
d#
coef
.
(τ=
0.7)
Uni
form
#co
ef.
(τ=
0.55
)W
eigh
ted
#co
ef.
Solv
ency
21In
tere
stC
over
age
−0.0
100
00
0
22O
pera
ting
Cas
hFl
owto
Inte
rest
Bea
ring
Lia
bilit
yR
atio
00
00
Effi
cien
cy23
Tota
lCap
italT
urno
ver
Rat
io0.
016
00
00
24A
sset
Tur
nove
rPe
riod
−0.0
670
00
0
25In
vent
ory
Tur
nove
rPe
riod
0.00
340
10.
007
405
0.00
711
30
26A
ccou
nts
Rec
eiva
ble
Tur
nove
rPe
riod
−0.0
040
340
00
27B
orro
wed
Inde
bted
ness
Tur
nove
rPe
riod
00
00
28W
orki
ngC
apita
lTur
nove
rPe
riod
00
00
29C
urre
ntA
sset
Tur
nove
rPe
riod
−0.0
280
00
0
30Fi
xed
Ass
etT
urno
ver
Peri
od0
00
0
31N
onop
erat
ing
Inco
me
Rat
io−0
.017
00
00
32N
onop
erat
ing
Exp
ense
Rat
io−0
.026
00
00
33Sa
les
per
Pers
on−0
.007
00
00
34N
etPr
ofitp
erPe
rson
0.01
338
640
30.
009
20
35Ta
ngib
leFi
xed
Ass
ets
per
Pers
on2
405
0.08
840
00
Size
36To
talC
apita
l31
90
00
37To
talA
sset
s−0
.086
405
0.62
440
50.
524
405
0.63
340
50.
730
38Sa
les
0.12
50
00
0
39A
fter
-Tax
Profi
t−0
.005
00
00
40C
ash
Flow
00
00
123
Interaction between financial risk measures and machine learning methods
Tabl
e2
cont
inue
d
Cat
egor
yid
x.N
o.Fi
nanc
iali
ndic
esO
rder
edlo
git
CV
aRM
ASD
coef
.
(β=
0.6)
Uni
form
#co
ef.
(β=
0.6)
Wei
ghte
d#
coef
.
(τ=
0.7)
Uni
form
#co
ef.
(τ=
0.55
)W
eigh
ted
#co
ef.
Cas
hFl
ow41
Ope
ratin
gC
ash
Flow
toSa
les
Rat
io0.
054
117
61
0
42Fi
nanc
ing
Cas
hFl
owto
Sale
sR
atio
00
00
43O
pera
ting
Cas
hFl
owto
Tota
lAss
ets
Rat
io−0
.008
00
00
44Fi
nanc
ing
Cas
hFl
owto
Tota
lAss
ets
Rat
io0.
004
00
00
45Fr
eeC
ash
Flow
toSa
les
Rat
io−0
.032
00
00
46Fr
eeC
ash
Flow
toTo
talA
sset
sR
atio
00
00
47O
pera
ting
Cas
hFl
owto
Cur
rent
Lia
bilit
ies
Rat
io0
00
0
48Fr
eeC
ash
Flow
toC
urre
ntL
iabi
litie
sR
atio
00
00
49O
pera
ting
Cas
hFl
owto
Lia
bilit
ies
Rat
io0
00
0
50Fr
eeC
ash
Flow
toL
iabi
litie
sR
atio
00
00
Gro
wth
51Sa
les
Gro
wth
Rat
e0
00
0
The
colu
mn
‘Ord
ered
Log
it’re
port
sth
eco
effic
ient
ses
timat
edby
the
func
tionpolr
ofR
,whi
chis
anim
plem
enta
tion
ofth
eor
dere
dlo
gita
naly
sis.
The
func
tionstep
was
empl
oyed
fort
heva
riab
lese
lect
ion.
The
blan
kpa
rts
inth
eco
lum
n‘c
oef.
’ind
icat
eth
ere
mov
edva
riab
les
via
the
sele
ctio
n.T
heco
effic
ient
sw
ere
norm
aliz
edby
�1-n
orm
.The
nega
tive
coef
ficie
nts
impl
yth
eop
posi
teto
the
expe
cted
sign
cond
ition
.The
colu
mns
nam
ed‘#
’fo
rC
VaR
and
MA
SDre
port
the
num
bers
ofno
n-ze
roco
effic
ient
sdu
ring
the
405
LO
Oso
lutio
nsat
the
sele
cted
para
met
ersβ
and
τ,w
here
asth
eco
lum
nsna
med
‘coe
f.’r
epor
tsth
eop
timal
solu
tion
obta
ined
usin
gth
ew
hole
405
sam
ples
afte
rdet
erm
inin
gth
epa
ram
eter
s
123
J. Gotoh et al.
Tabl
e3
Var
iabl
ese
lect
ion
(six
-cla
ssca
se)
Cat
egor
yid
x.N
o.Fi
nanc
iali
ndic
esO
rder
edlo
git
CV
aRM
ASD
coef
.
(β=
0.65
)U
nifo
rm#
coef
.
(β=
0.55
)W
eigh
ted
#co
ef.
(τ=
0.7)
Uni
form
#co
ef.
(τ=
0.65
)W
eigh
ted
#co
ef.
Profi
tabi
lity
1O
pera
ting
Profi
tMar
gin
−0.0
930
00
0
2N
etPr
ofitM
argi
n0.
078
00
00
3A
fter
-Tax
Profi
tMar
gin
00
00
4G
ross
Profi
tMar
gin
405
0.04
740
50.
042
405
0.03
139
5
5Sa
les
and
Gen
eral
Adm
inis
trat
ion
Exp
ense
Mar
gin
00
00
6R
etur
non
Cap
italE
mpl
oyed
0.01
20
00
0
Safe
ty7
Qui
ckR
atio
(Aci
dTe
stR
atio
)−0
.025
00
00
8C
urre
ntR
atio
00
00
9C
apita
lAde
quac
yR
atio
940
30.
017
01
10R
etai
ned
Ear
ning
sto
Tota
lAss
ets
0.01
740
50.
177
405
0.15
540
50.
181
405
0.17
3
11D
ebtt
oTo
talA
sset
sR
atio
0.01
10
00
0
12Fi
xed
Rat
io0
00
0
13C
ash
Equ
ival
ents
Sale
sR
atio
00
00
14L
iabi
lity
Tur
nove
rPe
riod
−0.0
790
00
0
123
Interaction between financial risk measures and machine learning methods
Tabl
e3
cont
inue
d
Cat
egor
yid
x.N
o.Fi
nanc
iali
ndic
esO
rder
edlo
git
CV
aRM
ASD
coef
.
(β=
0.65
)U
nifo
rm#
coef
.
(β=
0.55
)W
eigh
ted
#co
ef.
(τ=
0.7)
Uni
form
#co
ef.
(τ=
0.65
)W
eigh
ted
#co
ef.
15C
urre
ntL
iabi
litie
sT
urno
ver
Peri
od0.
069
00
00
16Fi
xed
Lia
bilit
ies
Tur
nove
rPe
riod
0.04
20
00
0
17C
urre
ntE
xpen
seto
Cur
rent
Inco
me
Rat
io0
00
0
18In
tere
stE
xpen
seto
Inte
rest
Bea
ring
Lia
bilit
yR
atio
00
00
19In
tere
stE
xpen
seto
Sale
sR
atio
00
00
20W
orki
ngC
apita
lRat
io1
00
0
Solv
ency
21In
tere
stC
over
age
−0.0
060
00
0
22O
pera
ting
Cas
hFl
owto
Inte
rest
Bea
ring
Lia
bilit
yR
atio
00
00
Effi
cien
cy23
Tota
lCap
italT
urno
ver
Rat
io0
00
0
24A
sset
Tur
nove
rPe
riod
−0.0
560
00
0
25In
vent
ory
Tur
nove
rPe
riod
0.01
040
50.
041
405
0.04
143
405
0.01
8
26A
ccou
nts
Rec
eiva
ble
Tur
nove
rPe
riod
−0.0
100
00
0
27B
orro
wed
Inde
bted
ness
Tur
nove
rPe
riod
00
00
28W
orki
ngC
apita
lTur
nove
rPe
riod
0.02
00
00
0
29C
urre
ntA
sset
Tur
nove
rPe
riod
−0.0
620
00
0
30Fi
xed
Ass
etT
urno
ver
Peri
od0
00
0
31N
onop
erat
ing
Inco
me
Rat
io−0
.016
00
00
32N
onop
erat
ing
Exp
ense
Rat
io−0
.016
00
00
33Sa
les
per
Pers
on−0
.015
00
00
34N
etPr
ofitp
erPe
rson
0.01
840
50.
058
405
0.04
840
50.
027
401
0.00
6
123
J. Gotoh et al.
Tabl
e3
cont
inue
d
Cat
egor
yid
x.N
o.Fi
nanc
iali
ndic
esO
rder
edlo
git
CV
aRM
ASD
coef
.
(β=
0.65
)U
nifo
rm#
coef
.
(β=
0.55
)W
eigh
ted
#co
ef.
(τ=
0.7)
Uni
form
#co
ef.
(τ=
0.65
)W
eigh
ted
#co
ef.
35Ta
ngib
leFi
xed
Ass
ets
per
Pers
on0
00
0
Size
36To
talC
apita
l−0
.052
00
00
37To
talA
sset
405
0.59
740
50.
606
405
0.67
540
50.
681
38Sa
les
0.13
00
00
0
39A
fter
-Tax
Profi
t0
00
0
40C
ash
Flow
0.00
80
00
0
Cas
hFl
ow41
Ope
ratin
gC
ash
Flow
toSa
les
Rat
io1
10
0
42Fi
nanc
ing
Cas
hFl
owto
Sale
sR
atio
00
00
43O
pera
ting
Cas
hFl
owto
Tota
lAss
ets
Rat
io0.
016
00
00
44Fi
nanc
ing
Cas
hFl
owto
Tota
lAss
ets
Rat
io0
00
0
45Fr
eeC
ash
Flow
toSa
les
Rat
io0
00
0
46Fr
eeC
ash
Flow
toTo
talA
sset
sR
atio
−0.0
330
00
0
47O
pera
ting
Cas
hFl
owto
Cur
rent
Lia
bilit
ies
Rat
io0.
039
405
0.08
140
50.
092
405
0.08
740
50.
122
48Fr
eeC
ash
Flow
toC
urre
ntL
iabi
litie
sR
atio
00
00
49O
pera
ting
Cas
hFl
owto
Lia
bilit
ies
Rat
io−0
.042
00
00
50Fr
eeC
ash
Flow
toL
iabi
litie
sR
atio
0.01
90
00
0
Gro
wth
51Sa
les
Gro
wth
Rat
e−0
.005
00
00
123
Interaction between financial risk measures and machine learning methods
Table 4 In-sample versus out-of-sample accuracy (two-class case) (%)
In-sample (2011) Out-of-sample (2012)
Logit CVaR MASD Logit CVaR MASD
Uniform Weighted Uniform Weighted Uniform Weighted Uniform Weighted
All 90.4 84.4 84.7 85.2 83.7 47.3 84.2 79.9 84.7 82.71 83.3 67.5 87.5 70.8 85.0 29.8 67.7 77.4 70.2 83.92 93.3 91.6 83.5 91.2 83.2 55.4 91.8 81.0 91.4 82.2
Table 5 In-sample versus out-of-sample accuracy (six-class case) (%)
In-sample (2011) Out-of-sample (2012)
Logit CVaR MASD Logit CVaR MASD
Uniform Weighted Uniform Weighted Uniform Weighted Uniform Weighted
Exact All 58.3 53.8 50.1 49.1 50.6 23.2 47.6 45.5 48.6 44.8
Class 1 54.9 62.0 90.1 47.9 91.5 34.3 55.7 84.3 41.4 84.3
Class 2 55.1 22.4 12.2 20.4 12.2 14.8 25.9 20.4 29.6 14.8
Class 3 57.7 55.1 23.1 62.8 24.4 11.0 37.8 20.7 56.1 18.3
Class 4 54.0 52.9 35.6 54.0 36.8 9.6 53.0 34.9 57.8 37.3
Class 5 47.2 33.3 13.9 11.1 11.1 13.2 26.3 5.3 15.8 2.6
Class 6 72.6 73.8 94.0 65.5 94.0 56.1 74.2 92.4 69.7 93.9
±1 All 92.6 88.4 82.5 86.9 80.0 48.3 86.0 78.9 86.3 77.1
Class 1 88.7 81.7 95.8 76.1 93.0 44.3 72.9 87.1 71.4 87.1
Class 2 89.8 91.8 91.8 91.8 91.8 51.9 92.6 92.6 92.6 92.6
Class 3 97.4 92.3 61.5 96.2 55.1 40.2 87.8 58.5 92.7 54.9
Class 4 92.0 90.8 71.3 92.0 70.1 37.3 88.0 66.3 90.4 63.9
Class 5 86.1 88.9 88.9 86.1 83.3 63.2 86.8 86.8 84.2 81.6
Class 6 96.4 85.7 94.0 79.8 94.0 65.2 89.4 95.5 84.8 95.5
‘Exact’ implies the rate of exactly predicting the true class label, whereas ‘±1’ additionally includes thewrong but one-class different predictions
Figure 6 shows the accuracy ratio in the LOO method. In the two-class case, thebest βs were achieved at 0.6 both with the uniform probability and the weightedprobability while the best τ s were achieved at τ = 0.77 (uniform probability) andτ = 0.5 (weighted probability). In the six-class case, the best βs were achieved at 0.65(uniform probability) and 0.55 (weighted probability) while the best τ s were achievedat τ = 0.70 (uniform probability) and τ = 0.65 (weighted probability).
We see that for the two-class case, both of the probability settings achieved almostthe same highest accuracy, while for the six-class case, uniform probability achievedhigher accuracy under the best parameters.
Tables 2 and 3 show the estimated coefficients of the attributes under the chosenparameters β or λ. Clearly, both the CVaR- and MASD-based classifications achieved
123
J. Gotoh et al.
very sparse estimates of the weight vector. These sparse solutions are the consequenceof the nonnegative �1-regularization, and can be considered as a result of variableselection. On the other hand, the coefficients estimated via the ordered logit modelresulted in many negative coefficients, implying that the estimated model was incon-sistent with the experts’ economic prior knowledge.
Tables 4 and 5 report the in-sample and out-of-sample accuracy of the two- andsix-class classification results, respectively. We see from the tables that both CVaR-and MASD-based methods achieved higher out-of-sample accuracy compared to thelogit model. Especially, their out-of-sample performance is as well as the in-sampleone, while the logit model showed the highest in-sample accuracy, but resulted ina poor out-of-sample performance. This indicates that the logit model overfits thein-sample data in spite of the variable selection based on AIC. On the other hand,the variable selection via the CVaR- and MASD-based methods with the nonnegative�1-regularization successfully avoided the overfitting.
The use of the uniform probability showed better performance than that of theweighted probability on average (see the row named ‘All’). However, the weightedprobability showed better performance at some classes, which seems to be due to thedifferent weights between classes.
7 Concluding remarks
In this article we demonstrate how the financial risk measures relate to the machinelearning methods, especially to SVMs. We have pointed out in the first part that theCVaR minimization has much to do with ν-SVMs. Indeed, both of them are basedon some empirical risk minimization, and aim good out-of-sample performance. Dueto their high degree of affinity and small difference, we can develop a new class ofmachine learning methods. In particular, we introduce the class of coherent risk-basedmethods. Their formulations can be viewed as a straightforward extension of ν-SVMs.Numerical experiments demonstrate promising results for the extension. In fact, byemploying both the �1-norm regularization and the nonnegative weights, both theCVaR- and the MASD-based methods can be rewritten by LPs and they successfullyprovided sparse estimates of the weight vector, achieving higher accuracy than theordered logit model.
In this paper we have concentrated on the coherent measures, but the similar exten-sion can be achieved on the basis of convex measures (Föllmer and Schied 2002). Theextension along that direction is left for a future research.
References
Altman EI (1968) Financial ratios, discriminant analysis and the prediction of corporate bankruptcy.J Finance 23:589–609
Artzner P, Delbaen F, Eber JM, Heath D (1999) Coherent measures of risk. Math Finance 9:203–228Bahrammirzaee A (2010) A comparative survey of artificial intelligence applications in finance: artificial
neural networks, expert system and hybrid intelligent systems. Neural Comput Appl 19:1165–1195
123
Interaction between financial risk measures and machine learning methods
Baourakis G, Conisescu M, van Dijk G, Pardalos PM, Zopounidis C (2009) A multicriteria approach forrating the credit risk of financial institutions. Comput Manag Sci 6:347–356
Bennell JA, Crabbe D, Thomas S, Gwilym O (2006) Modeling sovereign credit ratings: neural networksversus ordered profit. Expert Syst Appl 30:415–425
Bennett KP, Bredensteiner E (2000) Geometry in learning. In: Gorini C (ed) Geometry at work. MathematicalAssociation of America, Washington, DC, pp 132–145
Bennett KP, Mangasarian OL (1993) Multicategory discrimination via linear programming. Optim MethodsSoftw 3:27–39
Ben-Tal A, ElGhaoui L, Nemirovski A (2009) Robust optimization. Princeton Univ Press, PrincetonBredensteiner EJ, Bennett KP (1999) Multicategory classification by support vector machines. Comput
Optim Appl 12:53–79Brodie J, Daubechiesa I, De Mol C, Giannone D, Lorisc I (2009) Sparse and stable Markowitz portfolios.
PNAS 106:12267–12272Bugera V, Konno H, Uryasev S (2002) Credit cards scoring with quadratic utility function. J Multi Criteria
Decis Anal 11:197–211Caramanis C, Mannor S, Xu H (2012) Robust optimization in machine learning. In: Sra S, Nowozin S,
Wright SJ (eds) Optimization for machine learning. The MIT Press, Cambridge, pp 369–402Crook JN, Edelman DB, Thomas LC (2007) Recent developments in consumer credit risk assessment. Eur
J Oper Res 183:1447–1465Chen Z, Wang Y (2008) Two-sided coherent risk measures and their application in realistic portfolio
optimization. J Banking Finance 32:2667–2673Delbaen F (2002) Coherent risk measures on general probability spaces. In: Advances in finance and
stochastics. Essays in honour of dieter sondermann. Springer, Berlin, pp 1–37DeMiguel V, Garlappi L, Nogales FJ, Uppal R (2009) A generalized approach to portfolio optimization:
improving performance by constraining portfolio norms. Manag Sci 55:798–812Erdal HI, Ekinci A (2012) A comparison of various artificial intelligence methods in the prediction of bank
failures. Comput Econ. doi:10.1007/s10614-012-9332-0Fishburn PC (1977) Mean-risk analysis with risk associated with below-target returns. Am Econ Rev
67:116–126Fisher T (2001) Examples of coherent risk measures depending on one-sided moments. Darmstadt Univer-
sity of Technology. Discussion PaperFöllmer H, Schied A (2002) Convex measures of risk and trading constraints. Finance Stoch 6:429–447Gotoh J, Takeda A (2011) On the role of norm constraints in portfolio selection. Comput Manag Sci
8:323–353Gotoh J, Takeda A (2012) Minimizing loss probability bounds for portfolio selection. Eur J Oper Res
217:371–380Gotoh J, Takeda A (2005) A linear classification model based on conditional geometric score. Pac J Optim
1:277–296Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning-data mining, inference, and
prediction. Springer, NYHuang Z, Chen H, Hsu CJ, Chen WH, Wu S (2004)Credit rating analysis with support vector machines and
neural networks: a market comparative study. Decis Supp Syst 37:543–558Konno H, Kobayashi H (2000) Failure discrimination and rating of enterprises by semi-definite program-
ming. Asia Pac Financ Mark 7:261–273Konno H, Yamazaki H (1991) Mean-absolute deviation portfolio optimization model and its applications
to Tokyo stock market. Manag Sci 37:519–531Krokhmal P (2007) Higher moment coherent risk measures. Quant Finance 7:373–387Mangasarian OL (1999) Arbitrary-norm separating plane. Oper Res Lett 24:15–23Mangasarian OL (2000) Advances in large margin classifiers. In: Smola AJ, Bartlett PL, Schölkopf B,
Schuurmans D (eds) Generalized support vector machines. The MIT Press, Cambridge, pp 135–146Markowitz HM (1959) Portfolio selection: efficient diversification of investments. Wiley, New YorkOgryczak W, Ruszczýnski A (1999) From stochastic dominance to mean-risk models: semideviations as
risk measures. Eur J Ope Res 116:33–50Ogryczak W, Ruszczýnski A (2002) Dual stochastic dominance and related mean-risk models. SIAM J
Optim 13:60–78
123
J. Gotoh et al.
Perez-Cruz F, Weston J, Hermann DJL, Schölkopf B (2003) Extension of the ν-SVM range for classification.In: Suykens JAK, Horvath G, Basu S, Micchelli C, Vandewalle J (eds) Advances in learning theory:methods, models and applications, vol 190. IOS Press: Amsterdam, pp 179–196
Pflug GC (2000) Some remarks on the value-at-risk and the conditional value-at-risk. In: Uryasev S (ed)Probabilistic constrained optimization: methodology and applications. Springer, Berlin, pp 278–287
Rockafellar TR, Uryasev S (2000) Optimization of conditional value-at-risk. J Risk 2:21–41Rockafellar TR, Uryasev S (2002) Conditional value-at-risk for general loss distributions. J Banking Finance
26:1443–1471Schölkopf B, Smola AJ (2002) Learning with kernels-support vector machines, regularization, optimization,
and beyond. The MIT Press, MassachusettsSchölkopf B, Smola AJ, Williamson RC, Bartlett PL (2000) New support vector algorithms. Neural Comput
12:1207–1245Shin KS, Lee TS, Kim H (2005) An application of support vector machines in Bankruptcy prediction model.
Expert Syst Appl 28:127–135Takeda A (2009) Generalization performance of ν-support vector classifier based on conditional value-at-
risk minimization. Neurocomputing 72:2351–2358Takeda A, Gotoh J, Sugiyama M (2010) Support vector regression as conditional value-at-risk minimization
with application to financial time-series analysis. In: Proceedings of 2010 IEEE international workshopon machine learning for signal processing
Tax DMJ, Duin RPW (1999) Support vector domain description. Pattern Recognit Lett 20:1191–1199Thomas LC, Edelman DB, Crook JN (2002) Credit scoring and its applications. SIAM, PhiladelphiaVapnik VN (1995) The nature of statistical learning theory. Springer, New YorkWang Y (2012) Robust ν-support vector machine based on worst-case conditional value-at-risk minimiza-
tion. Optim Methods Softw 27:1025–1038Xanthopoulos P, Pardalos PM, Trafalis TB (2013) Robust data mining. Springer, BerlinYajima Y (2005) Linear programming approaches for multicategory support vector machines. Eur J Ope
Res 162:514–531Zhu S, Fukushima M (2009) Worst-case conditional value-at-risk with application to robust portfolio man-
agement. Oper Res 57:1155–1168
123