Applying Complementary Credit Scores to Calculate Aggregate Ranking

Page created by Janice Townsend
 
CONTINUE READING
Journal of Corporate Finance Research / Methods                                                                              Vol. 15 | № 3 | 2021

DOI: 10.17323/j.jcfr.2073-0438.15.3.2021.5-13.
JEL classification: C38, C44, G24, G34

Applying Complementary Credit
Scores to Calculate Aggregate
Ranking
Zinaida Seleznyova
Lecturer, Research Assistant,
HSE School of Finance, HSE Financial Engineering & Risk Management Lab, National Research University Higher
School of Economics, Moscow, Russia,
zseleznyova@hse.ru, ORCID

The journal is an open access journal which means that everybody can read, download, copy, distribute, print, search, or link to the full texts of these
articles in accordance with CC Licence type: Attribution 4.0 International (CC BY 4.0 http://creativecommons.org/licenses/by/4.0/).

5                                                                                                               Higher School of Economics
Journal of Corporate Finance Research / Methods                                                         Vol. 15 | № 3 | 2021

Abstract
Researchers have been improving credit scoring models for decades, as an increase in the predictive ability of scoring even
by a small amount can allow financial institutions to avoid significant losses. Many researchers believe that ensembles of
classifiers or aggregated scorings are the most effective. However, ensembles outperform base classifiers by thousandths
of a percent on unbalanced samples.
This article proposes an aggregated scoring model. In contrast to previous models, its base classifiers are focused on
identifying different types of borrowers. We illustrate the effectiveness of such scoring aggregation on real unbalanced
data.
As the effectiveness indicator we use the performance measure of the area under the ROC curve. The DeLong, DeLong and
Clarke-Pearson test is used to measure the statistical difference between two or more areas. In addition, we apply a logistic
model of defaults (logistic regression) to the data of company financial statements. This model is usually used to identify
default borrowers. To obtain a scoring aimed at non-default borrowers, we employ a modified Kemeny median, which was
initially developed to rank companies with credit ratings. Both scores are aggregated by logistic regression.
Our data Russian banks that existed or defaulted between July 1, 2010, and July 1, 2015. This sample of banks is highly
unbalanced, with a concentration of defaults of about 5%. The aggregation was carried out for banks with several ratings.
We show that aggregated classifiers based on different types of information significantly improve the discriminatory power
of scoring even on an unbalanced sample. Moreover, the absolute value of this improvement surpasses all the values
previously obtained from unbalanced samples.
The aggregated scoring and the approach to its construction can be applied by financial institutions to credit risk assessment
and as an auxiliary tool in the decision-making process thanks to the relatively high interpretability of the scores.

Keywords: forward stepwise selection, logistic regression, Kemeny median, credit scoring, statistical learning
For citation: Seleznyova, Z. (2021) “Applying Complementary Credit Scores to Calculate Aggregate Ranking”, Journal of
Corporate Finance Research | ISSN: 2073-0438, 15(3), pp. 5-13. doi: 10.17323/j.jcfr.2073-0438.15.3.2021.5-13.

6                                                                                            Higher School of Economics
Journal of Corporate Finance Research / Methods                                                             Vol. 15 | № 3 | 2021

                                                                     cial statements. So, this ranking is potentially aimed at
Introduction                                                         non-default companies.
Scoring models have been developing for decades. Re-
                                                                     We propose to use logistic regression as the strong learner.
searchers have proposed and compared different ap-
                                                                     It should be noted that logistic regressions, including ridge
proaches to data preparation for model construction and
                                                                     and lasso, were used as ensemble models in [1; 9] and proved
approaches to selecting factors which influence credit
                                                                     to be superior to other methods considered in these papers.
quality and their generation. They have also studied the
best approaches to assessing credit score capability/accu-           Our study is based on a sample of banks during the pe-
racy and the credit score methods themselves. This was               riod between July 1, 2010, and July 1, 2015. This sample
done to improve scoring accuracy, insofar as a gain or               is characterized by a low default concentration of 5.76%.
loss of a percentage point in accuracy can lead to mul-              Financial performance indicators, identifiers of external or
timillion profits or losses for banks and other financial            government support, and ratings of credit rating agencies
institutions [1].                                                    were used to create rankings. It was shown that the aggre-
                                                                     gation of two base classifiers focused on the identification
Over the past ten years, scholars have believed that the
                                                                     of different types of borrowers results in an improvement
best practice is to use machine learning models [2] and
                                                                     of the predictive power of aggregated credit scoring in
so-called “ensembles” [3] to construct credit scores. The
                                                                     comparison to base classifiers.
basic idea of an ensemble lies in the aggregation of mod-
elled base classifiers (scores) with the help of a model/al-         The interpretability of weak and strong learners makes it
gorithm. There exist different classifications of ensembles          possible to use aggregate rankings not only as an addition-
[3–5]; however, their division into bagging, boosting, and           al parameter for decision making in financial institutions
stacking ensembles is the most common. Bagging is the                but also to evaluate default probability in risk management
combination of several independent scorings (base classi-            [10; 11]. The proposed weak learners constitute the scien-
fiers, weak learners) constructed in a parallel way on the           tific novelty of this paper: they were trained using poten-
basis of independent random samples. Random forests are              tially complementary information (ratings and financial
a well-known example of bagging. Boosting is the aggre-              statements). We know of only one similar study [12] that
gation of several successively constructed base scorings.            trained weak learners using market indicators and finan-
Stacking is the combination of different base classifiers (for       cial statements. However, the ensemble did not outperform
example, logistic regression and a decision tree) that are           the base classifier in discriminatory power [12].
trained simultaneously. They are combined in the ensem-
ble model (strong learner), which includes different voting
rules, statistical models and machine learning methods.
                                                                     Literature Review
The ensemble paradigm makes ensembles relevant: several              The number of papers devoted to credit scoring methods
aggregated classifiers usually show greater discriminatory           has grown exponentially over the past 30 years [3]. In the
power/accuracy than a single classifier [5]. Nevertheless,           last five years, researchers have continued their attempts
some researchers have shown that ensemble models some-               to improve credit scoring for legal entities [13–15]and
times fail to surpass machine learning methods in regard             even more so for financial institutions involved in lending
to certain criteria [4; 6]. Also, their practical applicability is   to SMEs. The importance of credit scoring has increased
usually limited: in most cases, they are “black boxes” which         recently because of the financial crisis and increased cap-
are difficult to interpret because machine learning meth-            ital requirements for banks. There are, however, only few
ods and other ensembles are often used as the base clas-             studies that develop credit coring models for SME lending.
sifiers. Therefore, some researchers [7] have attempted to           The objective of this study is to introduce a novel, more
simplify the interpretability of ensemble models, including          accurate credit risk estimation approach for SMEs business
machine learning methods.                                            lending. Based on traditional statistical methods and re-
                                                                     cent artificial intelligence (AI). However, the majority of
In this paper, we focus on using complementary weak
                                                                     papers make use of databases of natural persons [16]. The
learners to calculate aggregate rankings. We chose the
                                                                     reason is that such databases are in open access and avail-
logistic model of defaults and a modified Kemeny median
                                                                     able for parsing. These samples have been used to compare
[8] as two weak classifiers of this type due to their relatively
                                                                     well-known approaches to credit scoring calculation [17]
high interpretability. We consider them to be complemen-
                                                                     the volume of databases that financial companies manage is
tary for our purposes for the following reasons:
                                                                     so great that it has become necessary to address this prob-
The logistic model (regression) is usually trained for defin-        lem, and the solution to this can be found in Big Data tech-
ing default borrowers using corporate financial statements.          niques applied to massive financial datasets for segmenting
In other words, the first weak learner is focused on default         risk groups. In this paper, the presence of large datasets is
borrowers.                                                           approached through the development of some Monte Car-
The modified Kemeny median has been proposed as a tool               lo experiments using known techniques and algorithms. In
for credit rating aggregation. Usually, companies which              addition, a linear mixed model (LMM) and propose new
have a better-than-average creditworthiness want to have             ones [18]. Different ensembles [18; 19] and logistic regres-
credit ratings in particular because they are ready to dis-          sions [20] have been identified as the best scoring methods.
close to a rating agency more information than just finan-           In addition, papers dedicated to the comparison of well-

7                                                                                               Higher School of Economics
Journal of Corporate Finance Research / Methods                                                          Vol. 15 | № 3 | 2021

known methods often consider neural networks [21] and             In order to calculate base classifiers, a preliminary prepara-
decision trees [22] to be the best.                               tion of data is carried out. One of the stages of preliminary
Such a diversity of best methods is partially explained by        preparation is parameter selection by means of forward
the wide range of simultaneously applied classification           feature selection. Nevertheless, it is necessary to describe
quality criteria. Many authors [4; 9] agree that it is better     the data sample before we explain the methodology in de-
to use several model performance measures at once. Nev-           tail. This is due to the fact that the choice of methods de-
ertheless, other researchers [23; 24] continue to apply only      pends on the data.
conventional methods calculated on the basis of an error
matrix.
                                                                  Data
In this paper, we propose looking at credit scoring aggrega-
                                                                  The main data pool comprises publicly available infor-
tion from a slightly different perspective. Usually, only one
                                                                  mation on 958 banks for the period between July 1, 2010,
type of data is used to create base scorings: financial state-
                                                                  and July 1, 2015, which represents approximately 80% of
ments or characteristics of natural persons [25]normally
                                                                  all banks operating in the Russian Federation during this
taking between 50% and 80% of the total project time. It
                                                                  period. 134 of these banks had two or more ratings calcu-
is in this stage that data in a relational database are trans-
                                                                  lated by seven credit rating agencies: AK&M, Expert RA
formed for applying a data mining technique. This stage
                                                                  (EXP), National Rating Agency (NRA), RusRating (RUS),
is a complex task that demands from database designers
                                                                  Fitch Ratings, Moody’s Analytics, and Standard & Poor’s.
a strong interaction with experts having a broad knowl-
                                                                  This data pool was formally divided into three parts: data
edge about the application domain. Frameworks aiming
                                                                  on banks up to July 2014, data on banks after July 2014, and
to systemize this stage have significant limitations when
                                                                  data on banks with two and more credit ratings.
applied to Credit Behavioral Scoring solutions. This paper
proposes a framework based on the Model Driven Devel-             Data on banks up to and including July 2014 comprises
opment approach to systemize the mentioned stage. This            70% of the observations of the main pool or 13,570 ob-
work has three main contributions: 1 or company market            servations. The default concentration is 4.6%. In terms of
indicators [13]and even more so for financial institutions        default/non-default observations, this sample is highly un-
involved in lending to SMEs. The importance of cred-              balanced. It comprises indicators from bank report forms
it scoring has increased recently because of the financial        101 and 102 and statutory requirements information (form
crisis and increased capital requirements for banks. There        135) posted on the website of the Bank of Russia1 and in-
are, however, only few studies that develop credit coring         formation on support from the Russian government or
models for SME lending. The objective of this study is to         foreign banks. This sample was used to train the logistic
introduce a novel, more accurate credit risk estimation           model of defaults.
approach for SMEs business lending. Based on tradition-           Data on banks after July 2014 consists of 4,261 observa-
al statistical methods and recent artificial intelligence (AI.    tions with a default concentration of 9.25%. The list of in-
Indicator categories from financial statements complement         dicators was the same as in the sample described above.
each other, and machine learning methods can be applied           This sample was used to test the logistic model of defaults.
to assess the nonlinear relations between them. However,          Data on banks with two and more credit ratings is part
the creditworthiness of a company may be characterized            of the two samples described above. This sample consists
by factors that are recorded only partially or not at all in      of observations on 134 banks. The sample size is 1,700
statements. These ratings may potentially complement the          observations, 17 of which are defaults. This sample is also
indicators of corporate financial statements: companies           unbalanced and has a default concentration of 2.72%. In
disclose more information to credit rating agencies (CRAs)        addition to the indicators described above, it includes CRA
than one can find in the public domain [26]. In addition,         ratings. For the purposes of creating scoring ratings, cat-
companies with a better creditworthiness, all other things        egories were assigned numerical values, where 0 was at-
being equal, tend to resort to CRAs: such companies are           tributed to the higher rating category of each credit rating
developing and need external ratings to expand into new           agency (CRA). Then, the numerical value of each lower
markets, for example. Thus, one may conjecture that the           category was increased by 1. As the last two columns of Ta-
collective opinion of credit rating agencies may comple-          ble 1 show, the number of assigned rating categories varied
ment information from financial statements.                       greatly from agency to agency.
In this paper, we will use classical logistic regression as the
base classifier and as the aggregated model. This practice
was applied in the sample is class imbalanced [9; 27]. Class
imbalance may affect the accuracy of default predictions,
as classifiers tend to be biased towards the majority class
(good borrowers, which showed the advantage of this ap-
proach over base classifiers.

1
    URL: https://www.cbr.ru/credit/

8                                                                                             Higher School of Economics
Journal of Corporate Finance Research / Methods                                                            Vol. 15 | № 3 | 2021

Table 1. Descriptive statistics of 7 CRA ratings

    Variable             Number of    Average           Mode              Standard            Min.               Max.
                         observations                                     deviation
    AK&M                 209          1.92345           2                 0.67502             1                  4

    EXP                  652          1.63497           2                 0.87912             0                  6

    FCH                  609          4.92939           0                 3.78709             0                  14

    MDS                  1108         6.22563           9                 3.12457             0                  15

    NRA                  627          3.70973           3                 1.73995             0                  13

    RUS                  246          4.23577           6                 2.57644             0                  10

    SNP                  511          5.04305           6                 2.8694              0                  21

Source: author’s calculations.                                    scorings [23; 24], insofar as it does not generate a bias of
If we consider previous papers that, in one way or anoth-         estimators due to an inappropriately chosen way of impu-
er, studied CRA ratings using Russian data (for example,          tation of missing values [30]. The forward stepwise selec-
[28]), we see that the general distribution of agency ratings     tion method was used for features selection for the logis-
has changed little. The most frequent ratings are low rat-        tic model. This approach adds a relevant variable to the
ings in the investment grade or best ratings in the specu-        defined significant variables. If this variable is significant
lative grade. The data on ratings is taken from the RUData        and significantly improves the model, it is also included. In
system2. Consensus and aggregate rankings are calculated          spite of its simplicity, this approach is still widely used to
using this sample.                                                select parameters [16]. Multicollinearity was controlled by
                                                                  means of a correlation matrix. It was controlled both at the
The low default concentration and small size of the sample
                                                                  intermediate stage of model building and at the final stage.
of banks with several ratings is insufficient for dividing it
into training and test samples to create a logistic model.        Logistic regression. In credit scoring problems, the logis-
This is why samples of banks with one or no ratings are           tic model may be formulated as follows: bank i has rating
used in this study.                                                yi , which is equal to 0 if there is no bank default and 1 if
                                                                  there is bank default. This rating depends on the latent var-
                                                                          *
                                                                  iable yi , which represents the bank credit quality:
Methodology
This chapter consists of several parts. “Logistic Regression”             1 if 0 ( default )
                                                                    yi =                              ,
describes the preliminary preparation of data for making                  0 otherwise ( no default )
a scoring using the logistic model of defaults, the logistic                                                     (1)
model itself, and ways of validating it. “Modified Kemeny          1 if  y*i ⩾
                                                                           0 (default).
Median” has a similar structure. “Aggregation” describes           0 otherwise (no default)
the mechanism for aggregating the two rankings obtained
from the logistic model and the modified Kemeny median.                        *
                                                                  In turn, yi linearly depends on X­– factors that may pre-
“Model Power Indicator” describes the tool applied to ver-
                                                                  dict the bank creditworthiness. They may be continuous
ify the efficiency (power) of obtained rankings.
                                                                  and categorical quantities that represent relevant finan-
                                                                  cial, macroeconomic and other indicators. In this case, the
Logistic Regression                                               probability of a bank being default or non-default is as fol-
Linear prediction of the logistic model of defaults or the        lows, respectively:
“continuous” rating of the defaults prediction model is
used as the first baseline ranking (classifier) [29]. Due to                         (     ) (              )
                                                                   P ( yi = 1) = P y*i ≥ 0 = P Xi' β + ε ≥ 0 ;
its simplicity, transparency, interpretability and a relatively                                                        (2)
high discriminatory power, this scoring model continues            P ( yi = 0 ) =   1− F(X β )
                                                                                          '
                                                                                          i
to be the industry standard [3; 28].                                                           ,
                                                                              '
Data preparation. In this paper, observations with miss-          where Xi is the transposed matrix of factors describing
ing data were not used for building the logistic regression.      the bank’s creditworthiness, ε is an unobservable random
Such an approach is frequently used for calculating credit        component with logistic distribution, and F is a logistic

2
    URL: https://rudata.info/

9                                                                                                 Higher School of Economics
Journal of Corporate Finance Research / Methods                                                                                        Vol. 15 | № 3 | 2021

distribution function. The linear predictions are calculated                          Validation. It is impossible to apply common validation
as follows:                                                                           measures such as cross-validation types to this method.
             n                                                                        The reason is that the modified Kemeny median is a result
Ricontin =   ∑X β .
             j =1
                      '
                      i
                                                                                      of a non-parametric approach that cannot be used for an-
                                                                                      other sample directly without mapping.
                 contin
                                                                                      The Kemeny median was originally a voting method that
In this paper, R        is used as one of the base scorings                           was subsequently used as an aggregator of credit ratings
focused on default borrowers.                                                         for banks. The collective opinion of credit rating agencies
Validation. The complete sample of banks is used to build                             may complement information from financial statements:
the logistic model, regardless of whether they have a rating                          companies disclose to credit rating agencies information
or not. This sample is divided into training and test sub-                            which may be absent from publicly available data. Thus, it
samples. This is done on an out-of-time basis and it’s no                             is expected that the combination of the logistic model of
coincidence. Such a validation method is used in credit                               defaults built on publicly available data and the ranking ob-
scoring studies [31; 32].                                                             tained from CRA ratings will surpass these base classifiers.

Modified Kemeny Median                                                                Aggregation
Data preparation. Unlike the previous method, observa-                                Logistic regression is applied as a strong classifier in this
tions with missing data for certain variables were used for                           paper. The binary default/non-default variable yi , ar-
building a modified Kemeny median (consensus ranking).
To create a consensus ranking, we used the ratings of seven                           ranged in the same way as in function (1), still serves as the
rating agencies operating in Russia from July 2010 to July                            interpretable factor. However, to create an aggregated scor-
2015.                                                                                 ing, yi is predicted using the following two factors:
Modified Kemeny median. Another base classifier is rep-
resented by the Kemeny median [8], whose application                                                     (                                     )
                                                                                        P ( yi = 1) = F γ 0 + γ 1Ricontin + γ 2 Ricons + ε ≥ 0 ,       (4)
results in the so-called “consensus ranking”. This method                             where γ j is a coefficient obtained from assessing the logis-
is based on the interpretation of credit rating as a relative                         tic regression with the help of the maximum likelihood
ranking of objects in accordance with a CRA’s opinion on                              method and F is the logistic distribution function. The
the credit quality of each object. On the basis of the rat-
                                                                                      aggregated scoring itself is calculated as follows:
ings specific nature as expert information, we modified the
concept of Kemeni distance between rankings. This made                                  Riagregated =
                                                                                                    γ 0 + γ 1Ricontin + γ 2 Ricons .           (5)
it possible to find a unique solution that least contradicts
the opinions of rating agencies with an acceptable accuracy                           Model Power Indicator
within an acceptable time:
                                                                                      We use the indicator of the area under the ROC curve (here-
                               m                                                      after, AUCROC) as a measure of the discriminatory power
  R cons     argmin        ∑ϕ  k =1
                                      k
                                           d ( R, Rk ) + λδ 2 ( R, Rk )  ,
                                                                         
                                                                                (3)   of all scorings. This indicator is appropriate for unbalanced
                                                                                      samples – in particular, because it takes different errors into
                                                                                      account [1, p. 2]. In addition, this indicator does not under-
             cons                                                                     rate or overrate its values due to erroneous classification or
where R            is the resulting (aggregated) rating, m is
                                                                                      default distribution [7, p. 38]. The resulting indicator values
the number of aggregated ratings, Rk is the kth rating,                               should be interpreted as follows: the closer the AUCROC
 d ( R′, R '') is the rank measure of distance between ratings                        value to 1, the greater the discriminatory power of the credit
 R ' and R '' (number of contradictory rankings for all pairs                         indicator. This indicator is described in more detail in [33].
of companies), λ is the regularization parameter (relative                            The statistical significance of differences between the AU-
significance of the secondary criterion), δ ( R ', R '') ­­is the
                                                2
                                                                                      CROC of base classifiers and the aggregated model is de-
additional (secondary) criterion (shows the extent of con-                            fined by means of the DeLong, DeLong and Clarke-Pear-
tradiction significance),                                                             son test [34] at a 10% significance level.
                    m
and ϕk > 0,      ∑ϕ k =1
                           k   1
                               =                                                      Results
are weights representing the degree of confidence in the                              This section deals with the discriminatory powers of credit
ratings of a given agency.                                                            scorings made with the help of base classifiers and through
                                                                                      the aggregation of scorings.
 R cons is a non-strict bank ranking. Each combination of
ratings is assigned its own numerical value, and so the
                          cons
                                                                                      Logistic Model of Defaults
granularity degree of R        depends on the number of
                                                                                      The model was trained on a sample of Russian banks from
such combinations, and the order of each combination in
                                                                                      the period July 1, 2010 – July 1, 2014. The following factors
 R cons depends on its inconsistency with other ratings.
                                                                                      were selected:

10                                                                                                                     Higher School of Economics
Journal of Corporate Finance Research / Methods                                                                   Vol. 15 | № 3 | 2021

 1) Ratio of the deposits of a legal entity to its bank             reason for this is that this aggregated rating is based on
    assets.                                                         information about banks which basically have a rating.
 2) Regulatory requirement of “the biggest possible credit          This is a positive signal for the market: the bank is not
    risks” Н7.                                                      afraid of its creditworthiness assessment and can afford
                                                                    it in practice.
 3) Regulatory requirement of long-term liquidity Н4.
 4) Amount of granted short-term credits.                           Figure 2. ROC and AUCROC of the consensus ranking
                                                                    on a sample of banks with two or more ratings
 The AUCROC of the obtained logistic model is equal to
 68.26%. In the test sample, the AUCROC is 70.03%. The                          1
 AUCROC consistency in these two non-overlapping sam-                          0.9
 ples indicates that the model has not been retrained. In the
 sample of banks with two and more ratings, the AUCROC                         0.8
 is equal to 71.4% (Figure 1). The quicker growth of ROC                       0.7
 diagram at the origin means that the default model defines
                                                                               0.6

                                                                 Sensitivity
 default banks better.

 Figure 1. ROC and AUCROC of the logistic model on a                           0.5
 sample of banks with two or more ratings                                      0.4
               1                                                               0.3
              0.9                                                              0.2
              0.8                                                              0.1
              0.7                                                               0
                                                                                     0         0.2     0.4       0.6      0.8       1
              0.6
Sensitivity

                                                                                                       100%-specifity
              0.5
                                                                                         Perfect classification
              0.4                                                                        Random classification
                                                                                         Consensus ranking, AUCROC=71.28%
              0.3

              0.2
                                                                    This ranking also has high discriminatory power from a
              0.1                                                   practical standpoint and is as good as statistical models and
               0                                                    machine learning methods in a low-default environment
                    0         0.2       0.4      0.6   0.8   1      [36; 37]. The consensus ranking is statistically indiscerni-
                                     100%-specifity                 ble at a 10% significance level with a logistic model of de-
                                                                    faults according to the DeLong, DeLong and Clarke-Pear-
                        Perfect classification                      son test (p-value = 99.3%).
                        Random classification
                        Logit model, AUCROC=71.24%                  Aggregate Ranking
 According to the quality criterion of scoring models from          The aggregated ranking was built from the two previous
 [35], this model shows a good discriminatory power from            rankings. Logistic regression was the aggregated model.
 a practical standpoint. This is confirmed by the results of        We obtained a scoring with the AUCROC equal to 76.16%
 [9; 21], which built logistic regressions using unbalanced         (Figure 3).
 samples. In such a case, the AUCROC of the logistic model          Statistically, the ranking surpasses the two base scorings
 usually lies in the range 60–74%.                                  at a 10% significance level3, showing the relevance of ag-
                                                                    gregating several baseline rankings and ensembling. In ad-
 Consensus Ranking                                                  dition, one should note that aggregated scoring includes
 The consensus ranking was calculated on the basis of               the best characteristics of both baseline rankings. It defines
 a sample consisting of banks with two or more ratings.             default and non-default borrowers with similar precision.
 The consensus AUCROC is equal to 71.28% (Figure 2).                Moreover, previously proposed versions of aggregate clas-
 This ranking defines trustworthy borrowers better, as the          sifiers showed a growth in the AUCROC not exceeding 3%
 right part of the ROC diagram is almost horizontal. The            [38; 39] on unbalanced samples.

                     contin
 3
        H 0 : AUCROC
               =            AUCROC aggregated
                                         =    , p − value 1.79%.
     H0 : =AUCROC cons AUCROC aggregated
                                     =   , p − value 9.22% .
                  contin        cons
     H 0 : AUCROC
                =        AUCROC
                             =         AUCROC aggregated , p −=
                                                              value 0% .

 11                                                                                                      Higher School of Economics
Journal of Corporate Finance Research / Methods                                                              Vol. 15 | № 3 | 2021

 Figure 3. ROC and AUCROC of the aggregated model                      2.    de Castro Vieira JR, Barboza F, Sobreiro VA, Kimura
 and two base classifiers on a sample of banks with two or                   H. Machine learning models for credit analysis
 more ratings                                                                improvements: Predicting low-income families’
               1                                                             default. Appl Soft Comput J. 2019 Oct 1;83:105640.
              0.9                                                      3.    Louzada F, Ara A, Fernandes GB. Classification
              0.8                                                            methods applied to credit scoring: Systematic
              0.7
Sensitivity

                                                                             review and overall comparison. Vol. 21, Surveys
              0.6
                                                                             in Operations Research and Management Science.
              0.5
                                                                             Elsevier Science B.V.; 2016. p. 117–34.
              0.4
              0.3                                                      4.    Xia Y, Liu C, Li YY, Liu N. A boosted decision
              0.2                                                            tree approach using Bayesian hyper-parameter
              0.1                                                            optimization for credit scoring. Expert Syst Appl.
               0                                                             2017 Jul 15;78:225–41.
                    0      0.2      0.4     0.6       0.8       1      5.    Ensemble methods: bagging, boosting and
                                   100%-specifity                            stacking | by Joseph Rocca | Towards Data Science
                    Perfect classification                                   [Internet]. [cited 2020 Oct 4]. Available from:
                    Random classification                                    https://towardsdatascience.com/ensemble-methods-
                    Consensus ranking, AUCROC=71.28%                         bagging-boosting-and-stacking-c9214a10a205
                    Aggregate ranking, AUCROC=76.16%                   6.    Ma X, Sha J, Wang D, Yu Y, Yang Q, Niu X. Study
                    Logit model, AUCROC=71.24%                               on a prediction of P2P network loan default based
                                                                             on the machine learning LightGBM and XGboost
 Conclusion                                                                  algorithms according to different high dimensional
                                                                             data cleaning. Electron Commer Res Appl. 2018
 Financial institutions need to identify both default and                    Sep 1;31:24–39. Available from: https://linkinghub.
 non-default contractors or customers in order to enable                     elsevier.com/retrieve/pii/S156742231830070X
 their management to take informed decisions when solv-
                                                                       7.    Hayashi Y. Application of a rule extraction algorithm
 ing risk management problems. In this paper, we propose
                                                                             family based on the Re-RX algorithm to financial
 the aggregation of credit scorings made with methods fo-
                                                                             credit risk assessment from a Pareto optimal
 cused on different types of borrowers: the logistic model of
                                                                             perspective. Oper Res Perspect. 2016 Jan 1;3:32–42.
 defaults and the modified Kemeny median. Logistic regres-
 sion is used as the strong learner.                                   8.    Buzdalin AV, Zanochkin AU, Kurbangaleev MZ,
                                                                             Smirnov SN. Сredit ratings aggregation as a task
 Our data sample consists of Russian banks from the pe-
                                                                             of building consensus in the system of expert
 riod July 2010 – July 2015, including credit ratings. From
                                                                             assessments. Globalnie rinki i finansoviy engineering.
 a practical standpoint, the discriminatory power of base-
                                                                             2017;4(3):181–207. (In Russ.).
 line rankings is high and typical for credit scorings in a
 low-default environment. However, their aggregation us-               9.    Zanin L. Combining multiple probability predictions
 ing logistic regression resulted in a significant growth in                 in the presence of class imbalance to discriminate
 the discriminatory power of scoring. Moreover, this incre-                  between potential bad and good borrowers in the
 ment surpassed the increments of ensembles or aggregated                    peer-to-peer lending market. J Behav Exp Financ.
 rankings on unbalanced samples described in earlier liter-                  2020 Mar 1;25:100272.
 ature. As long as the applied classifiers demonstrate a rela-         10.   Committee on Banking Supervision B. Basel
 tively high interpretability, such a model can be also used                 Committee on Banking Supervision International
 by financial institutions for risk management.                              Convergence of Capital Measurement and Capital
 In further research, feature engineering techniques (for                    Standards A Revised Framework Comprehensive
 example, principle component analysis) may be applied                       Version [Internet]. 2006 [cited 2020 Oct 4]. Available
 as explanatory factors, provided the obtained index is                      from: https://www.bis.org/basel_framework/
 interpretable. It is also possible to expand the set of base          11.   Carta S, Ferreira A, Reforgiato Recupero D, Saia M,
 scorings by adding market scorings and some other inter-                    Saia R. A combined entropy-based approach for a
 pretable scorings obtained, for example, from discriminant                  proactive credit scoring. Eng Appl Artif Intell. 2020
 analysis, decision trees, etc.                                              Jan 1;87:103292.
                                                                       12.   Tserng HP, Chen PC, Huang WH, Lei MC, Tran QH.
 References                                                                  Prediction of default probability for construction
                                                                             firms using the logit model. J Civ Eng Manag.
 1.            Abellán J, Castellano JG. A comparative study on base
                                                                             2014;20(2):247–55.
               classifiers in ensemble methods for credit scoring.
               Expert Syst Appl. 2017 May 1;73:1–10. Available         13.   Li K, Niskanen J, Kolehmainen M, Niskanen M.
               from: https://linkinghub.elsevier.com/retrieve/pii/           Financial innovation: Credit default hybrid model for
               S0957417416306947                                             SME lending. Expert Syst Appl. 2016;61.

 12                                                                                                Higher School of Economics
Journal of Corporate Finance Research / Methods                                                    Vol. 15 | № 3 | 2021

14. Barboza F, Kimura H, Altman E. Machine learning          28. Karminsky A.М. Credit rating and its modelling.
    models and bankruptcy prediction. Expert Syst Appl.          Moscow: HSE Publishing House; 2015. 304 p. (In
    2017 Oct 15;83:405–17.                                       Russ.).
15. Maldonado S, Peters G, Weber R. Credit scoring           29. Aivazyan S, Golovan S, Karminsky A, Peresetckiy A.
    using three-way decisions with probabilistic rough           Approaches to comparing rating scales. Prikladnaya
    sets. Inf Sci (Ny). 2020 Jan 1;507:700–14.                   ekonometrika. 2011;3(23):13–40. (In Russ.).
16. Dastile X, Celik T, Potsane M. Statistical and machine   30. Fitzmaurice G, Kenward MG. Handbooks of Modern
    learning models in credit scoring: A systematic              Statistical Methods Handbook of Missing Data
    literature survey. Appl Soft Comput J. 2020 Jun              Methodology. 1st ed. New York: Chapman and Hall/
    1;91:106263.                                                 CRC; 2014. 600 p.
17. Pérez-Martín A, Pérez-Torregrosa A, Vaca M. Big          31. Mushava J, Murray M. An experimental comparison
    Data techniques to measure credit banking risk in            of classification techniques in debt recoveries scoring:
    home equity loans. J Bus Res. 2018 Aug 1;89:448–54.          Evidence from South Africa’s unsecured lending
18. Ala’raj M, Abbod MF. A new hybrid ensemble credit            market. Expert Syst Appl. 2018 Nov 30;111:35–50.
    scoring model based on classifiers consensus system      32. Garrido F, Verbeke W, Bravo C. A Robust profit
    approach. Expert Syst Appl. 2016;64.                         measure for binary classification model evaluation.
19. Ala’Raj M, Abbod MF. Classifiers consensus system            Expert Syst Appl. 2018 Feb 1;92:154–60.
    approach for credit scoring. Knowledge-Based Syst.       33. Tasche D. Estimating discriminatory power and PD
    2016 Jul 15;104:89–105.                                      curves when the number of defaults is small. 2009
20. Fang F, Chen Y. A new approach for credit scoring            May 24 [cited 2018 Dec 5]; Available from: http://
    by directly maximizing the Kolmogorov–Smirnov                arxiv.org/abs/0905.3928
    statistic. Comput Stat Data Anal. 2019 May               34. DeLong ER, DeLong DM, Clarke-Pearson DL.
    1;133:180–94.                                                Comparing the Areas under Two or More Correlated
21. Teply P, Polena M. Best classification algorithms in         Receiver Operating Characteristic Curves: A
    peer-to-peer lending. North Am J Econ Financ. 2020           Nonparametric Approach. Biometrics. 1988
    Jan 1;51:100904.                                             Sep;44(3):837.
22. Butaru F, Chen Q, Clark B, Das S, Lo AW, Siddique A.     35. Pomazanov M.V. Credit risk-management in a bank:
    Risk and risk management in the credit card industry.        internal ratings-based approach (IRB) [Internet]. 1st
    J Bank Financ. 2016 Nov 1;72:218–39.                         ed. Penikas G.I, editor. Moscow: Urait; 2019 [cited
                                                                 2020 Oct 4]. 265 p. Available from: https://urait.
23. Sousa MR, Gama J, Brandão E. A new dynamic
                                                                 ru/book/upravlenie-kreditnym-riskom-v-banke-
    modeling framework for credit risk assessment.
                                                                 podhod-vnutrennih-reytingov-pvr-437044. (In
    Expert Syst Appl. 2016 Mar 1;45:341–51.
                                                                 Russ.).
24. Maldonado S, Pérez J, Bravo C. Cost-based
                                                             36. Fitzpatrick T, Mues C. An empirical comparison
    feature selection for Support Vector Machines: An
                                                                 of classification algorithms for mortgage default
    application in credit scoring. Eur J Oper Res. 2017
                                                                 prediction: Evidence from a distressed mortgage
    Sep 1;261(2):656–65.
                                                                 market. In: European Journal of Operational
25. Neto R, Jorge Adeodato P, Carolina Salgado A.                Research. Elsevier; 2016. p. 427–39.
    A framework for data transformation in Credit
                                                             37. Shen F, Zhao X, Li Z, Li K, Meng Z. A novel ensemble
    Behavioral Scoring applications based on Model
                                                                 classification model based on neural networks and
    Driven Development. Expert Syst Appl. 2017 Apr
                                                                 a classifier optimisation technique for imbalanced
    15;72:293–305.
                                                                 credit risk evaluation. Phys A Stat Mech its Appl.
26. Karminsky A, Polozov A. Handbook of Ratings:                 2019 Jul 15;526:121073.
    Approaches to Ratings in the Economy, Sports,
                                                             38. Xia Y, Liu C, Da B, Xie F. A novel heterogeneous
    and Society. Handbook of Ratings: Approaches to
                                                                 ensemble credit scoring model based on bstacking
    Ratings in the Economy, Sports, and Society. Springer
                                                                 approach. Expert Syst Appl. 2018 Mar 1;93:182–99.
    International Publishing; 2016. 1–356 p.
                                                             39. He H, Zhang W, Zhang S. A novel ensemble method
27. 27. Li K, Niskanen J, Kolehmainen M, Niskanen M.
                                                                 for credit scoring: Adaption of different imbalance
    Financial innovation: Credit default hybrid model for
                                                                 ratios. Expert Syst Appl. 2018 May 15;98:105–17.
    SME lending. Expert Syst Appl. 2016 Nov 1;61:343–55.

Contribution of the authors: the authors contributed equally to this article.
The authors declare no conflicts of interests.
The article was submitted 06.07.2021; approved after reviewing 08.08.2021; accepted for publication 14.08.2021.

13                                                                                      Higher School of Economics
You can also read