Collection Scoring Models Development and Research Based on the Deductor Analytical Platform

This article solves the problem of collection scoring models constructing and researching. The relevance of solving this problem on the intelligent modeling technologies basis: decision trees, logistic regression and neural networks is noted. The initial data for the models was a set of 14 columns and 5779 rows. The models construction was performed in Deductor platform. Each model was tested on the set of 462 records. For all models, the corresponding classification matrix were constructed and the1st and 2nd kind errors were calculated, as well as the general error of the models. In terms of minimizing these errors, logistic regression showed the worst results


INTRODUCTION
Currently, in Russia there is an increase of the number of loans issued to individuals by various banks and financial organizations (Shikimi, 2020;Gemzik-Salwach, 2020). There is also an overdue credit debt increase in all types of lending (Xie & Hansen, 2020;Du & Palia, 2016). This leads to increase the load on various collection agencies and banks collection departments. To reduce the increasing load, it is necessary to raise the collection departments activity efficiency. Efficiency in this case is time and material costs reduction for working with borrowers who have overdue credit debt. The collection activity optimization is possible due to the modern methods of intellectual analysis of accumulated data use Dela Cruz Galapon, 2020) and the effective collection scoring models construction (Shen et al., 2020;Terko et al., 2019). Such models are able to minimize the time spent working with clients to collect overdue debts, and the total time spent on activities, and maximize profits from collection activities. In addition, the collection scoring models should be able to assess the opportunities of obtaining cash from borrowers, to predict the outcome of measures taken to collect overdue debts, to segment debtors into groups with varying degrees of debt repayment probability, and also to develop a strategy for dealing with overdue credit debt. Therefore, an actual task is the collection scoring models construction and study for collection services activities optimization.

METHODS
There are many data mining methods that can be used for credit debt overdue estimation models construction. This work is focused on three most effective and frequently used methods: a decision tree (Nagra et al., 2020;Ke et al., 2017), а logistic regression (Ansori et al., 2019;Meier et al., 2008;Asar & Wu, 2020) and a neural network (Ismagilov et al., 2018;Mustafin et al., 2018;Swiderski et al., 2012;. When using decision trees to classify loan applications, a set of rules is applied, which is formed when constructing a tree based on a training set (Alqam & Zaro, 2019). A decision tree example is shown in Figure 1. Connections between nodes are called branches. Each node corresponds to a condition (rule) for classifying objects. In the initial and intermediate nodes, in accordance with this condition, the tree branches, and the leaves define a class of objects whose attributes correspond to the conditions that determine the way leading to this leaves. In the case of a complete decision tree constructing, it accurately describes the X vector realizations set classification, from which the tree was built, into subsets of credits whose results are considered «good» or «bad». However, in most cases, a truncated tree is built, acting by analogy with other classification problems, in which this approach can be justified by various factors (for example, errors in the initial data). The result of truncation in a scoring application is usually a decrease of classification accuracy. In addition, due to truncation, those implementations of the vector X for which there are no data on the results of lending can be included in the tree. These are such data sets, applications with which either did not arrive at the bank or were not satisfied. The results obtained for loans with partially matching attributes corresponding to uncut tree nodes extend to these implementations of the decision tree. In the regression model, the scoring function is approximated relative to the X vector components by a linear function of the following form (Aaserud et al., 2013): where a0free coefficient, аi , (i=1…n)weight coefficients of the request, хiapplication signs, i.e. X vector components.
The аi coefficients are determined by one of the statistical estimation methods, for example, by the maximum likelihood estimation method (Chen et al., 2019). If the proportion of «good» or «bad» loans is used as the scoring function р, then it should be in the interval from 0 to 1. However, the value of the right side of equation (1) where qthe probability of «bad» credit outcome. This approach is called logistic regression (Shen et al., 2020). The р function varies in the interval from -∞ to + ∞ (Fig. 2). The frequencies of the «bad» credit outcome are used as estimates of q(Х) values for each implementation of the Х vector (application features), for which the bank has data on loans issued. Based on these data, the аi coefficients are estimated, which calculate the scoring function approximated value by the formula (2) for any feasible implementation of the Х vector.
The neural networks can also be used to approximate the scoring function (Swiderski et al., 2012). A neural network is a mathematical model whose parameters for a specific task are formed by training the model on a training data set. For scoring, such a sample may be a data set of previously issued loans or part of this set. As a result, a piecewise linear approximation of the р(Х) function is formed, which is specified algorithmically, and can be calculated using a neural network for any feasible implementation of the Х vector.

RESULTS AND DISCUSSION
Let's consider the intelligent collection scoring models construction and research on the base of Deductor analytical platform (Lomakin et al., 2019). The initial data for models constructing is a sample consisting of 14 columns and 5779 rows. The sample fields structure is presented in table 1.  Probability q of "bad" credit outcome Logarithm of the chance of a "bad" credit outcome 7 Number of payments before delay As you can see from the table, information about borrowers is stored in such diverse fields as «Credit amount», «Credit term», etc. Based on the described initial data, the collection scoring models were constructed in the Deductor: a decision tree, a logistic regression, and a neural network in the form of a two-layer perceptron. Each model is tested on the data set marked in the initial sample as «Testing set». The testing data set consisted of 462 records, that is about 8% of the initial data volume. Let's consider the testing results of the constructed collection scoring models, evaluate these models, compare the models with each other according to various criteria, and choose the most effective. Table 2 presents the constructed models test results (Sulewski, 2019). The following notation is used in the table: LRlogistic regression, DTdecision tree, NNneural network. Number «0» means that work with the client was carried out, but the resumption of payments did not follow. The number «1» means that work with the client was carried out, and resumption of payments followed. Based on the data presented in table 2, we calculated 1st, 2nd kind errors (Zhang et al., 2017), and the general models errors (see Table 3). As it can be seen from the table, from the point of view of errors of the 1st and 2nd kind, and the general error of the models, logistic regression showed the worst results. At the same time, the neural network showed the best results in terms of minimizing these errors. The effectiveness of collection scoring models practical use of is determined not only by the 1st and 2nd kind errors value, but also by the time spent on collecting overdue debts, as well as by the income received from the result of collection activities. We introduce the following destinations: ttime for collection one debt (working time with one debtor), amounting to 1 conventional unit of time; zthe costs of collecting one debt, amounting to 1 cu; dthe average income from one client (if the client has resumed payments), amounting to 1000 cu; TPthe number of truly positive outcomes when working with debtors (the number of positive modeling results that match the actual values); FPthe number of false positive outcomes when working with debtors (the number of positive modeling results that do not match the actual values); D=TP(d-z)-FP*znet income from debt collection activities. Then, based on the data in table 2 and introduced designations, we calculate the constructed models effectiveness according to the criteria of «income» and «time» (see table 4). As it can be seen from the table, the logistic regression model outperforms other models in terms of time spent on collecting overdue debts. However, in terms of net income from the debt collection measures implementation, the neural network model is the best.

CONCLUSIONS
Thus, the problem of intelligent collection scoring models effectiveness constructing and evaluating has been solved in this study. The results showed that to minimize the time spent on work with debtors, it is advisable to use the logistic regression model. However, to maximize profits, it is advisable to use a model based on the multilayer neural network training. In addition, this model showed the greatest accuracy in terms of 1st and 2nd kind errors minimizing. This indicates its effectiveness and practical use possibility in intelligent scoring systems (Mehdi et al., 2019).

ACKNOWLEDGMENTS
The work is performed according to the Russian Government Program of Competitive Growth of Kazan Federal University.