Default Probability Prediction Model

Data Collection: In reality data collection turned out to be very difficult task. We planned to collect data of defaulting SMEs from RBI’s released list of defaulters. Further, referred Capital Line and CMIE Prowess for Financial information of the SMEs However, these sources were limited for our purpose as they do not provide data with regards to companies that have undergone Credit restructuring or have defaulted on their commitments.

We were unable to obtain RBI’s defaulters list. However we could get list of firm names who had filed for bankruptcy at website of Board for Industrial & Financial Reconstruction ( Data collection Procedure Involved following steps After going through entire list available at, we could obtain data of 36 SME companies having defaulted on their loans and whose relevant data was available on either Capitaline or Prowess.

Final Sample Consisted of 30 SMEs defaulting on their loans (6 data points obtained earlier were excluded because of unavailability of entire data), along with 80 SMEs not defaulting on their loans. Non-defaulting SMEs were selected randomly from Prowess. Thus entire sample had 100 data points with 2.66 non-defaulting companies per defaulting company. Due to less number of data points, sample couldn’t be divided among testing and training samples. We decided to monitor performance on intra-sample testing only.

We have four variable categories. All variables under one category will be subjected to factor analysis. Ideally we would have liked to select only one variable from each category. However in 3 categories only one variable failed to explain significant portion of original information, in such cases two variables were selected. IV) Flowchart of entire Process Results Obtained Factor Analysis was performed category-wise for all four categories. 7 factors were obtained from Factor Analysis as follows (For detailed results refer Appendix C) (For interpretation of Factor Analysis Results in Stata refer Appendix D)

High uniqueness values were noted for net income to total assets, net income to net worth, working capital to total assets, quick assets to Sales. 7 factors obtained in Factor Analysis were used as inputs to logit model with default flag as dependent variables Logit Model Output shows that Liquid assets to Debt Ratios is most significant category for default prediction followed by Net Income Ratio . (For Detailed Logit Model Output Refer Appendix E)

(For Interpretation of Typical Logit Model Output Refer Appendix F) Model built from these factors was subjected to intra-sample testing with following results. Based on these coefficients probabilities model were predicted and best results are shown in table for threshold of 0.3Logistic function is a sigmoid shaped function used in statistics. Its value ranges from 0 to 1. The logit function is the inverse of the logistic function and its value for a number p between 0 and 1 is given by the formula:

The logistic function of any number ? is hence given by the inverse of the logit: If p is a probability then p/(1 – p) is the corresponding odds, and the logit of the probability is the logarithm of the odds; similarly the difference between the logits of two probabilities is the logarithm of the odds ratio (R), thus providing a shorthand for writing the correct combination of odds ratios only by adding and subtracting:


Logistic regression (is also known as logistic model) is used for predicting the probability of occurrence of an event (which is usually a categorical variable) by using independent variables. Regression tries to fit data to a logistic function curve. Moreover the independent variables can be either numerical or categorical. A logistic function is used because for an input of any value from negative infinity to positive infinity, the output is confined to values between 0 and 1. For logistic regression the input to the logistic function is denoted by variable z, which is usually defined as

where 20 is called the intercept and, etc are the coefficient of x1, x2, x3 respectively. The value of the regression coefficients marks the size of the contribution of that risk factor. A positive regression coefficient means that the explanatory variable increases the probability of the outcome, while a negative regression coefficient means that the variable decreases the probability of that outcome; a large regression coefficient means that the risk factor strongly influences the probability of that outcome; while a near-zero regression coefficient means that that risk factor has little influence on the probability of that outcome.

Multiple discriminant analysis is a generalized technique of linear discriminant analysis. It is an extension of discriminant analysis, with many of the same assumptions and tests. MDA is used to explain a categorical dependent which has more than two categories, using a number of interval or dummy independent variables. It is closely related to regression analysis, principal component analysis and factor analysis trying to find a linear combination of features which characterize or separate two or more classes of objects or events. Discriminant coefficients are calculated in a similar way that to ANOVA. Coefficients depend on between-groups sum of square and individual-group sum of squares. Coefficients are chosen in such way that difference between groups is maximized.

Discriminant Analysis approaches the problem by assuming that the conditional probability density functions and are both normally distributed with mean and covariance parameters being and , respectively. Under this assumption, the optimal solution is to predict points based on the ratio of the log-likelihoods being below some threshold T, so that; It is often useful to see the conclusion in geometrical terms: the criterion of an input being in a class y is purely a function of projection of multidimensional-space point onto a direction.