Credit Risk Models Ep 2 machine learning methods for parameter estimation
credit risk
python
machine learning
Author
Xiaochuan Yang
Published
October 20, 2023
In a previous post, we’ve modelled the loss of a portfolio of \(d\) instruments as follows \[
L = \sum_{i=1}^d \mu_i S_i I_i,
\] where \(L\) represents the total loss, \(\mu_i\) is the exposure at default, \(S_i\) is the loss given default, and \(I_i\) is the event of default for the \(i\)-th instrument. To compute expected loss, VaR, and other relevant metrics in risk management, it is crucial to estimate these underlying parameters accurately. In this post, our focus is on estimating \(p_i=\mathbb{E}[I_i]\), which is formulated as a machine learning problem.
Estimating probaiblity of default (PD)
Consider the real-world example of a bank approving loans for applicants based on their profiles. In this scenario, every applicant must fill out a comprehensive application form, including details such as their profession, age, amount of debts, monthly salary, and so on. The bank maintains records and, in retrospect, knows who has defaulted on their loans.
To formalize this process, each applicant corresponds to a vector in \(\mathbb{R}^k\), known as the feature vector, which incorporates all the information from the form (possibly encoded for categorical values, e.g., converting ‘profession’ into dummy variables). The output we aim to predict is whether the applicant is in default (1) or not (0).
This constitutes a binary classification problem. With a substantial number of input-output pairs (features and default status) available, supervised learning algorithms can be employed to learn a relationship that can subsequently be used for predictions.
Many supervised learning algorithms for binary prediction actually output probabilities (specifically, the probability of the label being 1). This is suitable for our goal, as we precisely seek to estimate probabilities.
For illustration, let’s consider a credit default risk dataset from this Kaggle competition. We set up the competition with fastkaggle module.
We primarily look into application_train.csv, which contains 308k rows and 121 input features. While there are numerous aspects to discuss regarding this dataset, for the sake of brevity, I will address two significant points here:
Missing values
There are many missing values in this dataset. To be more precise, 41 columns actually have half of their values missing.
This is a critical issue that needs to be addressed because many off-the-shelf machine learning models in scikit-learn, such as logistic regression, random forest, and support vector machines, cannot handle missing values represented as np.nan. There are two possible options to handle this:
Impute the missing values before feeding the data into these models. Imputation can be done using “rule-based” methods such as scikit-learn’s SimpleImputer or “learning-based” methods such as scikit-learn’s IterativeImputer.
Use a different model that supports missing values natively. For instance, many gradient boosting implementations like HistGradientBoosting, XGBoost, LightGBM, and CatBoost handle missing values natively.
For the sake of providing a quick benchmark, we have opted for the second option and are using scikit-learn’s gradient boosting implementation, HistGradientBoostingClassifier. Explaining the detailed workings of gradient boosting is a vast topic that we might delve into in a future post.
Unbalanced data
Roughly 8% of the obligors goes into default in this dataset, making it unbalanced. From the perspective of risk management, it is important to predict defaults (label 1) accurately. Therefore, when it comes to evaluate model performance, accuracy is not an appropriate metric. Simply predicting non-default for all obligors would result in a correct prediction 92% of the time, but it’s never correct for the defaults.
Applying a machine learning model directly to an unbalanced dataset can lead to sub-optimal results. We’ll demonstrate this point in the next section. A simple mitigation strategy is to sub-sample the majority class to match the size of the minority class, creating a balanced dataset. In this example, the minority class comprises roughly 25k rows, so we can train a model using this strategy as the balanced data is not too small to be useful. xx Obviously, the downside of this strategy is that we have thrown away lots of valuable data, but let’s not worry about it at this stage of obtaining a quick benchmark.
Implementation
First we feed the whole unbalanced dataset into HistGradientBoostingClassifier.
As we can see, the precision for class 1 is only 0.02, indicating that amongst all that are identified as defaults, only 2% of them are true defaults. Does this mean that the model we chose is rubbish? Not necessarily. Let’s use the exact same model but now with a balanced dataset created using the sub-sampling strategy.
Much better! All the metrics look roughly the same, hovering around 69%. While this is far from being deployable in the real world, as a baseline, it’s far more reasonable than our previous attempt.
We’ll conclude the post here. To further enhance the overall performance, it’s necessary to meticulously explore the features, engage in more feature engineering, employ clever methods of data imputation, and conduct thorough hyperparameter tuning. Give it a go!