## Can financial analysts optimize a loan portfolio based on data-centric strategies & quantitative algorithms?

Published in

·

13 min read

·

Jul 19, 2020

--

Authors: Noah Mukhtar, Shaan Kohli, Shaher yar Jahangir and Ramy Hammam

While banks have traditionally acted as the main provider of loans, the needs of smaller businesses are evolving and now come with more concerns.

2020 has brought along spikes in operating costs and increasing fluctuations in demand.

These are some of the extreme challenges SMEs face when it comes to raising capital — a crucial element of their growth. Banks often overlook these, and as a result,** this has opened up the door for the P2P Lending Market to grow.**

This is why we are going to create a **data-driven investment strategy for Prosper Marketplace, a P2P lending firm**, that will take a holistic overview from processing the raw data to inferring business outcomes.

Overall, our predictive model will act as a **prescriptive tool **that directly portrays the impact on the investor & business.

**Two fundamental steps are required for an investor to gain a better understanding of which loan to invest in:**

- Prosper’s investors will have to decide on how much money to invest (considering potential return) and allocate to other options for investment.
- Prosper’s investors will need to decide on picking “good” loans (i.e.: loans which will end up being fully paid off) in which to invest their money, and this is what our loan classification model will focus on.

However, it is important to note that there is no one ideal investment, as it is highly contingent on the risk appetite of the investor.

The datasets to be utilized in this analysis are publicly available online for download using a registered

Prosperaccount.

**Predictors**

The dataset contains comprehensive information on all loans posted annually and a set of **22 variables** that cover the different attributes of an individual loan.

These include but are not limited to the** loan amount, the amount of interest & fees paid, loan duration, a Prosper credit rating score, and loan status** with reason if completed or defaulted.

## Completed vs. Defaulted Loan: 0s vs 1s

There are 4 outcomes for a loan in Prosper’s data:

- Completed (A)
- Charged Off (B)
- Current (C)
- Defaulted (D)

To simplify our Prosper data analysis, we decided to assign a value of **0 to completed loans (A) **and **1 to the remaining 3 types of loan statuses** **(B, C, & D) **that refer to loans that have not been completed.

## How Can We Identify a “Good” Loan?

In order to acquire an intuition of identifying good loans, it is important to realize the correlation & interaction amongst all the variables.

However, the **definition of a good loan **is still quite ambiguous so it is vital that the prediction should be based on one of these goals: whether a loan will default, paid back early, time of default, or time taken to pay back if paid back early.

## Analyzing Loan Terms

The plot indicates some interesting insights for the 60-month term (the longer-term) such as the highest loan amount being approximately $15,000 for both completed loans as well as uncompleted loans. Unsurprisingly, borrowers tend to take higher loan amounts to pay back over the longer-term loans.

Whereas loans in the 36-month terms (the shorter term) tend to be of much smaller amounts and are more consistently distributed for both completed and uncompleted loans.

## Analyzing Interest Rate Distribution Per Loan Status

Another violin plot was built that shows the different levels of interest rates for each of the loan status per grade type (ranging from AA — HR: highest grade to lowest).

Higher-grade loans for both completed (0) and uncompleted (1) have lower interest rates, and this trend is consistent for both loan statuses.

## Data Leakage

This phenomenon occurs when a model is built using predictors that will not be available at the time of future predictions. Therefore, we need to disregard these predictors in order to prevent a bias in our model.

**There are two main cases on how leakage can be identified:**

1. A predictor that is highly correlated with the target variable. We will check this using a heatmap or correlation matrix of all 22 predictors.

2. The predictor information is not available during the prediction of total return, for example, the late_fees_paid variable.

As suspected, we noticed that variables that are associated with data leakage such as the interest paid and principal paid stood out to have a high correlation with loan status.

**Based on this definition of leakage and the fact that most of these won’t be available when predicting whether a loan will default or not, we disregard the following predictors:**

loan_number, late_fees, age_in_months, days_past_due, origination_date, principal_balance, principal_paid, interest_paid, late_fees_paid, debt_sales_proceeds_received, loan_default_reason, loan_default_reason_description, next_payment_due_date, next_payment_due_amount, co_borrower_application

**Calculation of Investor’s Returns**

The most significant data required in determining the potential return from a given loan is the **calculation of the total payments that were received on each of the loans**.

To build an effective investment strategy, we need to build a strong indicator variable on the return amount of each loan. **It is** **vital that the return should consider both partially paid off defaulted loans and loans that have been paid off earlier than the due date**.

**In general, there are three effective return measures that can be created using the following variables:**

1. Total payment account denoted by variable **p**

2. Total amount invested in the loan **(amount borrowed by borrower) **denoted by variable **f**

3. Nominal length of the loan in months (**term of loan**) denoted by the variable **t**

4. Actual length of the loan in months (**loan length**) denoted by the variable **m**

We are going to calculate the expected returns by scenario testing under 3 different strategies:

**1) Pessimistic (M1)**

The pessimistic approach states that when the loan is paid back, the investor still cannot re-invest it until the term of the loan expires.

Using equation 1, our team was able to create a new variable called: “**ret_PESS**”.

It was worth noting that this approach favors long-term loans for loans that default early due to the loss spreading over a greater span.

**2) Optimistic (M2)**

The optimistic approach entails that once the loan is paid back, the investor’s money is returned, and the investor can invest immediately in another loan with the same return.

Using equation 2, our team was able to create a new variable called: “**ret_OPT**”.

This method can be rephrased as the annual monthly return of the loan over the time it was active. However, it is plagued by the assumption that money can be reinvested with the same rate and that if the loan defaults early, the loss can be a drastic overestimate of negative returns. On the plus side, short and long loans are treated equally using this approach.

**3) Fixed-horizon (M3)**

The fixed-horizon approach involves calculating fixed-time returns for 3 different interest rates: 1,3 & 6 %.

A function was built, called **ret_method_3()**, to replicate equation 3 and calculate the 3 new variables for each loan in the dataset.

**This approach can be the most accurate by equalizing the differences between loan length and defaulted loans. **However, it tends to disregard the depreciation value of the money over time.

**Prosper Rating (Grade) Breakdown**

A deeper dive into the prosper credit rating (“grade”) variable was performed to explore any significant trends with each of the type of return (M1-M3).

The higher the grade, the smaller percentage of defaulting, and the lower the interest rates.

The mean return % is very identical with minor increases across the different grades. In addition, there are no negative mean return values across any of the grades.

A quick count of the number of loan status for each type shows a significantly higher number of completed loans when compared to uncompleted loans, indicating a class imbalance in the dataset.

Therefore, the most probable option is to try to rebalance the dataset to consist of 50% of completed loans and 50% of uncompleted loans. However, doing this could prove to be detrimental and cause the average return to be negative, fully going against the point of enticing potential investors to sign up to Prosper.

Thus, to capture actual real-life trends and true proportions, physically balancing the dataset would hinder a realistic understanding of the analysis.

Instead, a calibration curve measure will be used in the model-building phase to check if a constructed model is biased by the data imbalance or not.

**The predictive modeling will cover a two-stage approach:**

First, we will be constructing a binary classification model to predict the loan default probability using a variety of industry-standard algorithms and evaluating the performance for the optimal and most accurate model.

The second stage involves constructing a regression model using a variety of industry-standard regressors to predict the amount of return a loan may generate to an investor for each of the return approaches (M1-M3).

The models in both phases will be optimized for higher accuracy by hyperparameter tuning through cross-validation.

*Stage 1: Classifying Loan Outcomes*

The first model was to determine the predictive power of the **grade** and **interest rate** of a loan, which is usually the industry’s standard practice.

Predictors:Grade & Interest Rate

Model #1 — Accuracy Score: 85%

This implies that these two features individually contribute to successfully determining the loan status as much as when all the predictors are used.

However, to build a more robust model and avoid underfitting, we have to drop these two features entirely.

Predictors: All Excluding Grade & Interest Rate

Model #2 — Accuracy Score: ~85%

Using all the predictors apart from grade and interest rate, model accuracies were maximized at roughly 85%. Although Naïve Bayes provided a very high accuracy, it was unable to account for the biases presented due to the data imbalance.

The calibration curves and AUC scores for all the above models are high, indicating strong performance. The calibration code shows a measure to see if the model is affected by the imbalance. If the calibration is perfect, it means the model is not biased by the imbalanced dataset. In other words, it will not always try to predict the outcome to be the grand majority, which is the completed loans **(0) **case.

Since Random forest was the best-selected model, we decided to explore the best in class for tree algorithms, also known as the **Light GBM classifier **developed by Microsoft. It is considered the latest industry trend and winner of multiple Kaggle competitions, so we were curious to see how well it would perform in the Prosper case. As presumed, all the model metrics were very high for the Light GBM classifier.

## Feature Importance

The feature importance plots for most of the models were also generated in the classification stage.

**Our model determined “Service fees paid”, “Propser fees paid” and “amount borrowed” as the most useful at predicting our target variable: “loan_status” which matches well with investing intuition.**

## Learning Curve

A learning curve was also plotted to assess the learning rate of the optimal random forest models when compared to a sample L2 logistic regression model.

When the training size is increased by intervals of 25 data points until a very large amount, the random forest model AUC increases, meaning that the model will continue to learn and give a much higher percentage of accurate return predictions relative to the logistic model.

*Stage 2: Predicting Expected Return*

A variety of models were tested using a different set of features. The R2 values of various regression models were calculated for the different return methods.

The Random forest regressor gave the highest score across all the different types of returns when compared to the rest of the models built.

**Investment Strategies**

By leveraging both the classification and regression models to predict whether a loan will default and the estimated return, an investment strategy can be formulated to maximize an investor’s average return based on the different methodologies used to calculate the return.

**Four different investment strategies were tested:**

- Random Strategy
- Default-based
- Simple return-based
- Default and return-based strategy.

In order to depict real-life scenarios, the M3 return approach seems to be a more accurate way to measure the % return. Thus, the baseline for us to compare the different investment strategies would be under the M3 method. Nonetheless, the returns seem to be very close to one and another regardless of which investment strategy is being used when looking at the M3 return approach.

The magnitude of the % return is also highly dependent on the reinvestment interest rate (M3).

**Sensitivity Test of Portfolio Size**

Next, a sensitivity test was run with regards to the portfolio size versus the investment return percentage

As the number of loans invested in by a potential investor increases, the percentage of investment return decreases. **This is also very intuitive due to the risk factor of a loan getting defaulted increases as the number of loans invested in increases**.

In other words, **an investor may have a similar chance of getting a high return and then losing most of it due to a defaulted loan and this cycle can keep repeating causing the overall percentage of return to decrease as the number of loans invested in increases.**

In this section, we implement three different optimization models to improve an investment strategy using Prosper.

**The three different optimization methods are:**

## 1) Directly maximize total profit

A binary variable is set-up for every loan in our data set. The number of loans constraint is added (as per the maximum number of loans in our dataset) and the objective function is defined (maximize total profit).

## 2) Maximize profit with budget constraint

The second optimization model considers the budget constraint of a potential investor. Like the first model, a new budget limiting constraint is added (testing different budgets), and the optimization problem is resolved.

## 3) Maximize profit with risk-return tradeoff

The third optimization strategy involves incorporating the portfolio risk factor by considering the variance of the returns.

To achieve this, first, a clustering model is trained with an adjustable k parameter by the investor. Second, the standard deviation is computed for each of the clusters. Third, each loan can be assigned to a specific cluster based on the Euclidean distance from each cluster. Fourth, the standard deviation of the return for each loan can be estimated using the standard deviation of the cluster.

The model is then built to maximize profit with the inclusion of the risk-return tradeoff constraint. A sensitivity/penalty factor can be set by the investor to account for the risk tolerance of the investor.

**Results show that the optimal investment strategy for Prosper is maximizing profit using budget constraints with an expected return on 10.5%.**

In conclusion, we saw how information from different attributes can be utilized to create new predictors that may end up boosting the statistical significance of our model as opposed to the original counterparts. It was important to note how a model’s performance over a single partition of the data into cross-validation folds can be misleading and running our model several times to develop different iterations of the train/test splits was essential. Another important part was calibration, and how it measured whether the probabilities produced by the model were correct.

It would be interesting to see how our model would perform if tied to macroeconomic external data that would represent the underlying economy’s performance at the time such as oil prices or The World Bank’s interest rates in order to further extend the pessimistic measure’s performance.

Additionally, it would be interesting to conduct sentiment analysis over the investor to understand what their risk appetite exactly is as opposed to asking for the numerical input as they may lack the understanding of how the P2P lending market operates.

## Our Inspiration — Original Study

I am an expert in the field of data-driven investment strategies and quantitative algorithms, with a deep understanding of financial analysis and loan portfolio optimization. My expertise extends to predictive modeling, classification, regression, and optimization techniques, which are essential components of creating effective investment strategies.

Now, let's delve into the concepts discussed in the article by Noah Mukhtar, Shaan Kohli, Shaher yar Jahangir, and Ramy Hammam regarding the optimization of a loan portfolio using data-centric strategies and quantitative algorithms.

**1. P2P Lending Market Growth:**

- Traditional banks overlook the evolving needs of smaller businesses, leading to increased challenges in raising capital.
- The P2P lending market, exemplified by Prosper Marketplace, is growing as an alternative source for loans.

**2. Data-Driven Investment Strategy:**

- Aiming to create a holistic strategy for Prosper Marketplace, the authors propose a data-driven approach to analyze raw data and infer business outcomes.
- The predictive model serves as a prescriptive tool, impacting both investors and businesses.

**3. Loan Classification Model:**

- Prosper's investors face the challenge of deciding how much money to invest and choosing "good" loans.
- The article focuses on building a loan classification model to identify factors indicating a "good" loan.

**4. Predictors and Dataset:**

- The dataset includes comprehensive information on loans, with 22 variables covering attributes like loan amount, interest, duration, credit rating, and status.
- Completed vs. Defaulted loans are categorized as 0s and 1s, respectively.

**5. Data Analysis Techniques:**

- Analysis involves understanding correlations among variables to identify "good" loans.
- Violin plots illustrate insights into loan terms, interest rate distributions, and Prosper credit rating breakdown.

**6. Data Leakage:**

- Leakage is addressed by identifying predictors highly correlated with the target variable, ensuring the model uses only relevant information.

**7. Calculation of Investor's Returns:**

- Expected returns are calculated under three scenarios: Pessimistic, Optimistic, and Fixed-horizon, considering different reinvestment approaches.

**8. Predictive Modeling Stages:**

- Two-stage approach involves binary classification to predict loan default probability and regression to predict return amount based on different strategies (M1-M3).

**9. Investment Strategies:**

- Four investment strategies are tested, leveraging classification and regression models to maximize investor returns.

**10. Optimization Methods:**

- Three optimization strategies are explored to improve investment strategies, including maximizing total profit, considering budget constraints, and incorporating risk-return tradeoff.

**11. Sensitivity Test and Results:**

- A sensitivity test explores the impact of portfolio size on investment return percentage.
- The optimal investment strategy involves maximizing profit with budget constraints, achieving an expected return of 10.5%.

In conclusion, the article provides a comprehensive exploration of data-centric strategies for optimizing a loan portfolio, encompassing data analysis, predictive modeling, and optimization techniques to enhance investor returns.