Random forest model using Lending Club public dataset shows opportunity to improve adjusted return by 2.75%
Arimo recently performed a study using a public dataset provided by Lending Club with the goal of showing how machine learning could improve investor returns. To do this we used the PredictiveEngine ™ component of our Data Intelligence Platform, which provides the ability to easily build a variety of predictive machine learning models which scale transparently when deployed on distributed parallel computing platforms.
Lending Club is an online peer-to-peer lending company that connects borrowers with investors who have capital to lend. When a loan application is submitted by a borrower, Lending Club reviews and decides whether to offer a loan at a risk-adjusted rate or to reject the application. As of the 3rd quarter of 2015, more than $12 billion in loans have been issued through Lending Club.
Lending Club’s loan approval process is designed to manage risk and offer predictable return to investors. The first step is assigning a credit-worthiness rating (A-G) to borrowers, where lower-rated borrowers are offered loans at higher interest rates (Figure 1). The other risk-management tool is breaking loans up into small notes, allowing the risk of a single loan to be shared by a number of investors who purchase the notes.
Figure 1: Interest Rates Charged to Various Borrower Risk Grades Over Time
What to Predict
The public data set does not give information on loans rejected, so we are only able to attempt to detect loans that should have been rejected and thus increased the return for investors by reducing charged off loans.
As discussed before, Lending Club’s grading mechanism is a way of balancing risk and profit for investors with different preferences. Since there is risk in each loan, some loans will be charged off. Adjusted Net Annualized Return (Adjusted NAR) is a measure of net return that takes into account an estimate of future losses on loans based on their current status. A description of Lending Club’s Adjusted NAR can be found here.
The difference between adjusted NAR and the interest rate charged across an investor’s portfolio (net of Lending Club fees) is the opportunity to improve return. Figure 1 shows this opportunity over time before any adjustment for fees.
Figure 2: Adjusted NAR versus Gross Interest Rate over time.
Our goal is to make a predictive model to detect approved loans that will charge off by using the information known at the time of issuance. To make the model more effective at prediction, it is crucial to prevent information leakage by removing features like last FICO scores from the set of predictors.
Sign Up for the Arimo Newsletter
An important characteristic of Lending Club’s historical data is that it is remarkably imbalanced. A majority of borrowers pay off their loans, with only about 7% of Lending Club’s loans charged off. The high ratio between paid and charged-off loans makes it difficult to build models that simultaneously minimize both the false positives (borrowers who default on their loans) and false negatives (applicants who are rejected but would have paid as agreed). To address this problem, we used stratified sampling techniques to train a random forest model on a balanced training set. The training set was constructed in a way to adjust class prior probabilities.
We chose a random forest model for this task because they offer high classification performance when taking large numbers of features into account. We formulated this predictive task as a binary classification problem and trained the model on Lending Club’s historical data of accepted loans to predict loans that are more likely to charge off (they are either actually charged off, default, or late).
The most significant outcome of Arimo’s study was that the finished model resulted in improved ability to classify issued loans by charge-off risk. The Receiver Operating Characteristic (ROC) curve below shows the performance of our model.
Figure 3: Receiver Operating Characteristic curve
This improvement translated into a 2.75% increase in adjusted NAR over the the average adjusted NAR on all loans. While 2.75% might be considered a small improvement, when considered over loan volumes exceeding $10 billion per year, this gain translates into tens of millions of dollars in additional returns to investors.
Viewing the performance of the model over time in Figure 4, it’s clear that Lending Club’s data scientists have been hard at work and the advantage of Arimo’s model gradually disappears after the beginning of 2014.
Figure 4: Predicted adjusted NAR versus actual adjusted NAR over time
The adjusted NAR improvement is critical to Lending Club’s business and revenue growth as the company’s primary revenue comes from a one-time origination fee for each loan (paid by borrowers) and a 1% service charge applied to each monthly payment. With the improved return for investors (assuming the interest rate for borrowers kept unchanged), the company can please its investors more and motivate them to invest more, or on the other hand if it keeps the same return for its investor, it can lower the interest rate for its borrowers and motivate the latter to borrow more. Either way, using a predictive model to improve the adjusted NAR, Lending Club can monetize more from its more satisfied investors and borrowers by incentivizing them to increase the total amount of funded loans!
* Average rates for September 2015
Lending Club is a registered trademark of Lending Club, Inc.