Arimo Predictive Engine ™ Shows Opportunity to Improve Investor Returns in Peer-to-Peer Lending

Random forest model using Lending Club public dataset shows opportunity to improve adjusted return by 2.75%

Arimo recently performed a study using a public dataset provided by Lending Club with the goal of showing how machine learning could improve investor returns. To do this we used the PredictiveEngine ™ component of our Data Intelligence Platform, which provides the ability to easily build a variety of predictive machine learning models which scale transparently when deployed on distributed parallel computing platforms.


Lending Club is an online peer-to-peer lending company that connects borrowers with investors who have capital to lend. When a loan application is submitted by a borrower, Lending Club reviews and decides whether to offer a loan at a risk-adjusted rate or to reject the application. As of the 3rd quarter of 2015, more than $12 billion in loans have been issued through Lending Club.

Lending Club’s loan approval process is designed to manage risk and offer predictable return to investors. The first step is assigning a credit-worthiness rating (A-G) to borrowers, where lower-rated borrowers are offered loans at higher interest rates (Figure 1). The other risk-management tool is breaking loans up into small notes, allowing the risk of a single loan to be shared by a number of investors who purchase the notes.


Figure 1:Interest Rates Charged to Various Borrower Risk Grades Over Time

What to Predict

The public data set does not give information on loans rejected, so we are only able to attempt to detect loans that should have been rejected and thus increased the return for investors by reducing charged off loans.

As discussed before, Lending Club’s grading mechanism is a way of balancing risk and profit for investors with different preferences. Since there is risk in each loan, some loans will be charged off. Adjusted Net Annualized Return (Adjusted NAR) is a measure of net return that takes into account an estimate of future losses on loans based on their current status. A description of Lending Club’s Adjusted NAR can be found here.

The difference between adjusted NAR and the interest rate charged across an investor’s portfolio (net of Lending Club fees) is the opportunity to improve return. Figure 1 shows this opportunity over time before any adjustment for fees.

Adjusted NAR versus Gross Interest Rate over time.

Figure 2: Adjusted NAR versus Gross Interest Rate over time.

Feature Engineering

Our goal is to make a predictive model to detect approved loans that will charge off by using the information known at the time of issuance. To make the model more effective at prediction, it is crucial to prevent information leakage by removing features like last FICO scores from the set of predictors.

Sign Up for the Arimo Newsletter

In our analysis, we also adapted Adjusted NAR definition and loss estimates provided by Lending Club with some modifications. It is important to note that loss estimate of 0% for current loans overestimates the actual return especially at the beginning of an investment when all loans in the portfolio of an investor are current. To have a more realistic loss estimate, we simply compute the probability of a loan to be charged off in the last nine months. Using Maximum Likelihood Estimation (MLE), we can estimate the loss factor for current loans by computing the ratio of all charged off loans to all issued loans in the past nine months. This is equal to 0.28%, and we use this loss factor to compute a more reliable Adjusted NAR.

Data Modeling

An important characteristic of Lending Club’s historical data is that it is remarkably imbalanced. A majority of borrowers pay off their loans, with only about 7% of Lending Club’s loans charged off. The high ratio between paid and charged-off loans makes it difficult to build models that simultaneously minimize both the false positives (borrowers who default on their loans) and false negatives (applicants who are rejected but would have paid as agreed). To address this problem, we used stratified sampling techniques to train a random forest model on a balanced training set. The training set was constructed in a way to adjust class prior probabilities.

We chose a random forest model for this task because they offer high classification performance when taking large numbers of features into account. We formulated this predictive task as a binary classification problem and trained the model on Lending Club’s historical data of accepted loans to predict loans that are more likely to charge off (they are either actually charged off, default, or late).

Predictive Performance

The most significant outcome of Arimo’s study was that the finished model resulted in improved ability to classify issued loans by charge-off risk. The Receiver Operating Characteristic (ROC) curve below shows the performance of our model.

Receiver Operating Characteristic curve

Figure 3: Receiver Operating Characteristic curve

This improvement translated into a 2.75% increase in adjusted NAR over the the average adjusted NAR on all loans. While 2.75% might be considered a small improvement, when considered over loan volumes exceeding $10 billion per year, this gain translates into tens of millions of dollars in additional returns to investors.

Viewing the performance of the model over time in Figure 4, it’s clear that Lending Club’s data scientists have been hard at work and the advantage of Arimo’s model gradually disappears after the beginning of 2014.

Predicted adjusted NAR versus actual adjusted NAR over time

Figure 4: Predicted adjusted NAR versus actual adjusted NAR over time

The adjusted NAR improvement is critical to Lending Club’s business and revenue growth as the company’s primary revenue comes from a one-time origination fee for each loan (paid by borrowers) and a 1% service charge applied to each monthly payment. With the improved return for investors (assuming the interest rate for borrowers kept unchanged), the company can please its investors more and motivate them to invest more, or on the other hand if it keeps the same return for its investor, it can lower the interest rate for its borrowers and motivate the latter to borrow more. Either way, using a predictive model to improve the adjusted NAR, Lending Club can monetize more from its more satisfied investors and borrowers by incentivizing them to increase the total amount of funded loans!

* Average rates for September 2015

Lending Club is a registered trademark of Lending Club, Inc.