Natural Language Processing for Loan Risk

Adding spaCy Word Vectors to a Keras Model

October 23, 2020
(last updated October 24, 2020)

The story so far
Exploratory data analysis
Imputing missing values
Optimizing data types
Creating document vectors
Building the pipeline
Evaluating the model
Next steps

The story so far

A few months ago, I built a neural network regression model to predict loan risk, training it with a public dataset from LendingClub. Then I built a public API with Flask to serve the model’s predictions.

Then last month, I decided to put my model to the test and found out that my model can pick grade A loans better than LendingClub!

But I’m not done. Now that I’ve learned the fundamentals of natural language processing (I highly recommend Kaggle’s course on the subject), I’m going to see if I can eke out a bit more predictive power using a couple of freeform text fields in the dataset: title and desc (description).

import joblib

prev_notebook_folder = "../input/building-a-neural-network-to-predict-loan-risk/"
loans = joblib.load(prev_notebook_folder + "loans_for_nlp.joblib")
num_loans = loans.shape[0]
print(f"This dataset includes {num_loans:,} loans.")

This dataset includes 1,110,171 loans.

loans.head()

	loan_amnt	term	emp_length	home_ownership	annual_inc	purpose	dti	delinq_2yrs	cr_hist_age_mths	fico_range_low	...	tot_hi_cred_lim	total_bal_ex_mort	total_bc_limit	total_il_high_credit_limit	fraction_recovered	issue_d	title	desc
0	3600.0	36 months	10+ years	MORTGAGE	55000.0	debt_consolidation	5.91	0.0	148	675.0	...	178050.0	7746.0	2400.0	13734.0	1.0	Dec-2015	Debt consolidation	NaN
1	24700.0	36 months	10+ years	MORTGAGE	65000.0	small_business	16.06	1.0	192	715.0	...	314017.0	39475.0	79300.0	24667.0	1.0	Dec-2015	Business	NaN
2	20000.0	60 months	10+ years	MORTGAGE	63000.0	home_improvement	10.78	0.0	184	695.0	...	218418.0	18696.0	6200.0	14877.0	1.0	Dec-2015	NaN	NaN
4	10400.0	60 months	3 years	MORTGAGE	104433.0	major_purchase	25.37	1.0	210	695.0	...	439570.0	95768.0	20300.0	88097.0	1.0	Dec-2015	Major purchase	NaN
5	11950.0	36 months	4 years	RENT	34000.0	debt_consolidation	10.20	0.0	338	690.0	...	16900.0	12798.0	9400.0	4000.0	1.0	Dec-2015	Debt consolidation	NaN

5 rows × 69 columns

This post, like its predecessors, was adapted from a Jupyter Notebook, so feel free to fork my notebook on Kaggle or GitHub if you’d like to follow along.

Exploratory data analysis

There isn’t too much exploratory data analysis left to do after how thoroughly I cleaned the data in my first post, but I do have a few quick questions about the title and desc fields I’d like to answer before I move on.

How many loans use each field?
Have these fields always been included in the loan application?
What is the typical length of each field (in number of words)?

nlp_cols = ["title", "desc"]

loans[nlp_cols].describe()

	title	desc
count	1097288	71967
unique	35863	70927
top	Debt consolidation
freq	573992	23

If the most frequent desc value is empty (or maybe just whitespace), perhaps I need to convert all empty or whitespace-only values to NaN before continuing.

import re
import numpy as np

for col in nlp_cols:
    replace_empties = lambda x: x if re.search("\S", x) else np.NaN
    loans[col] = loans[col].map(replace_empties, na_action="ignore")

description = loans[nlp_cols].describe()
description

	title	desc
count	1097288	71943
unique	35863	70925
top	Debt consolidation	Borrower added on 03/17/14 > Debt consolidat...
freq	573992	9

Thankfully that didn’t remove too many values, but this “Borrower added on [date]” deal worries me now. I’ll deal with that a little later.

for col in nlp_cols:
    percentage = int(description.at["count", col] / num_loans * 100)
    print(f"`{col}` is used in {percentage}% of loans in the dataset.")

percentage = int(description.at["freq", "title"] / num_loans * 100)
print(f'The title "Debt consolidation" is used in {percentage}% of loans.')

`title` is used in 98% of loans in the dataset.
`desc` is used in 6% of loans in the dataset.
The title "Debt consolidation" is used in 51% of loans.

These fields may not be as useful as I had previously thought. Even though there are 35,860 unique titles used across the dataset, 51% of them just use “Debt consolidation”. Maybe the titles are more descriptive in the other 49%?

And the desc field is only used with 6% of loans.

Now to check and see when these fields were introduced.

# `issue_d` is just the month and year the loan was issued, by the way.
loans["issue_d"] = loans["issue_d"].astype("datetime64[ns]")

print("Total date range:")
print(loans["issue_d"].agg(["min", "max"]))
print("\n`title` date range:")
print(loans[["title", "issue_d"]].dropna(axis="index")["issue_d"].agg(["min", "max"]))
print("\n`desc` date range:")
print(loans[["desc", "issue_d"]].dropna(axis="index")["issue_d"].agg(["min", "max"]))

Total date range:
min   2012-08-01
max   2018-12-01
Name: issue_d, dtype: datetime64[ns]

`title` date range:
min   2012-08-01
max   2018-12-01
Name: issue_d, dtype: datetime64[ns]

`desc` date range:
min   2012-08-01
max   2016-07-01
Name: issue_d, dtype: datetime64[ns]

Neither of these fields were introduced late, but they may have stopped using the desc field for the last two years of the database.

Now I’ll take a closer look at values in these fields.

import pandas as pd

with pd.option_context("display.min_rows", 50):
    print(loans["title"].value_counts())

Debt consolidation                       573992
Credit card refinancing                  214423
Home improvement                          64028
Other                                     56166
Major purchase                            20734
Medical expenses                          11454
Debt Consolidation                        10638
Business                                  10142
Car financing                              9660
Moving and relocation                      6806
Vacation                                   6707
Home buying                                5097
Consolidation                              4069
debt consolidation                         3310
Credit Card Consolidation                  1607
consolidation                              1538
Debt Consolidation Loan                    1265
Consolidation Loan                         1260
Personal Loan                              1040
Credit Card Refinance                      1020
Home Improvement                           1016
Credit Card Payoff                          991
Consolidate                                 947
Green loan                                  626
Loan                                        621
                                          ...
House Buying Consolidation                    1
Credit Card Deby                              1
Crdit cards                                   1
"CCC"                                         1
Loan to Moving & Relocation Expense           1
BILL PAYMENT                                  1
creit card pay off                            1
Auto Repair & Debt Consolidation              1
BMW 2004                                      1
Moving Expenses - STL to PHX                  1
 Pay off Bills                                1
Room addition                                 1
Optimistic                                    1
Consolid_loan2                                1
ASSISTANCE NEEDED                             1
My bail out                                   1
myfirstloan                                   1
second home                                   1
Just consolidating credit cards               1
Financially Sound Loan                        1
refinance loans and home improvements         1
credit cart refincition                       1
Managable Repayment Plan                      1
ccdebit                                       1
Project Pay Off Debt                          1
Name: title, Length: 35863, dtype: int64

Interesting. It seems like there’s plenty of variety in loan titles in the other 49%. A lot of them seem to directly correspond to the purpose categorical field, but not so many as to make this field useless, I think.

Side note: I discovered at one point when perusing this column that someone entered the Konami Code as the title of their loan application, and their inclusion in this dataset means that the code apparently worked for them—they got the loan.

loans[loans["title"] == "up up down down left right left right ba"][
    ["loan_amnt", "title", "issue_d"]
]

	loan_amnt	title	issue_d
1856340	12000.0	up up down down left right left right ba	2013-04-01

loans["desc"].value_counts()

  Borrower added on 03/17/14 > Debt consolidation<br>                                                                                                                                                                                                                                        9
  Borrower added on 01/15/14 > Debt consolidation<br>                                                                                                                                                                                                                                        7
  Borrower added on 02/19/14 > Debt consolidation<br>                                                                                                                                                                                                                                        7
  Borrower added on 03/10/14 > Debt consolidation<br>                                                                                                                                                                                                                                        7
  Borrower added on 01/29/14 > Debt consolidation<br>                                                                                                                                                                                                                                        7
                                                                                                                                                                                                                                                                                            ..
  Borrower added on 01/14/13 > Credit Card consolidation<br>                                                                                                                                                                                                                                 1
  Borrower added on 03/14/14 > Debts consolidation and cash for minor improvements on condominium<br>                                                                                                                                                                                        1
  Borrower added on 03/02/14 > I lost a house and need to pay taxes nd have credit card debt thatI already pay $350 a month on and it goes nowhere.<br>                                                                                                                                      1
  Borrower added on 04/09/13 > I want to put in a conscious effort in eliminating my debt by converting high interest cards to a fixed payment that can be effectively managed by me.<br>                                                                                                    1
  Borrower added on 09/18/12 > Want to become debt free, because of several circumstances and going back to school I got into debt. I want to pay for what I have purchased without it having an effect on my credit. That is why I want to consolidate my debt and become debt free!<br>    1
Name: desc, Length: 70925, dtype: int64

Do all these descriptions start with “Borrower added on [date]“?

pattern = "^\s*Borrower added on \d\d/\d\d/\d\d > "
prefix_count = (
    loans["desc"]
    .map(lambda x: True if re.search(pattern, x, re.I) else None, na_action="ignore")
    .count()
)
print(
    f"{prefix_count:,} loan descriptions begin with that pattern.",
    f"({description.loc['count', 'desc'] - prefix_count:,} do not.)",
)

71,858 loan descriptions begin with that pattern. (85 do not.)

Well now I need to check those other 85.

other_desc_map = loans["desc"].map(
    lambda x: False if pd.isna(x) or re.search(pattern, x, re.I) else True
)
other_descs = loans["desc"][other_desc_map]
other_descs.value_counts()

Debt Consolidation                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 2
I would like to pay off 3 different credit cards, at 12%, 17% and 22% (after initial 0% period is up).  It would be great to have everything under one loan, making it easier to pay off.  Also, once I've paid off or down the loan, I can start looking into buying a house.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     1
loan will be used for paying off credit card.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      1
This loan will be used to consolidate high interest credit card debt.    Over the course of this past year my wife and I had our first child, purchase a home and received a large bonus from work.  With the new home and the child on the way I chose to spread my tax withholdings on the bonus to all checks received in 2008 this caused my monthly income to fall by $1500.  This in combination with an unexpected additional down payment for our home of $17,000 with only a weeks notice we were force to dip into our Credit Cards for the past several months.    Starting January 1, 2009 I will be able to readjust my tax withholding and start to pay off the Credit Card debt we have racked up.  This loan will help lower the interest rate during the repayment period and give one central place for payment.  My wife and I have not missed a payment or been late for the past 5 years.  My fico score is 670 mainly due to several low limit credit cards near their max.  I manage the international devision of a software company and my wife is a kindergarten teacher, combined we make 140K a year.    Thank you for your consideration and I look forward to working with you.      1
to pay off different credit cards to consolidate my debt, so I can have just one monthly payment.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  1
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ..
Hello, I would like to consolidate my debt into a lower more convenient payment. I have a very stable career of more than 20 years with the same company. My community is in a part of the country that made it through the last few years basically unscathed and has a very promising future.<br>Thank You<br>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   1
consolidate my debt                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                1
I am looking to pay off my credit card debts.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      1
This loan is to help me payoff my credit card debt. I've done what I can to negotiate lower rates, but the interest is killing me and my monthly payments are basically just taking care of interest. Paying them off will give me the fresh start I need on my way to financial independence. Thank you.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          1
I have been in business for a year and want to eliminate some personal debt and use the remainder of the loan to take care of business expenses. Also lessening the number of trade lines I have open puts me in a better position to pursue business loans since it will  be based on my personal credit. A detailed report can be created to show where exactly the funds will go and this can be provided at any time during the course of the loan.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            1
Name: desc, Length: 84, dtype: int64

It looks like the borrower may be able to add information to the description at different points in time. I should check and see if any of those dates come after the actual issue date of the loan.

from datetime import datetime, date

for row in loans[["desc", "issue_d"]].itertuples():
    if not pd.isna(row.desc):
        month_after_issue = date(
            day=row.issue_d.day,
            month=row.issue_d.month % 12 + 1,
            year=row.issue_d.year + row.issue_d.month // 12,
        )

        date_strings = re.findall("\d\d/\d\d/\d\d", row.desc)
        dates = []
        for string in date_strings:
            try:
                dates.append(datetime.strptime(string, "%m/%d/%y").date())
            except:
                continue

        for d in dates:
            if d >= month_after_issue:
                print(f"{row.issue_d} – {row.desc}")
                break

2014-01-01 00:00:00 –   Borrower added on 01/08/14 > I am tired of making monthly payments and getting nowhere.  With your help, except for my mortgage, I intend to be completely debt free by 12/31/2016.<br>
2014-01-01 00:00:00 –   Borrower added on 01/08/14 > We have been engaged for  2 1/2yrs and wanted to bring out blended family together as one. We are set to get married on 03/22/14 and we are paying for it on our own. We saved the majority of the budget unfortunately there were a few unexpected cost that we still need help with.<br>
2014-01-01 00:00:00 –   Borrower added on 01/06/14 > I am getting married 04/05/2014 and I want to have a cushion for expenses just in case.<br>
2014-01-01 00:00:00 – BR called in to push payment date to 09/19/14 because of not having the exact amount of funds in their bank account.  Payment was processing. Was able to cancel. It is within grace period.
2014-01-01 00:00:00 –   Borrower added on 01/01/14 > This loan is to consolidate my credit cards debt. I made one year this past  11/28/2013 at my current job. I considered to have job security because I'm a good employee. I make all may credit cards payments on time.<br>
2013-05-01 00:00:00 –   Borrower added on 04/27/13 > My father passed away 05/12/2012 and I had to pay for the funeral.  My mother could not afford it.  He was not ill so I could not have planned it.  I paid with what I had in my savings and the rest I had to pay with my credit cards.  I would like to pay off the CC &amp; pay one monthly payment.<br><br>  Borrower added on 04/27/13 > My paerents own the house so I do not pay rent.    The utilities, insurance and taxes, etc my mother pays.  She can afford that.  I help when needed.<br>
2013-02-01 00:00:00 –   Borrower added on 02/10/13 > I am getting married in a week (02/17/2013) and have made some large purchases across my credit cards.  I would like to consolidate all of my debt with this low rate loan.<br><br> Borrower added on 02/10/13 > I will be getting married in a week (02/17/13) and have had to make some large purchases on my CC. I am financially sound otherwise with low debt obligations.<br>
2012-12-01 00:00:00 –   Borrower added on 12/10/12 > Approximately 1 year ago I had a highefficency furnace /AC installed.  The installing Co. used GECRB to get me a loan.  If I payoff the loan within one year, I pay no interest.  The interest rate if not payed by 12/23/2012 is 26.99%.  A 6.62% rate sounds a lot better.<br>
2012-11-01 00:00:00 –   Borrower added on 11/19/12 > Looking to finish off consolidating the rest of my bills and lower my payments on my exsisting loan. Thanks!!!<br><br>  Borrower added on 11/20/12 > Thanks again for everyone who has invested thus far. With this loan it will give me the ability to have only one payment monthly besides utilities and I will be almost debt free by my wedding date of 12/13/14!! Thanks again everyone!<br>
2012-10-01 00:00:00 –   Borrower added on 10/22/12 > Need money by 10/26/2012 to purchase property on discounted APR.<br>

Good, all of the dates that come after the month the loan is issued only come up because the borrower is talking about a future event.

Now to clean these desc values up a bit I’m going to remove the Borrower added on [date] >s and the <br>s, since those don’t add value to the description content.

def clean_desc(desc):
    if pd.isna(desc):
        return desc
    else:
        return re.sub(
            "\s*Borrower added on \d\d/\d\d/\d\d > |<br>", lambda x: " ", desc
        ).strip()


loans["desc"] = loans["desc"].map(clean_desc)

Imputing missing values

Since only 2% of loans in this set are missing a title, and since most titles simply copy the loan’s purpose, I’m going to impute missing titles with their loan’s purpose.

loans["title"].fillna(
    loans["purpose"].map(lambda x: x.replace("_", " ").capitalize()), inplace=True
)

Since only 6% of loans use a description, I’ll just impute missing descriptions with an empty string. I’m going to wait and include that as a pipeline step a little later, though.

Optimizing data types

I’d really love to get right to the fun part, converting these text fields into document vectors, but I ran into a problem the first several times I tried doing so. Manually adding two sets of 300-dimensional vectors to this 1,110,171-row DataFrame caused its size in memory to skyrocket, exhausting the 16GB Kaggle gives me.

My first attempt to fix this was optimizing my data types, which still didn’t solve the problem on its own, but it’s a worthwhile step to take anyway.

After removing the issue_d column, which is no longer needed, the dataset contains five types of data: float, integer, ordinal, (unordered) categorical, and text.

from pandas.api.types import CategoricalDtype


loans = loans.drop(columns=["issue_d"])

float_cols = ["annual_inc", "dti", "inv_mths_since_last_delinq",
    "inv_mths_since_last_record", "revol_util", "inv_mths_since_last_major_derog",
    "annual_inc_joint", "dti_joint", "bc_util", "inv_mo_sin_rcnt_rev_tl_op",
    "inv_mo_sin_rcnt_tl", "inv_mths_since_recent_bc", "inv_mths_since_recent_bc_dlq",
    "inv_mths_since_recent_inq", "inv_mths_since_recent_revol_delinq", "pct_tl_nvr_dlq",
    "percent_bc_gt_75", "fraction_recovered"]
int_cols = ["loan_amnt", "delinq_2yrs", "cr_hist_age_mths", "fico_range_low",
    "fico_range_high", "inq_last_6mths", "open_acc", "pub_rec", "revol_bal",
    "total_acc", "collections_12_mths_ex_med", "acc_now_delinq", "tot_coll_amt",
    "tot_cur_bal", "total_rev_hi_lim", "acc_open_past_24mths", "avg_cur_bal",
    "bc_open_to_buy", "chargeoff_within_12_mths", "delinq_amnt", "mo_sin_old_il_acct",
    "mo_sin_old_rev_tl_op", "mort_acc", "num_accts_ever_120_pd", "num_actv_bc_tl",
    "num_actv_rev_tl", "num_bc_sats", "num_bc_tl", "num_il_tl", "num_op_rev_tl",
    "num_rev_accts", "num_rev_tl_bal_gt_0", "num_sats", "num_tl_120dpd_2m",
    "num_tl_30dpd", "num_tl_90g_dpd_24m", "num_tl_op_past_12m", "pub_rec_bankruptcies",
    "tax_liens", "tot_hi_cred_lim", "total_bal_ex_mort", "total_bc_limit",
    "total_il_high_credit_limit"]
ordinal_cols = ["emp_length"]
category_cols = ["term", "home_ownership", "purpose", "application_type"]
text_cols = nlp_cols

size_metrics = pd.DataFrame(
    {
        "previous_dtype": loans.dtypes,
        "previous_size": loans.memory_usage(index=False, deep=True),
    }
)
previous_size = loans.memory_usage(deep=True).sum()


for col_name in float_cols:
    loans[col_name] = pd.to_numeric(loans[col_name], downcast="float")

for col_name in int_cols:
    loans[col_name] = pd.to_numeric(loans[col_name], downcast="unsigned")

emp_length_categories = ["< 1 year", "1 year", "2 years", "3 years", "4 years",
    "5 years", "6 years", "7 years", "8 years", "9 years", "10+ years"]
emp_length_type = CategoricalDtype(categories=emp_length_categories, ordered=True)
loans["emp_length"] = loans["emp_length"].astype(emp_length_type)

for col_name in category_cols:
    loans[col_name] = loans[col_name].astype("category")


current_size = loans.memory_usage(deep=True).sum()
reduction = (previous_size - current_size) / previous_size
print(f"Reduced DataFrame size in memory by {int(reduction * 100)}%.")

size_metrics["current_dtype"] = loans.dtypes
size_metrics["current_size"] = loans.memory_usage(index=False, deep=True)
pd.options.display.max_rows = 100
size_metrics

Reduced DataFrame size in memory by 69%.

	previous_dtype	previous_size	current_dtype	current_size
loan_amnt	float64	8881368	uint16	2220342
term	object	73271286	category	1110383
emp_length	object	71853397	category	1111197
home_ownership	object	69841784	category	1110700
annual_inc	float64	8881368	float32	4440684
purpose	object	79927721	category	1111750
dti	float64	8881368	float32	4440684
delinq_2yrs	float64	8881368	uint8	1110171
cr_hist_age_mths	int64	8881368	uint16	2220342
fico_range_low	float64	8881368	uint16	2220342
fico_range_high	float64	8881368	uint16	2220342
inq_last_6mths	float64	8881368	uint8	1110171
inv_mths_since_last_delinq	float64	8881368	float32	4440684
inv_mths_since_last_record	float64	8881368	float32	4440684
open_acc	float64	8881368	uint8	1110171
pub_rec	float64	8881368	uint8	1110171
revol_bal	float64	8881368	uint32	4440684
revol_util	float64	8881368	float32	4440684
total_acc	float64	8881368	uint8	1110171
collections_12_mths_ex_med	float64	8881368	uint8	1110171
inv_mths_since_last_major_derog	float64	8881368	float32	4440684
application_type	object	74360578	category	1110384
annual_inc_joint	float64	8881368	float32	4440684
dti_joint	float64	8881368	float32	4440684
acc_now_delinq	float64	8881368	uint8	1110171
tot_coll_amt	float64	8881368	uint32	4440684
tot_cur_bal	float64	8881368	uint32	4440684
total_rev_hi_lim	float64	8881368	uint32	4440684
acc_open_past_24mths	float64	8881368	uint8	1110171
avg_cur_bal	float64	8881368	uint32	4440684
bc_open_to_buy	float64	8881368	uint32	4440684
bc_util	float64	8881368	float32	4440684
chargeoff_within_12_mths	float64	8881368	uint8	1110171
delinq_amnt	float64	8881368	uint32	4440684
mo_sin_old_il_acct	float64	8881368	uint16	2220342
mo_sin_old_rev_tl_op	float64	8881368	uint16	2220342
inv_mo_sin_rcnt_rev_tl_op	float64	8881368	float32	4440684
inv_mo_sin_rcnt_tl	float64	8881368	float32	4440684
mort_acc	float64	8881368	uint8	1110171
inv_mths_since_recent_bc	float64	8881368	float32	4440684
inv_mths_since_recent_bc_dlq	float64	8881368	float32	4440684
inv_mths_since_recent_inq	float64	8881368	float32	4440684
inv_mths_since_recent_revol_delinq	float64	8881368	float32	4440684
num_accts_ever_120_pd	float64	8881368	uint8	1110171
num_actv_bc_tl	float64	8881368	uint8	1110171
num_actv_rev_tl	float64	8881368	uint8	1110171
num_bc_sats	float64	8881368	uint8	1110171
num_bc_tl	float64	8881368	uint8	1110171
num_il_tl	float64	8881368	uint8	1110171
num_op_rev_tl	float64	8881368	uint8	1110171
num_rev_accts	float64	8881368	uint8	1110171
num_rev_tl_bal_gt_0	float64	8881368	uint8	1110171
num_sats	float64	8881368	uint8	1110171
num_tl_120dpd_2m	float64	8881368	uint8	1110171
num_tl_30dpd	float64	8881368	uint8	1110171
num_tl_90g_dpd_24m	float64	8881368	uint8	1110171
num_tl_op_past_12m	float64	8881368	uint8	1110171
pct_tl_nvr_dlq	float64	8881368	float32	4440684
percent_bc_gt_75	float64	8881368	float32	4440684
pub_rec_bankruptcies	float64	8881368	uint8	1110171
tax_liens	float64	8881368	uint8	1110171
tot_hi_cred_lim	float64	8881368	uint32	4440684
total_bal_ex_mort	float64	8881368	uint32	4440684
total_bc_limit	float64	8881368	uint32	4440684
total_il_high_credit_limit	float64	8881368	uint32	4440684
fraction_recovered	float64	8881368	float32	4440684
title	object	82840461	object	82840461
desc	object	46918516	object	46918516

Creating document vectors

Now the fun part. Wrapping my spaCy document vector function in a scikit-learn FunctionTransformer turned out to be the secret that kept this process within memory limits. Scikit-learn must just be way better optimized than whatever manual process I was using (go figure).

import spacy
from sklearn.preprocessing import FunctionTransformer


def get_doc_vectors(X):
    n_cols = X.shape[1]
    nlp = spacy.load("en_core_web_lg", disable=["tagger", "parser", "ner"])

    result = []
    for row in X:
        result_row = []
        for i in range(n_cols):
            result_row.append(nlp(row[i]).vector)

        result.append(np.concatenate(result_row))

    return np.array(result)


vectorizer = FunctionTransformer(get_doc_vectors)

Building the pipeline

First, the transformer. I’ll use scikit-learn’s ColumnTransformer to apply different transformations to different kinds of data.

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler
from pathlib import Path


def generate_cat_encoder(col_name):
    categories = list(loans[col_name].cat.categories)
    if loans[col_name].cat.ordered:
        return (
            col_name,
            OrdinalEncoder(categories=[categories], dtype=np.uint8),
            [col_name],
        )
    else:
        return (
            col_name,
            OneHotEncoder(categories=[categories], drop="if_binary", dtype=np.bool_),
            [col_name],
        )


Path("../tmp/transformer_cache").mkdir(parents=True, exist_ok=True)
transformer = ColumnTransformer(
    [
        (
            "nlp_cols",
            Pipeline(
                [
                    (
                        "nlp_imputer",
                        SimpleImputer(strategy="constant", fill_value=""),
                    ),
                    ("nlp_vectorizer", vectorizer),
                    ("nlp_scaler", StandardScaler(with_mean=False)),
                ],
                verbose=True,
            ),
            make_column_selector("^(title|desc)$"),
        ),
    ]
    + [generate_cat_encoder(col_name) for col_name in ordinal_cols + category_cols],
    remainder=StandardScaler(),
    verbose=True,
)

This model itself will be identical to my previous model, but I’ll use Keras callbacks and a tqdm progress bar to make the training logs much more concise.

import tensorflow as tf
from tensorflow.keras import Sequential, Input
from tensorflow.keras.layers import Dense, Dropout
from sklearn.model_selection import train_test_split
from tqdm import tqdm

np.random.seed(0)
tf.random.set_seed(1)


class ProgressBar(tf.keras.callbacks.Callback):
    def __init__(self, epochs=100):
        self.epochs = epochs

    def on_train_begin(self, logs=None):
        self.progress_bar = tqdm(desc="Training model", total=self.epochs, unit="epoch")

    def on_epoch_end(self, epoch, logs=None):
        self.progress_bar.update()

    def on_train_end(self, logs=None):
        self.progress_bar.close()


class FinalMetrics(tf.keras.callbacks.Callback):
    def on_train_end(self, logs=None):
        metrics_msg = "Final metrics:"
        for metric, value in logs.items():
            metrics_msg += f" {metric}: {value:.5f} -"
        metrics_msg = metrics_msg[:-2]
        print(metrics_msg)


def run_pipeline(X, y, transformer, validate=True):
    X_train, X_val, y_train, y_val = (
        train_test_split(X, y, test_size=0.2, random_state=2)
        if validate
        else (X, None, y, None)
    )

    X_train_t = transformer.fit_transform(X_train)
    X_val_t = transformer.transform(X_val) if validate else None

    model = Sequential()
    model.add(Input((X_train_t.shape[1],)))
    model.add(Dense(64, activation="relu"))
    model.add(Dropout(0.3))
    model.add(Dense(32, activation="relu"))
    model.add(Dropout(0.3))
    model.add(Dense(16, activation="relu"))
    model.add(Dropout(0.3))
    model.add(Dense(1))
    model.compile(optimizer="adam", loss="mean_squared_logarithmic_error")

    history = model.fit(
        X_train_t,
        y_train,
        validation_data=(X_val_t, y_val) if validate else None,
        batch_size=128,
        epochs=100,
        verbose=0,
        callbacks=[ProgressBar(), FinalMetrics()],
    )

    return history.history, model, transformer

Evaluating the model

import dill

history_1, _, _ = run_pipeline(
    loans.drop(columns="fraction_recovered").copy(),
    loans["fraction_recovered"],
    transformer,
)

Path("save_points").mkdir(exist_ok=True)
dill.dump_session("save_points/model_1.pkl")

/opt/conda/lib/python3.7/site-packages/pandas/core/strings.py:2001: UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
  return func(self, *args, **kwargs)


[Pipeline] ....... (step 1 of 3) Processing nlp_imputer, total=   0.4s
[Pipeline] .... (step 2 of 3) Processing nlp_vectorizer, total= 1.2min
[Pipeline] ........ (step 3 of 3) Processing nlp_scaler, total=   8.6s
[ColumnTransformer] ...... (1 of 7) Processing nlp_cols, total= 1.3min
[ColumnTransformer] .... (2 of 7) Processing emp_length, total=   0.2s
[ColumnTransformer] .......... (3 of 7) Processing term, total=   0.3s
[ColumnTransformer]  (4 of 7) Processing home_ownership, total=   0.3s
[ColumnTransformer] ....... (5 of 7) Processing purpose, total=   0.3s
[ColumnTransformer]  (6 of 7) Processing application_type, total=   0.3s
[ColumnTransformer] ..... (7 of 7) Processing remainder, total=   1.3s


Training model: 100%|██████████| 100/100 [23:41<00:00, 14.22s/epoch]


Final metrics: loss: 0.02365 - val_loss: 0.02360

# Restore save point if needed
import dill

try:
    history_1
except NameError:
    dill.load_session("save_points/model_1.pkl")

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline


def plot_loss_metrics(history, model_num=None):
    for metric, values in history.items():
        sns.lineplot(x=range(len(values)), y=values, label=metric)
    plt.xlabel("epoch")
    plt.title(
        f"Model {f'{model_num} ' if model_num else ''} loss metrics during training"
    )
    plt.show()


plot_loss_metrics(history_1, "1")

A line plot entitled “Model 1 loss metrics during training”, with separate lines for training loss and validation loss, plotting the loss metric value on the y-axis across the 100 epochs of training on the x-axis. Training loss falls rapidly and fairly smoothly, with another small but interesting drop around the 40th epoch. The validation loss line, while very jagged, appears on average to follow the same trend as training loss throughout the 100 epochs of training, indicating that the dropout layers in the neural network were sufficient to prevent overfitting.

Well, it didn’t overfit, but this model performed a bit worse than my original, which had settled around a loss of 0.0231. I bet the desc feature is getting in the way—zeroes spanning 300 columns of the input data on 94% of the rows is probably quite confusing to the model. I’ll see what happens if I repeat the process while leaving desc out (making the title vectors the only new feature of this model compared to my original).

history_2, _, _ = run_pipeline(
    loans.drop(columns=["fraction_recovered", "desc"]).copy(),
    loans["fraction_recovered"],
    transformer,
)

Path("save_points").mkdir(exist_ok=True)
dill.dump_session("save_points/model_2.pkl")

/opt/conda/lib/python3.7/site-packages/pandas/core/strings.py:2001: UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
  return func(self, *args, **kwargs)


[Pipeline] ....... (step 1 of 3) Processing nlp_imputer, total=   0.1s
[Pipeline] .... (step 2 of 3) Processing nlp_vectorizer, total=  41.3s
[Pipeline] ........ (step 3 of 3) Processing nlp_scaler, total=   4.6s
[ColumnTransformer] ...... (1 of 7) Processing nlp_cols, total=  45.9s
[ColumnTransformer] .... (2 of 7) Processing emp_length, total=   0.2s
[ColumnTransformer] .......... (3 of 7) Processing term, total=   0.3s
[ColumnTransformer]  (4 of 7) Processing home_ownership, total=   0.3s
[ColumnTransformer] ....... (5 of 7) Processing purpose, total=   0.3s
[ColumnTransformer]  (6 of 7) Processing application_type, total=   0.3s
[ColumnTransformer] ..... (7 of 7) Processing remainder, total=   1.1s


Training model: 100%|██████████| 100/100 [22:26<00:00, 13.46s/epoch]


Final metrics: loss: 0.02396 - val_loss: 0.02451

# Restore save point if needed
import dill

try:
    history_2
except NameError:
    dill.load_session("save_points/model_2.pkl")

plot_loss_metrics(history_2, "2")

A line plot entitled “Model 2 loss metrics during training”, with separate lines for training loss and validation loss, plotting the loss metric value on the y-axis across the 100 epochs of training on the x-axis. The validation loss line is even chaotic this time than in the model 1 plot but still doesn’t appear to be overfitting.

Wow, still not good enough to beat my original model. Just for kicks, I also tried additional runs where I trained for 1,000 epochs, and others where I increased the numbers of nodes in the first two dense layers to 128 and 64. And I tried decreasing the batch size to 64. But still none of these beat my original model. I suppose these text features have no predictive quality to them in terms of loan outcomes. Interesting.

Next steps

If adding these two features decreased predictive capability, then perhaps some of the other variables I was already using are doing the same thing. I should try using some of scikit-learn’s feature selection methods to reduce the dimensionality of the input data.

A more efficient method of hyperparameter optimization would be pretty useful as well. I should give AutoKeras a shot.

Well that was fun! Have any thoughts on how to better integrate language data into the model? I’d love to hear them on Twitter, Mastodon, Facebook, or LinkedIn.

Found an error or typo in this post you’d like to fix? Send me a pull request on GitHub!

Want to publish this article on your blog, in your magazine, or anywhere else? This post, like most of the content on my website, is licensed under a Creative Commons Attribution license, so you’re welcome to share it wherever and however you please, as long as you cite me as the author. I’d also enjoy hearing from you if you do publish this somewhere, but that’s totally up to you.