Natural Language Processing for Loan Risk

Adding spaCy Word Vectors to a Keras Model

(last updated )
  1. The story so far
  2. Exploratory data analysis
  3. Imputing missing values
  4. Optimizing data types
  5. Creating document vectors
  6. Building the pipeline
  7. Evaluating the model
  8. Next steps

The story so far

A few months ago, I built a neural network regression model to predict loan risk, training it with a public dataset from LendingClub. Then I built a public API with Flask to serve the model’s predictions.

Then last month, I decided to put my model to the test and found out that my model can pick grade A loans better than LendingClub!

But I’m not done. Now that I’ve learned the fundamentals of natural language processing (I highly recommend Kaggle’s course on the subject), I’m going to see if I can eke out a bit more predictive power using a couple of freeform text fields in the dataset: title and desc (description).

import joblib

prev_notebook_folder = "../input/building-a-neural-network-to-predict-loan-risk/"
loans = joblib.load(prev_notebook_folder + "loans_for_nlp.joblib")
num_loans = loans.shape[0]
print(f"This dataset includes {num_loans:,} loans.")
This dataset includes 1,110,171 loans.
loans.head()
loan_amnttermemp_lengthhome_ownershipannual_incpurposedtidelinq_2yrscr_hist_age_mthsfico_range_low...pub_rec_bankruptciestax_lienstot_hi_cred_limtotal_bal_ex_morttotal_bc_limittotal_il_high_credit_limitfraction_recoveredissue_dtitledesc
03600.036 months10+ yearsMORTGAGE55000.0debt_consolidation5.910.0148675.0...0.00.0178050.07746.02400.013734.01.0Dec-2015Debt consolidationNaN
124700.036 months10+ yearsMORTGAGE65000.0small_business16.061.0192715.0...0.00.0314017.039475.079300.024667.01.0Dec-2015BusinessNaN
220000.060 months10+ yearsMORTGAGE63000.0home_improvement10.780.0184695.0...0.00.0218418.018696.06200.014877.01.0Dec-2015NaNNaN
410400.060 months3 yearsMORTGAGE104433.0major_purchase25.371.0210695.0...0.00.0439570.095768.020300.088097.01.0Dec-2015Major purchaseNaN
511950.036 months4 yearsRENT34000.0debt_consolidation10.200.0338690.0...0.00.016900.012798.09400.04000.01.0Dec-2015Debt consolidationNaN

5 rows × 69 columns

This post, like its predecessors, was adapted from a Jupyter Notebook, so feel free to fork my notebook on Kaggle or GitHub if you’d like to follow along.

Exploratory data analysis

There isn’t too much exploratory data analysis left to do after how thoroughly I cleaned the data in my first post, but I do have a few quick questions about the title and desc fields I’d like to answer before I move on.

  • How many loans use each field?
  • Have these fields always been included in the loan application?
  • What is the typical length of each field (in number of words)?
nlp_cols = ["title", "desc"]

loans[nlp_cols].describe()
titledesc
count109728871967
unique3586370927
topDebt consolidation
freq57399223

If the most frequent desc value is empty (or maybe just whitespace), perhaps I need to convert all empty or whitespace-only values to NaN before continuing.

import re
import numpy as np

for col in nlp_cols:
    replace_empties = lambda x: x if re.search("\S", x) else np.NaN
    loans[col] = loans[col].map(replace_empties, na_action="ignore")

description = loans[nlp_cols].describe()
description
titledesc
count109728871943
unique3586370925
topDebt consolidationBorrower added on 03/17/14 > Debt consolidat...
freq5739929

Thankfully that didn’t remove too many values, but this “Borrower added on [date]” deal worries me now. I’ll deal with that a little later.

for col in nlp_cols:
    percentage = int(description.at["count", col] / num_loans * 100)
    print(f"`{col}` is used in {percentage}% of loans in the dataset.")

percentage = int(description.at["freq", "title"] / num_loans * 100)
print(f'The title "Debt consolidation" is used in {percentage}% of loans.')
`title` is used in 98% of loans in the dataset.
`desc` is used in 6% of loans in the dataset.
The title "Debt consolidation" is used in 51% of loans.

These fields may not be as useful as I had previously thought. Even though there are 35,860 unique titles used across the dataset, 51% of them just use “Debt consolidation”. Maybe the titles are more descriptive in the other 49%?

And the desc field is only used with 6% of loans.

Now to check and see when these fields were introduced.

# `issue_d` is just the month and year the loan was issued, by the way.
loans["issue_d"] = loans["issue_d"].astype("datetime64[ns]")

print("Total date range:")
print(loans["issue_d"].agg(["min", "max"]))
print("\n`title` date range:")
print(loans[["title", "issue_d"]].dropna(axis="index")["issue_d"].agg(["min", "max"]))
print("\n`desc` date range:")
print(loans[["desc", "issue_d"]].dropna(axis="index")["issue_d"].agg(["min", "max"]))
Total date range:
min   2012-08-01
max   2018-12-01
Name: issue_d, dtype: datetime64[ns]

`title` date range:
min   2012-08-01
max   2018-12-01
Name: issue_d, dtype: datetime64[ns]

`desc` date range:
min   2012-08-01
max   2016-07-01
Name: issue_d, dtype: datetime64[ns]

Neither of these fields were introduced late, but they may have stopped using the desc field for the last two years of the database.

Now I’ll take a closer look at values in these fields.

import pandas as pd

with pd.option_context("display.min_rows", 50):
    print(loans["title"].value_counts())
Debt consolidation                       573992
Credit card refinancing                  214423
Home improvement                          64028
Other                                     56166
Major purchase                            20734
Medical expenses                          11454
Debt Consolidation                        10638
Business                                  10142
Car financing                              9660
Moving and relocation                      6806
Vacation                                   6707
Home buying                                5097
Consolidation                              4069
debt consolidation                         3310
Credit Card Consolidation                  1607
consolidation                              1538
Debt Consolidation Loan                    1265
Consolidation Loan                         1260
Personal Loan                              1040
Credit Card Refinance                      1020
Home Improvement                           1016
Credit Card Payoff                          991
Consolidate                                 947
Green loan                                  626
Loan                                        621
                                          ...
House Buying Consolidation                    1
Credit Card Deby                              1
Crdit cards                                   1
"CCC"                                         1
Loan to Moving & Relocation Expense           1
BILL PAYMENT                                  1
creit card pay off                            1
Auto Repair & Debt Consolidation              1
BMW 2004                                      1
Moving Expenses - STL to PHX                  1
 Pay off Bills                                1
Room addition                                 1
Optimistic                                    1
Consolid_loan2                                1
ASSISTANCE NEEDED                             1
My bail out                                   1
myfirstloan                                   1
second home                                   1
Just consolidating credit cards               1
Financially Sound Loan                        1
refinance loans and home improvements         1
credit cart refincition                       1
Managable Repayment Plan                      1
ccdebit                                       1
Project Pay Off Debt                          1
Name: title, Length: 35863, dtype: int64

Interesting. It seems like there’s plenty of variety in loan titles in the other 49%. A lot of them seem to directly correspond to the purpose categorical field, but not so many as to make this field useless, I think.

Side note: I discovered at one point when perusing this column that someone entered the Konami Code as the title of their loan application, and their inclusion in this dataset means that the code apparently worked for them—they got the loan.

loans[loans["title"] == "up up down down left right left right ba"][
    ["loan_amnt", "title", "issue_d"]
]
loan_amnttitleissue_d
185634012000.0up up down down left right left right ba2013-04-01
loans["desc"].value_counts()
  Borrower added on 03/17/14 > Debt consolidation<br>                                                                                                                                                                                                                                        9
  Borrower added on 01/15/14 > Debt consolidation<br>                                                                                                                                                                                                                                        7
  Borrower added on 02/19/14 > Debt consolidation<br>                                                                                                                                                                                                                                        7
  Borrower added on 03/10/14 > Debt consolidation<br>                                                                                                                                                                                                                                        7
  Borrower added on 01/29/14 > Debt consolidation<br>                                                                                                                                                                                                                                        7
                                                                                                                                                                                                                                                                                            ..
  Borrower added on 01/14/13 > Credit Card consolidation<br>                                                                                                                                                                                                                                 1
  Borrower added on 03/14/14 > Debts consolidation and cash for minor improvements on condominium<br>                                                                                                                                                                                        1
  Borrower added on 03/02/14 > I lost a house and need to pay taxes nd have credit card debt thatI already pay $350 a month on and it goes nowhere.<br>                                                                                                                                      1
  Borrower added on 04/09/13 > I want to put in a conscious effort in eliminating my debt by converting high interest cards to a fixed payment that can be effectively managed by me.<br>                                                                                                    1
  Borrower added on 09/18/12 > Want to become debt free, because of several circumstances and going back to school I got into debt. I want to pay for what I have purchased without it having an effect on my credit. That is why I want to consolidate my debt and become debt free!<br>    1
Name: desc, Length: 70925, dtype: int64

Do all these descriptions start with “Borrower added on [date]“?

pattern = "^\s*Borrower added on \d\d/\d\d/\d\d > "
prefix_count = (
    loans["desc"]
    .map(lambda x: True if re.search(pattern, x, re.I) else None, na_action="ignore")
    .count()
)
print(
    f"{prefix_count:,} loan descriptions begin with that pattern.",
    f"({description.loc['count', 'desc'] - prefix_count:,} do not.)",
)
71,858 loan descriptions begin with that pattern. (85 do not.)

Well now I need to check those other 85.

other_desc_map = loans["desc"].map(
    lambda x: False if pd.isna(x) or re.search(pattern, x, re.I) else True
)
other_descs = loans["desc"][other_desc_map]
other_descs.value_counts()
Debt Consolidation                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 2
I would like to pay off 3 different credit cards, at 12%, 17% and 22% (after initial 0% period is up).  It would be great to have everything under one loan, making it easier to pay off.  Also, once I've paid off or down the loan, I can start looking into buying a house.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     1
loan will be used for paying off credit card.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      1
This loan will be used to consolidate high interest credit card debt.    Over the course of this past year my wife and I had our first child, purchase a home and received a large bonus from work.  With the new home and the child on the way I chose to spread my tax withholdings on the bonus to all checks received in 2008 this caused my monthly income to fall by $1500.  This in combination with an unexpected additional down payment for our home of $17,000 with only a weeks notice we were force to dip into our Credit Cards for the past several months.    Starting January 1, 2009 I will be able to readjust my tax withholding and start to pay off the Credit Card debt we have racked up.  This loan will help lower the interest rate during the repayment period and give one central place for payment.  My wife and I have not missed a payment or been late for the past 5 years.  My fico score is 670 mainly due to several low limit credit cards near their max.  I manage the international devision of a software company and my wife is a kindergarten teacher, combined we make 140K a year.    Thank you for your consideration and I look forward to working with you.      1
to pay off different credit cards to consolidate my debt, so I can have just one monthly payment.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  1
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ..
Hello, I would like to consolidate my debt into a lower more convenient payment. I have a very stable career of more than 20 years with the same company. My community is in a part of the country that made it through the last few years basically unscathed and has a very promising future.<br>Thank You<br>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   1
consolidate my debt                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                1
I am looking to pay off my credit card debts.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      1
This loan is to help me payoff my credit card debt. I've done what I can to negotiate lower rates, but the interest is killing me and my monthly payments are basically just taking care of interest. Paying them off will give me the fresh start I need on my way to financial independence. Thank you.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          1
I have been in business for a year and want to eliminate some personal debt and use the remainder of the loan to take care of business expenses. Also lessening the number of trade lines I have open puts me in a better position to pursue business loans since it will  be based on my personal credit. A detailed report can be created to show where exactly the funds will go and this can be provided at any time during the course of the loan.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            1
Name: desc, Length: 84, dtype: int64

It looks like the borrower may be able to add information to the description at different points in time. I should check and see if any of those dates come after the actual issue date of the loan.

from datetime import datetime, date

for row in loans[["desc", "issue_d"]].itertuples():
    if not pd.isna(row.desc):
        month_after_issue = date(
            day=row.issue_d.day,
            month=row.issue_d.month % 12 + 1,
            year=row.issue_d.year + row.issue_d.month // 12,
        )

        date_strings = re.findall("\d\d/\d\d/\d\d", row.desc)
        dates = []
        for string in date_strings:
            try:
                dates.append(datetime.strptime(string, "%m/%d/%y").date())
            except:
                continue

        for d in dates:
            if d >= month_after_issue:
                print(f"{row.issue_d}{row.desc}")
                break
2014-01-01 00:00:00 –   Borrower added on 01/08/14 > I am tired of making monthly payments and getting nowhere.  With your help, except for my mortgage, I intend to be completely debt free by 12/31/2016.<br>
2014-01-01 00:00:00 –   Borrower added on 01/08/14 > We have been engaged for  2 1/2yrs and wanted to bring out blended family together as one. We are set to get married on 03/22/14 and we are paying for it on our own. We saved the majority of the budget unfortunately there were a few unexpected cost that we still need help with.<br>
2014-01-01 00:00:00 –   Borrower added on 01/06/14 > I am getting married 04/05/2014 and I want to have a cushion for expenses just in case.<br>
2014-01-01 00:00:00 – BR called in to push payment date to 09/19/14 because of not having the exact amount of funds in their bank account.  Payment was processing. Was able to cancel. It is within grace period.
2014-01-01 00:00:00 –   Borrower added on 01/01/14 > This loan is to consolidate my credit cards debt. I made one year this past  11/28/2013 at my current job. I considered to have job security because I'm a good employee. I make all may credit cards payments on time.<br>
2013-05-01 00:00:00 –   Borrower added on 04/27/13 > My father passed away 05/12/2012 and I had to pay for the funeral.  My mother could not afford it.  He was not ill so I could not have planned it.  I paid with what I had in my savings and the rest I had to pay with my credit cards.  I would like to pay off the CC &amp; pay one monthly payment.<br><br>  Borrower added on 04/27/13 > My paerents own the house so I do not pay rent.    The utilities, insurance and taxes, etc my mother pays.  She can afford that.  I help when needed.<br>
2013-02-01 00:00:00 –   Borrower added on 02/10/13 > I am getting married in a week (02/17/2013) and have made some large purchases across my credit cards.  I would like to consolidate all of my debt with this low rate loan.<br><br> Borrower added on 02/10/13 > I will be getting married in a week (02/17/13) and have had to make some large purchases on my CC. I am financially sound otherwise with low debt obligations.<br>
2012-12-01 00:00:00 –   Borrower added on 12/10/12 > Approximately 1 year ago I had a highefficency furnace /AC installed.  The installing Co. used GECRB to get me a loan.  If I payoff the loan within one year, I pay no interest.  The interest rate if not payed by 12/23/2012 is 26.99%.  A 6.62% rate sounds a lot better.<br>
2012-11-01 00:00:00 –   Borrower added on 11/19/12 > Looking to finish off consolidating the rest of my bills and lower my payments on my exsisting loan. Thanks!!!<br><br>  Borrower added on 11/20/12 > Thanks again for everyone who has invested thus far. With this loan it will give me the ability to have only one payment monthly besides utilities and I will be almost debt free by my wedding date of 12/13/14!! Thanks again everyone!<br>
2012-10-01 00:00:00 –   Borrower added on 10/22/12 > Need money by 10/26/2012 to purchase property on discounted APR.<br>

Good, all of the dates that come after the month the loan is issued only come up because the borrower is talking about a future event.

Now to clean these desc values up a bit I’m going to remove the Borrower added on [date] >s and the <br>s, since those don’t add value to the description content.

def clean_desc(desc):
    if pd.isna(desc):
        return desc
    else:
        return re.sub(
            "\s*Borrower added on \d\d/\d\d/\d\d > |<br>", lambda x: " ", desc
        ).strip()


loans["desc"] = loans["desc"].map(clean_desc)

Imputing missing values

Since only 2% of loans in this set are missing a title, and since most titles simply copy the loan’s purpose, I’m going to impute missing titles with their loan’s purpose.

loans["title"].fillna(
    loans["purpose"].map(lambda x: x.replace("_", " ").capitalize()), inplace=True
)

Since only 6% of loans use a description, I’ll just impute missing descriptions with an empty string. I’m going to wait and include that as a pipeline step a little later, though.

Optimizing data types

I’d really love to get right to the fun part, converting these text fields into document vectors, but I ran into a problem the first several times I tried doing so. Manually adding two sets of 300-dimensional vectors to this 1,110,171-row DataFrame caused its size in memory to skyrocket, exhausting the 16GB Kaggle gives me.

My first attempt to fix this was optimizing my data types, which still didn’t solve the problem on its own, but it’s a worthwhile step to take anyway.

After removing the issue_d column, which is no longer needed, the dataset contains five types of data: float, integer, ordinal, (unordered) categorical, and text.

from pandas.api.types import CategoricalDtype


loans = loans.drop(columns=["issue_d"])

float_cols = ["annual_inc", "dti", "inv_mths_since_last_delinq",
    "inv_mths_since_last_record", "revol_util", "inv_mths_since_last_major_derog",
    "annual_inc_joint", "dti_joint", "bc_util", "inv_mo_sin_rcnt_rev_tl_op",
    "inv_mo_sin_rcnt_tl", "inv_mths_since_recent_bc", "inv_mths_since_recent_bc_dlq",
    "inv_mths_since_recent_inq", "inv_mths_since_recent_revol_delinq", "pct_tl_nvr_dlq",
    "percent_bc_gt_75", "fraction_recovered"]
int_cols = ["loan_amnt", "delinq_2yrs", "cr_hist_age_mths", "fico_range_low",
    "fico_range_high", "inq_last_6mths", "open_acc", "pub_rec", "revol_bal",
    "total_acc", "collections_12_mths_ex_med", "acc_now_delinq", "tot_coll_amt",
    "tot_cur_bal", "total_rev_hi_lim", "acc_open_past_24mths", "avg_cur_bal",
    "bc_open_to_buy", "chargeoff_within_12_mths", "delinq_amnt", "mo_sin_old_il_acct",
    "mo_sin_old_rev_tl_op", "mort_acc", "num_accts_ever_120_pd", "num_actv_bc_tl",
    "num_actv_rev_tl", "num_bc_sats", "num_bc_tl", "num_il_tl", "num_op_rev_tl",
    "num_rev_accts", "num_rev_tl_bal_gt_0", "num_sats", "num_tl_120dpd_2m",
    "num_tl_30dpd", "num_tl_90g_dpd_24m", "num_tl_op_past_12m", "pub_rec_bankruptcies",
    "tax_liens", "tot_hi_cred_lim", "total_bal_ex_mort", "total_bc_limit",
    "total_il_high_credit_limit"]
ordinal_cols = ["emp_length"]
category_cols = ["term", "home_ownership", "purpose", "application_type"]
text_cols = nlp_cols

size_metrics = pd.DataFrame(
    {
        "previous_dtype": loans.dtypes,
        "previous_size": loans.memory_usage(index=False, deep=True),
    }
)
previous_size = loans.memory_usage(deep=True).sum()


for col_name in float_cols:
    loans[col_name] = pd.to_numeric(loans[col_name], downcast="float")

for col_name in int_cols:
    loans[col_name] = pd.to_numeric(loans[col_name], downcast="unsigned")

emp_length_categories = ["< 1 year", "1 year", "2 years", "3 years", "4 years",
    "5 years", "6 years", "7 years", "8 years", "9 years", "10+ years"]
emp_length_type = CategoricalDtype(categories=emp_length_categories, ordered=True)
loans["emp_length"] = loans["emp_length"].astype(emp_length_type)

for col_name in category_cols:
    loans[col_name] = loans[col_name].astype("category")


current_size = loans.memory_usage(deep=True).sum()
reduction = (previous_size - current_size) / previous_size
print(f"Reduced DataFrame size in memory by {int(reduction * 100)}%.")

size_metrics["current_dtype"] = loans.dtypes
size_metrics["current_size"] = loans.memory_usage(index=False, deep=True)
pd.options.display.max_rows = 100
size_metrics
Reduced DataFrame size in memory by 69%.
previous_dtypeprevious_sizecurrent_dtypecurrent_size
loan_amntfloat648881368uint162220342
termobject73271286category1110383
emp_lengthobject71853397category1111197
home_ownershipobject69841784category1110700
annual_incfloat648881368float324440684
purposeobject79927721category1111750
dtifloat648881368float324440684
delinq_2yrsfloat648881368uint81110171
cr_hist_age_mthsint648881368uint162220342
fico_range_lowfloat648881368uint162220342
fico_range_highfloat648881368uint162220342
inq_last_6mthsfloat648881368uint81110171
inv_mths_since_last_delinqfloat648881368float324440684
inv_mths_since_last_recordfloat648881368float324440684
open_accfloat648881368uint81110171
pub_recfloat648881368uint81110171
revol_balfloat648881368uint324440684
revol_utilfloat648881368float324440684
total_accfloat648881368uint81110171
collections_12_mths_ex_medfloat648881368uint81110171
inv_mths_since_last_major_derogfloat648881368float324440684
application_typeobject74360578category1110384
annual_inc_jointfloat648881368float324440684
dti_jointfloat648881368float324440684
acc_now_delinqfloat648881368uint81110171
tot_coll_amtfloat648881368uint324440684
tot_cur_balfloat648881368uint324440684
total_rev_hi_limfloat648881368uint324440684
acc_open_past_24mthsfloat648881368uint81110171
avg_cur_balfloat648881368uint324440684
bc_open_to_buyfloat648881368uint324440684
bc_utilfloat648881368float324440684
chargeoff_within_12_mthsfloat648881368uint81110171
delinq_amntfloat648881368uint324440684
mo_sin_old_il_acctfloat648881368uint162220342
mo_sin_old_rev_tl_opfloat648881368uint162220342
inv_mo_sin_rcnt_rev_tl_opfloat648881368float324440684
inv_mo_sin_rcnt_tlfloat648881368float324440684
mort_accfloat648881368uint81110171
inv_mths_since_recent_bcfloat648881368float324440684
inv_mths_since_recent_bc_dlqfloat648881368float324440684
inv_mths_since_recent_inqfloat648881368float324440684
inv_mths_since_recent_revol_delinqfloat648881368float324440684
num_accts_ever_120_pdfloat648881368uint81110171
num_actv_bc_tlfloat648881368uint81110171
num_actv_rev_tlfloat648881368uint81110171
num_bc_satsfloat648881368uint81110171
num_bc_tlfloat648881368uint81110171
num_il_tlfloat648881368uint81110171
num_op_rev_tlfloat648881368uint81110171
num_rev_acctsfloat648881368uint81110171
num_rev_tl_bal_gt_0float648881368uint81110171
num_satsfloat648881368uint81110171
num_tl_120dpd_2mfloat648881368uint81110171
num_tl_30dpdfloat648881368uint81110171
num_tl_90g_dpd_24mfloat648881368uint81110171
num_tl_op_past_12mfloat648881368uint81110171
pct_tl_nvr_dlqfloat648881368float324440684
percent_bc_gt_75float648881368float324440684
pub_rec_bankruptciesfloat648881368uint81110171
tax_liensfloat648881368uint81110171
tot_hi_cred_limfloat648881368uint324440684
total_bal_ex_mortfloat648881368uint324440684
total_bc_limitfloat648881368uint324440684
total_il_high_credit_limitfloat648881368uint324440684
fraction_recoveredfloat648881368float324440684
titleobject82840461object82840461
descobject46918516object46918516

Creating document vectors

Now the fun part. Wrapping my spaCy document vector function in a scikit-learn FunctionTransformer turned out to be the secret that kept this process within memory limits. Scikit-learn must just be way better optimized than whatever manual process I was using (go figure).

import spacy
from sklearn.preprocessing import FunctionTransformer


def get_doc_vectors(X):
    n_cols = X.shape[1]
    nlp = spacy.load("en_core_web_lg", disable=["tagger", "parser", "ner"])

    result = []
    for row in X:
        result_row = []
        for i in range(n_cols):
            result_row.append(nlp(row[i]).vector)

        result.append(np.concatenate(result_row))

    return np.array(result)


vectorizer = FunctionTransformer(get_doc_vectors)

Building the pipeline

First, the transformer. I’ll use scikit-learn’s ColumnTransformer to apply different transformations to different kinds of data.

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler
from pathlib import Path


def generate_cat_encoder(col_name):
    categories = list(loans[col_name].cat.categories)
    if loans[col_name].cat.ordered:
        return (
            col_name,
            OrdinalEncoder(categories=[categories], dtype=np.uint8),
            [col_name],
        )
    else:
        return (
            col_name,
            OneHotEncoder(categories=[categories], drop="if_binary", dtype=np.bool_),
            [col_name],
        )


Path("../tmp/transformer_cache").mkdir(parents=True, exist_ok=True)
transformer = ColumnTransformer(
    [
        (
            "nlp_cols",
            Pipeline(
                [
                    (
                        "nlp_imputer",
                        SimpleImputer(strategy="constant", fill_value=""),
                    ),
                    ("nlp_vectorizer", vectorizer),
                    ("nlp_scaler", StandardScaler(with_mean=False)),
                ],
                verbose=True,
            ),
            make_column_selector("^(title|desc)$"),
        ),
    ]
    + [generate_cat_encoder(col_name) for col_name in ordinal_cols + category_cols],
    remainder=StandardScaler(),
    verbose=True,
)

This model itself will be identical to my previous model, but I’ll use Keras callbacks and a tqdm progress bar to make the training logs much more concise.

import tensorflow as tf
from tensorflow.keras import Sequential, Input
from tensorflow.keras.layers import Dense, Dropout
from sklearn.model_selection import train_test_split
from tqdm import tqdm

np.random.seed(0)
tf.random.set_seed(1)


class ProgressBar(tf.keras.callbacks.Callback):
    def __init__(self, epochs=100):
        self.epochs = epochs

    def on_train_begin(self, logs=None):
        self.progress_bar = tqdm(desc="Training model", total=self.epochs, unit="epoch")

    def on_epoch_end(self, epoch, logs=None):
        self.progress_bar.update()

    def on_train_end(self, logs=None):
        self.progress_bar.close()


class FinalMetrics(tf.keras.callbacks.Callback):
    def on_train_end(self, logs=None):
        metrics_msg = "Final metrics:"
        for metric, value in logs.items():
            metrics_msg += f" {metric}: {value:.5f} -"
        metrics_msg = metrics_msg[:-2]
        print(metrics_msg)


def run_pipeline(X, y, transformer, validate=True):
    X_train, X_val, y_train, y_val = (
        train_test_split(X, y, test_size=0.2, random_state=2)
        if validate
        else (X, None, y, None)
    )

    X_train_t = transformer.fit_transform(X_train)
    X_val_t = transformer.transform(X_val) if validate else None

    model = Sequential()
    model.add(Input((X_train_t.shape[1],)))
    model.add(Dense(64, activation="relu"))
    model.add(Dropout(0.3))
    model.add(Dense(32, activation="relu"))
    model.add(Dropout(0.3))
    model.add(Dense(16, activation="relu"))
    model.add(Dropout(0.3))
    model.add(Dense(1))
    model.compile(optimizer="adam", loss="mean_squared_logarithmic_error")

    history = model.fit(
        X_train_t,
        y_train,
        validation_data=(X_val_t, y_val) if validate else None,
        batch_size=128,
        epochs=100,
        verbose=0,
        callbacks=[ProgressBar(), FinalMetrics()],
    )

    return history.history, model, transformer

Evaluating the model

import dill

history_1, _, _ = run_pipeline(
    loans.drop(columns="fraction_recovered").copy(),
    loans["fraction_recovered"],
    transformer,
)

Path("save_points").mkdir(exist_ok=True)
dill.dump_session("save_points/model_1.pkl")
/opt/conda/lib/python3.7/site-packages/pandas/core/strings.py:2001: UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
  return func(self, *args, **kwargs)


[Pipeline] ....... (step 1 of 3) Processing nlp_imputer, total=   0.4s
[Pipeline] .... (step 2 of 3) Processing nlp_vectorizer, total= 1.2min
[Pipeline] ........ (step 3 of 3) Processing nlp_scaler, total=   8.6s
[ColumnTransformer] ...... (1 of 7) Processing nlp_cols, total= 1.3min
[ColumnTransformer] .... (2 of 7) Processing emp_length, total=   0.2s
[ColumnTransformer] .......... (3 of 7) Processing term, total=   0.3s
[ColumnTransformer]  (4 of 7) Processing home_ownership, total=   0.3s
[ColumnTransformer] ....... (5 of 7) Processing purpose, total=   0.3s
[ColumnTransformer]  (6 of 7) Processing application_type, total=   0.3s
[ColumnTransformer] ..... (7 of 7) Processing remainder, total=   1.3s


Training model: 100%|██████████| 100/100 [23:41<00:00, 14.22s/epoch]


Final metrics: loss: 0.02365 - val_loss: 0.02360
# Restore save point if needed
import dill

try:
    history_1
except NameError:
    dill.load_session("save_points/model_1.pkl")
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline


def plot_loss_metrics(history, model_num=None):
    for metric, values in history.items():
        sns.lineplot(x=range(len(values)), y=values, label=metric)
    plt.xlabel("epoch")
    plt.title(
        f"Model {f'{model_num} ' if model_num else ''} loss metrics during training"
    )
    plt.show()


plot_loss_metrics(history_1, "1")
A line plot entitled “Model 1 loss metrics during training”, with separate lines for training loss and validation loss, plotting the loss metric value on the y-axis across the 100 epochs of training on the x-axis. Training loss falls rapidly and fairly smoothly, with another small but interesting drop around the 40th epoch. The validation loss line, while very jagged, appears on average to follow the same trend as training loss throughout the 100 epochs of training, indicating that the dropout layers in the neural network were sufficient to prevent overfitting.

Well, it didn’t overfit, but this model performed a bit worse than my original, which had settled around a loss of 0.0231. I bet the desc feature is getting in the wayzeroes spanning 300 columns of the input data on 94% of the rows is probably quite confusing to the model. I’ll see what happens if I repeat the process while leaving desc out (making the title vectors the only new feature of this model compared to my original).

history_2, _, _ = run_pipeline(
    loans.drop(columns=["fraction_recovered", "desc"]).copy(),
    loans["fraction_recovered"],
    transformer,
)

Path("save_points").mkdir(exist_ok=True)
dill.dump_session("save_points/model_2.pkl")
/opt/conda/lib/python3.7/site-packages/pandas/core/strings.py:2001: UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
  return func(self, *args, **kwargs)


[Pipeline] ....... (step 1 of 3) Processing nlp_imputer, total=   0.1s
[Pipeline] .... (step 2 of 3) Processing nlp_vectorizer, total=  41.3s
[Pipeline] ........ (step 3 of 3) Processing nlp_scaler, total=   4.6s
[ColumnTransformer] ...... (1 of 7) Processing nlp_cols, total=  45.9s
[ColumnTransformer] .... (2 of 7) Processing emp_length, total=   0.2s
[ColumnTransformer] .......... (3 of 7) Processing term, total=   0.3s
[ColumnTransformer]  (4 of 7) Processing home_ownership, total=   0.3s
[ColumnTransformer] ....... (5 of 7) Processing purpose, total=   0.3s
[ColumnTransformer]  (6 of 7) Processing application_type, total=   0.3s
[ColumnTransformer] ..... (7 of 7) Processing remainder, total=   1.1s


Training model: 100%|██████████| 100/100 [22:26<00:00, 13.46s/epoch]


Final metrics: loss: 0.02396 - val_loss: 0.02451
# Restore save point if needed
import dill

try:
    history_2
except NameError:
    dill.load_session("save_points/model_2.pkl")
plot_loss_metrics(history_2, "2")
A line plot entitled “Model 2 loss metrics during training”, with separate lines for training loss and validation loss, plotting the loss metric value on the y-axis across the 100 epochs of training on the x-axis. The validation loss line is even chaotic this time than in the model 1 plot but still doesn’t appear to be overfitting.

Wow, still not good enough to beat my original model. Just for kicks, I also tried additional runs where I trained for 1,000 epochs, and others where I increased the numbers of nodes in the first two dense layers to 128 and 64. And I tried decreasing the batch size to 64. But still none of these beat my original model. I suppose these text features have no predictive quality to them in terms of loan outcomes. Interesting.

Next steps

If adding these two features decreased predictive capability, then perhaps some of the other variables I was already using are doing the same thing. I should try using some of scikit-learn’s feature selection methods to reduce the dimensionality of the input data.

A more efficient method of hyperparameter optimization would be pretty useful as well. I should give AutoKeras a shot.


Well that was fun! Have any thoughts on how to better integrate language data into the model? I’d love to hear them on , Mastodon, , or LinkedIn.


Found an error or typo in this post you’d like to fix? Send me a pull request on GitHub!

Want to publish this article on your blog, in your magazine, or anywhere else? This post, like most of the content on my website, is licensed under a Creative Commons Attribution license, so you’re welcome to share it wherever and however you please, as long as you cite me as the author. I’d also enjoy hearing from you if you do publish this somewhere, but that’s totally up to you.