Creating a pipeline to tune tf-idf + ridge regularization parameters and select the best model for text based predictions.

I am going to dabble a bit into text mining in this post. The idea is very simple: we have a collection of documents (these could be emails, books or craiglist ads) and we are trying to build a model that predicts something when given a new document of the same provenance. To make this more concrete we will look at two examples:

predicting the salary offer for a job based on the description of the job listing
predicing whether a text message is spam.

Along the way I will also explore how to build pipelines in python using sklearn and how to use tf-idf to transform the documents into numeric matrices. I am pretty new to all of this myself (mostly writing this up so I don’t forget) so any suggestions and corrections are welcome!

Let’s load the required python modules:

import pandas as pd
import numpy as np

from sklearn import linear_model
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.preprocessing import LabelEncoder 


from matplotlib import pyplot as plt
% matplotlib inline

Let’s start with the salary listing first. We are going to try to build a model that predicts the salary offer for a job based on the description of the job listing.

train = pd.read_csv("https://raw.githubusercontent.com/ajschumacher/gadsdata/master/salary/train.csv")

y = train.SalaryNormalized

train.head(3)

	Id	Title	FullDescription	LocationRaw	LocationNormalized	ContractType	ContractTime	Company	Category	SalaryRaw	SalaryNormalized	SourceName
0	12612628	Engineering Systems Analyst	Engineering Systems Analyst Dorking Surrey Sal...	Dorking, Surrey, Surrey	Dorking	NaN	permanent	Gregory Martin International	Engineering Jobs	20000 - 30000/annum 20-30K	25000	cv-library.co.uk
1	12612830	Stress Engineer Glasgow	Stress Engineer Glasgow Salary ** to ** We...	Glasgow, Scotland, Scotland	Glasgow	NaN	permanent	Gregory Martin International	Engineering Jobs	25000 - 35000/annum 25-35K	30000	cv-library.co.uk
2	12612844	Modelling and simulation analyst	Mathematical Modeller / Simulation Analyst / O...	Hampshire, South East, South East	Hampshire	NaN	permanent	Gregory Martin International	Engineering Jobs	20000 - 40000/annum 20-40K	30000	cv-library.co.uk

Let’s take a closer look at what a posting looks like:

train.FullDescription[1]

'Stress Engineer Glasgow Salary **** to **** We re currently looking for talented engineers to join our growing Glasgow team at a variety of levels. The roles are ideally suited to high calibre engineering graduates with any level of appropriate experience, so that we can give you the opportunity to use your technical skills to provide high quality input to our aerospace projects, spanning both aerostructures and aeroengines. In return, you can expect good career opportunities and the chance for advancement and personal and professional development, support while you gain Chartership and some opportunities to possibly travel or work in other offices, in or outside of the UK. The Requirements You will need to have a good engineering degree that includes structural analysis (such as aeronautical, mechanical, automotive, civil) with some experience in a professional engineering environment relevant to (but not limited to) the aerospace sector. You will need to demonstrate experience in at least one or more of the following areas: Structural/stress analysis Composite stress analysis (any industry) Linear and nonlinear finite element analysis Fatigue and damage tolerance Structural dynamics Thermal analysis Aerostructures experience You will also be expected to demonstrate the following qualities: A strong desire to progress quickly to a position of leadership Professional approach Strong communication skills, written and verbal Commercial awareness Team working, being comfortable working in international teams and self managing PLEASE NOTE SECURITY CLEARANCE IS REQUIRED FOR THIS ROLE Stress Engineer Glasgow Salary **** to ****'

We will just use the description and build a pipeline to predict the Normalized Salary. This is quite easy in sklearn using a pipeline. Basically we will create a bag of words then scale the columns using tf_idf. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general. Then we will fit a regularized linear model to the data. Regularization is key here since when using bi-grams we’ll end up with over 400k features and only 10k training examples.

estimators = [("tf_idf", TfidfVectorizer()), 
              ("ridge", linear_model.Ridge())]
model = Pipeline(estimators)

So we just plug in the raw descriptions and the tf_idf transforms it into a matrix that is then fitted by the ridge model.

$Description\longrightarrow X , y \longrightarrow model$

model.fit(train.FullDescription, y) 

Pipeline(steps=[('tf_idf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...it_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001))])

Now both the tf_idf transform and the ridge regression have tuning parameters and the nice thing about the pipeline we just built is that we can tune all the parameters at once:

params = {"ridge__alpha":[0.1, 0.3, 1, 3, 10], #regularization param
          "tf_idf__min_df": [1, 3, 10], #min count of words allowed
          "tf_idf__ngram_range": [(1,1), (1,2)], #1-grams or 2-grams
          "tf_idf__stop_words": [None, "english"]} #use stopwords or don't

How many different model must we run? Well since we’re doing a grid search we can just multiply the possibilities for each parameter to get 5*3*2*2 for a total of 60 models - a decent number. And keep in mind that for each model we have to build the tf_idf vectorizer all over again.

grid = GridSearchCV(estimator=model, param_grid = params, scoring = "mean_squared_error")

grid.fit(train.FullDescription, y)

grid.best_params_

{'ridge__alpha': 0.3,
 'tf_idf__min_df': 1,
 'tf_idf__ngram_range': (1, 2),
 'tf_idf__stop_words': 'english'}

np.sqrt(-grid.best_score_)

10532.473521325306

We can also look at all the params:

params = pd.DataFrame([i[0] for i in grid.grid_scores_])
results = pd.DataFrame(grid.grid_scores_)
results = pd.concat([params, results], 1)
results["rmse"] = np.sqrt(-results.mean_validation_score)

results.head(3)

	ridge__alpha	tf_idf__min_df	tf_idf__ngram_range	tf_idf__stop_words	parameters	mean_validation_score	cv_validation_scores	rmse
0	0.1	1	(1, 1)	None	{'ridge__alpha': 0.1, 'tf_idf__stop_words': No...	-1.383986e+08	[-103831685.851, -141229157.862, -170145315.841]	11764.293270
1	0.1	1	(1, 1)	english	{'ridge__alpha': 0.1, 'tf_idf__stop_words': 'e...	-1.408870e+08	[-105929048.004, -144749023.148, -171993435.294]	11869.583228
2	0.1	1	(1, 2)	None	{'ridge__alpha': 0.1, 'tf_idf__stop_words': No...	-1.113026e+08	[-77620035.3972, -108499379.09, -147798481.11]	10550.004578

Examining the Best Model:

model = grid.best_estimator_

Every time we predict the model will run the tf-idf part first, already fitted on the train set and then use the ridge regression model.

model.predict(train.FullDescription)

array([ 25975.84531928,  32824.5058169 ,  32127.26976225, ...,
        50386.2916183 ,  50138.40072399,  27588.69246637])

One issue with using the pipeline is that we don’t see the little details that go into fitting the models.

What if we want to examing more closely what goes on in each model? Say for example I want to look at the coefficients of my linear regression. That’s also pretty straighforward using the named_steps method.

grid.best_estimator_.named_steps["ridge"].coef_

array([ -465.8824938 ,  1697.39286267,  1304.56896049, ...,  1416.89223231,
        -596.29992468,  -596.29992468])

ridge_model = model.named_steps["ridge"]
tf_idf_model = model.named_steps["tf_idf"]

coefficients = pd.DataFrame({"names":tf_idf_model.get_feature_names(),
                             "coef":ridge_model.coef_})

Let’s look at the tokens with the largest coefficients:

coefficients.sort_values("coef", ascending=False).head(10)

	coef	names
88432	51890.453609	consultant grade
235331	48488.999766	locum
399929	45052.453063	subsea
174963	43441.208259	global
235338	40651.232641	locum consultant
90843	40016.083870	contract
235682	38554.092136	london
211090	36076.259999	investment
121657	34280.854922	director
244094	33309.500667	manager

We see some of the usual suspects - such as london, consultant, director and manager. However given how many features we have (over 400k) it’s hard to interpret these coefficients very accurately. Perhaps doing a Lasso model with a strong $l_1$ regularization might help with that, since that wil reduce the number of non-zero coefficients.

Spam example:

Now let’s look at another classical text analysis problem - clasifying wether an email (or text message) is spam or not. Let’s load up the data:

url = 'https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv'
sms = pd.read_table(url, header=None, names=['label', 'message'])

sms.head()

	label	message
0	ham	Go until jurong point, crazy.. Available only ...
1	ham	Ok lar... Joking wif u oni...
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...
3	ham	U dun say so early hor... U c already then say...
4	ham	Nah I don't think he goes to usf, he lives aro...

And let’s look at one example:

sms.iloc[12, 1]

'URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18'

Ok this one is clearly spam :)

corp = sms.message
le = LabelEncoder()
y = le.fit_transform(sms.label) 

Notice that in this case we are prediting a class - ham vs. spam so linear regression won’t cut it. So do we need to go thru the whole process of building the pipeline again? Well not really. The tf-idf part stays the same we just have to use logistic (instead of linear) regression. So we simply replace linear_model.Ridge() with linear_model.RidgeClassifier().

estimators = [("tf_idf", TfidfVectorizer()), 
              ("ridge", linear_model.RidgeClassifier())]
model = Pipeline(estimators)

params = {"ridge__alpha":[0.1, 0.3, 1, 3, 10], #regularization param
          "tf_idf__min_df": [1, 3, 10], #min count of words allowed
          "tf_idf__ngram_range": [(1,1), (1,2)], #1-grams or 2-grams
          "tf_idf__stop_words": [None, "english"],#use stopwords or don't
          "tf_idf__use_idf":[True, False]}  #whether to scale columns or just leave normalized bag of words.

grid = GridSearchCV(estimator=model, param_grid = params)

grid.fit(corp, y)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(steps=[('tf_idf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...True,
        max_iter=None, normalize=False, random_state=None, solver='auto',
        tol=0.001))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'ridge__alpha': [0.1, 0.3, 1, 3, 10], 'tf_idf__stop_words': [None, 'english'], 'tf_idf__min_df': [1, 3, 10], 'tf_idf__ngram_range': [(1, 1), (1, 2)], 'tf_idf__use_idf': [True, False]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

grid.best_params_

{'ridge__alpha': 0.1,
 'tf_idf__min_df': 1,
 'tf_idf__ngram_range': (1, 2),
 'tf_idf__stop_words': None,
 'tf_idf__use_idf': True}

params = pd.DataFrame([i[0] for i in grid.grid_scores_])
results = pd.DataFrame(grid.grid_scores_)
results = pd.concat([params, results], 1)

results.head(3)

	ridge__alpha	tf_idf__min_df	tf_idf__ngram_range	tf_idf__stop_words	tf_idf__use_idf	parameters	mean_validation_score	cv_validation_scores
0	0.1	1	(1, 1)	None	True	{'ridge__alpha': 0.1, 'tf_idf__stop_words': No...	0.981874	[0.982238966631, 0.980613893376, 0.982767905223]
1	0.1	1	(1, 1)	None	False	{'ridge__alpha': 0.1, 'tf_idf__stop_words': No...	0.982233	[0.983315392896, 0.981152396338, 0.982229402262]
2	0.1	1	(1, 1)	english	True	{'ridge__alpha': 0.1, 'tf_idf__stop_words': 'e...	0.980617	[0.980624327234, 0.980075390415, 0.981152396338]

Let’s look at the regularization parameter alpha. Remember alpha is the inverse of C - so the smaller the alpha the stronger the regularization will be.

results.groupby(["ridge__alpha"])["mean_validation_score"].aggregate([np.mean])

	mean
ridge__alpha
0.1	0.981545
0.3	0.982434
1.0	0.981485
3.0	0.977267
10.0	0.954467

So we see that the best results are when alpha is small, around 0.1 - 0.3. If we make alpha too large we get a significant decrease in accuracy.

Note that in this case I tuned where the model should use idf or only tf. The best model does use idf but let’s see if we look across all the tuning settings:

results.groupby(["tf_idf__use_idf"])["mean_validation_score"].aggregate([np.mean, np.std])

	mean	std
tf_idf__use_idf
False	0.976038	0.009895
True	0.974841	0.014802

Hmm interestingly not using idf performs slightly better over the entire grid space we tried out. This might be because the sms messages aren’t very long. Here’s a quote from the sklearn documentation on tf-idf: While the tf–idf normalization is often very useful, there might be cases where the binary occurrence markers might offer better features. Very short texts are likely to have noisy tf–idf values while the binary occurrence info is more stable.

Tf-Idf Ridge Model Selection using Pipelines in Sklearn

Creating a pipeline to tune tf-idf + ridge regularization parameters and select the best model for text based predictions.

So we just plug in the raw descriptions and the tf_idf transforms it into a matrix that is then fitted by the ridge model.

\(Description\longrightarrow X , y \longrightarrow model\)

Examining the Best Model:

Spam example: