Tf-Idf Ridge Model Selection using Pipelines in Sklearn

Creating a pipeline to tune tf-idf + ridge regularization parameters and select the best model for text based predictions.

I am going to dabble a bit into text mining in this post. The idea is very simple: we have a collection of documents (these could be emails, books or craiglist ads) and we are trying to build a model that predicts something when given a new document of the same provenance. To make this more concrete we will look at two examples:

  • predicting the salary offer for a job based on the description of the job listing
  • predicing whether a text message is spam.

Along the way I will also explore how to build pipelines in python using sklearn and how to use tf-idf to transform the documents into numeric matrices. I am pretty new to all of this myself (mostly writing this up so I don’t forget) so any suggestions and corrections are welcome!

Let’s load the required python modules:

import pandas as pd
import numpy as np

from sklearn import linear_model
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.preprocessing import LabelEncoder 


from matplotlib import pyplot as plt
% matplotlib inline

Let’s start with the salary listing first. We are going to try to build a model that predicts the salary offer for a job based on the description of the job listing.

train = pd.read_csv("https://raw.githubusercontent.com/ajschumacher/gadsdata/master/salary/train.csv")
y = train.SalaryNormalized
train.head(3)
Id Title FullDescription LocationRaw LocationNormalized ContractType ContractTime Company Category SalaryRaw SalaryNormalized SourceName
0 12612628 Engineering Systems Analyst Engineering Systems Analyst Dorking Surrey Sal... Dorking, Surrey, Surrey Dorking NaN permanent Gregory Martin International Engineering Jobs 20000 - 30000/annum 20-30K 25000 cv-library.co.uk
1 12612830 Stress Engineer Glasgow Stress Engineer Glasgow Salary **** to **** We... Glasgow, Scotland, Scotland Glasgow NaN permanent Gregory Martin International Engineering Jobs 25000 - 35000/annum 25-35K 30000 cv-library.co.uk
2 12612844 Modelling and simulation analyst Mathematical Modeller / Simulation Analyst / O... Hampshire, South East, South East Hampshire NaN permanent Gregory Martin International Engineering Jobs 20000 - 40000/annum 20-40K 30000 cv-library.co.uk

Let’s take a closer look at what a posting looks like:

train.FullDescription[1]
'Stress Engineer Glasgow Salary **** to **** We re currently looking for talented engineers to join our growing Glasgow team at a variety of levels. The roles are ideally suited to high calibre engineering graduates with any level of appropriate experience, so that we can give you the opportunity to use your technical skills to provide high quality input to our aerospace projects, spanning both aerostructures and aeroengines. In return, you can expect good career opportunities and the chance for advancement and personal and professional development, support while you gain Chartership and some opportunities to possibly travel or work in other offices, in or outside of the UK. The Requirements You will need to have a good engineering degree that includes structural analysis (such as aeronautical, mechanical, automotive, civil) with some experience in a professional engineering environment relevant to (but not limited to) the aerospace sector. You will need to demonstrate experience in at least one or more of the following areas: Structural/stress analysis Composite stress analysis (any industry) Linear and nonlinear finite element analysis Fatigue and damage tolerance Structural dynamics Thermal analysis Aerostructures experience You will also be expected to demonstrate the following qualities: A strong desire to progress quickly to a position of leadership Professional approach Strong communication skills, written and verbal Commercial awareness Team working, being comfortable working in international teams and self managing PLEASE NOTE SECURITY CLEARANCE IS REQUIRED FOR THIS ROLE Stress Engineer Glasgow Salary **** to ****'

We will just use the description and build a pipeline to predict the Normalized Salary. This is quite easy in sklearn using a pipeline. Basically we will create a bag of words then scale the columns using tf_idf. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general. Then we will fit a regularized linear model to the data. Regularization is key here since when using bi-grams we’ll end up with over 400k features and only 10k training examples.

estimators = [("tf_idf", TfidfVectorizer()), 
              ("ridge", linear_model.Ridge())]
model = Pipeline(estimators)

So we just plug in the raw descriptions and the tf_idf transforms it into a matrix that is then fitted by the ridge model.

\(Description\longrightarrow X , y \longrightarrow model\)

model.fit(train.FullDescription, y) 
Pipeline(steps=[('tf_idf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...it_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001))])

Now both the tf_idf transform and the ridge regression have tuning parameters and the nice thing about the pipeline we just built is that we can tune all the parameters at once:

params = {"ridge__alpha":[0.1, 0.3, 1, 3, 10], #regularization param
          "tf_idf__min_df": [1, 3, 10], #min count of words allowed
          "tf_idf__ngram_range": [(1,1), (1,2)], #1-grams or 2-grams
          "tf_idf__stop_words": [None, "english"]} #use stopwords or don't

How many different model must we run? Well since we’re doing a grid search we can just multiply the possibilities for each parameter to get 5*3*2*2 for a total of 60 models - a decent number. And keep in mind that for each model we have to build the tf_idf vectorizer all over again.

grid = GridSearchCV(estimator=model, param_grid = params, scoring = "mean_squared_error")

grid.fit(train.FullDescription, y)

grid.best_params_
{'ridge__alpha': 0.3,
 'tf_idf__min_df': 1,
 'tf_idf__ngram_range': (1, 2),
 'tf_idf__stop_words': 'english'}
np.sqrt(-grid.best_score_)
10532.473521325306

We can also look at all the params:

params = pd.DataFrame([i[0] for i in grid.grid_scores_])
results = pd.DataFrame(grid.grid_scores_)
results = pd.concat([params, results], 1)
results["rmse"] = np.sqrt(-results.mean_validation_score)
results.head(3)
ridge__alpha tf_idf__min_df tf_idf__ngram_range tf_idf__stop_words parameters mean_validation_score cv_validation_scores rmse
0 0.1 1 (1, 1) None {'ridge__alpha': 0.1, 'tf_idf__stop_words': No... -1.383986e+08 [-103831685.851, -141229157.862, -170145315.841] 11764.293270
1 0.1 1 (1, 1) english {'ridge__alpha': 0.1, 'tf_idf__stop_words': 'e... -1.408870e+08 [-105929048.004, -144749023.148, -171993435.294] 11869.583228
2 0.1 1 (1, 2) None {'ridge__alpha': 0.1, 'tf_idf__stop_words': No... -1.113026e+08 [-77620035.3972, -108499379.09, -147798481.11] 10550.004578

Examining the Best Model:

model = grid.best_estimator_

Every time we predict the model will run the tf-idf part first, already fitted on the train set and then use the ridge regression model.

model.predict(train.FullDescription)
array([ 25975.84531928,  32824.5058169 ,  32127.26976225, ...,
        50386.2916183 ,  50138.40072399,  27588.69246637])

One issue with using the pipeline is that we don’t see the little details that go into fitting the models.

What if we want to examing more closely what goes on in each model? Say for example I want to look at the coefficients of my linear regression. That’s also pretty straighforward using the named_steps method.

grid.best_estimator_.named_steps["ridge"].coef_
array([ -465.8824938 ,  1697.39286267,  1304.56896049, ...,  1416.89223231,
        -596.29992468,  -596.29992468])
ridge_model = model.named_steps["ridge"]
tf_idf_model = model.named_steps["tf_idf"]
coefficients = pd.DataFrame({"names":tf_idf_model.get_feature_names(),
                             "coef":ridge_model.coef_})

Let’s look at the tokens with the largest coefficients:

coefficients.sort_values("coef", ascending=False).head(10)
coef names
88432 51890.453609 consultant grade
235331 48488.999766 locum
399929 45052.453063 subsea
174963 43441.208259 global
235338 40651.232641 locum consultant
90843 40016.083870 contract
235682 38554.092136 london
211090 36076.259999 investment
121657 34280.854922 director
244094 33309.500667 manager

We see some of the usual suspects - such as london, consultant, director and manager. However given how many features we have (over 400k) it’s hard to interpret these coefficients very accurately. Perhaps doing a Lasso model with a strong $l_1$ regularization might help with that, since that wil reduce the number of non-zero coefficients.

Spam example:

Now let’s look at another classical text analysis problem - clasifying wether an email (or text message) is spam or not. Let’s load up the data:

url = 'https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv'
sms = pd.read_table(url, header=None, names=['label', 'message'])
sms.head()
label message
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...

And let’s look at one example:

sms.iloc[12, 1]
'URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18'

Ok this one is clearly spam :)

corp = sms.message
le = LabelEncoder()
y = le.fit_transform(sms.label) 

Notice that in this case we are prediting a class - ham vs. spam so linear regression won’t cut it. So do we need to go thru the whole process of building the pipeline again? Well not really. The tf-idf part stays the same we just have to use logistic (instead of linear) regression. So we simply replace linear_model.Ridge() with linear_model.RidgeClassifier().

estimators = [("tf_idf", TfidfVectorizer()), 
              ("ridge", linear_model.RidgeClassifier())]
model = Pipeline(estimators)
params = {"ridge__alpha":[0.1, 0.3, 1, 3, 10], #regularization param
          "tf_idf__min_df": [1, 3, 10], #min count of words allowed
          "tf_idf__ngram_range": [(1,1), (1,2)], #1-grams or 2-grams
          "tf_idf__stop_words": [None, "english"],#use stopwords or don't
          "tf_idf__use_idf":[True, False]}  #whether to scale columns or just leave normalized bag of words.
grid = GridSearchCV(estimator=model, param_grid = params)
grid.fit(corp, y)
GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(steps=[('tf_idf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...True,
        max_iter=None, normalize=False, random_state=None, solver='auto',
        tol=0.001))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'ridge__alpha': [0.1, 0.3, 1, 3, 10], 'tf_idf__stop_words': [None, 'english'], 'tf_idf__min_df': [1, 3, 10], 'tf_idf__ngram_range': [(1, 1), (1, 2)], 'tf_idf__use_idf': [True, False]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)
grid.best_params_
{'ridge__alpha': 0.1,
 'tf_idf__min_df': 1,
 'tf_idf__ngram_range': (1, 2),
 'tf_idf__stop_words': None,
 'tf_idf__use_idf': True}
params = pd.DataFrame([i[0] for i in grid.grid_scores_])
results = pd.DataFrame(grid.grid_scores_)
results = pd.concat([params, results], 1)
results.head(3)
ridge__alpha tf_idf__min_df tf_idf__ngram_range tf_idf__stop_words tf_idf__use_idf parameters mean_validation_score cv_validation_scores
0 0.1 1 (1, 1) None True {'ridge__alpha': 0.1, 'tf_idf__stop_words': No... 0.981874 [0.982238966631, 0.980613893376, 0.982767905223]
1 0.1 1 (1, 1) None False {'ridge__alpha': 0.1, 'tf_idf__stop_words': No... 0.982233 [0.983315392896, 0.981152396338, 0.982229402262]
2 0.1 1 (1, 1) english True {'ridge__alpha': 0.1, 'tf_idf__stop_words': 'e... 0.980617 [0.980624327234, 0.980075390415, 0.981152396338]

Let’s look at the regularization parameter alpha. Remember alpha is the inverse of C - so the smaller the alpha the stronger the regularization will be.

results.groupby(["ridge__alpha"])["mean_validation_score"].aggregate([np.mean])
mean
ridge__alpha
0.1 0.981545
0.3 0.982434
1.0 0.981485
3.0 0.977267
10.0 0.954467

So we see that the best results are when alpha is small, around 0.1 - 0.3. If we make alpha too large we get a significant decrease in accuracy.

Note that in this case I tuned where the model should use idf or only tf. The best model does use idf but let’s see if we look across all the tuning settings:

results.groupby(["tf_idf__use_idf"])["mean_validation_score"].aggregate([np.mean, np.std])
mean std
tf_idf__use_idf
False 0.976038 0.009895
True 0.974841 0.014802

Hmm interestingly not using idf performs slightly better over the entire grid space we tried out. This might be because the sms messages aren’t very long. Here’s a quote from the sklearn documentation on tf-idf: While the tf–idf normalization is often very useful, there might be cases where the binary occurrence markers might offer better features. Very short texts are likely to have noisy tf–idf values while the binary occurrence info is more stable.