### Creating a pipeline to tune tf-idf + ridge regularization parameters and select the best model for text based predictions.

I am going to dabble a bit into text mining in this post. The idea is very simple: we have a collection of documents (these could be emails, books or craiglist ads) and we are trying to build a model that predicts something when given a new document of the same provenance. To make this more concrete we will look at two examples:

- predicting the salary offer for a job based on the description of the job listing
- predicing whether a text message is spam.

Along the way I will also explore how to build pipelines in python using sklearn and how to use tf-idf to transform the documents into numeric matrices. I am pretty new to all of this myself (mostly writing this up so I don’t forget) so any suggestions and corrections are welcome!

Let’s load the required python modules:

```
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.preprocessing import LabelEncoder
from matplotlib import pyplot as plt
% matplotlib inline
```

Let’s start with the salary listing first. We are going to try to build a model that predicts the salary offer for a job based on the description of the job listing.

```
train = pd.read_csv("https://raw.githubusercontent.com/ajschumacher/gadsdata/master/salary/train.csv")
```

```
y = train.SalaryNormalized
```

```
train.head(3)
```

Id | Title | FullDescription | LocationRaw | LocationNormalized | ContractType | ContractTime | Company | Category | SalaryRaw | SalaryNormalized | SourceName | |
---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 12612628 | Engineering Systems Analyst | Engineering Systems Analyst Dorking Surrey Sal... | Dorking, Surrey, Surrey | Dorking | NaN | permanent | Gregory Martin International | Engineering Jobs | 20000 - 30000/annum 20-30K | 25000 | cv-library.co.uk |

1 | 12612830 | Stress Engineer Glasgow | Stress Engineer Glasgow Salary **** to **** We... | Glasgow, Scotland, Scotland | Glasgow | NaN | permanent | Gregory Martin International | Engineering Jobs | 25000 - 35000/annum 25-35K | 30000 | cv-library.co.uk |

2 | 12612844 | Modelling and simulation analyst | Mathematical Modeller / Simulation Analyst / O... | Hampshire, South East, South East | Hampshire | NaN | permanent | Gregory Martin International | Engineering Jobs | 20000 - 40000/annum 20-40K | 30000 | cv-library.co.uk |

Let’s take a closer look at what a posting looks like:

```
train.FullDescription[1]
```

```
'Stress Engineer Glasgow Salary **** to **** We re currently looking for talented engineers to join our growing Glasgow team at a variety of levels. The roles are ideally suited to high calibre engineering graduates with any level of appropriate experience, so that we can give you the opportunity to use your technical skills to provide high quality input to our aerospace projects, spanning both aerostructures and aeroengines. In return, you can expect good career opportunities and the chance for advancement and personal and professional development, support while you gain Chartership and some opportunities to possibly travel or work in other offices, in or outside of the UK. The Requirements You will need to have a good engineering degree that includes structural analysis (such as aeronautical, mechanical, automotive, civil) with some experience in a professional engineering environment relevant to (but not limited to) the aerospace sector. You will need to demonstrate experience in at least one or more of the following areas: Structural/stress analysis Composite stress analysis (any industry) Linear and nonlinear finite element analysis Fatigue and damage tolerance Structural dynamics Thermal analysis Aerostructures experience You will also be expected to demonstrate the following qualities: A strong desire to progress quickly to a position of leadership Professional approach Strong communication skills, written and verbal Commercial awareness Team working, being comfortable working in international teams and self managing PLEASE NOTE SECURITY CLEARANCE IS REQUIRED FOR THIS ROLE Stress Engineer Glasgow Salary **** to ****'
```

We will just use the description and build a pipeline to predict the Normalized Salary. This is quite easy in sklearn using a pipeline. Basically we will create a bag of words then scale the columns using tf_idf. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general. Then we will fit a regularized linear model to the data. Regularization is key here since when using bi-grams we’ll end up with over 400k features and only 10k training examples.

```
estimators = [("tf_idf", TfidfVectorizer()),
("ridge", linear_model.Ridge())]
model = Pipeline(estimators)
```

#### So we just plug in the raw descriptions and the tf_idf transforms it into a matrix that is then fitted by the ridge model.

```
model.fit(train.FullDescription, y)
```

```
Pipeline(steps=[('tf_idf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...it_intercept=True, max_iter=None,
normalize=False, random_state=None, solver='auto', tol=0.001))])
```

Now both the tf_idf transform and the ridge regression have tuning parameters and the nice thing about the pipeline we just built is that we can tune all the parameters at once:

```
params = {"ridge__alpha":[0.1, 0.3, 1, 3, 10], #regularization param
"tf_idf__min_df": [1, 3, 10], #min count of words allowed
"tf_idf__ngram_range": [(1,1), (1,2)], #1-grams or 2-grams
"tf_idf__stop_words": [None, "english"]} #use stopwords or don't
```

How many different model must we run? Well since we’re doing a grid search we can just multiply the possibilities for each parameter to get `5*3*2*2`

for a total of 60 models - a decent number. And keep in mind that for each model we have to build the tf_idf vectorizer all over again.

```
grid = GridSearchCV(estimator=model, param_grid = params, scoring = "mean_squared_error")
```

grid.fit(train.FullDescription, y)

```
grid.best_params_
```

```
{'ridge__alpha': 0.3,
'tf_idf__min_df': 1,
'tf_idf__ngram_range': (1, 2),
'tf_idf__stop_words': 'english'}
```

```
np.sqrt(-grid.best_score_)
```

```
10532.473521325306
```

We can also look at all the params:

```
params = pd.DataFrame([i[0] for i in grid.grid_scores_])
results = pd.DataFrame(grid.grid_scores_)
results = pd.concat([params, results], 1)
results["rmse"] = np.sqrt(-results.mean_validation_score)
```

```
results.head(3)
```

ridge__alpha | tf_idf__min_df | tf_idf__ngram_range | tf_idf__stop_words | parameters | mean_validation_score | cv_validation_scores | rmse | |
---|---|---|---|---|---|---|---|---|

0 | 0.1 | 1 | (1, 1) | None | {'ridge__alpha': 0.1, 'tf_idf__stop_words': No... | -1.383986e+08 | [-103831685.851, -141229157.862, -170145315.841] | 11764.293270 |

1 | 0.1 | 1 | (1, 1) | english | {'ridge__alpha': 0.1, 'tf_idf__stop_words': 'e... | -1.408870e+08 | [-105929048.004, -144749023.148, -171993435.294] | 11869.583228 |

2 | 0.1 | 1 | (1, 2) | None | {'ridge__alpha': 0.1, 'tf_idf__stop_words': No... | -1.113026e+08 | [-77620035.3972, -108499379.09, -147798481.11] | 10550.004578 |

### Examining the Best Model:

```
model = grid.best_estimator_
```

Every time we predict the model will run the tf-idf part first, already fitted on the train set and then use the ridge regression model.

```
model.predict(train.FullDescription)
```

```
array([ 25975.84531928, 32824.5058169 , 32127.26976225, ...,
50386.2916183 , 50138.40072399, 27588.69246637])
```

One issue with using the pipeline is that we don’t see the little details that go into fitting the models.

What if we want to examing more closely what goes on in each model? Say for example I want to look at the coefficients of my linear regression. That’s also pretty straighforward using the `named_steps`

method.

```
grid.best_estimator_.named_steps["ridge"].coef_
```

```
array([ -465.8824938 , 1697.39286267, 1304.56896049, ..., 1416.89223231,
-596.29992468, -596.29992468])
```

```
ridge_model = model.named_steps["ridge"]
tf_idf_model = model.named_steps["tf_idf"]
```

```
coefficients = pd.DataFrame({"names":tf_idf_model.get_feature_names(),
"coef":ridge_model.coef_})
```

Let’s look at the tokens with the largest coefficients:

```
coefficients.sort_values("coef", ascending=False).head(10)
```

coef | names | |
---|---|---|

88432 | 51890.453609 | consultant grade |

235331 | 48488.999766 | locum |

399929 | 45052.453063 | subsea |

174963 | 43441.208259 | global |

235338 | 40651.232641 | locum consultant |

90843 | 40016.083870 | contract |

235682 | 38554.092136 | london |

211090 | 36076.259999 | investment |

121657 | 34280.854922 | director |

244094 | 33309.500667 | manager |

We see some of the usual suspects - such as london, consultant, director and manager. However given how many features we have (over 400k) it’s hard to interpret these coefficients very accurately. Perhaps doing a Lasso model with a strong $l_1$ regularization might help with that, since that wil reduce the number of non-zero coefficients.

# Spam example:

Now let’s look at another classical text analysis problem - clasifying wether an email (or text message) is spam or not. Let’s load up the data:

```
url = 'https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv'
sms = pd.read_table(url, header=None, names=['label', 'message'])
```

```
sms.head()
```

label | message | |
---|---|---|

0 | ham | Go until jurong point, crazy.. Available only ... |

1 | ham | Ok lar... Joking wif u oni... |

2 | spam | Free entry in 2 a wkly comp to win FA Cup fina... |

3 | ham | U dun say so early hor... U c already then say... |

4 | ham | Nah I don't think he goes to usf, he lives aro... |

And let’s look at one example:

```
sms.iloc[12, 1]
```

```
'URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18'
```

Ok this one is clearly spam :)

```
corp = sms.message
le = LabelEncoder()
y = le.fit_transform(sms.label)
```

Notice that in this case we are prediting a class - ham vs. spam so linear regression won’t cut it.
So do we need to go thru the whole process of building the pipeline again? Well not really. The tf-idf part stays the same we just have to use logistic (instead of linear) regression. So we simply replace `linear_model.Ridge()`

with `linear_model.RidgeClassifier()`

.

```
estimators = [("tf_idf", TfidfVectorizer()),
("ridge", linear_model.RidgeClassifier())]
model = Pipeline(estimators)
```

```
params = {"ridge__alpha":[0.1, 0.3, 1, 3, 10], #regularization param
"tf_idf__min_df": [1, 3, 10], #min count of words allowed
"tf_idf__ngram_range": [(1,1), (1,2)], #1-grams or 2-grams
"tf_idf__stop_words": [None, "english"],#use stopwords or don't
"tf_idf__use_idf":[True, False]} #whether to scale columns or just leave normalized bag of words.
```

```
grid = GridSearchCV(estimator=model, param_grid = params)
```

```
grid.fit(corp, y)
```

```
GridSearchCV(cv=None, error_score='raise',
estimator=Pipeline(steps=[('tf_idf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...True,
max_iter=None, normalize=False, random_state=None, solver='auto',
tol=0.001))]),
fit_params={}, iid=True, n_jobs=1,
param_grid={'ridge__alpha': [0.1, 0.3, 1, 3, 10], 'tf_idf__stop_words': [None, 'english'], 'tf_idf__min_df': [1, 3, 10], 'tf_idf__ngram_range': [(1, 1), (1, 2)], 'tf_idf__use_idf': [True, False]},
pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)
```

```
grid.best_params_
```

```
{'ridge__alpha': 0.1,
'tf_idf__min_df': 1,
'tf_idf__ngram_range': (1, 2),
'tf_idf__stop_words': None,
'tf_idf__use_idf': True}
```

```
params = pd.DataFrame([i[0] for i in grid.grid_scores_])
results = pd.DataFrame(grid.grid_scores_)
results = pd.concat([params, results], 1)
```

```
results.head(3)
```

ridge__alpha | tf_idf__min_df | tf_idf__ngram_range | tf_idf__stop_words | tf_idf__use_idf | parameters | mean_validation_score | cv_validation_scores | |
---|---|---|---|---|---|---|---|---|

0 | 0.1 | 1 | (1, 1) | None | True | {'ridge__alpha': 0.1, 'tf_idf__stop_words': No... | 0.981874 | [0.982238966631, 0.980613893376, 0.982767905223] |

1 | 0.1 | 1 | (1, 1) | None | False | {'ridge__alpha': 0.1, 'tf_idf__stop_words': No... | 0.982233 | [0.983315392896, 0.981152396338, 0.982229402262] |

2 | 0.1 | 1 | (1, 1) | english | True | {'ridge__alpha': 0.1, 'tf_idf__stop_words': 'e... | 0.980617 | [0.980624327234, 0.980075390415, 0.981152396338] |

Let’s look at the regularization parameter alpha. Remember alpha is the inverse of C - so the smaller the alpha the stronger the regularization will be.

```
results.groupby(["ridge__alpha"])["mean_validation_score"].aggregate([np.mean])
```

mean | |
---|---|

ridge__alpha | |

0.1 | 0.981545 |

0.3 | 0.982434 |

1.0 | 0.981485 |

3.0 | 0.977267 |

10.0 | 0.954467 |

So we see that the best results are when alpha is small, around 0.1 - 0.3. If we make alpha too large we get a significant decrease in accuracy.

Note that in this case I tuned where the model should use idf or only tf. The best model does use idf but let’s see if we look across all the tuning settings:

```
results.groupby(["tf_idf__use_idf"])["mean_validation_score"].aggregate([np.mean, np.std])
```

mean | std | |
---|---|---|

tf_idf__use_idf | ||

False | 0.976038 | 0.009895 |

True | 0.974841 | 0.014802 |

Hmm interestingly not using idf performs slightly better over the entire grid space we tried out. This might be because the sms messages aren’t very long. Here’s a quote from the sklearn documentation on tf-idf: While the tf–idf normalization is often very useful, there might be cases where the binary occurrence markers might offer better features. Very short texts are likely to have noisy tf–idf values while the binary occurrence info is more stable.

```
```