## Tf-Idf Ridge Model Selection using Pipelines in Sklearn

Posted on August 4, 2016

Creating a pipeline to tune tf-idf + ridge regularization parameters and select the best model for text based predictions. I am going to dabble a bit into text mining in this post. The idea is very simple: we have a collection of documents (these could be emails, books or...

[Read More]
## Gender Neutral Baby Names

Posted on March 12, 2016

Gender neutral names are names that tend to be given to both girls and boys. I will be looking at this interesting subset of names and try to answer some questions: What have historically been the most gender neutral baby names? How do these names behave over time? ...

[Read More]
## A Shiny App honoring Pi

Posted on March 3, 2016

With March 14 looming on the horizon I decided to make a little interactive visualizaton on how to approximate Pi by randomly generating points in a square. This is a certainly not a novel idea but I think it touches on two really important topics: the nature of volume/area...

[Read More]
## Patterns in the Republican Primaries

Posted on February 25, 2016

I am going to focus on the Republican Primaries in early states, plot some maps and graphs and see if I can figure out any patterns in the ways people voted. Hopefully I will update this as more results start pouring in. I also try to build some prediction models...

[Read More]
## Interactive Maps of NYC: Biking, Ancestry and Dating

Posted on February 15, 2016

I’ve been playing with the ACS census data in New York City and made some maps that will hopefully reveal some interesting facets of the city . Instead of looking at the usual suspects like median income or density I’m going to try to show New York from slightly different...

[Read More]
## Polynomial Overfittting

Posted on January 17, 2016

The bias-variance tradeoff is one of the main buzzwords people hear when starting out with machine learning. Basically a lot of times we are faced with the choice between a flexible model that is prone to overfitting (high variance) and a simpler model who might not capture the entire signal...

[Read More]
## MNIST Digit Recognition: Exploratory Data Analysis and Prediction

Posted on January 2, 2016

We will be looking at the MNIST data set on Kaggle. The goal in this competition is to take an image of a handwritten single digit, and determine what that digit is. We’ll start with some exploratory data analysis and then trying to build some predictive models to predict the...

[Read More]
## Strategies for the Board Game Risk

Posted on November 7, 2015

The game of Risk is a turn-based strategy game where players battle each other to take over the world. Your aim is to control as many of the forty-two territories with the armies at your disposal. The way you gain ground is by attacking enemy territories via rolling dice. Here...

[Read More]
## Tipsforvisualization

Posted on October 4, 2015

D3: A true bar chart with SVG rects Click me. var n = 3000; var w = 600; var h = 600; var color = d3.scale.category10() var data...

[Read More]
## Cross Validation Error Pitfalls

Posted on October 4, 2015

Let’s say you have 10 models that you’d want to test and roughly all models have the same cross validation error distribution: the Cross Validation Mean Squared Error is normally distributed with mean = 3 and standard deviation equal to .2. Since CV error is an average of a bunch...

[Read More]