The Data Den

Projects on Math + Visualization + Machine Learning
by Alexandru Papiu

Tf-Idf Ridge Model Selection using Pipelines in Sklearn

Creating a pipeline to tune tf-idf + ridge regularization parameters and select the best model for text based predictions. I am going to dabble a bit into text mining in this post. The idea is very simple: we have a collection of documents (these could be emails, books or... [Read More]

Gender Neutral Baby Names

Gender neutral names are names that tend to be given to both girls and boys. I will be looking at this interesting subset of names and try to answer some questions: What have historically been the most gender neutral baby names? How do these names behave over time? ... [Read More]

A Shiny App honoring Pi

With March 14 looming on the horizon I decided to make a little interactive visualizaton on how to approximate Pi by randomly generating points in a square. This is a certainly not a novel idea but I think it touches on two really important topics: the nature of volume/area... [Read More]

Patterns in the Republican Primaries

I am going to focus on the Republican Primaries in early states, plot some maps and graphs and see if I can figure out any patterns in the ways people voted. Hopefully I will update this as more results start pouring in. I also try to build some prediction models... [Read More]

Interactive Maps of NYC: Biking, Ancestry and Dating

I’ve been playing with the ACS census data in New York City and made some maps that will hopefully reveal some interesting facets of the city . Instead of looking at the usual suspects like median income or density I’m going to try to show New York from slightly different... [Read More]

Polynomial Overfittting

The bias-variance tradeoff is one of the main buzzwords people hear when starting out with machine learning. Basically a lot of times we are faced with the choice between a flexible model that is prone to overfitting (high variance) and a simpler model who might not capture the entire signal... [Read More]

Strategies for the Board Game Risk

The game of Risk is a turn-based strategy game where players battle each other to take over the world. Your aim is to control as many of the forty-two territories with the armies at your disposal. The way you gain ground is by attacking enemy territories via rolling dice. Here... [Read More]


D3: A true bar chart with SVG rects Click me. var n = 3000; var w = 600; var h = 600; var color = d3.scale.category10() var data... [Read More]

Cross Validation Error Pitfalls

Let’s say you have 10 models that you’d want to test and roughly all models have the same cross validation error distribution: the Cross Validation Mean Squared Error is normally distributed with mean = 3 and standard deviation equal to .2. Since CV error is an average of a bunch... [Read More]