Skip to content
May 4, 2011 / Ben Chun


Since posting yesterday a call for help exploring the “I learned to program…” data set, I’ve heard back from multiple people who work on machine learning and natural language processing — including a former water polo teammate and lab partner from MIT that I’d lost touch with! Cool how that these things bring people together.

Anyway, more than one person suggested that I just release the data for everyone to visualize and explore, and I had to take a minute to think about the privacy implications. When people submit their story to the web site, they’re also giving their name and a URL with the knowledge that I publish these things and with the intention of having them published. So I think that’s okay to share.

The email address that I collect is only in case I need to follow up with people, which I have done in a number of cases to either get a more suitable URL or let them know I’ve published their story. I promised not to spam these addresses, and I think that includes not letting them be harvested. So that’s not okay to share.

Thus, I’ve decided to post the data with four columns: timestamp, answer, name, and URL. If you correlate this data with the site, you’ll find that I’ve edited people’s answers for length, chosen different URLs to represent them, corrected grammar, made spelling and capitalization consistent, and generally done all sorts of curation to make the site what it is.

The data I’m releasing contains none of those editorial changes. This is the raw data, exactly as submitted by users. People suggested a number of places to put the data, including Fusion Tables and Many Eyes. Those sound cool. For now, I’m just putting it out there as a comma-separated text file, because I think you probably know more than me about how to do this efficiently, and I think you probably know about lots and lots of interesting tools and techniques.

So, download 280 kilobytes of iltp.csv goodness and comment back with your findings or links to interactive things. I can’t wait to see what you come up with!



Leave a Comment
  1. Garth / May 5 2011 12:16 pm

    Thanks. Some great diversity in there but also a whole lot of extreme geekiness. I love it.

  2. glenn mcdonald / May 6 2011 5:32 am

    Here’s a browsable, explorable form of this data in Google/ITA Needlebase:

    In addition to putting this up, I did some very basic cleanup to merge a couple dozen duplicate responses and a few other duplicate responders, and delete a bunch of non-URL URL entries.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: