Skip to content
May 3, 2011 / Ben Chun

Why I Need Help from Scientists

Dear people who know things,

Please put me in touch with researchers who might be interested in analyzing 2000 user-generated answers to the prompt “I learned to program…”

A couple days ago, Jean Hsu wrote a blog post that really nailed something I’ve been thinking about for a long time now: guys (and girls who’ve been drawn into this particular aspect of geek culture) really need to tell you how incredibly young they were when they did the first of so many amazing things they’ve done without even trying.

While it may not be consciously intended as intimidation, it can easily function as such. This sentiment has been expressed before, and showed up with references in a comment from Caroline on Boing Boing. So thanks to her for that:

Jean’s post made me think about looking more deeply at the ilearnedtoprogram.com data set. Now I didn’t ask people to self-report gender on the input form, so I’m doing the same thing that she was doing: inferring gender based on names and photos. I’ve published 505 stories, approximately 68% from males, and 32% from females, and a handful that I couldn’t really discern.

I took those published stories, and pasted the text for each gender into Wordle. Here’s the male cloud:

And this is the female cloud:

The terms “games” and “college” seem to have the biggest difference. Now I’m well aware that this isn’t even meaningful textual analysis, let alone scholarship. But hopefully it points in the direction of something interesting or helpful that we can investigate further. Anyone working in this area and interested in looking at some user-generated content? Let’s talk.

About these ads

9 Comments

Leave a Comment
  1. Peter Combs / May 3 2011 9:58 am

    Why not just open it up and let anyone who wants to work on it do so? I’m assuming that you’re not looking to get a journal publication out of it, but rather just get interesting analyses out there. In that case, then, opening up the data would let people slice and dice it in interesting ways that a handful of people wouldn’t necessarily think of.

  2. Ben Chun / May 3 2011 10:09 am

    Well that idea certainly appeals to my sensibilities. Are there any risks or privacy concerns? Let me think about that (and ask for help thinking about that); my initial reaction is that this is the right way to go.

  3. Marshall Kirkpatrick / May 3 2011 12:53 pm

    this is awesome, as was the original project itself. So Facebook has used US Census data to guess, based on last names, the likely racial background of groups of users. That would be an interesting approach here in addition to gender. You could also look up peoples’ names on LinkedIn (or ask Mechanical Turkers to do so) and analyze based on factors related to employment.

  4. Eric Nguyen / May 3 2011 1:10 pm

    It’s probably okay to publish any data that is already public on your site. You are exposing the data more openly (i.e. more easily-findable), so the question is whether you set user expectations well enough on the input form.

    If you publish, I recommend http://www.google.com/fusiontables, which has some nice built-in visualization tools, sharing, and API access.

  5. Bob Calder / May 3 2011 1:52 pm

    Have you tried manyeyes? The best program for this is over a grand and IBM makes manyeyes free. It is good for network analysis. I have no idea about semantic analysis. Here’s an example where there are strings of words:

    http://www-958.ibm.com/software/data/cognos/manyeyes/visualizations/ct-semantic-symptom-relationships-2

  6. Adam Marcus / May 3 2011 6:27 pm

    You can give TF-IDF [1] a shot (or just TF if you don’t have a way to get inverse document frequency from somewhere). Then for each word, do TFIDF_women – TFIDF_men and vice versa, and see which terms have the largest difference between the two groups.

    This is out of left field, but you can also try to adapt David Ayman Shamma’s work on labelling differences in peaks on Twitter conversations to label the differences in groups [2].

    One last thing I can think of to determine gender is to borrow Facebook’s use of the census as a proxy for race. This is going to be extremely dirty, and your sample might not have enough data in it to be able to draw anything useful. Use carefully: facebook can do this because of the size of its corpus, but it would be harder for you to infer race/gender information just yet.

    [1] http://en.wikipedia.org/wiki/Tf%E2%80%93idf
    [2] http://research.yahoo.com/pub/3419

  7. Peter Timusk / May 6 2011 6:40 pm

    Hello I am a budding Internet researcher on my own and work in statistics. I have one published paper and a few others not pubished on Internet topics. I have also read a number of texts in school in the sociology of science and technology as well as studying “gender and computing” studies. And I have been programming from a young age so am “of” the data. I would offer a more quantitative approach, I believe. While I am not experienced yet with textual analysis myself, I have read a few textual analysis studies to know some methods. If you want to discuss with me please view my school/research blog above and e-mail me.

  8. Angela / May 14 2011 9:55 am

    Hi, I’m doing research in the area of technology use and access with a particular focus on race, ethnicity and the future career mobility of people of color in the new global economy. I would be very interested in this type of data disaggregated by race, ethnicity, age and gender.

Trackbacks

  1. ILTP Data « And Yet It Moves

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: