CrowdFlower Research


Ask a Stupid Question

December 16th, 2009 by Aaron Shaw

What makes a bad survey question and why does it matter? I thought I’d use my first blog posts as Dolores Labs’s friendly neighborhood social scientist to talk a little bit about question design since it’s a relevant, but often overlooked, area of Crowdsourcing work.

You can ask “the crowd” all kinds of questions, but if you don’t stop to think about the best way to ask your question, you’re likely to get unexpected and unreliable results. You might call it the GIGO theory of research design.

To demonstrate the point, I decided to recreate some classic survey design experiments and distribute them to the workers in Crowdflower’s labor pools. For the experiments, every worker saw only one version of the questions and the tasks were posted using exactly the same title, description, and pricing. One hundred workers did each version of each question and I threw out the data from a handful of workers who failed a simple attention test question. The results are actual answers from actual people.

An Example: Response Scales

The rest of this post focuses on one example question that involved a response scale and a test to see how altering the scale would affect people’s answers. Here are two versions of the same question that I posted to Crowdflower:


Low Scale Version:


About how many hours do you spend online per day?

(a) 0 – 1 hour
(b) 1 – 2 hours
(c) 2 – 3 hours
(d) More than 3 hours






High Scale Version:


About how many hours do you spend online per day?

(a) 0 – 3 hours
(b) 3 - 6 hours
(c) 6 – 9 hours
(d) More than 9 hours





Notice that both versions can accommodate any answer and that the only difference is in the range of the scale items. You can give an accurate response to either question and neither version explicitly pushes you to give any answer over another.

So what did people say? Here’s a pair of histograms breaking the responses up by the two versions of the question:

boring histograms: hours online by scale

I didn’t label the height of the bars because the results are almost useless in this form. The only conclusion we can draw is that a lot of people in the Crowdflower worker pool tend to spend more than three hours per day online (whoa, no way…).

At the same time, it seems like the workers might have given low answers more frequently in response the low scale (check out how big the first three blue bars are compared to just the first orange bar).

To look at that comparison more closely, let’s break the answers into two categories for each scale: (1) the percentage of responses that were less than three hours, or (2) the percentage of responses that were more than 3 hours.

hours online in two bins

The difference between the height of the orange points (high scale) is much bigger than the corresponding difference between the height of the blue points (low scale). In other words, people who saw the high scale were much more likely to say they spent more than 3 hours online. In case you’re a stats nerd, the Chi-square test showed that this variation was significant with a p-value < 0.001, so the difference was almost certainly not due to chance.

But maybe collapsing the responses like this is a little too coarse and you'd still like to see how the variation worked across the scale as a whole. With that in mind, Lukas suggested another way to look at the effects – a comparison of the cumulative percentage of responses – and the differences are even more clear.

hours online - cumulative bins

That gap between the blue and the orange line at “Less than 3 hours” – the one level that was measured explicitly on both scales – is huge!

Explaining the Gap

If you’re thinking that the differences between the scales alone can’t explain why all of these results are so skewed, that’s a good thought. However, the fact that this was a randomized experiment on a relatively homogeneous group of people makes it very unlikely that anything else explains the difference. Just to be sure, I did some other tests and found no significant differences between the sets of respondents that saw the low and high scales in terms of gender, country of origin, and the amount of time they took to complete the survey. So it seems like the scale is indeed the most likely culprit.

But what explains why scale questions can bias people’s responses so heavily? Survey researchers call this kind of behavior satisficing - it happens when people taking a survey use cognitive shortcuts to answer questions. In the case of questions about personal behaviors that we’re not used to quantifying (like the time we spend online), we tend to shape our responses based on what we perceive as “normal.” If you don’t know what normal is in advance, you define it based on the midpoint of the answer range. Since respondents didn’t really differentiate between the answer options, they were more likely to have their responses shaped by the scale itself.

These results illustrate a sticky problem: it’s possible that a survey question that is distributed, understood, and analyzed perfectly could give you completely inaccurate results if the scale is poorly designed.

Okay, it’s Broken. Now How Do I fix It?

So what are you supposed to do in order to figure out which scale is more accurate? One of the best ways to mitigate the problem is to do some open-ended research on your respondent population so that you can get a good sense of a reasonable range of responses. Then you can re-center your response scale around that distribution.

To try this out, I ran the survey yet again with the same question, except that this time I left the “hours online” question open-ended, allowing Crowdflower workers to type in their responses. Here’s a density plot of those responses with the minimum, maximum, and mean responses highlighted (sparklines style):

hours online - open ended

While the distribution is skewed and has something of a long-ish tail, the mean (6.53 hours per day), median (6 hours per day), and mode (5 hours per day) are all close to the midpoint of the high scale in my original questions. Therefore, the responses from the high scale were probably a more accurate reflection of the worker’s judgments.

Keep in mind, this technique provides no guarantee that the workers have accurate knowledge of how many hours they spend online – it’s turtles all the way down. I’d be willing to bet that their best guesses are pretty good, but if a big policy decision was riding on this question, I’d try to supplement my little survey with some other data sources. No matter what, there’s no perfect solution.

So what?

The point of all this has not been to undermine survey research, but to illustrate some of the problems that can happen if you’re not careful with things like scale design, as well as to present some strategies for solving those problems. As crowdsourcing becomes a mainstream tool in a range of academic and commercial fields, survey and questionnaire design techniques are also becoming more widely applicable. Nevertheless, people don’t usually encounter this kind of stuff outside of research methodology textbooks and the polling season of an election year.

I have a few more examples from these same experiments that I hope to follow up with in more posts soon. Meanwhile, leave a comment or email me at aaron [at] doloreslabs [dot] com with questions, comments, corrections and requests for data/code. All of these plots were created using R.

Getting the Gold-Farmers to do useful work

October 22nd, 2009 by Lukas Biewald

screen shot crowdflower gambitOne of the most interesting and successful ways that games make money is through “offers” — basically ads or surveys that players can do to earn virtual currency. The game maker earns real money for every player that completes an offer.

We’ve integrated with Gambit, a leading offer provider. They post our tasks inside games alongside other offers. Instead of filling out a survey or buying something they don’t actually want, we have people doing real, useful work for our customers. I might be too old to understand the appeal of virtual currency, but we’ve observed from the feedback and volume of gamers doing our tasks that people care about getting their in-game money. It’s fascinating and exciting that people are shifting from doing things that aren’t standardly conceived of as productive to tasks that people need done. You can see a screenshot of how a task looks in the facebook game “SportsBets” on the left.

This is one way that by working through CrowdFlower we’re able to give you access to people speaking virtually every major language.

iPhone app — Give work

October 13th, 2009 by Stephanie Geerlings

landing-page.jpgWe just launched our first iPhone app: Give Work lets you do tasks in your downtime and help increase the wages of refugees in Kenya working for us.

We have been working with Samasource for a while now — they are a fantastic local non-profit that brings computer based work to people in Africa. We send tasks to one of their hardest to employ groups: a Kenyan refugee camp.

The people are extremely motivated, speak fluent English and even have high speed internet. But sometimes there are downtime issues (due to floods, satellite failure, etc.) and sometimes there are data quality issues (due to cultural misunderstandings), which makes it hard for them to compete for traditional outsourcing work. Fortunately, our dynamic routing and quality control technology can resolve these problems gracefully.

When you complete a task on your iPhone, your work is paired with the work of someone in Kenya.  iPhone users results are used for quality control — if someone waiting for a bus in San Francisco gives the same answer as someone working in a refugee camp, we can be fairly certain that the results are reliable. All of the profits we make on the work collected between the iPhone and the refugees go directly into the pockets of the refugee workers.

How to do tasks
Download the free app from the app store. You can start doing tasks in seconds. Shoot us an email at feedback@crowdflower.com if you have an issues or questions.

How to submit tasks to Samasource and GiveWork
Visit CrowdFlower where you can build tasks to outsource. On the order page, click the “iPhone” and “Samasource” channels.

Thanks!
I want to give a special shoutout to Josh Snyder, who did most of the work of building the actual application. I wasn’t sure we had the resources to get this crazy idea out the door, but everyone pitched in after hours and it looks great!

We’re still growing

October 5th, 2009 by Lukas Biewald

The Office

Come work for us! We’re funded by a group of well known investors, we’re generating substantial increasing revenue, and we’re looking for people to take us to the next level.

We’re located in the heart of the Mission, and we particularly love to meet readers of this blog. Among many amenities, we offer unlimited otter pops and a healthy oxygen-neutral environment.

If you refer someone that we hire, we will also confer upon you lifetime access to our otter pop supply.

Please send your application to jobs@doloreslabs.com.

Director of BD/Sales

Responsibilities:

  • Manage the sales pipeline
  • Investigate new markets and new applications of our technology
  • Close deals with large enterprise customers

Requirements

  • Proven track record of closing deals with enterprise customers

Account Manager

Responsibilities:

  • Communicate with large customers
  • Handle new customers
  • Reviews all major deliverables to ensure quality standards and client expectations are met.
  • Approves Change Orders and invoices, and is responsible for payment collections.

Requirements

  • Basic statistical/quantitative literacy
  • Organized and meticulous, but willing to work within the chaos of a startup
  • Undergraduate degree

Airlines: Who to fly with?

September 30th, 2009 by John Le

I really hate flying, not because I do not like being transported through the sky in a giant metal cylinder which is incredibly amazing and cool, but because service is unpredictable. Inevitably, I am burdened with mundane research, searching for and reading recent reviews, to find a good service. No one really wants to do this, but we still want to know which airlines have a greater percentage of recently satisfied customers. PeopleBrowsr, a social search engine we are working with, did a really interesting sentiment analysis task on airlines and generously allowed us to publish the data.

positive-tweet-percentage-number-of-passengers

We can see a slight negative correlation between size of the airline and the percentage of positive tweets. Smaller commercial airlines like Hawaiian, SkyWest, and Virgin had higher percentages of positive tweets clustering towards the lower right hand corner of the graph, while larger carriers like United, Delta, and Continental had lower percentages clustering towards the upper left hand corner. Though SouthWest was one of the larger carriers that broke this trend. We should note that Aloha Airlines no longer exists (the passenger data for which is from 2007) and it’s possible that the tweets for Aloha airlines showed such a high percentage of positive tweets because “aloha” invokes positive sentiment.

What was said

To gain insight into what we can expect from a positively or negatively viewed airline, it would also be nice to know what words were being used. So below are two nice Wordle visualizations of tweeted words where size indicates greater prevalence of the word, while word orientation and color are for style.

Positive Sentiment Tweet Words
positive-tweet-words
Negative Sentiment Tweet Words
negative-tweet-words

Not surprisingly the prevalences of the words “delay”, “wait”, and “waiting” for negatively viewed airlines implies having delays and making people wait is bad. Positive tweets contained common words like “great”, “best”, and “good”, but also “internet”, “wifi”, and “wireless” which is also not very surprising, since internet connectivity is so highly valued. Interestingly, some positive tweets contain words like “galactic”, “mothership”, and “spaceport”. Taking this into account I’ll try to find an airline run by aliens the next time I fly, and hopefully my interactions with them won’t prove disastrous.

-John

TechCrunch 50 – Business Card Analysis

September 22nd, 2009 by John Le

One of my favorite scenes in American Psycho is the business card scene. A business card’s importance cannot be understated as Christian Bale’s character decides to kill a man who had a better business card. In order to avoid a similar fate, I thought it would be nice to know what kind of reactions people might have to my business card.

With this in mind, we demoed CrowdFlower in the TechCrunch50 DemoPit last week on Tuesday with a live task in which we scanned images of a person’s business card and asked crowdsourced workers from the Amazon Mechanical Turk channel to write five kind words about the person based on what they saw. Here are some examples of what some workers said:

businesscardexample

Worker id Kind Word 1 Kind Word 2 Kind Word 3 Kind Word 4 Kind Word 5
37928 Intuitive Smart Organized Connected Gentlemen
38341 computer-savvy articulate intelligent hard-working ambitious
37928 Organized Experienced Accredited Professional Thorough
43713 Smart Efficient Hard-working No-nonsense Leader
42272 respected technical consulant Manager senior kindly person high powered
42344 technical organized professional respectable senior
41905 professional connected a business card so awesome it screwed with the scanner! international traveler
1148 Long term Life Accessible For the long haul Reality

It feels great to be complimented by strangers, and seeing the positive reaction people had to the kind words said about them reminded me of this awesome short film.

Quality

For tasks like these, where the responses are subjective, it is generally hard to control for quality. Workers can input anything, making it difficult to tell whether they are actually doing the task or scamming you. This is why we also asked workers to input the business card holder’s name. The name was usually clear, so we could quickly tell as judgments were being made whether the workers were actually completing the task.

We found that workers were generally inputting the correct names, and knowing that they are doing this part of the task correctly, we can for the most part infer that workers are actually doing the other part of the task well and not scamming us. And as we can see in the above examples most words were indeed kind.

Of the 306 total judgments for 61 business cards, only 29 were of bad quality (a single judgment asked for 5 words and the name). A judgment was considered bad when workers were not inputting kind words describing the person, repeating things like “NA”, “this is good”, “this is bad”, “…”, “No image”. But 19 of those 29 bad quality judgments were due to the business card being scanned poorly looking like the ones here and here.

If we had wanted to hear kind words even when images were of poor quality, a quick task improvement to increase our “kind word” quality might be to clearly specify what workers should do in these cases. Because we didn’t specify to do this when the image was particularly poor, workers were more likely to finish the task as quickly as possible by giving us non-useful data (non-kind words). But overall the results and words used were certainly interesting (I particularly liked worker 41905 who said “a business card so awesome it screwed with the scanner!” as kind words).

-John

CrowdFlower

September 15th, 2009 by Lukas Biewald

Today we took CrowdFlower out of private beta and launched it at TechCrunch 50.

We started Dolores Labs to help enterprise companies manage large pools of casual workers. Now we’re making our technology available in a self-service app: CrowdFlower.

Why use a casual workforce?

CrowdFlower makes it easy to have thousands of people working on your jobs, instantly! Early customers have used it for things as varied as calling local businesses to verify their hours to checking images for copyright violations to finding the best still images for videos. I used it to have hundreds of people call my partner Chris and wish him happy birthday (he quickly turned off the task).

You can build a task and outsource 1 hour of work - there’s no minimum charge and setup takes minutes. You can also outsource an unknown number of hours of work: maybe you want CrowdFlower to check your user comments for spam, and one day you suddenly have thousands more comments than expected. CrowdFlower will automatically go out and find a big enough work force to take care of your job.

Why use CrowdFlower?

Quality control. When you have 1000 people working on your job all at once, you don’t want to spot check everyone and make sure they’re doing a good job — at that point you might as well do the work yourself! CrowdFlower efficiently uses redundancy and analytics so that you can get the optimal cost/accuracy tradeoff.

What casual workforces can I access through CrowdFlower?

Amazon Mechanical/LiveWork - these are two services that offer tens of thousands of workers on demand. We will route your tasks through their API.
Gambit - a facebook offerwall. Gambit puts your tasks inside casual games so that millions of people can earn virtual currency by doing you tasks.
SamaSource - sends your tasks to refugees in Kenya.

crowdflowerstill2.jpg

Seattle MTurk Meetup

August 31st, 2009 by Stephanie Geerlings

Tomorrow is Amazon’s Mechanical Turk Meetup in New York City! I was lucky enough to get to go to the Seattle meetup on the 18th at Amazon HQ. Seattle is a beautiful city and there are plenty of smart people working in the crowdsourcing space.

Here is a short list of some Seattle companies worth checking out:
Smartsheet – smartsourcing your spreadsheet needs
Casting Words – transcription plus, plus
Cloudvox – innovative stuff with API-based phone calls
TagCow – photo tagging bliss

Do check out the NYC meetup tomorrow, September 1st. You’ll see our very own Howie Liu talking about boosting accuracy numbers through iterative task design and statistical quality analysis.

Seattle from the plane

Beautiful Data

August 5th, 2009 by Lukas Biewald

Brendan and I wrote a chapter for an O’Reilly book called Beautiful Data. We took a lot of the analysis from earlier blog posts and distilled it into a longer book chapter about exploring a large data set and turning the messy data into beautiful, compelling graphs. We tried to highlight the tools and techniques that don’t make it into textbooks and are instead passed along by word-of-mouth among people in the field.

You can check out a version of our chapter, and if you like it, we recommend you buy the book which is full of authors I admire: Jeff Hammerbacher, Toby Segaran, Aaron Koblin, Nathan Yau, Mike Migurski, Peter Norvig, Andrew Gelman and many more.

Crowdsourced Computing part 2

July 22nd, 2009 by Lukas Biewald

The results from the Engine Yard contest are in. Our best answer was 34, while the winning score was 30. Much thanks to everyone who installed and ran our application - it blows my mind that we were able to build an app over the weekend, throw up a webpage, and be crunching numbers on hundreds of friends’ and strangers’ computers around the world.

Two other teams independently came up with the same crowdsourced solution, but did it in javascript in the browser: RBI Engine. It’s an approach that’s about 1000 times slower, but definitely more accessible than installing a native app.

Shout out to Brian who lead the development on this experiment.