Designing Incentives for Crowdsourcing Workers

In a recent paper, presented at the ACM Conference on Computer Supported Cooperative Work (CSCW), John Horton, Daniel Chen and I used a large-scale experiment to test the effect of different incentive schemes on the quality of crowdsourcing work.

The results surprised us. They suggest that workers perform most accurately when the task design credibly links payoffs to a worker’s ability to think about the answers that their peers are likely to provide.

Horserace!

a horserace experiment! (photo cc-by-sa by iyoupapa)

The idea for this study came out of our sense that, as social scientists, we had something unique to offer the existing research on human computation. Early and influential crowdsourcing research has focused on how to filter the judgments of the crowd to find the best answers. We wanted to know whether simple task-design changes could improve the quality of data coming into a crowdsourcing system in the first place.

To test this idea, we chose 14 different incentive schemes and framing techniques developed and validated across the social sciences and set up a horse race experiment to see which schemes/techniques would work best.

Consistent with our personal biases (John and Daniel are both economists, and I’m a sociologist), some of the schemes were financially oriented, some were social or psychological, and some were hybrids combining social and financial incentives. The details of all the schemes are included in the paper (it’s a long list, and some of them are kind of involved), but it’s worth giving some examples.

On the financial end of the incentives spectrum, we had one condition we called “reward-accuracy,” which was pretty much what you’d expect: we told workers, “we’ll pay you a bonus if you get the answers right.” We also had one called “punishment-accuracy,” the gist of which you can deduce. On the purely social-psychological side, we had one we called “trust,” in which we told workers, “we’ll pay you for this job no matter how bad your performance, we trust that you’ll still make your best effort.”

One of the weirdest schemes turns out to be important, so I need to explain that one. Called “Bayesian Truth Serum” (BTS), it incorporates a design from the work of Drazen Prelec, a behavioral economist at MIT, who realized that research subjects could probably provide useful information regarding the expected distribution for subjective, qualitative questions (nb, the mechanics of how he does this are arcane in a way that is almost sure to delight the geeks among you, so I encourage you to read his paper). Few of the details of real BTS are important, except that we incorporated the piece about asking workers to answer the questions themselves and predict the distribution of other workers’ responses. We also told them we’d give them a bonus if their predictions were correct.

We then created a task that asked workers to answer five questions. In this case, the questions were drawn from another study examining participatory features of websites, for which we already possessed validated data collected by research assistants.

All workers answered the same five questions about the same website (www.kiva.org) while being exposed to one and only one of the 14 incentive schemes (or a control condition of no scheme). Roughly 2,000 individuals participated in the study, resulting in over 100 subjects in each of the experimental conditions. (The statistics and science nerds out there will be pleased to know that both the drop-out rate and demographic covariates were distributed evenly across conditions.)

To measure worker performance, we used the research assistant responses as correct answers to the questions and then calculated the total number of matching answers (out of five) provided by each worker. The results (aggregated across all treatments) are plotted in a histogram below and show that the average worker answered just over two questions out of five correctly.

Aggregate performance histogram

 

Then, in order to see how the treatments compared against each other relative to the control group, we calculated the mean correct response rate for each condition and conducted difference of means tests to see which of these means were significantly greater than the control group. The results of this comparison appear below (in a new plot that doesn’t even appear in the paper!):

ITT estimates per treatment

The orange dots show the value of the mean in each condition, and the blue bars illustrate the 95% confidence interval around that mean. The treatments are sorted by the size of the difference in means from the control. (More hard-core nerd stuff: the means are adjusted using Intent-To-Treat estimators).

From these results, we concluded that our horse race had two clear front-runners: the “Bayesian Truth Serum” (BTS) and “Punishment – disagreement” conditions, each of which improved average worker performance by almost half of a correct answer above the 2.08 correct answers in the control group. A few of the other financial and hybrid incentives had fairly large point estimates, but were not significantly different from control once we adjusted the test statistics and corresponding p-values to account for the fact that we were making so many comparisons at once (apologies if this doesn’t make sense — it’s yet another precautionary measure to avoid upsetting the stats nerds among you). In a tough turn for the sociologists and psychologists, none of the purely social/psychological treatments had any signficant effects at all.

Why do BTS and punishing workers for disagreement succeed in improving performance significantly where so many of the other incentive schemes failed? The answer hinges on the fact that both conditions tied workers’ payoffs to their ability to think about their peers’ likely responses. (We elaborate on the argument in more detail in the paper.)

Does this mean that we should give up on simple financial or social-psychological incentives? Probably not. The fact that we conducted the experiment on MTurk means that the deck may have been stacked against incentives like the “trust” condition I described earlier. Because requesters on MTurk have little oversight, workers are more likely to respond to financial incentives than stated promises. In this sense, the marketplace has structured the interaction between workers and requesters in a way that may limit the opportunities to harness motivations that are not linked to money in some explicit way.

You can download the full paper to read more.

9 Responses to “Designing Incentives for Crowdsourcing Workers”

  1. Alexis

    “A few of the other financial and hybrid incentives had fairly large point estimates, but were not significantly different from control once we adjusted the test statistics and corresponding p-values to account for the fact that we were making so many comparisons at once (apologies if this doesn’t make sense — it’s yet another precautionary measure to avoid upsetting the stats nerds among you).”

    They used the absurdly prone to Type II error (false negative) adjustment (The Bonferroni method). In other words their analysis stacked the deck against finding more incentives significant. Other Family-Wise Error Rate adjustments for multiple correction give better power (e.g. Holm’s or Holm-Sidak method). However, the False Discovery Rate methods (Benjamini’s methods) both correct for multiple comparisons, but do not make absurd reductions in statistical power (e.g. preferring to make false negative errors) in order to do so.

  2. Alexis

    “A few of the other financial and hybrid incentives had fairly large point estimates, but were not significantly different from control once we adjusted the test statistics and corresponding p-values to account for the fact that we were making so many comparisons at once (apologies if this doesn’t make sense — it’s yet another precautionary measure to avoid upsetting the stats nerds among you).”

    They used the absurdly prone to Type II error (false negative) adjustment (The Bonferroni method). In other words their analysis stacked the deck against finding more incentives significant. Other Family-Wise Error Rate adjustments for multiple correction give better power (e.g. Holm’s or Holm-Sidak method). However, the False Discovery Rate methods (Benjamini’s methods) both correct for multiple comparisons, but do not make absurd reductions in statistical power (e.g. preferring to make false negative errors) in order to do so, and they also account for hypotheses rejected after adjustment (positive findings) among all the comparisons.

  3. Aaron Shaw

    Hi Alexis,

    Not sure if you meant to get a response to this comment, but I wanted to reply to say that we ran all the other adjustments for multiple comparisons you mentioned on our data and they all generated the exact same results (in terms of which results were significant) as the “absurdly prone to Type II error” Bonferroni method!

    thanks for reading and commenting.

  4. Alexis

    Hi Aaron,

    (Appologies for the accidental double post earlier). A fair point, and thank you for the response. Better to report using the less absurd methods (even when they agree with the absurd ones). Bonferroni is an historical artifact which is unfortunately perpetuated in practice by its role in graduate level introductory applied statistics pedagogy.

    And yup: I teach graduate level introductory applied statistics. :)

  5. Dan

    It looks like blue bars are much too short in the graph with the orange dots. If they actually represent the 95% CI’s on each condition’s mean, then the control condition is significantly different from every one of the groups that did better, since they all have non-overlapping confidence intervals. A 95% CI covers about the range (-2 SE, +2 SE), so even Humanization looks to be about 8 SE’s better than the control according to the blue bars. I assume that you described the results correctly, which puts the mistake in your unpublished graph rather than your results.

    Perhaps you used the total N to calculate the CI’s for the graph when you should have used the n for each condition?

  6. Barry

    Excessive infatuation with the instrument? Why should we care about this?

  7. Nita

    Perhaps the effect is not so much due to better motivation (incentives) as to a more effective approach to the problem encouraged by these incentives? In any case, I think you’ve shown that “what would others say?” is a useful heuristic for this task.

  8. Alan

    Is there any way to implement this through CrowdFlower? Being able to encourage more valid responses would be very useful. Not everything we ask can be validated with gold data.