Posts about ‘Wisdom of Small Crowds’


Crowdsifter: More Efficient Content Filtering

Friday, July 10th, 2009 by John Le

“I know it when I see it.”Justice Potter Stewart

We have been running Crowdsifter, our content moderation product backed by Amazon’s Mechanical Turk for a while and we wanted to share some quality metrics and some stats on how our system aggregates redundant results to improve those metrics.

minerrors11

Controlling for Worker Quality, Bias, and Item Difficulty

In the graph above we picked the the best error rate for raw AMT with 1-11 workers and the best error rate that Crowdsifter provided on a porn judgment task with 2491 images (1006 porn, 1485 non-porn). The error rate is the rate at which wrong decisions are made. A wrong decision is whenever we label porn as not porn or non-porn as porn. The above experiment includes images which were labeled as ambiguous, which is the reason the error rates shown seem so high.

Using Crowdsifter with an average of 3.93 workers per image we achieve the same possible minimum error rate as majority voting in raw AMT with 9 workers per image. We do this by controlling worker quality by keeping track of their judgments. And if we have a “expert” evaluated gold standard of what is pornographic, then we can keep track of which workers are doing a good job or a bad job. On non-gold standard images we weight workers’ judgments based on how well we trust their judgment to reflect our standard of porn. Without these controls, majority voting in raw AMT is vulnerable to the many scammers that lurk there.

For images where obscenity is particularly ambiguous, we can allocate more workers. This results in a better sampling of whether an image is obscene. Some images don’t need many judges to accurately determine if they are pornographic. We can determine which images are easily classifiable as porn by sampling a group of workers and checking whether they all agree. Using too many judges per image can become prohibitively costly. It is important to have this scheme so we can dynamically allocate workers. Raw AMT is both wasteful and inefficient, applying many judgments to easy items, while not using enough judgments for hard items.

Better Measures

The raw error rate includes both images incorrectly labeled as porn, and incorrectly labeled as non-porn. In content moderation we want to minimize our porn miss rate (also known as false negative rate) because we don’t want to let any porn onto our site. The graph for the porn miss rate corresponding with the above graph is shown below.

fnrate11

The most important part is the porn miss rate, and our rate is close to the rates of 9 to 11 workers per image on AMT, even though we are using less than half that number of workers, meaning we significantly cut our costs.

Adjusting Thresholds
We can adjust our certain thresholds to lower the porn miss rate, but we do this at the risk of labeling all our images as porn, so nothing would make it onto our site. Adjusting the threshold to meet the needs of minimizing the porn miss rate, while maintaining an acceptable non-porn miss rate, is a task Crowdsifter can readily handle.

We’ll save what we can do with threshold adjustment for a later blog post.
-John


Thanks to Brendan for help in this post.

EMNLP Slides

Thursday, November 6th, 2008 by Lukas Biewald

Rion Snow presented the paper, “Cheap and Fast - But is it Good?” at EMNLP last week.

Here are the slides from the talk:

Rls For Emnlp 2008
View SlideShare presentation or Upload your own.

AMT is fast, cheap, and good for machine learning data

Tuesday, September 9th, 2008 by Brendan O'Connor

Update 9/19: Final PDF version has been uploaded. See also the comments below for updates — our released data is already being used by others!


We recently teamed up with Rion Snow, Prof. Dan Jurafsky, and Prof. Andrew Ng from the Stanford AI Lab to try using Amazon Mechanical Turk to generate data sets for Machine Learning research. Many AI tasks require a large amount of training data, and to build natural language systems, researchers traditionally pay linguistic experts for millions of annotations. Search engine companies employ hundreds or thousands of annotators for their classification, ranking, and other statistically trained systems, but their data is private and is not available for research. AMT is a potential tool to create high quality data sets accessible to everyone.

We rigorously tested the quality of AMT responses for several classic human language problems, and found that the quality was the same or better than the expert data that most researchers use. We wrote a paper, “Cheap and Fast — But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks,” that will be presented in an upcoming conference, EMNLP-2008.

Our findings:

1. Turker-generated data is good. AMT makes it easy to ask many people for judgments, so for several tasks, we looked at accuracy rates for how well the averaged Turker judgments correlate to the expert gold standard. With more judgments per example, accuracy increases. For comparison, on each graph the horizontal dotted line indicates the rate at which a single expert agrees with their gold standard. Enough non-experts can match or often beat experts’ reliability.

k-acc3.png

2. Turker-generated data is cheap and fast. We can collect thousands of labels per dollar and per hour.

(more…)

Wisdom of small crowds, part 3: another worker visualization

Thursday, August 7th, 2008 by Brendan O'Connor

This is a follow-up to the previous post on individual workloads and rates. Here are the submission times and durations for every worker on the same graph. Each worker is one horizontal line. An assignment is started at a dot, and its duration is for the line segment extending to the right.

submission-durations-wide1.png

The particular data set isn’t the same as in the previous post, but was for a similar task and exhibits a similar structure. Worker rates substantially differ. Some workers do a few HIT’s, but others work on as many as are available. Some work rapidly with breaks (19, 36). Some assignment durations are as long as 5-10 minutes (13, 37). Some work very intermittently (29).

This view makes the parallelism of AMT apparent. At any vertical timeslice you can see how many workers are active at that time. The entire job ends on the right side when the available HIT’s run out.

[ This article is part of a series, Wisdom of Small Crowds, on crowdsourcing methodology. ]

Wisdom of small crowds, part 2: individual workloads and rates

Tuesday, August 5th, 2008 by Brendan O'Connor

[ Update: see also another visualization of this. ]

AMT’s great new interface makes it easy to download completion times for individual worker assignments. Therefore, it’s easy to visualize :) For a recent small job we did (250 HIT’s, 5 workers per HIT), here’s a graph of completion times per worker, over the entire 15 minute duration of the job. Each assignment is a single point, graphed by when it was done versus how long it took.

completion-times.png

(more…)

Wisdom of small crowds, part 1: how to aggregate Turker judgments for classification (the threshold calibration trick)

Monday, June 16th, 2008 by Brendan O'Connor

[ This article is part of a series, Wisdom of Small Crowds, which focuses on crowdsourcing methodology for Amazon Mechanical Turk-like systems. ]

We use Turkers to classify all sorts of data, by having several workers render judgments on each item. But what should we do when they disagree? Like any other human behavior, Turker judgments are noisy: sometimes there are mistakes, and sometimes the task is genuinely difficult or subjective, and there is no “right” answer. Once we have a bunch of Turker judgments, we need to aggregate them — that is, use some sort of voting mechanism — to give as accurate a classification as possible. It turns out that one simple trick, threshold calibration, can substantially improve accuracy, and can be tuned to the specifics of the problem.

Here’s an example. A recent client of ours had a de-duping task: given a pair of similar articles, the task was to decide if they were “about the same topic” or “about different topics”. This is just a binary classification problem; call these labels “YES” and “NO”. To figure out how well Turkers could perform the task, we had our client provide us with a gold standard data set. That is, for 135 examples, their experts did the task themselves and provided “gold” ground truth labels.

We used a very high number of workers per example (about 20). For all 135 examples in the gold standard, the following graph plots them vertically by their “Turker confidence in YES” — that’s just the percentage of votes for “YES” among the 20 or so judgments for that particular example. I’ve also colored each example with the experts’ gold label. You can see that this simple Turker data provides some statistical separation between the classes.

Test set separation by Turker ensemble binary classifier

This graph also shows how to create a classifier from Turker votes. We have to choose a confidence threshold for our classifier’s decision: above the threshold, say “YES”, and below say “NO”. Unfortunately, Turkers aren’t perfect at modeling the experts: anywhere we place the threshold, errors occur. However, some thresholds are better than others. The threshold with the best accuracy is at 73% confidence — that is, a 73% super-majority voting rule — and it classifies instances correctly 90% of the time. Furthermore, we can tune for different types of errors. If we are particularly concerned with avoiding false positive errors, we can set a higher, more conservative threshold; or, if we want to find as many “YES” instances as possible, we can set a lower, more liberal threshold.

Here’s another chart that more carefully details the tradeoffs between true and false positives vs. true and false negatives. For a particular decision threshold, it shows how it divides up the instances into the confusion matrix’s 4 categories of correct and incorrect decisions.

Classifier performance on gold standard at different thresholds

A final note on why threshold calibration is important: For this task, the Turkers were considerably more liberal than the experts at deciding what a “YES” example was — experts marked only 36% of examples as “YES”, whereas a simple Turker majority voting rule marks 57% that way. This is because the experts understood the full implications of the decision, which were substantial — various entries in their database and website would be merged, and users would be confused if they were exposed to a bad merge. False positives had a very high cost. The prompt for Turkers, by contrast, was fairly vague. (In our experience, we generally find that good task design is a huge factor in getting better Turker accuracy.) However, since Turker decisions noisily correlate with the experts, moving the decision threshold can help accuracy. Here’s the threshold vs. accuracy graph:

thresh-acc.png

Statistical analysis of Turker data can substantially improve accuracy performance, even with something as simple as choosing the best decision threshold. This blog post only scratched the surface; there are a few more useful things to consider. Stay tuned for Part 2 and hopefully many more!

A few more notes on Turker voting and threshold calibration:

(more…)