<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The CrowdFlower Blog &#187; Wisdom of Small Crowds</title>
	<atom:link href="http://blog.crowdflower.com/topics/wisdom/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.crowdflower.com</link>
	<description></description>
	<lastBuildDate>Tue, 10 Jan 2012 20:00:35 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Oscar Fever: The Sequel!</title>
		<link>http://blog.crowdflower.com/2011/03/oscar-fever-the-sequel/</link>
		<comments>http://blog.crowdflower.com/2011/03/oscar-fever-the-sequel/#comments</comments>
		<pubDate>Fri, 04 Mar 2011 22:51:03 +0000</pubDate>
		<dc:creator>Patrick Philips and Joseph Childress</dc:creator>
				<category><![CDATA[Art]]></category>
		<category><![CDATA[Experiments]]></category>
		<category><![CDATA[Media]]></category>
		<category><![CDATA[Miscellaneous]]></category>
		<category><![CDATA[Wisdom of Small Crowds]]></category>

		<guid isPermaLink="false">http://blog.crowdflower.com/?p=2190</guid>
		<description><![CDATA[The votes are in from our Oscar crowdsourcing experiment, and the crowd successfully picked the winners of 14 of the academy awards. For reference, Roger Ebert got 15 predictions correct so we&#8217;d have to conclude that the crowd performed reasonably well at predicting the winners of this glorified popularity contest. One fascinating thing about aggregating responses [...]]]></description>
			<content:encoded><![CDATA[<div class="socialize-in-content" style="float:left;"><div class="socialize-in-button socialize-in-button-left"><a href="http://twitter.com/share" class="twitter-share-button" data-url="http://blog.crowdflower.com/2011/03/oscar-fever-the-sequel/" data-text="Oscar Fever: The Sequel!" data-count="vertical" data-via="crowdflower" ><!--Tweetter--></a></div><div class="socialize-in-button socialize-in-button-left"><script>
			<!-- 
			var fbShare = {
				url: "http://blog.crowdflower.com/2011/03/oscar-fever-the-sequel/",
				size: "large",
				google_analytics: "true"
			}
			//-->
			</script>
                        <script src="http://widgets.fbshare.me/files/fbshare.js"></script></div><div class="socialize-in-button socialize-in-button-left"><script type="in/share" data-url="http://blog.crowdflower.com/2011/03/oscar-fever-the-sequel/" data-counter="top"></script></div><div class="socialize-in-button socialize-in-button-left"><g:plusone size="small" href="http://blog.crowdflower.com/2011/03/oscar-fever-the-sequel/"></g:plusone></div></div><p>The votes are in from <a href="http://blog.crowdflower.com/2011/02/oscar-fever/" target="_blank">our Oscar crowdsourcing experiment</a>, and the crowd successfully picked the winners of 14 of the academy awards. For reference, <a href="http://rogerebert.suntimes.com/apps/pbcs.dll/article?AID=/20110210/OSCARS/110219999" target="_blank">Roger Ebert got 15 predictions correct</a> so we&#8217;d have to conclude that the crowd performed reasonably well at predicting the winners of this glorified popularity contest.</p>
<div id="attachment_2191" class="wp-caption aligncenter" style="width: 808px"><a rel="attachment wp-att-2191" href="http://blog.crowdflower.com/2011/03/oscar-fever-the-sequel/actual_results/"><img class="size-full wp-image-2191" title="actual_results" src="http://blog.crowdflower.com/wp-content/uploads/2011/03/actual_results.jpg" alt="movie picks" width="798" height="414" /></a><p class="wp-caption-text">Predicted and Actual Winners of the 2011 Academy Awards</p></div>
<p><span id="more-2190"></span><br />
One fascinating thing about aggregating responses is that the crowd as a whole will often outperform the average worker. In this case, among the 500 people we polled, the majority of respondents picked fewer than 10 awards correctly (mean of 9.6 and median of 9). And yet, by aggregating all the responses, such that the nominee with the most &#8220;votes&#8221; is predicted to win, the crowd as a whole correctly picked 14 awards. While the &#8220;wisdom of crowds&#8221; doesn&#8217;t come as much of a surprise, it&#8217;s always reassuring to see it confirmed in new applications.</p>
<p><a rel="attachment wp-att-2199" href="http://blog.crowdflower.com/2011/03/oscar-fever-the-sequel/correct_histogram1/"><img class="aligncenter size-full wp-image-2199" title="correct_histogram1" src="http://blog.crowdflower.com/wp-content/uploads/2011/03/correct_histogram1.jpg" alt="" width="614" height="445" /></a></p>
<p>As we noted in<a href="http://blog.crowdflower.com/2011/02/oscar-fever/"> our earlier post</a>, though, the more interesting question was whether workers who indicated higher confidence in their responses would outperform workers with lower confidence. Looking at the results, however, we saw no significant correlation between a worker&#8217;s predicted accuracy and  actual performance.</p>
<div id="attachment_2222" class="wp-caption aligncenter" style="width: 823px"><a rel="attachment wp-att-2222" href="http://blog.crowdflower.com/2011/03/oscar-fever-the-sequel/scatts/"><img class="size-full wp-image-2222" title="scatts" src="http://blog.crowdflower.com/wp-content/uploads/2011/03/scatts.jpg" alt="" width="813" height="547" /></a><p class="wp-caption-text">&quot;Squint all you want, but there&#39;s no pattern&quot;</p></div>
<p>While it&#8217;s certainly possible that we didn&#8217;t offer enough of an incentive for workers to estimate their own accuracy, the more likely explanation is that predicting the winners of the Oscars is not something that a person can do with any degree of certainty. Confident or not, the people we polled did not see the &#8220;Inside Job&#8221; coming.</p>
<p>As a final exercise, we ran a regression on every explanatory variable we could find, including what state workers came from, what day they made their predictions, whether they made their predictions during the day or at night, how long they spent making their predictions and even their historical accuracy on other CrowdFlower tasks. The only variable with any significance turned out to be how long they spent on making their predictions, and while it was significant (at p=0.001), no model we could come up with explained more than 5 percent of the total variation in accuracy.</p>
<p>While the wisdom of crowds seems to extend to picking Oscar winners, the more interesting experiment of having workers self-select as trustworthy is ongoing. In the future, it would be worthwhile to repeat this experiment with questions that can be answered objectively and without uncertainty (solving algebra problems seems like a good candidate), to see if any correlation emerges between predicted and actual accuracy.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.crowdflower.com/2011/03/oscar-fever-the-sequel/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Good work knows no boundaries</title>
		<link>http://blog.crowdflower.com/2010/12/good-work-knows-no-boundaries/</link>
		<comments>http://blog.crowdflower.com/2010/12/good-work-knows-no-boundaries/#comments</comments>
		<pubDate>Thu, 16 Dec 2010 21:01:14 +0000</pubDate>
		<dc:creator>Patrick Philips</dc:creator>
				<category><![CDATA[Miscellaneous]]></category>
		<category><![CDATA[Wisdom of Small Crowds]]></category>
		<category><![CDATA[crowdsourcing]]></category>
		<category><![CDATA[data collection]]></category>
		<category><![CDATA[quality control]]></category>

		<guid isPermaLink="false">http://blog.crowdflower.com/?p=1864</guid>
		<description><![CDATA[Can the quality of crowdsourced work be linked to geography? In short, the answer is no. Here&#8217;s why. Our team looked at whether including a workforce from a specific region helped or hurt the completion rate and quality of a recent website categorization project. Overall, the project showed a common trend, where the quality of [...]]]></description>
			<content:encoded><![CDATA[<div class="socialize-in-content" style="float:left;"><div class="socialize-in-button socialize-in-button-left"><a href="http://twitter.com/share" class="twitter-share-button" data-url="http://blog.crowdflower.com/2010/12/good-work-knows-no-boundaries/" data-text="Good work knows no boundaries" data-count="vertical" data-via="crowdflower" ><!--Tweetter--></a></div><div class="socialize-in-button socialize-in-button-left"><script>
			<!-- 
			var fbShare = {
				url: "http://blog.crowdflower.com/2010/12/good-work-knows-no-boundaries/",
				size: "large",
				google_analytics: "true"
			}
			//-->
			</script>
                        <script src="http://widgets.fbshare.me/files/fbshare.js"></script></div><div class="socialize-in-button socialize-in-button-left"><script type="in/share" data-url="http://blog.crowdflower.com/2010/12/good-work-knows-no-boundaries/" data-counter="top"></script></div><div class="socialize-in-button socialize-in-button-left"><g:plusone size="small" href="http://blog.crowdflower.com/2010/12/good-work-knows-no-boundaries/"></g:plusone></div></div><p><img src="http://blog.crowdflower.com/wp-content/uploads/2010/12/Nova_totius_Terrarum_Orbis_geographica_ac_hydrographica_tabula_Hendrik_Hondius_balanced1.jpg"></p>
<p>Can the quality of crowdsourced work be linked to geography?</p>
<p>In short, the answer is no. Here&#8217;s why.</p>
<p><span id="more-1864"></span></p>
<p>Our team looked at whether including a workforce from a specific region helped or hurt the completion rate and quality of a recent website categorization project. Overall, the project showed a common trend, where the quality of work improved over time, regardless of geography. </p>
<p><img src="http://blog.crowdflower.com/wp-content/uploads/2010/12/october_throughput.png"></p>
<p>The figure above shows the overall volume of judgments (blue) plotted against trusted judgments (green). In addition to the large amount of untrusted work over the first two days, this graph shows spikes in volume that correspond with time of day. Two possible explanations for the change in quality are: </p>
<ol>
<li>Only good workers were able to continue working after the first few days. </li>
<li>A flood of bad work from specific geographies, related to the spikes in throughput at certain hours.</li>
</ol>
<p><strong>“Just a Taste” Workers</strong></p>
<p>One of the ways that CrowdFlower maintains quality is by incorporating questions with known answers, and tracking worker performance on these units as a proxy for overall accuracy. If workers don’t maintain a minimum level of accuracy, they are prohibited from continuing work on any given task.</p>
<p>In this job, it appears that many workers were unable to meet our quality standards. This was a somewhat subtle categorization job, characterizing the content of websites according to handful of criteria, so it’s not surprising that many workers had trouble. </p>
<p>As a result, many workers did a relatively small amount of work, but they could not continue because they didn’t meet our quality standards. The figure below shows the number of workers completing a specified number of judgments. </p>
<p><img src="http://blog.crowdflower.com/wp-content/uploads/2010/12/trusted_workers.png"></p>
<p>This graph demonstrates a known behavior in online tasks, where many workers attempt only a small amount of work before abandoning a given task. Our quality-control mechanism requires workers to demonstrate accuracy before it accepts their work.</p>
<p>While it is interesting that our quality control identified certain high accuracy workers and allowed them to continue, this doesn’t answer the question of why there were such dramatic changes in throughput during the day. </p>
<p><strong>Non-US Workers</strong></p>
<p>Especially over the first two days, the relative amount of untrusted work was much greater between 10 p.m. and 10 a.m. (Pacific Time), which is when we see a preponderance of work coming from workers outside of the United States. Indeed, after filtering workers by IP address, we saw that 73 percent of all workers in this job came from India. On the other hand, the region&#8217;s workers accounted for only 46 percent of trusted workers. </p>
<p><img src="http://blog.crowdflower.com/wp-content/uploads/2010/12/untrusted_trusted_countries.png"></p>
<p>However, while these workers account for a relatively small proportion of trusted workers, they did much better in terms of trusted work submitted.</p>
<p><img src="http://blog.crowdflower.com/wp-content/uploads/2010/12/untrusted_trusted_judgments.png"></p>
<p>This workforce accounts for fully two-thirds of the total trusted work on this job. While that is somewhat less than their overall representation in the labor pool, this suggests that we can’t dismiss these workers as low-quality.</p>
<p><strong>Pareto Me This</strong></p>
<p>All of this raises a very interesting observation. As you may have seen elsewhere<sup>1</sup>,  a relatively small minority of people often account for the vast majority of observed effects. This is common in crowdsourcing, just as it was in terms of land ownership in 19th century Italy.</p>
<p><img src="http://blog.crowdflower.com/wp-content/uploads/2010/12/nifty_cumulative_graph.png"></p>
<p>In this example, the top three percent of most prolific workers provided over 40 percent of trusted work. The top 20 percent of workers provided over 80 percent of trusted work. </p>
<p><strong>Keep the Bums Out</strong></p>
<p>Focusing on the top 20 percent of most prolific workers, we see that one-half came from India while another one-third came from the U.S. Other countries provided the remaining workers. While it is true that the vast majority of untrusted workers in this job, who collectively provided relatively few judgments, came from India, it is also true that the country&#8217;s workers make up half of the most prolific workers and provided two-thirds of all trusted work. </p>
<p>The solution to improving the efficiency of this job, then, is not the crude choice of excluding workers from certain geographies. Rather, we can discourage bad workers by increasing the burden of entry, so that only workers with an interest in completing more than a few judgments will bother with the job.</p>
<hr />
1. Ferris, Tim (2006), <em>The 4-Hour Workweek</em>, Crown Publishing.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.crowdflower.com/2010/12/good-work-knows-no-boundaries/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Visions and revisions</title>
		<link>http://blog.crowdflower.com/2010/10/visions-and-revisions/</link>
		<comments>http://blog.crowdflower.com/2010/10/visions-and-revisions/#comments</comments>
		<pubDate>Sat, 30 Oct 2010 00:46:42 +0000</pubDate>
		<dc:creator>Josh Eveleth</dc:creator>
				<category><![CDATA[Art]]></category>
		<category><![CDATA[Experiments]]></category>
		<category><![CDATA[Miscellaneous]]></category>
		<category><![CDATA[Wisdom of Small Crowds]]></category>
		<category><![CDATA[art]]></category>
		<category><![CDATA[Hobbes]]></category>
		<category><![CDATA[Sandburg]]></category>
		<category><![CDATA[Shakespeare]]></category>
		<category><![CDATA[Writing]]></category>

		<guid isPermaLink="false">http://blog.crowdflower.com/?p=1734</guid>
		<description><![CDATA[Writing is easy. Just sit in front of a typewriter, open up a vein and bleed it out drop by drop. &#8211; Red Smith When I was in college, a professor I respected said that one of the best ways to demystify writing is to write like people you admire. Specifically, to find passages that [...]]]></description>
			<content:encoded><![CDATA[<div class="socialize-in-content" style="float:left;"><div class="socialize-in-button socialize-in-button-left"><a href="http://twitter.com/share" class="twitter-share-button" data-url="http://blog.crowdflower.com/2010/10/visions-and-revisions/" data-text="Visions and revisions" data-count="vertical" data-via="crowdflower" ><!--Tweetter--></a></div><div class="socialize-in-button socialize-in-button-left"><script>
			<!-- 
			var fbShare = {
				url: "http://blog.crowdflower.com/2010/10/visions-and-revisions/",
				size: "large",
				google_analytics: "true"
			}
			//-->
			</script>
                        <script src="http://widgets.fbshare.me/files/fbshare.js"></script></div><div class="socialize-in-button socialize-in-button-left"><script type="in/share" data-url="http://blog.crowdflower.com/2010/10/visions-and-revisions/" data-counter="top"></script></div><div class="socialize-in-button socialize-in-button-left"><g:plusone size="small" href="http://blog.crowdflower.com/2010/10/visions-and-revisions/"></g:plusone></div></div><blockquote><p>
Writing is easy. Just sit in front of a typewriter, open up a vein and bleed it out drop by drop.<br />
<span style="font-size: 13.3333px;">&#8211; Red Smith</span></p></blockquote>
<div id="attachment_1434" class="wp-caption alignnone" style="width: 330px"><img src="http://blog.crowdflower.com/wp-content/uploads/2010/10/Underwoodfive.jpg" width="320" height="240" /></a><p class="wp-caption-text">Underwood Five Typewriter</p></div>
<p>When I was in college, a professor I respected said that one of the best ways to demystify writing is to write like people you admire. Specifically, to find passages that you love, and try to revise them in your own words. This exercise proved invaluable. It allowed me to walk in their literary footsteps, shedding light on why they chose &#8212; or avoided &#8212; certain words, punctuation, and syntax.</p>
<p>With this in mind, I recently wondered whether you can crowdsource writing, specifically, revising.</p>
<p><span id="more-1734"></span></p>
<p>I posted a task through CrowdFlower that asked the crowd to rewrite four famous quotations, pithily, while preserving their meaning.</p>
<p><img src="http://blog.crowdflower.com/wp-content/uploads/2010/10/Screen-shot-2010-10-29-at-11.59.44-AM.png"></p>
<p>In one evening, I was able to get 20 revisions of each quotation from people across the country. I won&#8217;t summarize them all here, but I will pull a few highlights.</p>
<p><strong>Original Quotation 1 (from <em><a href="http://www.bartleby.com/100/138.31.118.html">Macbeth</a></em>, William Shakespeare):</strong></p>
<blockquote><p>
To-morrow, and to-morrow, and to-morrow,/ Creeps in this petty pace from day to day,/ To the last syllable of recorded time;/ And all our yesterdays have lighted fools/ The way to dusty death. Out, out, brief candle!/ Life&#8217;s but a walking shadow, a poor player/ That struts and frets his hour upon the stage/ And then is heard no more. It is a tale/ Told by an idiot, full of sound and fury/ Signifying nothing.</p></blockquote>
<p><strong>Revision 1:</strong></p>
<blockquote><p>
Time marches on, and everyone dies; life is meaningless.<br />
<span style="font-size: 13.3333px;">&#8211; Hatboro, PA</span></p></blockquote>
<blockquote><p>
Life creeps along and ends suddenly like the end of a bad play. The play is dramatic and had poor acting, and has no point or moral in the end.<br />
<span style="font-size: 13.3333px;">&#8211;Salt Lake City, UT</span></p></blockquote>
<p><strong>Original Quotation 2 (from &#8220;<a href="http://www.bartleby.com/100/160.2.html">The Leviathan</a>,&#8221; Thomas Hobbes):</strong></p>
<blockquote><p>
No arts, no letters, no society, and which is worst of all, continual fear and danger of violent death, and the life of man solitary, poor, nasty, brutish, and short.</p></blockquote>
<p><strong>Revision 2:</strong></p>
<blockquote><p>
The life of a man on his own is barbaric and degrading.<br />
<span style="font-size: 13.3333px;">&#8211; East Aurora, NY</span></p></blockquote>
<blockquote><p>
No art, letters or society. Worst of all, living in fear of being alone, poor and short.<br />
<span style="font-size: 13.3333px;">&#8211; Arlington, TX</span></p></blockquote>
<blockquote><p>
Life is all but a mere scam.<br />
<span style="font-size: 13.3333px;">&#8211;Overland Park, KS</span></p></blockquote>
<p><strong>Original Quotation 3 (&#8220;<a href="http://www.bartleby.com/124/pres31.html">First Inaugural Address</a>,&#8221; Abraham Lincoln):</strong></p>
<blockquote><p>
We are not enemies, but friends. We must not be enemies. Though passion may have strained it must not break our bonds of affection. The mystic chords of memory, stretching from every battlefield and patriot grave to every living heart and hearthstone all over this broad land, will yet swell the chorus of the Union, when again touched, as surely they will be, by the better angels of our nature.</p></blockquote>
<p><strong>Revision 3:</strong></p>
<blockquote><p>
We&#8217;re fools for fighting each other. We should co-operate, and let our example bring everyone together.<br />
<span style="font-size: 13.3333px;">&#8211; Plattsburgh, NY</span></p></blockquote>
<blockquote><p>
We must be friends, we should now forget each other. Our memories will always stay through our rough times and good times.<br />
<span style="font-size: 13.3333px;">&#8211; Tallahassee, FL</span></p></blockquote>
<p><strong>Original Quotation 4 (from &#8220;<a href="http://www.bartleby.com/165/1.html">Chicago</a>,&#8221; Carl Sandburg):</strong></p>
<blockquote><p>
&#8220;They tell me you are wicked and I believe them, for I have seen your painted women under the gas lamps luring the farm boys.&#8221;</p></blockquote>
<p><strong>Revision 4:</strong></p>
<blockquote><p>
Your reputation follows you and it is a bad one, makeup and lust.<br />
<span style="font-size: 13.3333px;">&#8211; Iola, WI</span></p></blockquote>
<blockquote><p>
This city isn&#8217;t a nice place, it&#8217;s full of hookers.<br />
<span style="font-size: 13.3333px;">&#8211; Milledgeville, GA</span></p></blockquote>
<p>The full data is available <a href="http://blog.crowdflower.com/wp-content/uploads/2010/10/f17005-2.csv">here</a>.</p>
<p>What&#8217;s your opinion? Can the revision process be crowdsourced?</p>
<p>I&#8217;d love to hear your thoughts.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.crowdflower.com/2010/10/visions-and-revisions/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Crowdsifter: More Efficient Content Filtering</title>
		<link>http://blog.crowdflower.com/2009/07/crowdsifter-more-efficient-pornography-filtering/</link>
		<comments>http://blog.crowdflower.com/2009/07/crowdsifter-more-efficient-pornography-filtering/#comments</comments>
		<pubDate>Fri, 10 Jul 2009 02:50:36 +0000</pubDate>
		<dc:creator>John Le</dc:creator>
				<category><![CDATA[Miscellaneous]]></category>
		<category><![CDATA[Wisdom of Small Crowds]]></category>

		<guid isPermaLink="false">http://blog.doloreslabs.com/2009/07/crowdsifter-more-efficient-pornography-filtering/</guid>
		<description><![CDATA[&#8220;I know it when I see it.&#8221; &#8212; Justice Potter Stewart We have been running Crowdsifter, our content moderation product backed by Amazon&#8217;s Mechanical Turk for a while and we wanted to share some quality metrics and some stats on how our system aggregates redundant results to improve those metrics. Controlling for Worker Quality, Bias, [...]]]></description>
			<content:encoded><![CDATA[<div class="socialize-in-content" style="float:left;"><div class="socialize-in-button socialize-in-button-left"><a href="http://twitter.com/share" class="twitter-share-button" data-url="http://blog.crowdflower.com/2009/07/crowdsifter-more-efficient-pornography-filtering/" data-text="Crowdsifter: More Efficient Content Filtering" data-count="vertical" data-via="crowdflower" ><!--Tweetter--></a></div><div class="socialize-in-button socialize-in-button-left"><script>
			<!-- 
			var fbShare = {
				url: "http://blog.crowdflower.com/2009/07/crowdsifter-more-efficient-pornography-filtering/",
				size: "large",
				google_analytics: "true"
			}
			//-->
			</script>
                        <script src="http://widgets.fbshare.me/files/fbshare.js"></script></div><div class="socialize-in-button socialize-in-button-left"><script type="in/share" data-url="http://blog.crowdflower.com/2009/07/crowdsifter-more-efficient-pornography-filtering/" data-counter="top"></script></div><div class="socialize-in-button socialize-in-button-left"><g:plusone size="small" href="http://blog.crowdflower.com/2009/07/crowdsifter-more-efficient-pornography-filtering/"></g:plusone></div></div><p><em><a href="http://en.wikipedia.org/wiki/I_know_it_when_I_see_it">&#8220;I know it when I see it.&#8221;</a> &#8212; </em>Justice Potter Stewart</p>
<p>We have been running <a href="http://www.crowdsifter.com">Crowdsifter</a>, our content moderation product backed by Amazon&#8217;s Mechanical Turk for a while and we wanted to share some quality metrics and some stats on how our system aggregates redundant results to improve those metrics.</p>
<p><a href="http://blog.doloreslabs.com/wp-content/uploads/2009/07/minerrors11.png" title="minerrors11"><img src="http://blog.doloreslabs.com/wp-content/uploads/2009/07/minerrors11.png" alt="minerrors11" /></a></p>
<p><strong>Controlling for Worker Quality, Bias, and Item Difficulty</strong></p>
<p>In the graph above we picked the the best error rate for raw AMT with 1-11 workers and the best error rate that Crowdsifter provided on a porn judgment task with 2491 images (1006 porn, 1485 non-porn).  The error rate is the rate at which wrong decisions are made.  A wrong decision is whenever we label porn as not porn or non-porn as porn.  The above experiment includes images which were labeled as ambiguous, which is the reason the error rates shown seem so high.</p>
<p>Using Crowdsifter with an average of 3.93 workers per image we achieve the same possible minimum error rate as majority voting in raw AMT with 9 workers per image.  We do this by controlling worker quality by keeping track of their judgments.  And if we have a &#8220;expert&#8221; evaluated gold standard of what is pornographic, then we can keep track of which workers are doing a good job or a bad job. On non-gold standard images we weight workers&#8217; judgments based on how well we trust their judgment to reflect our standard of porn.  Without these controls, majority voting in raw AMT is vulnerable to the many scammers that lurk there.</p>
<p>For images where obscenity is particularly ambiguous, we can allocate more workers. This results in a better sampling of whether an image is obscene.  Some images don&#8217;t need many judges to accurately determine if they are pornographic.  We can determine which images are easily classifiable as porn by sampling a group of workers and checking whether they all agree.  Using too many judges per image can become prohibitively costly. It is important to have this scheme so we can dynamically allocate workers.  Raw AMT is both wasteful and inefficient, applying many judgments to easy items, while not using enough judgments for hard items.</p>
<p><strong>Better Measures</strong></p>
<p>The raw error rate includes both images incorrectly labeled as porn, and incorrectly labeled as non-porn. In content moderation we want to minimize our porn miss rate (also known as <a href="http://en.wikipedia.org/wiki/Type_I_and_type_II_errors#False_negative_rate">false negative rate</a>) because we don&#8217;t want to let any porn onto our site.  The graph for the porn miss rate corresponding with the above graph is shown below.</p>
<p><a href="http://blog.doloreslabs.com/wp-content/uploads/2009/07/fnrate11.png" title="fnrate11"><img src="http://blog.doloreslabs.com/wp-content/uploads/2009/07/fnrate11.png" alt="fnrate11" /></a></p>
<p>The most important part is the porn miss rate, and our rate is close to the rates of 9 to 11 workers per image on AMT, even though we are using less than half that number of workers, meaning we significantly cut our costs.</p>
<p><strong>Adjusting Thresholds</strong><br />
We can adjust our certain thresholds to lower the porn miss rate, but we do this at the risk of labeling all our images as porn, so nothing would make it onto our site.  Adjusting the threshold to meet the needs of minimizing the porn miss rate, while maintaining an acceptable non-porn miss rate, is a task Crowdsifter can readily handle.</p>
<p>We&#8217;ll save what we can do with threshold adjustment for a later blog post.<br />
-John</p>
<hr /> Thanks to <a href="http://anyall.org/">Brendan</a> for help in this post.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.crowdflower.com/2009/07/crowdsifter-more-efficient-pornography-filtering/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>EMNLP Slides</title>
		<link>http://blog.crowdflower.com/2008/11/emnlp-slides/</link>
		<comments>http://blog.crowdflower.com/2008/11/emnlp-slides/#comments</comments>
		<pubDate>Thu, 06 Nov 2008 18:56:08 +0000</pubDate>
		<dc:creator>Lukas Biewald</dc:creator>
				<category><![CDATA[Miscellaneous]]></category>
		<category><![CDATA[Wisdom of Small Crowds]]></category>

		<guid isPermaLink="false">http://blog.doloreslabs.com/2008/11/emnlp-slides/</guid>
		<description><![CDATA[Rion Snow presented the paper, &#8220;Cheap and Fast &#8211; But is it Good?&#8221; at EMNLP last week. Here are the slides from the talk: Rls For Emnlp 2008 View SlideShare presentation or Upload your own.]]></description>
			<content:encoded><![CDATA[<div class="socialize-in-content" style="float:left;"><div class="socialize-in-button socialize-in-button-left"><a href="http://twitter.com/share" class="twitter-share-button" data-url="http://blog.crowdflower.com/2008/11/emnlp-slides/" data-text="EMNLP Slides" data-count="vertical" data-via="crowdflower" ><!--Tweetter--></a></div><div class="socialize-in-button socialize-in-button-left"><script>
			<!-- 
			var fbShare = {
				url: "http://blog.crowdflower.com/2008/11/emnlp-slides/",
				size: "large",
				google_analytics: "true"
			}
			//-->
			</script>
                        <script src="http://widgets.fbshare.me/files/fbshare.js"></script></div><div class="socialize-in-button socialize-in-button-left"><script type="in/share" data-url="http://blog.crowdflower.com/2008/11/emnlp-slides/" data-counter="top"></script></div><div class="socialize-in-button socialize-in-button-left"><g:plusone size="small" href="http://blog.crowdflower.com/2008/11/emnlp-slides/"></g:plusone></div></div><p>Rion Snow presented the paper, <a href="http://blog.doloreslabs.com/2008/09/amt-fast-cheap-good-machine-learning/">&#8220;Cheap and Fast &#8211; But is it Good?&#8221;</a> at EMNLP last week.  </p>
<p>Here are the slides from the talk:</p>
<div style="width:425px;text-align:left" id="__ss_727609"><a style="font:14px Helvetica,Arial,Sans-serif;display:block;margin:12px 0 3px 0;text-decoration:underline;" href="http://www.slideshare.net/guest60b48a/rls-for-emnlp-2008-presentation?type=powerpoint" title="Rls For Emnlp 2008">Rls For Emnlp 2008</a><object style="margin:0px" width="425" height="355"><param name="movie" value="http://static.slideshare.net/swf/ssplayer2.swf?doc=rlsforemnlp2008-1225995105983756-8&#038;stripped_title=rls-for-emnlp-2008-presentation" /><param name="allowFullScreen" value="true"/><param name="allowScriptAccess" value="always"/><embed src="http://static.slideshare.net/swf/ssplayer2.swf?doc=rlsforemnlp2008-1225995105983756-8&#038;stripped_title=rls-for-emnlp-2008-presentation" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="355"></embed></object>
<div style="font-size:11px;font-family:tahoma,arial;height:26px;padding-top:2px;">View SlideShare <a style="text-decoration:underline;" href="http://www.slideshare.net/guest60b48a/rls-for-emnlp-2008-presentation?type=powerpoint" title="View Rls For Emnlp 2008 on SlideShare">presentation</a> or <a style="text-decoration:underline;" href="http://www.slideshare.net/upload?type=powerpoint">Upload</a> your own.</div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://blog.crowdflower.com/2008/11/emnlp-slides/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>AMT is fast, cheap, and good for machine learning data</title>
		<link>http://blog.crowdflower.com/2008/09/amt-fast-cheap-good-machine-learning/</link>
		<comments>http://blog.crowdflower.com/2008/09/amt-fast-cheap-good-machine-learning/#comments</comments>
		<pubDate>Tue, 09 Sep 2008 00:10:14 +0000</pubDate>
		<dc:creator>Brendan O'Connor</dc:creator>
				<category><![CDATA[Miscellaneous]]></category>
		<category><![CDATA[Wisdom of Small Crowds]]></category>

		<guid isPermaLink="false">http://blog.doloreslabs.com/2008/09/amt-fast-cheap-good-machine-learning/</guid>
		<description><![CDATA[Update 9/19: Final PDF version has been uploaded. See also the comments below for updates &#8212; our released data is already being used by others! We recently teamed up with Rion Snow, Prof. Dan Jurafsky, and Prof. Andrew Ng from the Stanford AI Lab to try using Amazon Mechanical Turk to generate data sets for [...]]]></description>
			<content:encoded><![CDATA[<div class="socialize-in-content" style="float:left;"><div class="socialize-in-button socialize-in-button-left"><a href="http://twitter.com/share" class="twitter-share-button" data-url="http://blog.crowdflower.com/2008/09/amt-fast-cheap-good-machine-learning/" data-text="AMT is fast, cheap, and good for machine learning data" data-count="vertical" data-via="crowdflower" ><!--Tweetter--></a></div><div class="socialize-in-button socialize-in-button-left"><script>
			<!-- 
			var fbShare = {
				url: "http://blog.crowdflower.com/2008/09/amt-fast-cheap-good-machine-learning/",
				size: "large",
				google_analytics: "true"
			}
			//-->
			</script>
                        <script src="http://widgets.fbshare.me/files/fbshare.js"></script></div><div class="socialize-in-button socialize-in-button-left"><script type="in/share" data-url="http://blog.crowdflower.com/2008/09/amt-fast-cheap-good-machine-learning/" data-counter="top"></script></div><div class="socialize-in-button socialize-in-button-left"><g:plusone size="small" href="http://blog.crowdflower.com/2008/09/amt-fast-cheap-good-machine-learning/"></g:plusone></div></div><p><b>Update 9/19:</b> <a href="http://blog.doloreslabs.com/wp-content/uploads/2008/09/amt_emnlp08_final.pdf">Final PDF version</a> has been uploaded.  See also the comments below for updates &#8212; our <a href="http://ai.stanford.edu/~rion/annotations/">released data</a> is already being used by others!</p>
<hr />
We recently teamed up with <a href="http://ai.stanford.edu/~rion/">Rion Snow</a>, <a href="http://www.stanford.edu/~jurafsky/">Prof. Dan Jurafsky</a>, and <a href="http://ai.stanford.edu/~ang/">Prof. Andrew Ng</a>  from the <a href="http://sail.stanford.edu">Stanford AI Lab</a> to try using Amazon Mechanical Turk to generate data sets for Machine Learning research.  Many AI tasks require a large amount of training data, and to build natural language systems, researchers traditionally pay linguistic experts for millions of annotations.  Search engine companies employ hundreds or thousands of annotators for their classification, ranking, and other statistically trained systems, but their data is private and is not available for research.  AMT is a potential tool to create high quality data sets accessible to everyone.</p>
<p>We rigorously tested the quality of AMT responses for several classic human language problems, and found that the quality was the same or better than the expert data that most researchers use.  We wrote a paper, <a href='http://blog.doloreslabs.com/wp-content/uploads/2008/09/amt_emnlp08_final.pdf' title='amt_emnlp08_accepted.pdf'>&#8220;Cheap and Fast &#8212; But is it Good?  Evaluating Non-Expert Annotations for Natural Language Tasks,&#8221;</a> that will be presented in an upcoming conference, <a href="http://conferences.inf.ed.ac.uk/emnlp08/">EMNLP-2008</a>.</p>
<p><b>Our findings:</b></p>
<p><b>1. Turker-generated data is good.</b>  AMT makes it easy to ask many people for judgments, so for several tasks, we looked at accuracy rates for how well the averaged Turker judgments correlate to the expert gold standard.  With more judgments per example, accuracy increases.  For comparison, on each graph the horizontal dotted line indicates the rate at which a single expert agrees with their gold standard.  Enough non-experts can match or often beat experts&#8217; reliability.</p>
<p><a href='http://blog.doloreslabs.com/wp-content/uploads/2008/09/k-acc3.png' title='k-acc3.png'><img class='centered' src='http://blog.doloreslabs.com/wp-content/uploads/2008/09/k-acc3.png' alt='k-acc3.png' /></a></p>
<p><b>2. Turker-generated data is cheap and fast.</b>  We can collect thousands of labels per dollar and per hour.  </p>
<p><span id="more-109"></span> </p>
<p><a href='http://blog.doloreslabs.com/wp-content/uploads/2008/09/costs.png' title='costs.png'><img class='centered' src='http://blog.doloreslabs.com/wp-content/uploads/2008/09/costs.png' alt='costs.png' /></a></p>
<p><b>3. Expert data enhances individual Turker data.</b>  First off, individual workers have differing accuracy rates:</p>
<p><a href='http://blog.doloreslabs.com/wp-content/uploads/2008/09/worker-acc.png' title='worker-acc.png'><img class='centered' src='http://blog.doloreslabs.com/wp-content/uploads/2008/09/worker-acc.png' alt='worker-acc.png' /></a></p>
<p>So we implemented a statistical technique where we test their accuracy on a portion of the experts&#8217; gold standard data, then reweight votes by worker reliability.  This yields higher aggregated accuracy.  (Also see our related <a href="http://blog.doloreslabs.com/2008/06/aggregate-turker-judgments-threshold-calibration/">threshold calibration post</a>.)</p>
<p><a href='http://blog.doloreslabs.com/wp-content/uploads/2008/09/goldcalib.png' title='goldcalib.png'><img class='centered' src='http://blog.doloreslabs.com/wp-content/uploads/2008/09/goldcalib.png' alt='goldcalib.png' /></a></p>
<p><b>4. Turker data enhances NLP systems.</b>  For one of the tasks, predicting the emotions elicited by a newspaper headline, we wrote a simple machine-learned classifier and trained it on the Turker data.  It easily outperforms one trained on expert data.  (There&#8217;s a subtle effect here; see the paper for details.)</p>
<p><a href='http://blog.doloreslabs.com/wp-content/uploads/2008/09/classifier-perf.png' title='classifier-perf.png'><img  class='centered' src='http://blog.doloreslabs.com/wp-content/uploads/2008/09/classifier-perf.png' alt='classifier-perf.png' /></a></p>
<p>We&#8217;ll update this blog post with a link to the final version of the paper in the coming weeks.  Many thanks to our friend <a href="http://ai.stanford.edu/~rion/">Rion</a>, who spearheaded this collaboration.  The current version of the paper is here:</p>
<ul>
<li><a href='http://blog.doloreslabs.com/wp-content/uploads/2008/09/amt_emnlp08_final.pdf'>Rion Snow, Brendan O&#8217;Connor, Daniel Jurafsky, Andrew Y. Ng.  &#8220;Cheap and Fast &#8212; But is it Good?  Evaluating Non-Expert Annotations for Natural Language Tasks.&#8221;  EMNLP-2008.</a></ul>
<p>[ This article is part of a series, <a href="/topics/wisdom/">Wisdom of Small Crowds</a>, on crowdsourcing methodology. ]</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.crowdflower.com/2008/09/amt-fast-cheap-good-machine-learning/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>Wisdom of small crowds, part 3: another worker visualization</title>
		<link>http://blog.crowdflower.com/2008/08/wisdom-of-small-crowds-part-3-another-worker-visualization/</link>
		<comments>http://blog.crowdflower.com/2008/08/wisdom-of-small-crowds-part-3-another-worker-visualization/#comments</comments>
		<pubDate>Thu, 07 Aug 2008 23:55:32 +0000</pubDate>
		<dc:creator>Brendan O'Connor</dc:creator>
				<category><![CDATA[Miscellaneous]]></category>
		<category><![CDATA[Wisdom of Small Crowds]]></category>

		<guid isPermaLink="false">http://blog.doloreslabs.com/2008/08/wisdom-of-small-crowds-part-3-another-worker-visualization/</guid>
		<description><![CDATA[This is a follow-up to the previous post on individual workloads and rates. Here are the submission times and durations for every worker on the same graph. Each worker is one horizontal line. An assignment is started at a dot, and its duration is for the line segment extending to the right. The particular data [...]]]></description>
			<content:encoded><![CDATA[<div class="socialize-in-content" style="float:left;"><div class="socialize-in-button socialize-in-button-left"><a href="http://twitter.com/share" class="twitter-share-button" data-url="http://blog.crowdflower.com/2008/08/wisdom-of-small-crowds-part-3-another-worker-visualization/" data-text="Wisdom of small crowds, part 3: another worker visualization" data-count="vertical" data-via="crowdflower" ><!--Tweetter--></a></div><div class="socialize-in-button socialize-in-button-left"><script>
			<!-- 
			var fbShare = {
				url: "http://blog.crowdflower.com/2008/08/wisdom-of-small-crowds-part-3-another-worker-visualization/",
				size: "large",
				google_analytics: "true"
			}
			//-->
			</script>
                        <script src="http://widgets.fbshare.me/files/fbshare.js"></script></div><div class="socialize-in-button socialize-in-button-left"><script type="in/share" data-url="http://blog.crowdflower.com/2008/08/wisdom-of-small-crowds-part-3-another-worker-visualization/" data-counter="top"></script></div><div class="socialize-in-button socialize-in-button-left"><g:plusone size="small" href="http://blog.crowdflower.com/2008/08/wisdom-of-small-crowds-part-3-another-worker-visualization/"></g:plusone></div></div><p>This is a follow-up to the previous post on <a href="/?p=73">individual workloads and rates</a>.  Here are the submission times and durations for every worker on the same graph.  Each worker is one horizontal line.  An assignment is started at a dot, and its duration is for the line segment extending to the right.</p>
<p><a href='http://blog.doloreslabs.com/wp-content/uploads/2008/08/submission-durations-wide1.png' title='submission-durations-wide1.png'><img src='http://blog.doloreslabs.com/wp-content/uploads/2008/08/submission-durations-wide1.png' alt='submission-durations-wide1.png' /></a></p>
<p>The particular data set isn&#8217;t the same as in the previous post, but was for a similar task and exhibits a similar structure.  Worker rates substantially differ.  Some workers do a few HIT&#8217;s, but others work on as many as are available.  Some work rapidly with breaks (19, 36).  Some assignment durations are as long as 5-10 minutes (13, 37).  Some work very intermittently (29).</p>
<p>This view makes the parallelism of AMT apparent.  At any vertical timeslice you can see how many workers are active at that time.  The entire job ends on the right side when the available HIT&#8217;s run out.</p>
<p>[ This article is part of a series, <a href="/topics/wisdom">Wisdom of Small Crowds</a>, on crowdsourcing methodology. ]</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.crowdflower.com/2008/08/wisdom-of-small-crowds-part-3-another-worker-visualization/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Wisdom of small crowds, part 2: individual workloads and rates</title>
		<link>http://blog.crowdflower.com/2008/08/wisdom-of-small-crowds-part-2-individual-workloads-and-rates/</link>
		<comments>http://blog.crowdflower.com/2008/08/wisdom-of-small-crowds-part-2-individual-workloads-and-rates/#comments</comments>
		<pubDate>Tue, 05 Aug 2008 18:00:10 +0000</pubDate>
		<dc:creator>Brendan O'Connor</dc:creator>
				<category><![CDATA[Miscellaneous]]></category>
		<category><![CDATA[Wisdom of Small Crowds]]></category>

		<guid isPermaLink="false">http://blog.doloreslabs.com/2008/08/wisdom-of-small-crowds-part-2-individual-workloads-and-rates/</guid>
		<description><![CDATA[[ Update: see also another visualization of this. ] AMT&#8217;s great new interface makes it easy to download completion times for individual worker assignments. Therefore, it&#8217;s easy to visualize :) For a recent small job we did (250 HIT&#8217;s, 5 workers per HIT), here&#8217;s a graph of completion times per worker, over the entire 15 [...]]]></description>
			<content:encoded><![CDATA[<div class="socialize-in-content" style="float:left;"><div class="socialize-in-button socialize-in-button-left"><a href="http://twitter.com/share" class="twitter-share-button" data-url="http://blog.crowdflower.com/2008/08/wisdom-of-small-crowds-part-2-individual-workloads-and-rates/" data-text="Wisdom of small crowds, part 2: individual workloads and rates" data-count="vertical" data-via="crowdflower" ><!--Tweetter--></a></div><div class="socialize-in-button socialize-in-button-left"><script>
			<!-- 
			var fbShare = {
				url: "http://blog.crowdflower.com/2008/08/wisdom-of-small-crowds-part-2-individual-workloads-and-rates/",
				size: "large",
				google_analytics: "true"
			}
			//-->
			</script>
                        <script src="http://widgets.fbshare.me/files/fbshare.js"></script></div><div class="socialize-in-button socialize-in-button-left"><script type="in/share" data-url="http://blog.crowdflower.com/2008/08/wisdom-of-small-crowds-part-2-individual-workloads-and-rates/" data-counter="top"></script></div><div class="socialize-in-button socialize-in-button-left"><g:plusone size="small" href="http://blog.crowdflower.com/2008/08/wisdom-of-small-crowds-part-2-individual-workloads-and-rates/"></g:plusone></div></div><p>[ Update: see also <a href="http://blog.doloreslabs.com/2008/08/wisdom-of-small-crowds-part-3-another-worker-visualization/">another visualization of this</a>. ]</p>
<p>AMT&#8217;s great <a href="http://venturebeat.com/2008/07/30/outsourcing-gets-easier-with-new-features-on-amazons-mechanical-turk/">new interface</a> makes it easy to download completion times for individual worker assignments.  Therefore, it&#8217;s easy to visualize :)  For a recent small job we did (250 HIT&#8217;s, 5 workers per HIT), here&#8217;s a graph of completion times per worker, over the entire 15 minute duration of the job.  Each assignment is a single point, graphed by when it was done versus how long it took.</p>
<p><a href='http://blog.doloreslabs.com/wp-content/uploads/2008/08/completion-times.png' title='completion-times.png'><img src='http://blog.doloreslabs.com/wp-content/uploads/2008/08/completion-times.png' alt='completion-times.png' /></a></p>
<p><span id="more-73"></span></p>
<p>Most workers come in, do a string of HITs, then leave.  Some do all of the HITs available.  There seem to be two distinct work modes.  Most people do lots of HITs in rapid succession.  But several of them work slowly (e.g. workers 8, 13, and 18, with more horizontal space between points), either spending more time on each assignment, or perhaps leaving then coming back.</p>
<p>This graph also illustrates a common trend we see: lots of the work gets done by &#8220;tail&#8221; workers; that is, people who do only a small amount of work.  This is where crowdsourcing really shines &#8212; it&#8217;s OK if individuals give you a small number of judgments, because you can aggregate across many of them.  The total &#8220;prolificness&#8221; of each worker was lightly skewed on this task &#8212; 50% of the work was done by 8 out of 37 workers.  Usually, we see a split more like 50% of the work being done by the top 10% of workers; this one had a more even distribution probably because it was small, so enthusiastic workers didn&#8217;t have an opportunity to do a very large number of HITs.</p>
<p>Another phenomenon: some workers have a downward trend in work time.  This could be learning to do the task faster, or it could be increased carelessness.  A quality analysis (along the lines of <a href="/?p=61">part 1</a>) can flesh this out.</p>
<p>The task was a fairly subjective image classification problem where positives are rare; purple points are &#8220;YES&#8221; responses.  Responding &#8220;YES&#8221; takes more time (presumably, more cognitive load) &#8212; average work times for YES vs NO responses are 22 vs 12 seconds, significant at t-test p<.001.</p>
<p>-<a href="http://socialscienceplusplus.blogspot.com/">Brendan</a></p>
<p>p.s. The graph is due to the R&#8217;s awesome <a href="http://www.statmethods.net/advgraphs/trellis.html">lattice package</a>.  It&#8217;s incredibly easy to use: not much more than <code>xyplot(WorkTimeInSeconds ~ SubmitTime | WorkerId)</code>.</p>
<p>[ This article is part of a series, <a href="/topics/wisdom/">Wisdom of Small Crowds</a>, which focuses on crowdsourcing methodology for Amazon <a href="http://www.mturk.com/">Mechanical Turk</a>-like systems. ]</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.crowdflower.com/2008/08/wisdom-of-small-crowds-part-2-individual-workloads-and-rates/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Wisdom of small crowds, part 1: how to aggregate Turker judgments for classification (the threshold calibration trick)</title>
		<link>http://blog.crowdflower.com/2008/06/aggregate-turker-judgments-threshold-calibration/</link>
		<comments>http://blog.crowdflower.com/2008/06/aggregate-turker-judgments-threshold-calibration/#comments</comments>
		<pubDate>Mon, 16 Jun 2008 00:13:03 +0000</pubDate>
		<dc:creator>Brendan O'Connor</dc:creator>
				<category><![CDATA[Miscellaneous]]></category>
		<category><![CDATA[Wisdom of Small Crowds]]></category>

		<guid isPermaLink="false">http://blog.doloreslabs.com/2008/06/turkers-as-an-ensemble-classifier-part-1-threshold-calibration/</guid>
		<description><![CDATA[[ This article is part of a series, Wisdom of Small Crowds, which focuses on crowdsourcing methodology for Amazon Mechanical Turk-like systems. ] We use Turkers to classify all sorts of data, by having several workers render judgments on each item. But what should we do when they disagree? Like any other human behavior, Turker [...]]]></description>
			<content:encoded><![CDATA[<div class="socialize-in-content" style="float:left;"><div class="socialize-in-button socialize-in-button-left"><a href="http://twitter.com/share" class="twitter-share-button" data-url="http://blog.crowdflower.com/2008/06/aggregate-turker-judgments-threshold-calibration/" data-text="Wisdom of small crowds, part 1: how to aggregate Turker judgments for classification (the threshold calibration trick)" data-count="vertical" data-via="crowdflower" ><!--Tweetter--></a></div><div class="socialize-in-button socialize-in-button-left"><script>
			<!-- 
			var fbShare = {
				url: "http://blog.crowdflower.com/2008/06/aggregate-turker-judgments-threshold-calibration/",
				size: "large",
				google_analytics: "true"
			}
			//-->
			</script>
                        <script src="http://widgets.fbshare.me/files/fbshare.js"></script></div><div class="socialize-in-button socialize-in-button-left"><script type="in/share" data-url="http://blog.crowdflower.com/2008/06/aggregate-turker-judgments-threshold-calibration/" data-counter="top"></script></div><div class="socialize-in-button socialize-in-button-left"><g:plusone size="small" href="http://blog.crowdflower.com/2008/06/aggregate-turker-judgments-threshold-calibration/"></g:plusone></div></div><p>[ This article is part of a series, <a href="/topics/wisdom/">Wisdom of Small Crowds</a>, which focuses on crowdsourcing methodology for Amazon <a href="http://www.mturk.com/">Mechanical Turk</a>-like systems. ]</p>
<p>We use Turkers to <a href="http://doloreslabs.com/services.html">classify all sorts of data</a>, by having several workers render judgments on each item.  But what should we do when they disagree?  Like any other human behavior, Turker judgments are noisy: sometimes there are mistakes, and sometimes the task is genuinely difficult or subjective, and there is no &#8220;right&#8221; answer.  Once we have a bunch of Turker judgments, we need to aggregate them &#8212; that is, use some sort of voting mechanism &#8212; to give as accurate a classification as possible.  It turns out that one simple trick, threshold calibration, can substantially improve accuracy, and can be tuned to the specifics of the problem.</p>
<p>Here&#8217;s an example.  A recent client of ours had a de-duping task: given a pair of similar articles, the task was to decide if they were &#8220;about the same topic&#8221; or &#8220;about different topics&#8221;.  This is just a binary classification problem; call these labels &#8220;YES&#8221; and &#8220;NO&#8221;.  To figure out how well Turkers could perform the task, we had our client provide us with a gold standard data set.  That is, for 135 examples, their experts did the task themselves and provided &#8220;gold&#8221; ground truth labels.</p>
<p>We used a very high number of workers per example (about 20).  For all 135 examples in the gold standard, the following graph plots them vertically by their &#8220;Turker confidence in YES&#8221; &#8212; that&#8217;s just the percentage of votes for &#8220;YES&#8221; among the 20 or so judgments for that particular example.  I&#8217;ve also colored each example with the experts&#8217; gold label.  You can see that this simple Turker data provides some statistical separation between the classes.</p>
<p><a href='http://blog.doloreslabs.com/wp-content/uploads/2008/06/vertthresh.png'><img class='centered' src='http://blog.doloreslabs.com/wp-content/uploads/2008/06/vertthresh.png' alt='Test set separation by Turker ensemble binary classifier' /></a></p>
<p>This graph also shows how to create a classifier from Turker votes.  We have to choose a confidence threshold for our classifier&#8217;s decision: above the threshold, say &#8220;YES&#8221;, and below say &#8220;NO&#8221;.  Unfortunately, Turkers aren&#8217;t perfect at modeling the experts: anywhere we place the threshold, errors occur.  However, some thresholds are better than others.  The threshold with the best accuracy is at 73% confidence &#8212; that is, a 73% super-majority voting rule &#8212; and it classifies instances correctly 90% of the time.  Furthermore, we can tune for different types of errors.  If we are particularly concerned with avoiding false positive errors, we can set a higher, more conservative threshold; or, if we want to find as many &#8220;YES&#8221; instances as possible, we can set a lower, more liberal threshold.</p>
<p>Here&#8217;s another chart that more carefully details the tradeoffs between true and false positives vs. true and false negatives.  For a particular decision threshold, it shows how it divides up the instances into the confusion matrix&#8217;s 4 categories of correct and incorrect decisions.</p>
<p><img class='centered' src='http://blog.doloreslabs.com/wp-content/uploads/2008/06/confusionbars.png' alt='Classifier performance on gold standard at different thresholds' /></p>
<p>A final note on why threshold calibration is important: For this task, the Turkers were considerably more liberal than the experts at deciding what a &#8220;YES&#8221; example was &#8212; experts marked only 36% of examples as &#8220;YES&#8221;, whereas a simple Turker majority voting rule marks 57% that way.  This is because the experts understood the full implications of the decision, which were substantial &#8212; various entries in their database and website would be merged, and users would be confused if they were exposed to a bad merge.  False positives had a very high cost.  The prompt for Turkers, by contrast, was fairly vague.  (In our experience, we generally find that good task design is a huge factor in getting better Turker accuracy.)  However, since Turker decisions noisily correlate with the experts, moving the decision threshold can help accuracy.  Here&#8217;s the threshold vs. accuracy graph:</p>
<p><a href='http://blog.doloreslabs.com/wp-content/uploads/2008/06/thresh-acc.png' title='thresh-acc.png'><img src='http://blog.doloreslabs.com/wp-content/uploads/2008/06/thresh-acc.png' alt='thresh-acc.png' class='centered' /></a></p>
<p>Statistical analysis of Turker data can substantially improve accuracy performance, even with something as simple as choosing the best decision threshold.  This blog post only scratched the surface; there are a few more useful things to consider.  Stay tuned for Part 2 and hopefully many more!</p>
<p>A few more notes on Turker voting and threshold calibration:</p>
<p><span id="more-61"></span></p>
<p>An interesting question is the upper bound of possible performance on the task.  A good experiment to try is to have two experts independently perform a task, and check their agreement rate.  We should be satisfied if Turkers can match experts as reliably as experts match each other.  That is, for this task, if experts agree no more than 90% of the time, then Turkers perform the task as well as experts.  (We didn&#8217;t have this particular experiment done in this case, but I&#8217;d be very curious to see the results!)  In general, agreement rates can help indicate the difficulty of a task.  If expert agreement rates are low, it can be argued that the task is not very &#8220;real&#8221;.</p>
<p>The terminology of &#8220;true/false positives&#8221;, &#8220;true/false negatives&#8221;, &#8220;precision&#8221; and &#8220;recall&#8221; are all part of a statistics/machine learning mini-field of binary classifier evaluation.  Any statistical classifier that outputs a confidence value or ranking among instances (Naive Bayes, logistic regression, IR ranking, etc.) can be subject to this sort of threshold analysis.  A decent place to read more is the <a href="http://en.wikipedia.org/wiki/Receiver_operating_characteristic">ROC</a> Wikipedia page.  ROC and Precision-Recall curves have long been used to show thresholding tradeoffs.  I think the above plots make it easier to interpret the basic information, but the more traditional graphs are also useful.  Here they are for this data (provided courtesy of the excellent <a href="http://rocr.bioinf.mpi-sb.mpg.de/">ROCR</a> package):</p>
<p><a href='http://blog.doloreslabs.com/wp-content/uploads/2008/06/roc-pr.png' title='ROC and Precision-Recall curves'><img src='http://blog.doloreslabs.com/wp-content/uploads/2008/06/roc-pr.png' alt='ROC and Precision-Recall curves' class='centered' /></a></p>
<p>A nice overview of these topics can be found in <a href="http://nr.com/CS395T/lectures2008/17-ROCPrecisionRecall.pdf">these lecture notes</a> from <a href="http://www.nr.com/whp/">William Press</a>.</p>
<p>-<a href="http://socialscienceplusplus.blogspot.com">Brendan</a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.crowdflower.com/2008/06/aggregate-turker-judgments-threshold-calibration/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
	</channel>
</rss>

