<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The CrowdFlower Blog &#187; experiment</title>
	<atom:link href="http://blog.crowdflower.com/tag/experiment/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.crowdflower.com</link>
	<description></description>
	<lastBuildDate>Tue, 10 Jan 2012 20:00:35 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Confidence Bias: Evidence from Crowdsourcing</title>
		<link>http://blog.crowdflower.com/2011/09/confidence-bias-evidence-from-crowdsourcing/</link>
		<comments>http://blog.crowdflower.com/2011/09/confidence-bias-evidence-from-crowdsourcing/#comments</comments>
		<pubDate>Wed, 07 Sep 2011 23:02:04 +0000</pubDate>
		<dc:creator>Patrick Philips</dc:creator>
				<category><![CDATA[Experiments]]></category>
		<category><![CDATA[Miscellaneous]]></category>
		<category><![CDATA[bias]]></category>
		<category><![CDATA[confidence]]></category>
		<category><![CDATA[crowdflower]]></category>
		<category><![CDATA[crowdsourcing]]></category>
		<category><![CDATA[experiment]]></category>

		<guid isPermaLink="false">http://blog.crowdflower.com/?p=3080</guid>
		<description><![CDATA[Evidence in experimental psychology suggests that most people overestimate their own ability to complete objective tasks accurately. This phenomenon, often called confidence bias, refers to &#8220;a systematic error of judgment made by individuals when they assess the correctness of their responses to questions related to intellectual or perceptual problems.&#8221; 1 But does this hold up in crowdsourcing? We ran an experiment to [...]]]></description>
			<content:encoded><![CDATA[<div class="socialize-in-content" style="float:left;"><div class="socialize-in-button socialize-in-button-left"><a href="http://twitter.com/share" class="twitter-share-button" data-url="http://blog.crowdflower.com/2011/09/confidence-bias-evidence-from-crowdsourcing/" data-text="Confidence Bias: Evidence from Crowdsourcing" data-count="vertical" data-via="crowdflower" ><!--Tweetter--></a></div><div class="socialize-in-button socialize-in-button-left"><script>
			<!-- 
			var fbShare = {
				url: "http://blog.crowdflower.com/2011/09/confidence-bias-evidence-from-crowdsourcing/",
				size: "large",
				google_analytics: "true"
			}
			//-->
			</script>
                        <script src="http://widgets.fbshare.me/files/fbshare.js"></script></div><div class="socialize-in-button socialize-in-button-left"><g:plusone size="small" href="http://blog.crowdflower.com/2011/09/confidence-bias-evidence-from-crowdsourcing/"></g:plusone></div><div class="socialize-in-button socialize-in-button-left"><script type="in/share" data-url="http://blog.crowdflower.com/2011/09/confidence-bias-evidence-from-crowdsourcing/" data-counter="top"></script></div></div><div id="attachment_3593" class="wp-caption alignleft" style="width: 164px"><a href="http://blog.crowdflower.com/2011/09/confidence-bias-evidence-from-crowdsourcing/overconfidence/" rel="attachment wp-att-3593"><img class="size-full wp-image-3593    " title="psychologytoday.com" src="http://blog.crowdflower.com/wp-content/uploads/2011/09/Overconfidence.gif" alt="crowdsourcing" width="154" height="132" /></a><p class="wp-caption-text">psychologytoday.com</p></div>
<p>Evidence in experimental psychology suggests that most people overestimate their own ability to complete objective tasks accurately. This phenomenon, often called <em>confidence bias, </em>refers to &#8220;a systematic error of judgment made by individuals when they assess the correctness of their responses to questions related to intellectual or perceptual problems.&#8221; <sup><a href="#footnote-1">1</a></sup> But does this hold up in crowdsourcing?</p>
<p>We ran an experiment to test for a persistent difference between people&#8217;s perceptions of their own accuracy and their actual objective accuracy. We used a set of standardized questions, focusing on the Verbal and Math sections of a common standardized test. For the 829 individuals who answered more than 10 of these questions, we asked for the correct answer as well as an indication of how confident they were of the answer they supplied.</p>
<p><span id="more-3080"></span>We didn&#8217;t use any Gold in this experiment. Instead, we incentivized performance by rewarding those finishing in the top 10%, based on objective accuracy.</p>
<p style="text-align: center;"><a href="http://blog.crowdflower.com/2011/09/confidence-bias-evidence-from-crowdsourcing/sample_problem/" rel="attachment wp-att-3427"><img class="aligncenter size-full wp-image-3427" title="sample_problem" src="http://blog.crowdflower.com/wp-content/uploads/2011/08/sample_problem.jpg" alt="crowdsourcing" width="713" height="520" /></a></p>
<h2>Does Bias Exist?<em> </em></h2>
<p>To estimate confidence bias, we looked at the difference between the average of how confident an individual was of his/her answers and how many he/she answered correctly. If the difference is positive, the individual overestimated how well they did. <strong>Amazingly, over 75% of contributors overestimated their ability to answer multiple choice questions correctly.</strong></p>
<h2><a href="http://blog.crowdflower.com/2011/09/confidence-bias-evidence-from-crowdsourcing/histogram_res/" rel="attachment wp-att-3278"><img class="aligncenter size-full wp-image-3278" title="histogram_res" src="http://blog.crowdflower.com/wp-content/uploads/2011/07/histogram_res.jpg" alt="crowdsourcing" width="599" height="341" /></a></h2>
<h2>Are Individuals Consistently Biased?</h2>
<p>Because our dataset consisted of Math and Verbal questions, we looked at each individual contributor&#8217;s confidence bias for both types of questions. In aggregate, people tended to have more trouble with the Verbal questions (average accuracy of 28%, compared to 41% for Math), though the average confidence score was nearly identical (63% +/-1).</p>
<h2><a href="http://blog.crowdflower.com/2011/09/confidence-bias-evidence-from-crowdsourcing/scatterplot/" rel="attachment wp-att-3279"><img class="aligncenter size-full wp-image-3279" title="scatterplot" src="http://blog.crowdflower.com/wp-content/uploads/2011/07/scatterplot.jpg" alt="crowdsourcing" width="639" height="380" /></a></h2>
<p>The vast majority of contributors fall into the &#8220;overconfident on both&#8221; quadrant (top right), while only a handful of contributors were overconfident for one question type and underconfident for the other (top left and bottom right quadrants). Overall, there is certainly a correlation between bias scores on the two problem types, suggesting that many individuals are consistently biased on different types of problems. However, this explains only a portion of the variation.</p>
<h2>Does Bias Vary Across Groups?</h2>
<p>Given that overconfidence seems to be a consistent trait, we were curious how this trait varies across the different groups making up our contributor pool. We sliced and diced our contributors into a number of different sub-groups, which are summarized below.</p>
<p style="text-align: center;"><a href="http://blog.crowdflower.com/2011/09/confidence-bias-evidence-from-crowdsourcing/summary-table/" rel="attachment wp-att-3280"><img class="aligncenter size-full wp-image-3280" title="summary table" src="http://blog.crowdflower.com/wp-content/uploads/2011/07/summary-table.jpg" alt="crowdsourcing" width="703" height="460" /></a></p>
<p>There are a lot of interesting things going on here. To highlight a few, accuracy increases consistently as the contributor&#8217;s education level advances from High School to College, but so does confidence, leaving the bias score nearly unchanged. There&#8217;s a similar pattern with Age, with older contributors tending to be both more accurate and more confident.</p>
<p style="text-align: center;"><a href="http://blog.crowdflower.com/2011/09/confidence-bias-evidence-from-crowdsourcing/splits/" rel="attachment wp-att-3408"><img class="aligncenter size-full wp-image-3408" title="splits" src="http://blog.crowdflower.com/wp-content/uploads/2011/08/splits.jpg" alt="crowdsourcing splits" width="768" height="252" /></a></p>
<p>Gender and Location also have an effect on confidence bias. Taking the two countries that supplied the most people, contributors from the US were much more accurate and slightly more confident than the average, while those from India were average in terms of accuracy but much more confident. As such, the bias score for contributors from India is nearly double that of contributors from the US. With respect to gender, confidence didn&#8217;t vary much, but women were more accurate and thus less biased than men. Moving on.</p>
<p style="text-align: center;"><a href="http://blog.crowdflower.com/2011/09/confidence-bias-evidence-from-crowdsourcing/splits2/" rel="attachment wp-att-3405"><img class="aligncenter size-full wp-image-3405" title="splits2" src="http://blog.crowdflower.com/wp-content/uploads/2011/08/splits2.jpg" alt="crowdsourcing" width="773" height="251" /></a></p>
<h2>Further Research</h2>
<p>In the context of experimentation, we decided against using Gold to minimize any selection bias among contributors. However, this makes it difficult to apply these results to enterprise crowdsourcing, at least as practiced by CrowdFlower. In the future, it would be interesting to look at confidence bias among trusted workers only, and particularly among trusted workers with repeated experience in specific job types. We would expect these workers to have a better sense of whether their answers are correct, though it is possible (and perhaps likely) that confidence would increase along with accuracy.</p>
<p>&nbsp;</p>
<hr />
<p id="footnote-1" style="text-align: -webkit-auto;">1. Pallier, G., Wilkinson, R., Danthir, V., Kleitman, S., Knezevic, G., Stankov, L., &amp; Roberts, R. D. (2002). The role of individual differences in the accuracy of conﬁdence judgments. Journal of General Psychology, 129,257–299</p>
<p style="text-align: center;">
]]></content:encoded>
			<wfw:commentRss>http://blog.crowdflower.com/2011/09/confidence-bias-evidence-from-crowdsourcing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Ask a Stupid Question</title>
		<link>http://blog.crowdflower.com/2009/12/ask-a-stupid-question/</link>
		<comments>http://blog.crowdflower.com/2009/12/ask-a-stupid-question/#comments</comments>
		<pubDate>Wed, 16 Dec 2009 00:01:23 +0000</pubDate>
		<dc:creator>Aaron Shaw</dc:creator>
				<category><![CDATA[Miscellaneous]]></category>
		<category><![CDATA[data collection]]></category>
		<category><![CDATA[experiment]]></category>
		<category><![CDATA[garbage-in garbage-out]]></category>
		<category><![CDATA[social science]]></category>
		<category><![CDATA[surveys]]></category>

		<guid isPermaLink="false">http://blog.doloreslabs.com/2009/12/ask-a-stupid-question/</guid>
		<description><![CDATA[What makes a bad survey question and why does it matter? I thought I&#8217;d use my first blog posts as Dolores Labs&#8217;s friendly neighborhood social scientist to talk a little bit about question design since it&#8217;s a relevant, but often overlooked, area of Crowdsourcing work. You can ask “the crowd” all kinds of questions, but [...]]]></description>
			<content:encoded><![CDATA[<div class="socialize-in-content" style="float:left;"><div class="socialize-in-button socialize-in-button-left"><a href="http://twitter.com/share" class="twitter-share-button" data-url="http://blog.crowdflower.com/2009/12/ask-a-stupid-question/" data-text="Ask a Stupid Question" data-count="vertical" data-via="crowdflower" ><!--Tweetter--></a></div><div class="socialize-in-button socialize-in-button-left"><script>
			<!-- 
			var fbShare = {
				url: "http://blog.crowdflower.com/2009/12/ask-a-stupid-question/",
				size: "large",
				google_analytics: "true"
			}
			//-->
			</script>
                        <script src="http://widgets.fbshare.me/files/fbshare.js"></script></div><div class="socialize-in-button socialize-in-button-left"><script type="in/share" data-url="http://blog.crowdflower.com/2009/12/ask-a-stupid-question/" data-counter="top"></script></div><div class="socialize-in-button socialize-in-button-left"><g:plusone size="small" href="http://blog.crowdflower.com/2009/12/ask-a-stupid-question/"></g:plusone></div></div><p style="margin-bottom: 0in"><span style="font-size: medium" class="Apple-style-span"></span></p>
<p><strong>
<p style="margin-bottom: 0in"><span style="font-weight: normal" class="Apple-style-span"></span></strong></p>
<p>What makes a bad survey question and why does it matter? I thought I&#8217;d use my first blog posts as Dolores Labs&#8217;s friendly neighborhood social scientist to talk a little bit about question design since it&#8217;s a relevant, but often overlooked, area of Crowdsourcing work. </p>
<p>You can ask “the crowd” all kinds of questions, but if you don&#8217;t stop to think about the best way to ask your question, you&#8217;re likely to get unexpected and unreliable results. You might call it the <a href="http://en.wikipedia.org/wiki/Garbage_In,_Garbage_Out">GIGO</a> theory of research design.</p>
<p>To demonstrate the point, I decided to recreate some classic survey design experiments and distribute them to the workers in Crowdflower&#8217;s labor pools. For the experiments, every worker saw only one version of the questions and the tasks were posted using exactly the same title, description, and pricing. One hundred workers did each version of each question and I threw out the data from a handful of workers who failed a simple attention test question. The results are actual answers from actual people.</p>
<p><strong>An Example: Response Scales</strong></p>
<p>The rest of this post focuses on one example question that involved a response scale and a test to see how altering the scale would affect people&#8217;s answers. Here are two versions of the same question that I posted to Crowdflower:<br />
<br />
</br></p>
<blockquote><blockquote><small>Low Scale Version:<br /></br><br />
<code>About how many hours do you spend online per day?</p>
<blockquote><p>(a) 0 – 1 hour<br />
     (b) 1 – 2 hours<br />
     (c) 2 – 3 hours<br />
     (d) More than 3 hours</code></p></blockquote>
</blockquote>
<p>
</br><br />
<br />
</br><br />
High Scale Version:<br /></br><br />
<code>About how many hours do you spend online per day?</p>
<blockquote><p>(a) 0 – 3 hours<br />
     (b) 3 - 6 hours<br />
     (c) 6 – 9 hours<br />
     (d) More than 9 hours</code></small></p></blockquote>
</blockquote>
</blockquote>
<p>
</br><br />
<br />
</br></p>
<p>Notice that both versions can accommodate any answer and that the only difference is in the range of the scale items. You can give an accurate response to either question and neither version explicitly pushes you to give any answer over another.</p>
<p>So what did people say? Here&#8217;s a pair of histograms breaking the responses up by the two versions of the question:</p>
<p><img class="centered" src="http://blog.doloreslabs.com/wp-content/uploads/2009/12/hours_online-raw_bins.png" alt="boring histograms: hours online by scale" /></p>
<p>I didn&#8217;t label the height of the bars because the results are almost useless in this form. The only conclusion we can draw is that a lot of people in the Crowdflower worker pool tend to spend more than three hours per day online (whoa, no way&#8230;). </p>
<p>At the same time, it seems like the workers might have given low answers more frequently in response the low scale (check out how big the first three blue bars are compared to just the first orange bar).</p>
<p>To look at that comparison more closely, let&#8217;s break the answers into two categories for each scale: (1) the percentage of responses that were less than three hours, or (2) the percentage of responses that were more than 3 hours.</p>
<p><img class="centered" src="http://blog.doloreslabs.com/wp-content/uploads/2009/12/dotplots-2bins.png" alt="hours online in two bins" /></p>
<p>The difference between the height of the orange points (high scale) is much bigger than the corresponding difference between the height of the blue points (low scale). In other words, people who saw the high scale were much more likely to say they spent more than 3 hours online. In case you&#8217;re a stats nerd, the <a href="http://ccnmtl.columbia.edu/projects/qmss/the_chisquare_test/about_the_chisquare_test.html">Chi-square test</a> showed that this variation was significant with a p-value < 0.001, so the difference was almost certainly not due to chance.</p>
<p>But maybe collapsing the responses like this is a little too coarse and you'd still like to see how the variation worked across the scale as a whole. With that in mind, Lukas suggested another way to look at the effects – a comparison of the cumulative percentage of responses – and the differences are even more clear.</p>
<p><img class="centered" src="http://blog.doloreslabs.com/wp-content/uploads/2009/12/cumulative_dotplots.png" alt="hours online - cumulative bins" /></p>
<p>That gap between the blue and the orange line at “Less than 3 hours” – the one level that was measured explicitly on both scales – is huge!</p>
<p><strong>Explaining the Gap</strong></p>
<p>If you&#8217;re thinking that the differences between the scales alone can&#8217;t explain why all of these results are so skewed, that&#8217;s a good thought. However, the fact that this was a randomized experiment on a relatively homogeneous group of people makes it very unlikely that anything else explains the difference. Just to be sure, I did some other tests and found no significant differences between the sets of respondents that saw the low and high scales in terms of gender, country of origin, and the amount of time they took to complete the survey. So it seems like the scale is indeed the most likely culprit. </p>
<p>But what explains why scale questions can bias people&#8217;s responses so heavily? Survey researchers call this kind of behavior <a href="http://en.wikipedia.org/wiki/Satisficing#Survey_Taking">satisficing</a> &#8211; it happens when people taking a survey use cognitive shortcuts to answer questions. In the case of questions about personal behaviors that we&#8217;re not used to quantifying (like the time we spend online), we tend to shape our responses based on what we perceive as “normal.” If you don&#8217;t know what normal is in advance, you define it based on the midpoint of the answer range. Since respondents didn&#8217;t really differentiate between the answer options, they were more likely to have their responses shaped by the scale itself.</p>
<p>These results illustrate a sticky problem: it&#8217;s possible that a survey question that is distributed, understood, and analyzed perfectly could give you completely inaccurate results if the scale is poorly designed.</p>
<p><strong>Okay, it&#8217;s Broken. Now How Do I fix It?</strong></p>
<p>So what are you supposed to do in order to figure out which scale is more accurate? One of the best ways to mitigate the problem is to do some open-ended research on your respondent population so that you can get a good sense of a reasonable range of responses. Then you can re-center your response scale around that distribution.</p>
<p>To try this out, I ran the survey yet again with the same question, except that this time I left the “hours online” question open-ended, allowing Crowdflower workers to type in their responses. Here&#8217;s a density plot of those responses with the minimum, maximum, and mean responses highlighted (<a href="http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0001OR">sparklines</a> style):</p>
<p><img src="http://blog.doloreslabs.com/wp-content/uploads/2009/12/hours_online_open.png" alt="hours online - open ended" /></p>
<p>While the distribution is skewed and has something of a long-ish tail, the mean (6.53 hours per day), median (6 hours per day), and mode (5 hours per day) are all close to the midpoint of the high scale in my original questions. Therefore, the responses from the high scale were probably a more accurate reflection of the worker&#8217;s judgments.</p>
<p>Keep in mind, this technique provides no guarantee that the workers have accurate knowledge of how many hours they spend online – it&#8217;s <a href="http://en.wikipedia.org/wiki/Turtles_all_the_way_down">turtles all the way down</a>. I&#8217;d be willing to bet that their best guesses are pretty good, but if a big policy decision was riding on this question, I&#8217;d try to supplement my little survey with some other data sources. No matter what, there&#8217;s no perfect solution.</p>
<p><strong>So what?</strong></p>
<p>The point of all this has not been to undermine survey research, but to illustrate some of the problems that can happen if you&#8217;re not careful with things like scale design, as well as to present some strategies for solving those problems. As crowdsourcing becomes a mainstream tool in a range of academic and commercial fields, survey and questionnaire design techniques are also becoming more widely applicable. Nevertheless, people don&#8217;t usually encounter this kind of stuff outside of research methodology textbooks and the polling season of an election year.</p>
<p>I have a few more examples from these same experiments that I hope to follow up with in more posts soon. Meanwhile, leave a comment or email me at <em>aaron [at] doloreslabs [dot] com</em> with questions, comments, corrections and requests for data/code. All of these plots were created using <a href="http://www.r-project.org/">R</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.crowdflower.com/2009/12/ask-a-stupid-question/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
	</channel>
</rss>

