<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The CrowdFlower Blog &#187; social science</title>
	<atom:link href="http://blog.crowdflower.com/tag/social-science/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.crowdflower.com</link>
	<description></description>
	<lastBuildDate>Tue, 10 Jan 2012 20:00:35 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Designing Incentives for Crowdsourcing Workers</title>
		<link>http://blog.crowdflower.com/2011/05/designing-incentives-for-crowdsourcing-workers/</link>
		<comments>http://blog.crowdflower.com/2011/05/designing-incentives-for-crowdsourcing-workers/#comments</comments>
		<pubDate>Tue, 24 May 2011 19:19:45 +0000</pubDate>
		<dc:creator>Aaron Shaw</dc:creator>
				<category><![CDATA[Economics]]></category>
		<category><![CDATA[Experiments]]></category>
		<category><![CDATA[Human Behavior]]></category>
		<category><![CDATA[Miscellaneous]]></category>
		<category><![CDATA[behavior]]></category>
		<category><![CDATA[crowdsourcing]]></category>
		<category><![CDATA[data collection]]></category>
		<category><![CDATA[incentives]]></category>
		<category><![CDATA[motivation]]></category>
		<category><![CDATA[social science]]></category>

		<guid isPermaLink="false">http://blog.crowdflower.com/?p=2572</guid>
		<description><![CDATA[In a recent paper, presented at the ACM Conference on Computer Supported Cooperative Work (CSCW), John Horton, Daniel Chen and I used a large-scale experiment to test the effect of different incentive schemes on the quality of crowdsourcing work. The results surprised us. They suggest that workers perform most accurately when the task design credibly [...]]]></description>
			<content:encoded><![CDATA[<div class="socialize-in-content" style="float:left;"><div class="socialize-in-button socialize-in-button-left"><a href="http://twitter.com/share" class="twitter-share-button" data-url="http://blog.crowdflower.com/2011/05/designing-incentives-for-crowdsourcing-workers/" data-text="Designing Incentives for Crowdsourcing Workers" data-count="vertical" data-via="crowdflower" ><!--Tweetter--></a></div><div class="socialize-in-button socialize-in-button-left"><script>
			<!-- 
			var fbShare = {
				url: "http://blog.crowdflower.com/2011/05/designing-incentives-for-crowdsourcing-workers/",
				size: "large",
				google_analytics: "true"
			}
			//-->
			</script>
                        <script src="http://widgets.fbshare.me/files/fbshare.js"></script></div><div class="socialize-in-button socialize-in-button-left"><g:plusone size="small" href="http://blog.crowdflower.com/2011/05/designing-incentives-for-crowdsourcing-workers/"></g:plusone></div><div class="socialize-in-button socialize-in-button-left"><script type="in/share" data-url="http://blog.crowdflower.com/2011/05/designing-incentives-for-crowdsourcing-workers/" data-counter="top"></script></div></div><p>In a <a title="Designing Incentives for Inexpert Human Raters, Berkman Center" href="http://cyber.law.harvard.edu/publications/2011/Designing_Incentives_Inexpert_Human_Raters">recent paper</a>, presented at the ACM Conference on Computer Supported Cooperative Work (CSCW), <a title="John Horton, oDesk" href="https://sites.google.com/site/johnjosephhorton/">John Horton</a>, <a title="Daniel Chen, Duke Law School" href="http://www.law.duke.edu/fac/chen">Daniel Chen</a> and <a title="Aaron Shaw, UC Berkeley &amp; Harvard" href="http://aaronshaw.org">I</a> used a large-scale experiment to test the effect of different incentive schemes on the quality of crowdsourcing work.</p>
<p>The results surprised us. They suggest that workers perform most accurately when the task design credibly links payoffs to a worker&#8217;s ability to think about the answers that their peers are likely to provide.</p>
<p style="text-align: center;">
<div id="attachment_2577" class="wp-caption aligncenter" style="width: 549px"><a href="http://www.flickr.com/photos/iyoupapa/"><img class="size-full wp-image-2577 " title="Horserace!" src="http://blog.crowdflower.com/wp-content/uploads/2011/05/3757438159_horserace-iyoupapa-altered.jpg" alt="Horserace!" width="539" height="264" /></a><p class="wp-caption-text">a horserace experiment! (photo cc-by-sa by iyoupapa)</p></div>
<p><span id="more-2572"></span></p>
<p>The idea for this study came out of our sense that, as social scientists, we had something unique to offer the existing research on human computation. <a title="AMT is fast, cheap, and good for machine learning data" href="http://blog.crowdflower.com/2008/09/amt-fast-cheap-good-machine-learning/">Early</a> and <a title="&quot;Get Another Label?&quot; Ipeirotis et al. 2008" href="http://archive.nyu.edu/handle/2451/25882">influential</a> crowdsourcing research has focused on how to filter the judgments of the crowd to find the best answers. We wanted to know whether simple task-design changes could improve the quality of data coming into a crowdsourcing system in the first place.</p>
<p>To test this idea, we chose 14 different incentive schemes and framing techniques developed and validated across the social sciences and set up a horse race experiment to see which schemes/techniques would work best.</p>
<p>Consistent with our personal biases (John and Daniel are both economists, and I&#8217;m a sociologist), some of the schemes were financially oriented, some were social or psychological, and some were hybrids combining social and financial incentives. The details of all the schemes are included <a title="Designing Incentives for Inexpert Human Raters" href="http://cyber.law.harvard.edu/publications/2011/Designing_Incentives_Inexpert_Human_Raters">in the paper</a> (it&#8217;s a long list, and some of them are kind of involved), but it&#8217;s worth giving some examples.</p>
<p>On the financial end of the incentives spectrum, we had one condition we called &#8220;reward-accuracy,&#8221; which was pretty much what you&#8217;d expect: we told workers, &#8220;we&#8217;ll pay you a bonus if you get the answers right.&#8221; We also had one called &#8220;punishment-accuracy,&#8221; the gist of which you can deduce. On the purely social-psychological side, we had one we called &#8220;trust,&#8221; in which we told workers, &#8220;we&#8217;ll pay you for this job no matter how bad your performance, we trust that you&#8217;ll still make your best effort.&#8221;</p>
<p>One of the weirdest schemes turns out to be important, so I need to explain that one. Called &#8220;Bayesian Truth Serum&#8221; (BTS), it incorporates a design from the work of <a title="Drazen Prelec" href="http://econ-www.mit.edu/faculty/dprelec">Drazen Prelec</a>, a behavioral economist at MIT, who realized that research subjects could probably provide useful information regarding the expected distribution for subjective, qualitative questions (<em>nb</em>, the mechanics of how he does this are arcane in a way that is almost sure to delight the geeks among you, so I encourage you to <a title="Bayesian Truth Serum" href="http://econ-www.mit.edu/files/1966">read his paper</a>). Few of the details of <em>real</em> BTS are important, except that we incorporated the piece about asking workers to answer the questions themselves <em>and predict the distribution of other workers&#8217; responses</em>. We also told them we&#8217;d give them a bonus if their predictions were correct.</p>
<p>We then created a task that asked workers to answer five questions. In this case, the questions were drawn from another study examining participatory features of websites, for which we already possessed validated data collected by research assistants.</p>
<p>All workers answered the same five questions about the same website (<a href="http://www.kiva.org">www.kiva.org</a>) while being exposed to one and only one of the 14 incentive schemes (or a control condition of no scheme). Roughly 2,000 individuals participated in the study, resulting in over 100 subjects in each of the experimental conditions. (The statistics and science nerds out there will be pleased to know that both the drop-out rate and demographic covariates were distributed evenly across conditions.)</p>
<p>To measure worker performance, we used the research assistant responses as correct answers to the questions and then calculated the total number of matching answers (out of five) provided by each worker. The results (aggregated across all treatments) are plotted in a histogram below and show that the average worker answered just over two questions out of five correctly.</p>
<p style="text-align: center;"><a href="http://blog.crowdflower.com/2011/05/designing-incentives-for-crowdsourcing-workers/aggperf/" rel="attachment wp-att-2578"><img class="aligncenter size-full wp-image-2578" title="Inexpert raters - Aggregate Performance" src="http://blog.crowdflower.com/wp-content/uploads/2011/05/AggPerf.png" alt="Aggregate performance histogram" width="280" height="280" /></a></p>
<p>&nbsp;</p>
<p>Then, in order to see how the treatments compared against each other relative to the control group, we calculated the mean correct response rate for each condition and conducted difference of means tests to see which of these means were significantly greater than the control group. The results of this comparison appear below (in a new plot that doesn&#8217;t even appear in the paper!):</p>
<p><a href="http://blog.crowdflower.com/2011/05/designing-incentives-for-crowdsourcing-workers/inexpert-itt/" rel="attachment wp-att-2579"><img class="aligncenter size-full wp-image-2579" title="inexpert raters - ITT estimates" src="http://blog.crowdflower.com/wp-content/uploads/2011/05/inexpert-ITT.png" alt="ITT estimates per treatment" width="500" height="500" /></a></p>
<p>The orange dots show the value of the mean in each condition, and the blue bars illustrate the 95% confidence interval around that mean. The treatments are sorted by the size of the difference in means from the control. (More hard-core nerd stuff: the means are adjusted using Intent-To-Treat estimators).</p>
<p>From these results, we concluded that our horse race had two clear front-runners: the &#8220;Bayesian Truth Serum&#8221; (BTS) and &#8220;Punishment &#8211; disagreement&#8221; conditions, each of which improved average worker performance by almost half of a correct answer above the 2.08 correct answers in the control group. A few of the other financial and hybrid incentives had fairly large point estimates, but were not significantly different from control once we adjusted the test statistics and corresponding p-values to account for the fact that we were making so many comparisons at once (apologies if this doesn&#8217;t make sense — it&#8217;s yet another precautionary measure to avoid upsetting the stats nerds among you). In a tough turn for the sociologists and psychologists, none of the purely social/psychological treatments had any signficant effects at all.</p>
<p>Why do BTS and punishing workers for disagreement succeed in improving performance significantly where so many of the other incentive schemes failed? The answer hinges on the fact that both conditions tied workers&#8217; payoffs to their ability to think about their peers&#8217; likely responses. (We elaborate on the argument in more detail in the paper.)</p>
<p>Does this mean that we should give up on simple financial or social-psychological incentives? Probably not. The fact that we conducted the experiment on MTurk means that the deck may have been stacked against incentives like the &#8220;trust&#8221; condition I described earlier. Because requesters on MTurk have little oversight, workers are more likely to respond to financial incentives than stated promises. In this sense, the marketplace has structured the interaction between workers and requesters in a way that may limit the opportunities to harness motivations that are not linked to money in some explicit way.</p>
<p>You can <a title="Designing Incentives for Inexpert Human Raters" href="http://cyber.law.harvard.edu/sites/cyber.law.harvard.edu/files/Shaw-Horton-Chen_Designing_Incentives_Inexpert_Human_Raters_2011.pdf">download the full paper</a> to read more.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.crowdflower.com/2011/05/designing-incentives-for-crowdsourcing-workers/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Ask a Stupid Question</title>
		<link>http://blog.crowdflower.com/2009/12/ask-a-stupid-question/</link>
		<comments>http://blog.crowdflower.com/2009/12/ask-a-stupid-question/#comments</comments>
		<pubDate>Wed, 16 Dec 2009 00:01:23 +0000</pubDate>
		<dc:creator>Aaron Shaw</dc:creator>
				<category><![CDATA[Miscellaneous]]></category>
		<category><![CDATA[data collection]]></category>
		<category><![CDATA[experiment]]></category>
		<category><![CDATA[garbage-in garbage-out]]></category>
		<category><![CDATA[social science]]></category>
		<category><![CDATA[surveys]]></category>

		<guid isPermaLink="false">http://blog.doloreslabs.com/2009/12/ask-a-stupid-question/</guid>
		<description><![CDATA[What makes a bad survey question and why does it matter? I thought I&#8217;d use my first blog posts as Dolores Labs&#8217;s friendly neighborhood social scientist to talk a little bit about question design since it&#8217;s a relevant, but often overlooked, area of Crowdsourcing work. You can ask “the crowd” all kinds of questions, but [...]]]></description>
			<content:encoded><![CDATA[<div class="socialize-in-content" style="float:left;"><div class="socialize-in-button socialize-in-button-left"><a href="http://twitter.com/share" class="twitter-share-button" data-url="http://blog.crowdflower.com/2009/12/ask-a-stupid-question/" data-text="Ask a Stupid Question" data-count="vertical" data-via="crowdflower" ><!--Tweetter--></a></div><div class="socialize-in-button socialize-in-button-left"><script>
			<!-- 
			var fbShare = {
				url: "http://blog.crowdflower.com/2009/12/ask-a-stupid-question/",
				size: "large",
				google_analytics: "true"
			}
			//-->
			</script>
                        <script src="http://widgets.fbshare.me/files/fbshare.js"></script></div><div class="socialize-in-button socialize-in-button-left"><script type="in/share" data-url="http://blog.crowdflower.com/2009/12/ask-a-stupid-question/" data-counter="top"></script></div><div class="socialize-in-button socialize-in-button-left"><g:plusone size="small" href="http://blog.crowdflower.com/2009/12/ask-a-stupid-question/"></g:plusone></div></div><p style="margin-bottom: 0in"><span style="font-size: medium" class="Apple-style-span"></span></p>
<p><strong>
<p style="margin-bottom: 0in"><span style="font-weight: normal" class="Apple-style-span"></span></strong></p>
<p>What makes a bad survey question and why does it matter? I thought I&#8217;d use my first blog posts as Dolores Labs&#8217;s friendly neighborhood social scientist to talk a little bit about question design since it&#8217;s a relevant, but often overlooked, area of Crowdsourcing work. </p>
<p>You can ask “the crowd” all kinds of questions, but if you don&#8217;t stop to think about the best way to ask your question, you&#8217;re likely to get unexpected and unreliable results. You might call it the <a href="http://en.wikipedia.org/wiki/Garbage_In,_Garbage_Out">GIGO</a> theory of research design.</p>
<p>To demonstrate the point, I decided to recreate some classic survey design experiments and distribute them to the workers in Crowdflower&#8217;s labor pools. For the experiments, every worker saw only one version of the questions and the tasks were posted using exactly the same title, description, and pricing. One hundred workers did each version of each question and I threw out the data from a handful of workers who failed a simple attention test question. The results are actual answers from actual people.</p>
<p><strong>An Example: Response Scales</strong></p>
<p>The rest of this post focuses on one example question that involved a response scale and a test to see how altering the scale would affect people&#8217;s answers. Here are two versions of the same question that I posted to Crowdflower:<br />
<br />
</br></p>
<blockquote><blockquote><small>Low Scale Version:<br /></br><br />
<code>About how many hours do you spend online per day?</p>
<blockquote><p>(a) 0 – 1 hour<br />
     (b) 1 – 2 hours<br />
     (c) 2 – 3 hours<br />
     (d) More than 3 hours</code></p></blockquote>
</blockquote>
<p>
</br><br />
<br />
</br><br />
High Scale Version:<br /></br><br />
<code>About how many hours do you spend online per day?</p>
<blockquote><p>(a) 0 – 3 hours<br />
     (b) 3 - 6 hours<br />
     (c) 6 – 9 hours<br />
     (d) More than 9 hours</code></small></p></blockquote>
</blockquote>
</blockquote>
<p>
</br><br />
<br />
</br></p>
<p>Notice that both versions can accommodate any answer and that the only difference is in the range of the scale items. You can give an accurate response to either question and neither version explicitly pushes you to give any answer over another.</p>
<p>So what did people say? Here&#8217;s a pair of histograms breaking the responses up by the two versions of the question:</p>
<p><img class="centered" src="http://blog.doloreslabs.com/wp-content/uploads/2009/12/hours_online-raw_bins.png" alt="boring histograms: hours online by scale" /></p>
<p>I didn&#8217;t label the height of the bars because the results are almost useless in this form. The only conclusion we can draw is that a lot of people in the Crowdflower worker pool tend to spend more than three hours per day online (whoa, no way&#8230;). </p>
<p>At the same time, it seems like the workers might have given low answers more frequently in response the low scale (check out how big the first three blue bars are compared to just the first orange bar).</p>
<p>To look at that comparison more closely, let&#8217;s break the answers into two categories for each scale: (1) the percentage of responses that were less than three hours, or (2) the percentage of responses that were more than 3 hours.</p>
<p><img class="centered" src="http://blog.doloreslabs.com/wp-content/uploads/2009/12/dotplots-2bins.png" alt="hours online in two bins" /></p>
<p>The difference between the height of the orange points (high scale) is much bigger than the corresponding difference between the height of the blue points (low scale). In other words, people who saw the high scale were much more likely to say they spent more than 3 hours online. In case you&#8217;re a stats nerd, the <a href="http://ccnmtl.columbia.edu/projects/qmss/the_chisquare_test/about_the_chisquare_test.html">Chi-square test</a> showed that this variation was significant with a p-value < 0.001, so the difference was almost certainly not due to chance.</p>
<p>But maybe collapsing the responses like this is a little too coarse and you'd still like to see how the variation worked across the scale as a whole. With that in mind, Lukas suggested another way to look at the effects – a comparison of the cumulative percentage of responses – and the differences are even more clear.</p>
<p><img class="centered" src="http://blog.doloreslabs.com/wp-content/uploads/2009/12/cumulative_dotplots.png" alt="hours online - cumulative bins" /></p>
<p>That gap between the blue and the orange line at “Less than 3 hours” – the one level that was measured explicitly on both scales – is huge!</p>
<p><strong>Explaining the Gap</strong></p>
<p>If you&#8217;re thinking that the differences between the scales alone can&#8217;t explain why all of these results are so skewed, that&#8217;s a good thought. However, the fact that this was a randomized experiment on a relatively homogeneous group of people makes it very unlikely that anything else explains the difference. Just to be sure, I did some other tests and found no significant differences between the sets of respondents that saw the low and high scales in terms of gender, country of origin, and the amount of time they took to complete the survey. So it seems like the scale is indeed the most likely culprit. </p>
<p>But what explains why scale questions can bias people&#8217;s responses so heavily? Survey researchers call this kind of behavior <a href="http://en.wikipedia.org/wiki/Satisficing#Survey_Taking">satisficing</a> &#8211; it happens when people taking a survey use cognitive shortcuts to answer questions. In the case of questions about personal behaviors that we&#8217;re not used to quantifying (like the time we spend online), we tend to shape our responses based on what we perceive as “normal.” If you don&#8217;t know what normal is in advance, you define it based on the midpoint of the answer range. Since respondents didn&#8217;t really differentiate between the answer options, they were more likely to have their responses shaped by the scale itself.</p>
<p>These results illustrate a sticky problem: it&#8217;s possible that a survey question that is distributed, understood, and analyzed perfectly could give you completely inaccurate results if the scale is poorly designed.</p>
<p><strong>Okay, it&#8217;s Broken. Now How Do I fix It?</strong></p>
<p>So what are you supposed to do in order to figure out which scale is more accurate? One of the best ways to mitigate the problem is to do some open-ended research on your respondent population so that you can get a good sense of a reasonable range of responses. Then you can re-center your response scale around that distribution.</p>
<p>To try this out, I ran the survey yet again with the same question, except that this time I left the “hours online” question open-ended, allowing Crowdflower workers to type in their responses. Here&#8217;s a density plot of those responses with the minimum, maximum, and mean responses highlighted (<a href="http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0001OR">sparklines</a> style):</p>
<p><img src="http://blog.doloreslabs.com/wp-content/uploads/2009/12/hours_online_open.png" alt="hours online - open ended" /></p>
<p>While the distribution is skewed and has something of a long-ish tail, the mean (6.53 hours per day), median (6 hours per day), and mode (5 hours per day) are all close to the midpoint of the high scale in my original questions. Therefore, the responses from the high scale were probably a more accurate reflection of the worker&#8217;s judgments.</p>
<p>Keep in mind, this technique provides no guarantee that the workers have accurate knowledge of how many hours they spend online – it&#8217;s <a href="http://en.wikipedia.org/wiki/Turtles_all_the_way_down">turtles all the way down</a>. I&#8217;d be willing to bet that their best guesses are pretty good, but if a big policy decision was riding on this question, I&#8217;d try to supplement my little survey with some other data sources. No matter what, there&#8217;s no perfect solution.</p>
<p><strong>So what?</strong></p>
<p>The point of all this has not been to undermine survey research, but to illustrate some of the problems that can happen if you&#8217;re not careful with things like scale design, as well as to present some strategies for solving those problems. As crowdsourcing becomes a mainstream tool in a range of academic and commercial fields, survey and questionnaire design techniques are also becoming more widely applicable. Nevertheless, people don&#8217;t usually encounter this kind of stuff outside of research methodology textbooks and the polling season of an election year.</p>
<p>I have a few more examples from these same experiments that I hope to follow up with in more posts soon. Meanwhile, leave a comment or email me at <em>aaron [at] doloreslabs [dot] com</em> with questions, comments, corrections and requests for data/code. All of these plots were created using <a href="http://www.r-project.org/">R</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.crowdflower.com/2009/12/ask-a-stupid-question/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
	</channel>
</rss>

