<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The CrowdFlower Blog &#187; Aaron Shaw</title>
	<atom:link href="http://blog.crowdflower.com/author/aaron/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.crowdflower.com</link>
	<description></description>
	<lastBuildDate>Tue, 10 Jan 2012 20:00:35 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Designing Incentives for Crowdsourcing Workers</title>
		<link>http://blog.crowdflower.com/2011/05/designing-incentives-for-crowdsourcing-workers/</link>
		<comments>http://blog.crowdflower.com/2011/05/designing-incentives-for-crowdsourcing-workers/#comments</comments>
		<pubDate>Tue, 24 May 2011 19:19:45 +0000</pubDate>
		<dc:creator>Aaron Shaw</dc:creator>
				<category><![CDATA[Economics]]></category>
		<category><![CDATA[Experiments]]></category>
		<category><![CDATA[Human Behavior]]></category>
		<category><![CDATA[Miscellaneous]]></category>
		<category><![CDATA[behavior]]></category>
		<category><![CDATA[crowdsourcing]]></category>
		<category><![CDATA[data collection]]></category>
		<category><![CDATA[incentives]]></category>
		<category><![CDATA[motivation]]></category>
		<category><![CDATA[social science]]></category>

		<guid isPermaLink="false">http://blog.crowdflower.com/?p=2572</guid>
		<description><![CDATA[In a recent paper, presented at the ACM Conference on Computer Supported Cooperative Work (CSCW), John Horton, Daniel Chen and I used a large-scale experiment to test the effect of different incentive schemes on the quality of crowdsourcing work. The results surprised us. They suggest that workers perform most accurately when the task design credibly [...]]]></description>
			<content:encoded><![CDATA[<div class="socialize-in-content" style="float:left;"><div class="socialize-in-button socialize-in-button-left"><a href="http://twitter.com/share" class="twitter-share-button" data-url="http://blog.crowdflower.com/2011/05/designing-incentives-for-crowdsourcing-workers/" data-text="Designing Incentives for Crowdsourcing Workers" data-count="vertical" data-via="crowdflower" ><!--Tweetter--></a></div><div class="socialize-in-button socialize-in-button-left"><script>
			<!-- 
			var fbShare = {
				url: "http://blog.crowdflower.com/2011/05/designing-incentives-for-crowdsourcing-workers/",
				size: "large",
				google_analytics: "true"
			}
			//-->
			</script>
                        <script src="http://widgets.fbshare.me/files/fbshare.js"></script></div><div class="socialize-in-button socialize-in-button-left"><g:plusone size="small" href="http://blog.crowdflower.com/2011/05/designing-incentives-for-crowdsourcing-workers/"></g:plusone></div><div class="socialize-in-button socialize-in-button-left"><script type="in/share" data-url="http://blog.crowdflower.com/2011/05/designing-incentives-for-crowdsourcing-workers/" data-counter="top"></script></div></div><p>In a <a title="Designing Incentives for Inexpert Human Raters, Berkman Center" href="http://cyber.law.harvard.edu/publications/2011/Designing_Incentives_Inexpert_Human_Raters">recent paper</a>, presented at the ACM Conference on Computer Supported Cooperative Work (CSCW), <a title="John Horton, oDesk" href="https://sites.google.com/site/johnjosephhorton/">John Horton</a>, <a title="Daniel Chen, Duke Law School" href="http://www.law.duke.edu/fac/chen">Daniel Chen</a> and <a title="Aaron Shaw, UC Berkeley &amp; Harvard" href="http://aaronshaw.org">I</a> used a large-scale experiment to test the effect of different incentive schemes on the quality of crowdsourcing work.</p>
<p>The results surprised us. They suggest that workers perform most accurately when the task design credibly links payoffs to a worker&#8217;s ability to think about the answers that their peers are likely to provide.</p>
<p style="text-align: center;">
<div id="attachment_2577" class="wp-caption aligncenter" style="width: 549px"><a href="http://www.flickr.com/photos/iyoupapa/"><img class="size-full wp-image-2577 " title="Horserace!" src="http://blog.crowdflower.com/wp-content/uploads/2011/05/3757438159_horserace-iyoupapa-altered.jpg" alt="Horserace!" width="539" height="264" /></a><p class="wp-caption-text">a horserace experiment! (photo cc-by-sa by iyoupapa)</p></div>
<p><span id="more-2572"></span></p>
<p>The idea for this study came out of our sense that, as social scientists, we had something unique to offer the existing research on human computation. <a title="AMT is fast, cheap, and good for machine learning data" href="http://blog.crowdflower.com/2008/09/amt-fast-cheap-good-machine-learning/">Early</a> and <a title="&quot;Get Another Label?&quot; Ipeirotis et al. 2008" href="http://archive.nyu.edu/handle/2451/25882">influential</a> crowdsourcing research has focused on how to filter the judgments of the crowd to find the best answers. We wanted to know whether simple task-design changes could improve the quality of data coming into a crowdsourcing system in the first place.</p>
<p>To test this idea, we chose 14 different incentive schemes and framing techniques developed and validated across the social sciences and set up a horse race experiment to see which schemes/techniques would work best.</p>
<p>Consistent with our personal biases (John and Daniel are both economists, and I&#8217;m a sociologist), some of the schemes were financially oriented, some were social or psychological, and some were hybrids combining social and financial incentives. The details of all the schemes are included <a title="Designing Incentives for Inexpert Human Raters" href="http://cyber.law.harvard.edu/publications/2011/Designing_Incentives_Inexpert_Human_Raters">in the paper</a> (it&#8217;s a long list, and some of them are kind of involved), but it&#8217;s worth giving some examples.</p>
<p>On the financial end of the incentives spectrum, we had one condition we called &#8220;reward-accuracy,&#8221; which was pretty much what you&#8217;d expect: we told workers, &#8220;we&#8217;ll pay you a bonus if you get the answers right.&#8221; We also had one called &#8220;punishment-accuracy,&#8221; the gist of which you can deduce. On the purely social-psychological side, we had one we called &#8220;trust,&#8221; in which we told workers, &#8220;we&#8217;ll pay you for this job no matter how bad your performance, we trust that you&#8217;ll still make your best effort.&#8221;</p>
<p>One of the weirdest schemes turns out to be important, so I need to explain that one. Called &#8220;Bayesian Truth Serum&#8221; (BTS), it incorporates a design from the work of <a title="Drazen Prelec" href="http://econ-www.mit.edu/faculty/dprelec">Drazen Prelec</a>, a behavioral economist at MIT, who realized that research subjects could probably provide useful information regarding the expected distribution for subjective, qualitative questions (<em>nb</em>, the mechanics of how he does this are arcane in a way that is almost sure to delight the geeks among you, so I encourage you to <a title="Bayesian Truth Serum" href="http://econ-www.mit.edu/files/1966">read his paper</a>). Few of the details of <em>real</em> BTS are important, except that we incorporated the piece about asking workers to answer the questions themselves <em>and predict the distribution of other workers&#8217; responses</em>. We also told them we&#8217;d give them a bonus if their predictions were correct.</p>
<p>We then created a task that asked workers to answer five questions. In this case, the questions were drawn from another study examining participatory features of websites, for which we already possessed validated data collected by research assistants.</p>
<p>All workers answered the same five questions about the same website (<a href="http://www.kiva.org">www.kiva.org</a>) while being exposed to one and only one of the 14 incentive schemes (or a control condition of no scheme). Roughly 2,000 individuals participated in the study, resulting in over 100 subjects in each of the experimental conditions. (The statistics and science nerds out there will be pleased to know that both the drop-out rate and demographic covariates were distributed evenly across conditions.)</p>
<p>To measure worker performance, we used the research assistant responses as correct answers to the questions and then calculated the total number of matching answers (out of five) provided by each worker. The results (aggregated across all treatments) are plotted in a histogram below and show that the average worker answered just over two questions out of five correctly.</p>
<p style="text-align: center;"><a href="http://blog.crowdflower.com/2011/05/designing-incentives-for-crowdsourcing-workers/aggperf/" rel="attachment wp-att-2578"><img class="aligncenter size-full wp-image-2578" title="Inexpert raters - Aggregate Performance" src="http://blog.crowdflower.com/wp-content/uploads/2011/05/AggPerf.png" alt="Aggregate performance histogram" width="280" height="280" /></a></p>
<p>&nbsp;</p>
<p>Then, in order to see how the treatments compared against each other relative to the control group, we calculated the mean correct response rate for each condition and conducted difference of means tests to see which of these means were significantly greater than the control group. The results of this comparison appear below (in a new plot that doesn&#8217;t even appear in the paper!):</p>
<p><a href="http://blog.crowdflower.com/2011/05/designing-incentives-for-crowdsourcing-workers/inexpert-itt/" rel="attachment wp-att-2579"><img class="aligncenter size-full wp-image-2579" title="inexpert raters - ITT estimates" src="http://blog.crowdflower.com/wp-content/uploads/2011/05/inexpert-ITT.png" alt="ITT estimates per treatment" width="500" height="500" /></a></p>
<p>The orange dots show the value of the mean in each condition, and the blue bars illustrate the 95% confidence interval around that mean. The treatments are sorted by the size of the difference in means from the control. (More hard-core nerd stuff: the means are adjusted using Intent-To-Treat estimators).</p>
<p>From these results, we concluded that our horse race had two clear front-runners: the &#8220;Bayesian Truth Serum&#8221; (BTS) and &#8220;Punishment &#8211; disagreement&#8221; conditions, each of which improved average worker performance by almost half of a correct answer above the 2.08 correct answers in the control group. A few of the other financial and hybrid incentives had fairly large point estimates, but were not significantly different from control once we adjusted the test statistics and corresponding p-values to account for the fact that we were making so many comparisons at once (apologies if this doesn&#8217;t make sense — it&#8217;s yet another precautionary measure to avoid upsetting the stats nerds among you). In a tough turn for the sociologists and psychologists, none of the purely social/psychological treatments had any signficant effects at all.</p>
<p>Why do BTS and punishing workers for disagreement succeed in improving performance significantly where so many of the other incentive schemes failed? The answer hinges on the fact that both conditions tied workers&#8217; payoffs to their ability to think about their peers&#8217; likely responses. (We elaborate on the argument in more detail in the paper.)</p>
<p>Does this mean that we should give up on simple financial or social-psychological incentives? Probably not. The fact that we conducted the experiment on MTurk means that the deck may have been stacked against incentives like the &#8220;trust&#8221; condition I described earlier. Because requesters on MTurk have little oversight, workers are more likely to respond to financial incentives than stated promises. In this sense, the marketplace has structured the interaction between workers and requesters in a way that may limit the opportunities to harness motivations that are not linked to money in some explicit way.</p>
<p>You can <a title="Designing Incentives for Inexpert Human Raters" href="http://cyber.law.harvard.edu/sites/cyber.law.harvard.edu/files/Shaw-Horton-Chen_Designing_Incentives_Inexpert_Human_Raters_2011.pdf">download the full paper</a> to read more.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.crowdflower.com/2011/05/designing-incentives-for-crowdsourcing-workers/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>For love or for money? A list experiment on the motivations behind crowdsourcing work</title>
		<link>http://blog.crowdflower.com/2010/08/for-love-or-for-money-a-list-experiment-on-the-motivations-behind-crowdsourcing-work/</link>
		<comments>http://blog.crowdflower.com/2010/08/for-love-or-for-money-a-list-experiment-on-the-motivations-behind-crowdsourcing-work/#comments</comments>
		<pubDate>Thu, 05 Aug 2010 15:00:11 +0000</pubDate>
		<dc:creator>Aaron Shaw</dc:creator>
				<category><![CDATA[Economics]]></category>
		<category><![CDATA[Experiments]]></category>
		<category><![CDATA[Miscellaneous]]></category>
		<category><![CDATA[Motivation]]></category>
		<category><![CDATA[crowdsourcing]]></category>
		<category><![CDATA[Judd Antin]]></category>
		<category><![CDATA[list experiment]]></category>
		<category><![CDATA[Mturk]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[social desirability]]></category>

		<guid isPermaLink="false">http://blog.crowdflower.com/?p=931</guid>
		<description><![CDATA[What motivates crowdsourcing workers to do what they do? According to some surveys, many of the workers say they&#8217;re just in it for the money. However, my friend Judd Antin and I recently ran what&#8217;s called a &#8220;list experiment&#8221; — an awesome twist on a traditional survey — and we found that the reality is [...]]]></description>
			<content:encoded><![CDATA[<div class="socialize-in-content" style="float:left;"><div class="socialize-in-button socialize-in-button-left"><a href="http://twitter.com/share" class="twitter-share-button" data-url="http://blog.crowdflower.com/2010/08/for-love-or-for-money-a-list-experiment-on-the-motivations-behind-crowdsourcing-work/" data-text="For love or for money? A list experiment on the motivations behind crowdsourcing work" data-count="vertical" data-via="crowdflower" ><!--Tweetter--></a></div><div class="socialize-in-button socialize-in-button-left"><script>
			<!-- 
			var fbShare = {
				url: "http://blog.crowdflower.com/2010/08/for-love-or-for-money-a-list-experiment-on-the-motivations-behind-crowdsourcing-work/",
				size: "large",
				google_analytics: "true"
			}
			//-->
			</script>
                        <script src="http://widgets.fbshare.me/files/fbshare.js"></script></div><div class="socialize-in-button socialize-in-button-left"><script type="in/share" data-url="http://blog.crowdflower.com/2010/08/for-love-or-for-money-a-list-experiment-on-the-motivations-behind-crowdsourcing-work/" data-counter="top"></script></div><div class="socialize-in-button socialize-in-button-left"><g:plusone size="small" href="http://blog.crowdflower.com/2010/08/for-love-or-for-money-a-list-experiment-on-the-motivations-behind-crowdsourcing-work/"></g:plusone></div></div><div id="attachment_933" class="wp-caption aligncenter" style="width: 501px"><a href="http://www.flickr.com/photos/mikelewis/2287255370"><img class="size-full wp-image-933 " title="Motivation" src="http://blog.crowdflower.com/wp-content/uploads/2010/07/Motivation_img.jpg" alt="" width="491" height="392" /></a><p class="wp-caption-text">Motivation in the workplace. Created by user: pescatello on flickr and licensed cc-by 2.0</p></div>
<div class="mceTemp" style="text-align: left;">What motivates crowdsourcing workers to do what they do? According to some surveys, many of the workers <em>say</em> they&#8217;re just in it for the money. However, my friend <a href="http://www.technotaste.com/" target="new">Judd Antin</a> and I recently ran what&#8217;s called a &#8220;list experiment&#8221; — an awesome twist on a traditional survey — and we found that the reality is much more complex.</div>
<p><span id="more-931"></span></p>
<p>A few weeks ago, I was talking about the motivations of crowdsourcing workers with Judd, who has already done <a href="http://technotaste.com/research" target="new">a ton of great work</a> looking at motivations for participation across a wide range of online environments. He is a recent Ph.D. from the <a href="http://ischool.berkeley.edu">UC Berkeley School of Information</a> and just joined <a href="http://research.yahoo.com/Judd_Antin" target="new">Yahoo! Research</a> as a social psychologist and research scientist in the Internet Experiences Group, so it was no surprise that he had a great idea about how to design an experiment to better understand crowdsourcing.</p>
<p>The most straightforward way to ask crowdsourcing workers why they do what they do is with a survey (e.g., <a href="http://pages.stern.nyu.edu/~panos/" target="new">Panos Ipeirotis&#8217;</a> fascinating <a href="http://behind-the-enemy-lines.blogspot.com/2010/03/new-demographics-of-mechanical-turk.html" target="new">recent informal survey</a> of MTurk workers.) However, you also might recall from <a href="http://blog.crowdflower.com/2009/12/ask-a-stupid-question/" target="new">one</a> or <a href="http://blog.crowdflower.com/2010/03/ask-a-stupid-question-part-ii-forced-choice-vs-checkboxes/" target="new">two</a> of my previous posts that I tend not to take survey results at face value.</p>
<p>Judd&#8217;s “list experiment&#8221; presents the subjects of a study with a list of several motivations and asks them to provide a count of the number of items in the list they agree with (rather than posing yes/no questions or checkboxes).</p>
<p>Here&#8217;s what that looked like once Judd had it set up in Crowdflower:</p>
<div id="attachment_935" class="wp-caption aligncenter" style="width: 884px"><a rel="attachment wp-att-935" href="http://blog.crowdflower.com/2010/08/for-love-or-for-money-a-list-experiment-on-the-motivations-behind-crowdsourcing-work/list_exper_screenshot/"><img class="size-full wp-image-935 " title="List_exper_screenshot" src="http://blog.crowdflower.com/wp-content/uploads/2010/07/List_exper_screenshot.png" alt="" width="874" height="262" /></a><p class="wp-caption-text">A screenshot from one version of our list experiment</p></div>
<p>We presented experimental treatment groups with four other permutations of the same list — each one missing one of the items — and aggregated the results across every group. This allowed us to estimate the proportion of respondents choosing each item in the list.</p>
<p>The advantage of the list experiment over the traditional survey format is that it doesn&#8217;t require anybody to explicitly say, &#8220;I crowdsource because it gives me a sense of purpose.” Indeed, it perfectly preserves the anonymity of individual user preferences, since the results that we generate are estimates based on summaries of behavior across the different treatment groups. The questions are less obtrusive and there&#8217;s no pressure to hide your true sentiments or conform to the expectations of others. List experiments are thus amazing tools to examine preferences that may be controversial or otherwise influenced by social pressures in some way.</p>
<p>Judd and I designed a pilot experiment with the list above and administered it to MTurk workers through Crowdflower. For the sake of comparison, we also included a control condition that asked Turkers the same questions in traditional, agreement-style survey form. To simplify things, we limited the responses to US workers only.</p>
<p>Comparing the results from the survey condition and the list experiment revealed some mind-blowing differences:</p>
<p><a rel="attachment wp-att-936" href="http://blog.crowdflower.com/2010/08/for-love-or-for-money-a-list-experiment-on-the-motivations-behind-crowdsourcing-work/list_exper-comparisonresults/"><img class="aligncenter size-full wp-image-936" title="List_exper-ComparisonResults" src="http://blog.crowdflower.com/wp-content/uploads/2010/07/List_exper-ComparisonResults.png" alt="" width="672" height="672" /></a></p>
<p>Note the discrepancy between some of the paired bars. Whereas 97% of the Turkers in the control group agreed with the statement &#8220;I am motivated to do HITs on Mechanical Turk to make extra money,&#8221; just 60% of the Turkers in the list experiment condition expressed the same preference.</p>
<p>Similarly, check out the difference between the agreement-style questions and list experiment results in the &#8220;for fun&#8221; category. Again, agreement statements elicit over-reporting when compared with the list experiment (although this time to a less extreme degree).</p>
<p>Our preliminary conclusions from this pilot study? The ideas of crowdsourcing for money and crowdsourcing for fun sound better than they actually are.</p>
<p>Another, slightly more science-y way to put this is that the workers in our study over-report the extent to which they are motivated by money and fun in response to agreement statements versus a list experiment, suggesting that they perceive these two factors to be socially desirable.</p>
<p>Understanding the cause of this <a href="http://en.wikipedia.org/wiki/Social_desirability_bias" target="new">social desirability bias</a> as well as its implications for crowdsourcing across different environments will require further research. In other contexts, social desirability bias (a.k.a. <a href="http://www.fivethirtyeight.com/2010/07/broadus-effect-social-desirability-bias.html" target="new">&#8220;the Broadus effect&#8221;</a>, if you read the amazing Nate Silver) has played a role in everything from elections to educational attainment. There&#8217;s no reason to believe it doesn&#8217;t affect the way people work and participate in various online environments as well.</p>
<p>Perhaps most interesting of all, our findings here further complicate the growing debate over how paid crowdsourcing ought to be <a href="http://cyber.law.harvard.edu/events/2010/02/zittrain" target="new">understood</a> and <a href="http://blog.crowdflower.com/2010/06/regulating-distributed-work-part-three-why-its-a-good-idea/" target="new">(potentially) regulated</a>. If a substantial proportion of workers aren&#8217;t actually on MTurk for the money, does that support the claim that we should regulate crowdsourcing along the same lines that we regulate other post-industrial sectors?</p>
<p>These are big questions that we should continue to probe through future studies and discussion. In the meantime, Judd and I re-ran our list experiment with a few minor adjustments and a much bigger sample. We&#8217;re in the process of writing up this larger version of the study for a conference submission and will post the full paper here as soon as we can.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.crowdflower.com/2010/08/for-love-or-for-money-a-list-experiment-on-the-motivations-behind-crowdsourcing-work/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Crowdsourcing Benford&#8217;s Law</title>
		<link>http://blog.crowdflower.com/2010/04/crowdsourcing-benfords-law/</link>
		<comments>http://blog.crowdflower.com/2010/04/crowdsourcing-benfords-law/#comments</comments>
		<pubDate>Mon, 05 Apr 2010 17:19:53 +0000</pubDate>
		<dc:creator>Aaron Shaw</dc:creator>
				<category><![CDATA[Miscellaneous]]></category>

		<guid isPermaLink="false">http://blog.crowdflower.com/2010/04/crowdsourcing-benfords-law/</guid>
		<description><![CDATA[On one of those quiet, rainy afternoons that we call &#8220;Winter&#8221; here in San Francisco, I found myself in the Crowdflower offices with the legendary Mike Love. Well, okay, maybe not that Mike Love. The other legendary Mike Love. In any event, Mike and I were chatting about statistics and blogging (he&#8217;s exceptionally skilled at [...]]]></description>
			<content:encoded><![CDATA[<div class="socialize-in-content" style="float:left;"><div class="socialize-in-button socialize-in-button-left"><a href="http://twitter.com/share" class="twitter-share-button" data-url="http://blog.crowdflower.com/2010/04/crowdsourcing-benfords-law/" data-text="Crowdsourcing Benford&#8217;s Law" data-count="vertical" data-via="crowdflower" ><!--Tweetter--></a></div><div class="socialize-in-button socialize-in-button-left"><script>
			<!-- 
			var fbShare = {
				url: "http://blog.crowdflower.com/2010/04/crowdsourcing-benfords-law/",
				size: "large",
				google_analytics: "true"
			}
			//-->
			</script>
                        <script src="http://widgets.fbshare.me/files/fbshare.js"></script></div><div class="socialize-in-button socialize-in-button-left"><script type="in/share" data-url="http://blog.crowdflower.com/2010/04/crowdsourcing-benfords-law/" data-counter="top"></script></div><div class="socialize-in-button socialize-in-button-left"><g:plusone size="small" href="http://blog.crowdflower.com/2010/04/crowdsourcing-benfords-law/"></g:plusone></div></div><p>On one of those quiet, rainy afternoons that we call &#8220;Winter&#8221; here in San Francisco, I found myself in the Crowdflower offices with the legendary <a href="http://en.wikipedia.org/wiki/Mike_Love">Mike Love</a>. </p>
<p>Well, okay, maybe not <em>that</em> Mike Love. The <em>other</em> legendary <strong><a href="http://mike-love.net/">Mike Love</a></strong>.</p>
<p>In any event, Mike and I were chatting about statistics and blogging (he&#8217;s exceptionally skilled at both) when he raised an interesting question: could you Crowdsource Benford&#8217;s Law?</p>
<p><span id="more-213"></span></p>
<p>Doe-eyed, mathematical neophyte that I am, neither Benford nor his law rang any bells, so I high-tailed it over to <a href="http://mathworld.wolfram.com" >Wolfram Alpha MathWorld</a>. There, in <a href="http://mathworld.wolfram.com/BenfordsLaw.html">an article</a> by <a href="http://mathworld.wolfram.com/about/author.html">Eric W. Weisstein</a>, I encountered the following histogram:</p>
<p><a href="http://mathworld.wolfram.com/images/eps-gif/BenfordsLaw_800.gif"><img class="centered" src="http://mathworld.wolfram.com/images/eps-gif/BenfordsLaw_800.gif" alt="Benford's Law distribution" title="Benford's Law distribution"/></a></p>
<p>Turns out that this simple, elegant, and purple plot describes the probability distribution of the first digit for many naturally occurring distributions. In other words, about 30% of the leading digits of the numbers in a large variety of tables or listings wil tend to be 1; 17.6% will tend to be 2; and so on down to around 4.6% of the numbers beginning with 9.</p>
<p>The phenomenon has an interesting history (at least as far as I could tell by going a few clicks past the <a href="http://en.wikipedia.org/wiki/Benford's_law">Wikipedia article</a>). It was originally described in an <a href="http://dx.doi.org/10.2307/2369148">1881 paper</a> (subscription required) by <a href="http://en.wikipedia.org/wiki/Simon_Newcomb">Simon Newcomb</a>, an astronomer who noticed that the first few pages in books of logarithm tables were worn much more heavily than later pages. The eponymous physicist <a href="http://en.wikipedia.org/wiki/Frank_Benford">Frank Benford</a> then re-discovered the distribution in a <a href="http://www.tphill.net/publications/BENFORD%20PAPERS/TheFirstDigitPhenomenonAmericanScientist1996.pdf">1938 paper</a> where he tested it on a number of different data sources. More recently, the mathematician <a href="http://www.math.gatech.edu/~hill/">Ted Hill</a> has provided a <a href="http://www.tphill.net/publications/BENFORD%20PAPERS/statisticalDerivationSigDigitLaw1995.pdf">more sophisticated proof</a> of the origins of the phenomenon and pioneered its use as a diagnostic to discover irregularities in data such as tax and elections returns (Tax season pro-tip: don&#8217;t fake a bunch of receipts that start with the number 9).</p>
<p>In any case, at Mike&#8217;s suggestion, I set out to see whether I could re-re-discover Benford&#8217;s Law with Crowdflower. The task I designed consisted of one question: &#8220;pick any number greater than zero.&#8221; In a little over an hour, I had about 500 valid responses and was off to the races to see what the resulting distribution of first digits looked like.</p>
<p>The results were all over the place. Here&#8217;s what they look like in a scatter plot with a log-scaled y-axis:</p>
<p><a href='http://blog.crowdflower.com/wp-content/uploads/2010/04/benfordslaw-scatter.png' title='benford-scatterplot'><img class="centered" src='http://blog.crowdflower.com/wp-content/uploads/2010/04/benfordslaw-scatter.png' alt='benford-scatterplot' /></a></p>
<p>And here I plotted all of the numbers sorted by magnitude and leading digit:</p>
<p><a href='http://blog.crowdflower.com/wp-content/uploads/2010/04/benfordslaw-raw.png' title='benford-raw_binned'><img class="centered" src='http://blog.crowdflower.com/wp-content/uploads/2010/04/benfordslaw-raw.png' alt='benford-raw_binned' /></a></p>
<p>The height of each tower of numbers captures the frequency of that leading digit. The size of the numerals corresponds to the magnitude of the number and the color corresponds roughly to its order of magnitude (red = big). Within each digit, the numbers are sorted along the Y-axis.</p>
<p>Even though it&#8217;s too small to really be legible, you can probably pick out a few interesting things from that graphic, such as the extraordinary frequency of the number 1 (it occurred, 67 times) or the size of the biggest number (2.345 x 10^13). The mean of all submissions was 4.72 x 10^10 while the median was 7. In other words, this little experiment resulted in a seriously skewed distribution.</p>
<p>But what about Benford&#8217;s Law? The height of the towers of numbers suggests that this data corresponded pretty well to the histogram at the beginning of my post. To supplement the visual, here&#8217;s a table showing the raw frequency and percentages by first digits (note, the first digits are at the top and are written out as words).</p>
<p><a href='http://blog.crowdflower.com/wp-content/uploads/2010/04/benfords_law_data_table.png' title='benford-data-summary-table'><img class="centered" src='http://blog.crowdflower.com/wp-content/uploads/2010/04/benfords_law_data_table.png' alt='benford-data-summary-table' /></a></p>
<p>It&#8217;s still hard to tell how well that does or does not match up with the distribution in the histogram above. There are some nice statistical tests that could help us here, but for the sake of a decent blog post, I went in search of a better way to compare the distributions visually. With that in mind, another plot:</p>
<p><a href='http://blog.crowdflower.com/wp-content/uploads/2010/04/benfordslaw-smooth.png' title='benford-smoothed'><img class="centered" src='http://blog.crowdflower.com/wp-content/uploads/2010/04/benfordslaw-smooth.png' alt='benford-smoothed' /></a></p>
<p>Here I superimposed the percentage values in table 1 (as <font color="A0522D">brown dots</font>) on a density curve of the original Benford&#8217;s Law values (shaded <font color="5F9EA0">in blue</font>). Then I also added a brown (loess) smoothed curve along with some gray confidence interval shading (thank you, <a href="http://had.co.nz/ggplot2">Hadley Wickham</a>) to capture the overall trend of the Crowdflower data. As you can see, the fit looks pretty good &#8211; I think Benford would be proud. Or at least maybe Mike Love. Maybe.</p>
<p>Admittedly, I may have cheated a bit in creating this second plot. Since the variable presented along the X axis (leading digit) consists of ordered categories, the smoothed lines are a bit of a representational stretch. Nevertheless, I&#8217;m satisfied with the way the resulting graphic visualizes the relationship between the probability function underlying Benford&#8217;s Law and the distribution of the Crowdflower responses. If anybody out there is inclined to pursue more interesting visualizations and/or rigorous tests to verify the fit, here&#8217;s a <a href="http://dl.dropbox.com/u/564931/Crowdflower_pick-a-number_data.csv">copy of the Crowdflower data</a> I used.</p>
<p><small><em>I do all of my analysis and graphs in <a href="www.r-project.org">R</a> using <a href="http://had.co.nz/ggplot2">ggplot2</a>. As always, feel free to send requests for code to: aaron [at] crowdflower [dot] com and be sure to tip your server.</small></em></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.crowdflower.com/2010/04/crowdsourcing-benfords-law/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Ask A Stupid Question Part 2: Forced Choice vs. Checkboxes</title>
		<link>http://blog.crowdflower.com/2010/03/ask-a-stupid-question-part-ii-forced-choice-vs-checkboxes/</link>
		<comments>http://blog.crowdflower.com/2010/03/ask-a-stupid-question-part-ii-forced-choice-vs-checkboxes/#comments</comments>
		<pubDate>Mon, 01 Mar 2010 19:05:17 +0000</pubDate>
		<dc:creator>Aaron Shaw</dc:creator>
				<category><![CDATA[Miscellaneous]]></category>
		<category><![CDATA[data collection]]></category>
		<category><![CDATA[Experiments]]></category>
		<category><![CDATA[methodology]]></category>
		<category><![CDATA[question formats]]></category>

		<guid isPermaLink="false">http://blog.crowdflower.com/2010/03/ask-a-stupid-question-part-ii-forced-choice-vs-checkboxes/</guid>
		<description><![CDATA[What kinds of questions produce the best results in crowdsourcing tasks and surveys? To answer that question, I bring you another geeked-out blog post in which I pit the multiple choice (or forced choice) question against its bitter arch-rival, the check-all-that-apply (or checkbox) question.Both kinds of formatting can be useful when you want people to [...]]]></description>
			<content:encoded><![CDATA[<div class="socialize-in-content" style="float:left;"><div class="socialize-in-button socialize-in-button-left"><a href="http://twitter.com/share" class="twitter-share-button" data-url="http://blog.crowdflower.com/2010/03/ask-a-stupid-question-part-ii-forced-choice-vs-checkboxes/" data-text="Ask A Stupid Question Part 2: Forced Choice vs. Checkboxes" data-count="vertical" data-via="crowdflower" ><!--Tweetter--></a></div><div class="socialize-in-button socialize-in-button-left"><script>
			<!-- 
			var fbShare = {
				url: "http://blog.crowdflower.com/2010/03/ask-a-stupid-question-part-ii-forced-choice-vs-checkboxes/",
				size: "large",
				google_analytics: "true"
			}
			//-->
			</script>
                        <script src="http://widgets.fbshare.me/files/fbshare.js"></script></div><div class="socialize-in-button socialize-in-button-left"><script type="in/share" data-url="http://blog.crowdflower.com/2010/03/ask-a-stupid-question-part-ii-forced-choice-vs-checkboxes/" data-counter="top"></script></div><div class="socialize-in-button socialize-in-button-left"><g:plusone size="small" href="http://blog.crowdflower.com/2010/03/ask-a-stupid-question-part-ii-forced-choice-vs-checkboxes/"></g:plusone></div></div><p>What kinds of questions produce the best results in crowdsourcing tasks and surveys? To answer that question, I bring you another geeked-out blog post in which I pit the multiple choice (or forced choice) question against its bitter arch-rival, the check-all-that-apply (or checkbox) question.Both kinds of formatting can be useful when you want people to identify or categorize something(s) in a list. Check-all-that-apply seems to offer the added bonus of easily fitting an entire list into a single question, and thereby requiring less mental effort from respondents and (presumably) reducing response times. How do the two kinds of questions compare on answer quality, though?In my <a href="http://blog.crowdflower.com/2009/12/ask-a-stupid-question">first post</a> a few weeks ago, I talked about some of the reasons why response scales matter when you&#8217;re designing multiple choice questions for a survey or data collection task. In the comments, <a href="http://blog.crowdflower.com/2009/12/ask-a-stupid-question/#comment-1853">michael</a> raised an interesting point:</p>
<blockquote><p><em>Why use a scale at all? I would make those types of questions always open ended. Anyone who takes the survey has to think about how many hours they spend online anyway. That’s the first step. The second is fitting their estimate in one of the categories. Seems like unnecessary work for the participants.</em></p></blockquote>
<p>You can check out the <a href="http://blog.crowdflower.com/2009/12/ask-a-stupid-question/#comments">rest of the thread</a> to see michael&#8217;s idea in context as well as how other people (including me) replied.The discussion got me thinking more about multiple choice questions and some of the costs and benefits that they entail in comparison to other types of questions. As luck would have it, a few of the other questions that I included in my original experiment can provide additional grist for the mill.</p>
<p><span id="more-189"></span></p>
<p><strong>Question Format Smackdown!</strong></p>
<p><a title="checkboxes" href="http://blog.crowdflower.com/wp-content/uploads/2010/01/checkbox_screenshot.png"><img src="http://blog.crowdflower.com/wp-content/uploads/2010/01/checkbox_screenshot.png" alt="checkboxes" /></a></p>
<p><a title="forced choice" href="http://blog.crowdflower.com/wp-content/uploads/2010/01/forced_choice-screenshot.png"><img class="centered  alignnone" src="http://blog.crowdflower.com/wp-content/uploads/2010/01/forced_choice-screenshot.png" alt="forced choice" /></a></p>
<p>In order to test how each format affects responses, I asked workers in the Crowdlabor pools one (and only one) version of the following:</p>
<p>As you can see, the forced choice version is a little clunky because I had to separate each item as a separate question. Nevertheless, there&#8217;s no substantive difference between the two versions other than the answer choice format, which makes it possible to compare the results.I should explain why I included several <em>extremely popular</em> websites (Google) among the answer options as well as some slightly less well-traveled, but still popular sites (Times of India, New York Times). Basically, this was in order to avoid too many people having visited all the sites or none of the sites. If a lot of responses fell into either extreme, it would have been impossible to estimate the extent to which the two formats affected the outcomes.As with my response scale example, the groups that saw the two versions of the question did not vary widely on potentially confounding demographic covariates such as gender or country of residence.</p>
<p>Here&#8217;s a table showing the number of positive responses per format per site:</p>
<p><a title="sites-visited-table1" href="http://blog.crowdflower.com/wp-content/uploads/2010/01/sites-visited-table1.png"><img class="centered alignnone" src="http://blog.crowdflower.com/wp-content/uploads/2010/01/sites-visited-table1.png" alt="sites-visited-table1" /></a></p>
<p><span style="color: #000000; -webkit-text-decorations-in-effect: none;">And a plot to visualize the variations per site as a percentage of total responses per question format:</span></p>
<p><a href="http://blog.crowdflower.com/wp-content/uploads/2010/03/sites-visited_points2.png"><img class="alignnone size-full wp-image-261" title="visited points" src="http://blog.crowdflower.com/wp-content/uploads/2010/03/sites-visited_points2.png" alt="" width="693" height="462" /></a></p>
<p>With one exception (Orkut), forced choice formatting resulted in more people saying they had visited every single site in the list.</p>
<p><strong>Estimating the Effect</strong></p>
<p>In order to get a precise measurement of the effect of forced choice format vs. checkbox format, I reshape the data into cumulative counts and compare the distributions of total number of sites visited among people who saw the checkbox and forced choice versions respectively. Here&#8217;s the resulting table:</p>
<p><img class="centered aligncenter" src="http://blog.crowdflower.com/wp-content/uploads/2010/01/sites-visited-table2.png" alt="sites-visited-table2" /></p>
<p>A pair of density plots represents the same information in graphical form:</p>
<p><a href="http://blog.crowdflower.com/wp-content/uploads/2010/03/sites-visited_density2.png"><img class="alignnone size-full wp-image-262" title="visited density" src="http://blog.crowdflower.com/wp-content/uploads/2010/03/sites-visited_density2.png" alt="" width="693" height="1040" /></a></p>
<p>On each plot, I&#8217;ve highlighted the minimum, maximum, and mean (sparklines-style). The heavy left-leaning skew of the checkbox curve contrasts nicely with slightly right-leaning shape of the forced choice curve.From both the table and the density plots, it&#8217;s easy to see that the two question formats appear to have caused a substantial difference. The difference in means between the two distributions suggests that a respondent who saw the forced or multiple choice format identified (on average) one <em>additional</em> site they had visited which their peers who saw checkboxes did not identify.</p>
<p><strong>Should I hate Checkboxes?</strong></p>
<p>The demographic profile of the two groups was pretty similar, so the disparity in the results was almost certainly due to the question format. But why does the question format have such a powerful effect?Given the opportunity, checkbox respondents either failed to notice or ignored answer choices when they were not forced to provide a response to each one. As with numerical response scales, this is yet another example of how mental shortcuts can compromise data quality.This time around, a solution is pretty simple. All things being equal, you&#8217;re better off using forced choice formatting when you care about precise results. That said, things are never really equal and there will always be some reason you might want to consider making life faster/simpler for the people answering the questions. For example, if you&#8217;re asking people to choose tags or labels for something, the precision of each response might not matter very much and checkboxes would work just fine.</p>
<p><em>I used R for all the analysis and plots. I created the first plot using Hadley Wickham&#8217; <a href="http://had.co.nz/ggplot2">ggplot2</a> package. Contact me with requests for data or code at  aaron [at] doloreslabs [dot] com and leave your questions, complaints, or suggestions below.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.crowdflower.com/2010/03/ask-a-stupid-question-part-ii-forced-choice-vs-checkboxes/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Not-quite-live-blog: Jonathan Zittrain on &#8220;Minds For Sale&#8221;</title>
		<link>http://blog.crowdflower.com/2009/12/not-quite-live-blog-jonathan-zittrain-on-minds-for-sale/</link>
		<comments>http://blog.crowdflower.com/2009/12/not-quite-live-blog-jonathan-zittrain-on-minds-for-sale/#comments</comments>
		<pubDate>Tue, 22 Dec 2009 06:03:50 +0000</pubDate>
		<dc:creator>Aaron Shaw</dc:creator>
				<category><![CDATA[Miscellaneous]]></category>
		<category><![CDATA[Berkman Center]]></category>
		<category><![CDATA[crowdsourcing]]></category>
		<category><![CDATA[ethics]]></category>
		<category><![CDATA[Harvard]]></category>
		<category><![CDATA[human computing]]></category>
		<category><![CDATA[talks]]></category>
		<category><![CDATA[videos]]></category>
		<category><![CDATA[Zittrain]]></category>

		<guid isPermaLink="false">http://blog.doloreslabs.com/2009/12/not-quite-live-blog-jonathan-zittrain-on-minds-for-sale/</guid>
		<description><![CDATA[Jonathan Zittrain, Professor of Law and Faculty Co-director (and co-founder) of the Berkman Center for Internet and Society at Harvard University, gave a presentation at the Computer History Museum in Mountain View about a month ago that ought to be required viewing for anyone interested in Cloudlabor and Crowdsourcing. Drawing examples from all over the [...]]]></description>
			<content:encoded><![CDATA[<div class="socialize-in-content" style="float:left;"><div class="socialize-in-button socialize-in-button-left"><a href="http://twitter.com/share" class="twitter-share-button" data-url="http://blog.crowdflower.com/2009/12/not-quite-live-blog-jonathan-zittrain-on-minds-for-sale/" data-text="Not-quite-live-blog: Jonathan Zittrain on &#8220;Minds For Sale&#8221;" data-count="vertical" data-via="crowdflower" ><!--Tweetter--></a></div><div class="socialize-in-button socialize-in-button-left"><script>
			<!-- 
			var fbShare = {
				url: "http://blog.crowdflower.com/2009/12/not-quite-live-blog-jonathan-zittrain-on-minds-for-sale/",
				size: "large",
				google_analytics: "true"
			}
			//-->
			</script>
                        <script src="http://widgets.fbshare.me/files/fbshare.js"></script></div><div class="socialize-in-button socialize-in-button-left"><script type="in/share" data-url="http://blog.crowdflower.com/2009/12/not-quite-live-blog-jonathan-zittrain-on-minds-for-sale/" data-counter="top"></script></div><div class="socialize-in-button socialize-in-button-left"><g:plusone size="small" href="http://blog.crowdflower.com/2009/12/not-quite-live-blog-jonathan-zittrain-on-minds-for-sale/"></g:plusone></div></div><p><a href="http://cyber.law.harvard.edu/people/jzittrain">Jonathan Zittrain</a>, Professor of Law and Faculty Co-director (and co-founder) of the <a href="http://cyber.law.harvard.edu">Berkman Center for Internet and Society</a> at Harvard University, gave a presentation at the Computer History Museum in Mountain View about a month ago that ought to be required viewing for anyone interested in Cloudlabor and Crowdsourcing.</p>
<p>Drawing examples from all over the Internet &#8211; including <a href="http://crowdflower.com/general/givework">a certain iPhone app</a> that you may have heard of &#8211; Zittrain raises some serious (and some seriously entertaining) questions about ethical and legal aspects of distributed human computing. </p>
<p>Straight from the <a href="http://www.youtube.com/user/BerkmanCenter">Berkman Center YouTube channel</a>, here&#8217;s the full video (which is also <a href="http://cyber.law.harvard.edu/events/2009/11/berkwest">available for download</a> under a <a href="http://creativecommons.org/licenses/by/3.0/">Creative Commons Attribution 3.0 license</a> from the President and Fellows of Harvard College:<br /></br><br /></br><br />
<center><object width="320" height="260"><param name="movie" value="http://www.youtube.com/v/Dw3h-rae3uo&#038;hl=en_US&#038;fs=1&#038;"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/Dw3h-rae3uo&#038;hl=en_US&#038;fs=1&#038;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="344"></embed></object></center><br /></br><br /></br><br />
Zittrain focuses on the potential alienation and opportunities for abuse that can arise with the growth of distributed online production. He also contemplates the thin line that separates exploitation from volunteering in the context of online communities and collaboration. </p>
<p>I enjoyed his analysis and the discussion afterwards, although I suspect that some of the conversation with the audience might get lost in the video. As with Zittrain&#8217;s most recent book, <em><a href="http://www.amazon.com/Future-Internet-How-Stop/dp/0300124872">The Future of the Internet and How to Stop It</a></em>, this is some of the best thinking about life online that you&#8217;ll find anywhere.</p>
<p>Zittrain has also published an abbreviated portion of his argument in <em>Newsweek</em> under the slightly more extreme title <a href="http://www.newsweek.com/id/225629">&#8220;Work the New Digital Sweatshops.&#8221;</a> </p>
<p>I find a lot of what Zittrain has to say compelling; however, I do wonder if the efforts of ReCaptcha-spammers and sock-puppeteers to exploit Crowdsourcing markets will ultimately prove successful. I also wonder whether the imposition of labor regulations in these contexts makes sense or would prove effective. Should my decision to kill time or make a few extra bucks by filtering images be subject to labor law? What about the ability of other people to offer money for distasteful and perhaps unethical (but usually not illegal) micro-tasks? </p>
<p>It may be a few years before anyone really understands if Crowdsourcing lends itself to unique types of market failure along these lines, but Zittrain and others such as <a href="http://www.ics.uci.edu/~lirani/">Lily Irani</a> and <a href="http://www.aaronkoblin.com/">Aaron Koblin</a> are doing us all a favor by asking some of the most important questions early in the game.</p>
<p><small><small><em> Full disclosure: the author of this post is affiliated with Harvard and the Berkman Center for Internet and Society, where he was a fellow during 2008-2009. While he doesn&#8217;t think that his affiliation influences his opinions about Zittrain&#8217;s work, it does mean that he&#8217;s very pleased not to be spending another winter in Cambridge this year.</em></small></small></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.crowdflower.com/2009/12/not-quite-live-blog-jonathan-zittrain-on-minds-for-sale/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>Ask a Stupid Question</title>
		<link>http://blog.crowdflower.com/2009/12/ask-a-stupid-question/</link>
		<comments>http://blog.crowdflower.com/2009/12/ask-a-stupid-question/#comments</comments>
		<pubDate>Wed, 16 Dec 2009 00:01:23 +0000</pubDate>
		<dc:creator>Aaron Shaw</dc:creator>
				<category><![CDATA[Miscellaneous]]></category>
		<category><![CDATA[data collection]]></category>
		<category><![CDATA[experiment]]></category>
		<category><![CDATA[garbage-in garbage-out]]></category>
		<category><![CDATA[social science]]></category>
		<category><![CDATA[surveys]]></category>

		<guid isPermaLink="false">http://blog.doloreslabs.com/2009/12/ask-a-stupid-question/</guid>
		<description><![CDATA[What makes a bad survey question and why does it matter? I thought I&#8217;d use my first blog posts as Dolores Labs&#8217;s friendly neighborhood social scientist to talk a little bit about question design since it&#8217;s a relevant, but often overlooked, area of Crowdsourcing work. You can ask “the crowd” all kinds of questions, but [...]]]></description>
			<content:encoded><![CDATA[<div class="socialize-in-content" style="float:left;"><div class="socialize-in-button socialize-in-button-left"><a href="http://twitter.com/share" class="twitter-share-button" data-url="http://blog.crowdflower.com/2009/12/ask-a-stupid-question/" data-text="Ask a Stupid Question" data-count="vertical" data-via="crowdflower" ><!--Tweetter--></a></div><div class="socialize-in-button socialize-in-button-left"><script>
			<!-- 
			var fbShare = {
				url: "http://blog.crowdflower.com/2009/12/ask-a-stupid-question/",
				size: "large",
				google_analytics: "true"
			}
			//-->
			</script>
                        <script src="http://widgets.fbshare.me/files/fbshare.js"></script></div><div class="socialize-in-button socialize-in-button-left"><script type="in/share" data-url="http://blog.crowdflower.com/2009/12/ask-a-stupid-question/" data-counter="top"></script></div><div class="socialize-in-button socialize-in-button-left"><g:plusone size="small" href="http://blog.crowdflower.com/2009/12/ask-a-stupid-question/"></g:plusone></div></div><p style="margin-bottom: 0in"><span style="font-size: medium" class="Apple-style-span"></span></p>
<p><strong>
<p style="margin-bottom: 0in"><span style="font-weight: normal" class="Apple-style-span"></span></strong></p>
<p>What makes a bad survey question and why does it matter? I thought I&#8217;d use my first blog posts as Dolores Labs&#8217;s friendly neighborhood social scientist to talk a little bit about question design since it&#8217;s a relevant, but often overlooked, area of Crowdsourcing work. </p>
<p>You can ask “the crowd” all kinds of questions, but if you don&#8217;t stop to think about the best way to ask your question, you&#8217;re likely to get unexpected and unreliable results. You might call it the <a href="http://en.wikipedia.org/wiki/Garbage_In,_Garbage_Out">GIGO</a> theory of research design.</p>
<p>To demonstrate the point, I decided to recreate some classic survey design experiments and distribute them to the workers in Crowdflower&#8217;s labor pools. For the experiments, every worker saw only one version of the questions and the tasks were posted using exactly the same title, description, and pricing. One hundred workers did each version of each question and I threw out the data from a handful of workers who failed a simple attention test question. The results are actual answers from actual people.</p>
<p><strong>An Example: Response Scales</strong></p>
<p>The rest of this post focuses on one example question that involved a response scale and a test to see how altering the scale would affect people&#8217;s answers. Here are two versions of the same question that I posted to Crowdflower:<br />
<br />
</br></p>
<blockquote><blockquote><small>Low Scale Version:<br /></br><br />
<code>About how many hours do you spend online per day?</p>
<blockquote><p>(a) 0 – 1 hour<br />
     (b) 1 – 2 hours<br />
     (c) 2 – 3 hours<br />
     (d) More than 3 hours</code></p></blockquote>
</blockquote>
<p>
</br><br />
<br />
</br><br />
High Scale Version:<br /></br><br />
<code>About how many hours do you spend online per day?</p>
<blockquote><p>(a) 0 – 3 hours<br />
     (b) 3 - 6 hours<br />
     (c) 6 – 9 hours<br />
     (d) More than 9 hours</code></small></p></blockquote>
</blockquote>
</blockquote>
<p>
</br><br />
<br />
</br></p>
<p>Notice that both versions can accommodate any answer and that the only difference is in the range of the scale items. You can give an accurate response to either question and neither version explicitly pushes you to give any answer over another.</p>
<p>So what did people say? Here&#8217;s a pair of histograms breaking the responses up by the two versions of the question:</p>
<p><img class="centered" src="http://blog.doloreslabs.com/wp-content/uploads/2009/12/hours_online-raw_bins.png" alt="boring histograms: hours online by scale" /></p>
<p>I didn&#8217;t label the height of the bars because the results are almost useless in this form. The only conclusion we can draw is that a lot of people in the Crowdflower worker pool tend to spend more than three hours per day online (whoa, no way&#8230;). </p>
<p>At the same time, it seems like the workers might have given low answers more frequently in response the low scale (check out how big the first three blue bars are compared to just the first orange bar).</p>
<p>To look at that comparison more closely, let&#8217;s break the answers into two categories for each scale: (1) the percentage of responses that were less than three hours, or (2) the percentage of responses that were more than 3 hours.</p>
<p><img class="centered" src="http://blog.doloreslabs.com/wp-content/uploads/2009/12/dotplots-2bins.png" alt="hours online in two bins" /></p>
<p>The difference between the height of the orange points (high scale) is much bigger than the corresponding difference between the height of the blue points (low scale). In other words, people who saw the high scale were much more likely to say they spent more than 3 hours online. In case you&#8217;re a stats nerd, the <a href="http://ccnmtl.columbia.edu/projects/qmss/the_chisquare_test/about_the_chisquare_test.html">Chi-square test</a> showed that this variation was significant with a p-value < 0.001, so the difference was almost certainly not due to chance.</p>
<p>But maybe collapsing the responses like this is a little too coarse and you'd still like to see how the variation worked across the scale as a whole. With that in mind, Lukas suggested another way to look at the effects – a comparison of the cumulative percentage of responses – and the differences are even more clear.</p>
<p><img class="centered" src="http://blog.doloreslabs.com/wp-content/uploads/2009/12/cumulative_dotplots.png" alt="hours online - cumulative bins" /></p>
<p>That gap between the blue and the orange line at “Less than 3 hours” – the one level that was measured explicitly on both scales – is huge!</p>
<p><strong>Explaining the Gap</strong></p>
<p>If you&#8217;re thinking that the differences between the scales alone can&#8217;t explain why all of these results are so skewed, that&#8217;s a good thought. However, the fact that this was a randomized experiment on a relatively homogeneous group of people makes it very unlikely that anything else explains the difference. Just to be sure, I did some other tests and found no significant differences between the sets of respondents that saw the low and high scales in terms of gender, country of origin, and the amount of time they took to complete the survey. So it seems like the scale is indeed the most likely culprit. </p>
<p>But what explains why scale questions can bias people&#8217;s responses so heavily? Survey researchers call this kind of behavior <a href="http://en.wikipedia.org/wiki/Satisficing#Survey_Taking">satisficing</a> &#8211; it happens when people taking a survey use cognitive shortcuts to answer questions. In the case of questions about personal behaviors that we&#8217;re not used to quantifying (like the time we spend online), we tend to shape our responses based on what we perceive as “normal.” If you don&#8217;t know what normal is in advance, you define it based on the midpoint of the answer range. Since respondents didn&#8217;t really differentiate between the answer options, they were more likely to have their responses shaped by the scale itself.</p>
<p>These results illustrate a sticky problem: it&#8217;s possible that a survey question that is distributed, understood, and analyzed perfectly could give you completely inaccurate results if the scale is poorly designed.</p>
<p><strong>Okay, it&#8217;s Broken. Now How Do I fix It?</strong></p>
<p>So what are you supposed to do in order to figure out which scale is more accurate? One of the best ways to mitigate the problem is to do some open-ended research on your respondent population so that you can get a good sense of a reasonable range of responses. Then you can re-center your response scale around that distribution.</p>
<p>To try this out, I ran the survey yet again with the same question, except that this time I left the “hours online” question open-ended, allowing Crowdflower workers to type in their responses. Here&#8217;s a density plot of those responses with the minimum, maximum, and mean responses highlighted (<a href="http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0001OR">sparklines</a> style):</p>
<p><img src="http://blog.doloreslabs.com/wp-content/uploads/2009/12/hours_online_open.png" alt="hours online - open ended" /></p>
<p>While the distribution is skewed and has something of a long-ish tail, the mean (6.53 hours per day), median (6 hours per day), and mode (5 hours per day) are all close to the midpoint of the high scale in my original questions. Therefore, the responses from the high scale were probably a more accurate reflection of the worker&#8217;s judgments.</p>
<p>Keep in mind, this technique provides no guarantee that the workers have accurate knowledge of how many hours they spend online – it&#8217;s <a href="http://en.wikipedia.org/wiki/Turtles_all_the_way_down">turtles all the way down</a>. I&#8217;d be willing to bet that their best guesses are pretty good, but if a big policy decision was riding on this question, I&#8217;d try to supplement my little survey with some other data sources. No matter what, there&#8217;s no perfect solution.</p>
<p><strong>So what?</strong></p>
<p>The point of all this has not been to undermine survey research, but to illustrate some of the problems that can happen if you&#8217;re not careful with things like scale design, as well as to present some strategies for solving those problems. As crowdsourcing becomes a mainstream tool in a range of academic and commercial fields, survey and questionnaire design techniques are also becoming more widely applicable. Nevertheless, people don&#8217;t usually encounter this kind of stuff outside of research methodology textbooks and the polling season of an election year.</p>
<p>I have a few more examples from these same experiments that I hope to follow up with in more posts soon. Meanwhile, leave a comment or email me at <em>aaron [at] doloreslabs [dot] com</em> with questions, comments, corrections and requests for data/code. All of these plots were created using <a href="http://www.r-project.org/">R</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.crowdflower.com/2009/12/ask-a-stupid-question/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
	</channel>
</rss>

