<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress/2.3.3" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>
<channel>
	<title>Comments on: AMT is fast, cheap, and good for machine learning data</title>
	<link>http://blog.crowdflower.com/2008/09/amt-fast-cheap-good-machine-learning/</link>
	<description></description>
	<pubDate>Thu, 11 Mar 2010 22:02:01 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.3.3</generator>
		<item>
		<title>By: The Bates Method</title>
		<link>http://blog.crowdflower.com/2008/09/amt-fast-cheap-good-machine-learning/#comment-2429</link>
		<dc:creator>The Bates Method</dc:creator>
		<pubDate>Fri, 26 Feb 2010 22:46:23 +0000</pubDate>
		<guid>http://blog.crowdflower.com/2008/09/amt-fast-cheap-good-machine-learning/#comment-2429</guid>
		<description>Kudo's to whoeve4r made this post. Very good stuff, you really know what you're talking about...signed up for your feed. Thanks</description>
		<content:encoded><![CDATA[<p>Kudo&#8217;s to whoeve4r made this post. Very good stuff, you really know what you&#8217;re talking about&#8230;signed up for your feed. Thanks</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Rhett Smeenk</title>
		<link>http://blog.crowdflower.com/2008/09/amt-fast-cheap-good-machine-learning/#comment-2329</link>
		<dc:creator>Rhett Smeenk</dc:creator>
		<pubDate>Sat, 13 Feb 2010 22:07:32 +0000</pubDate>
		<guid>http://blog.crowdflower.com/2008/09/amt-fast-cheap-good-machine-learning/#comment-2329</guid>
		<description>I have been a reader for a long while, but this is my first time as a commenter.  I just wanted to let you know that this has been / is my favorite update of yours!  Keep up the good work and I'll keep on checking back.  If you'd be interested in swapping blogroll links with me, my website is &lt;a href="http://www.expertinfopedia.com/the-monavie-scam-is-it-fact-or-fiction/" rel="nofollow"&gt;MonaVie Scam&lt;/a&gt;.</description>
		<content:encoded><![CDATA[<p>I have been a reader for a long while, but this is my first time as a commenter.  I just wanted to let you know that this has been / is my favorite update of yours!  Keep up the good work and I&#8217;ll keep on checking back.  If you&#8217;d be interested in swapping blogroll links with me, my website is <a href="http://www.expertinfopedia.com/the-monavie-scam-is-it-fact-or-fiction/" rel="nofollow">MonaVie Scam</a>.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ruben</title>
		<link>http://blog.crowdflower.com/2008/09/amt-fast-cheap-good-machine-learning/#comment-2240</link>
		<dc:creator>Ruben</dc:creator>
		<pubDate>Wed, 03 Feb 2010 15:39:39 +0000</pubDate>
		<guid>http://blog.crowdflower.com/2008/09/amt-fast-cheap-good-machine-learning/#comment-2240</guid>
		<description>Einen schoenen Blog hast du hier, warum kannte ich den denn noch nicht. Naja jetzt habe ich Ihn gebookmarkt und werde in der naechsten Zeit oefters vorbei schauen. Bin auf jeden Fall schon auf deine neuen Artikel gespannt.</description>
		<content:encoded><![CDATA[<p>Einen schoenen Blog hast du hier, warum kannte ich den denn noch nicht. Naja jetzt habe ich Ihn gebookmarkt und werde in der naechsten Zeit oefters vorbei schauen. Bin auf jeden Fall schon auf deine neuen Artikel gespannt.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Anastasia</title>
		<link>http://blog.crowdflower.com/2008/09/amt-fast-cheap-good-machine-learning/#comment-2169</link>
		<dc:creator>Anastasia</dc:creator>
		<pubDate>Sat, 23 Jan 2010 07:48:03 +0000</pubDate>
		<guid>http://blog.crowdflower.com/2008/09/amt-fast-cheap-good-machine-learning/#comment-2169</guid>
		<description>Hey this is great. I just found what I was exactly looking for</description>
		<content:encoded><![CDATA[<p>Hey this is great. I just found what I was exactly looking for</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: koi fish designs</title>
		<link>http://blog.crowdflower.com/2008/09/amt-fast-cheap-good-machine-learning/#comment-2166</link>
		<dc:creator>koi fish designs</dc:creator>
		<pubDate>Fri, 22 Jan 2010 06:28:23 +0000</pubDate>
		<guid>http://blog.crowdflower.com/2008/09/amt-fast-cheap-good-machine-learning/#comment-2166</guid>
		<description>I m really impressed with your work. I m glad to have read this article. It was a great way of putting forward your ideas on this subject</description>
		<content:encoded><![CDATA[<p>I m really impressed with your work. I m glad to have read this article. It was a great way of putting forward your ideas on this subject</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Anonymous</title>
		<link>http://blog.crowdflower.com/2008/09/amt-fast-cheap-good-machine-learning/#comment-1495</link>
		<dc:creator>Anonymous</dc:creator>
		<pubDate>Wed, 16 Sep 2009 14:59:42 +0000</pubDate>
		<guid>http://blog.crowdflower.com/2008/09/amt-fast-cheap-good-machine-learning/#comment-1495</guid>
		<description>Hi !
My first commentary on Your blog, much like You to read</description>
		<content:encoded><![CDATA[<p>Hi !<br />
My first commentary on Your blog, much like You to read</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Brendan O'Connor</title>
		<link>http://blog.crowdflower.com/2008/09/amt-fast-cheap-good-machine-learning/#comment-621</link>
		<dc:creator>Brendan O'Connor</dc:creator>
		<pubDate>Wed, 22 Oct 2008 07:32:31 +0000</pubDate>
		<guid>http://blog.crowdflower.com/2008/09/amt-fast-cheap-good-machine-learning/#comment-621</guid>
		<description>Oh, I just realized I've used item difficulty modelling in a different context, to figure out the tradeoff between more annotations and accuracy under naive voting:

With a gold standard set of true examples, get a high #annos/example, and get per-item error rates (1-sensitivity) err_i.  This is to make a high recall system, so "yes" decisions must be unanimous (single dissenter causes "no" decision), meaning that all annotators must make an error to get an aggregate error.  Then for k annotations per example the expected number of errors is \sum_i (err_i)^k.

This can be viewed as a model where all workers have equal capability, but items have a difficulty parameter.  It gives a more pessimistic error estimate than most naive model where all items have the same difficulty, whose expected errors is n * (err)^k.  I believe the estimates are different because hard items don't get solved too well by throwing more and more annotators at them.</description>
		<content:encoded><![CDATA[<p>Oh, I just realized I&#8217;ve used item difficulty modelling in a different context, to figure out the tradeoff between more annotations and accuracy under naive voting:</p>
<p>With a gold standard set of true examples, get a high #annos/example, and get per-item error rates (1-sensitivity) err_i.  This is to make a high recall system, so &#8220;yes&#8221; decisions must be unanimous (single dissenter causes &#8220;no&#8221; decision), meaning that all annotators must make an error to get an aggregate error.  Then for k annotations per example the expected number of errors is \sum_i (err_i)^k.</p>
<p>This can be viewed as a model where all workers have equal capability, but items have a difficulty parameter.  It gives a more pessimistic error estimate than most naive model where all items have the same difficulty, whose expected errors is n * (err)^k.  I believe the estimates are different because hard items don&#8217;t get solved too well by throwing more and more annotators at them.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Bob Carpenter</title>
		<link>http://blog.crowdflower.com/2008/09/amt-fast-cheap-good-machine-learning/#comment-611</link>
		<dc:creator>Bob Carpenter</dc:creator>
		<pubDate>Sun, 28 Sep 2008 17:19:59 +0000</pubDate>
		<guid>http://blog.crowdflower.com/2008/09/amt-fast-cheap-good-machine-learning/#comment-611</guid>
		<description>I do think there's more headroom with fewer annos.  What I'm interested in is getting the posterior estimate of accuracy so we can decide which items need more annotation.

I have exactly the same feeling about difficulty, but I haven't been able to fit any of the logistic models with a latent difficulty predictor.   The basic model is just  p(anno[i,j]) = inverseLogit(accuracy[j] - difficulty[i]), just like the items response model.  I'm going to send some mail to the epidemiologists fitting these models -- some of the more recent paper discuss the instability of the model fitting.  Even with the true category known, I'm having trouble fitting the models.

The problem seems to be that there's two ways to account for variability in an annotation, either that it's difficulty or that the annotators are error-prone.  If I crank prior variance on difficulty down close to zero, it fits, but there's not much of a difficulty effect.  If I let variance even approach 1, different chains don't mix well at all and I just can't infer the difficulties reliably.  I've tried this with both real and simulated data.

I did manage to fit the binary mixture of easy/hard items, but that seems less relevant with more and more annotators and especially with very noisy annotators.</description>
		<content:encoded><![CDATA[<p>I do think there&#8217;s more headroom with fewer annos.  What I&#8217;m interested in is getting the posterior estimate of accuracy so we can decide which items need more annotation.</p>
<p>I have exactly the same feeling about difficulty, but I haven&#8217;t been able to fit any of the logistic models with a latent difficulty predictor.   The basic model is just  p(anno[i,j]) = inverseLogit(accuracy[j] - difficulty[i]), just like the items response model.  I&#8217;m going to send some mail to the epidemiologists fitting these models &#8212; some of the more recent paper discuss the instability of the model fitting.  Even with the true category known, I&#8217;m having trouble fitting the models.</p>
<p>The problem seems to be that there&#8217;s two ways to account for variability in an annotation, either that it&#8217;s difficulty or that the annotators are error-prone.  If I crank prior variance on difficulty down close to zero, it fits, but there&#8217;s not much of a difficulty effect.  If I let variance even approach 1, different chains don&#8217;t mix well at all and I just can&#8217;t infer the difficulties reliably.  I&#8217;ve tried this with both real and simulated data.</p>
<p>I did manage to fit the binary mixture of easy/hard items, but that seems less relevant with more and more annotators and especially with very noisy annotators.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tim Converse</title>
		<link>http://blog.crowdflower.com/2008/09/amt-fast-cheap-good-machine-learning/#comment-606</link>
		<dc:creator>Tim Converse</dc:creator>
		<pubDate>Thu, 25 Sep 2008 04:21:02 +0000</pubDate>
		<guid>http://blog.crowdflower.com/2008/09/amt-fast-cheap-good-machine-learning/#comment-606</guid>
		<description>This is a really nice paper  - the section on judge bias was helpful for us at Powerset where we worry about some of the same AM Turk data-quality issues.  Thanks for putting it out there.</description>
		<content:encoded><![CDATA[<p>This is a really nice paper  - the section on judge bias was helpful for us at Powerset where we worry about some of the same AM Turk data-quality issues.  Thanks for putting it out there.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: brendano</title>
		<link>http://blog.crowdflower.com/2008/09/amt-fast-cheap-good-machine-learning/#comment-583</link>
		<dc:creator>brendano</dc:creator>
		<pubDate>Fri, 19 Sep 2008 08:31:48 +0000</pubDate>
		<guid>http://blog.crowdflower.com/2008/09/amt-fast-cheap-good-machine-learning/#comment-583</guid>
		<description>Bob, sorry I didn't respond earlier -- this is really exciting.  Congrats on getting the model to work so well.

Since we didn't include the exact numbers for worker modelling/correction in the paper, to make this complete, here are exact numbers from both of our experiments so far (plus a new one I just did):

Accuracy rates at Turker ensembles matching the gold standard, RTE with 10 judgments/example:

89.7%  -  naïve voting
92.6%  -  hidden labels, inferred worker prior  [MAP via Gibbs sampler] &lt;a href="http://lingpipe-blog.com/2008/09/15/dolores-labs-text-entailment-data-from-amazon-mechanical-turk/" rel="nofollow"&gt;[link]&lt;/a&gt;
92.9%  -  hidden labels, uniform worker prior  [MAP via Gibbs sampler] &lt;a href="http://lingpipe-blog.com/2008/09/15/dolores-labs-text-entailment-data-from-amazon-mechanical-turk/" rel="nofollow"&gt;[link]&lt;/a&gt;
92.6%  -  known labels (LOO), add-1 worker prior  [MAP via direct inference]
92.9%  -  known labels (LOO), non-bayesian: drop workers with &lt;67% accuracy then naïve vote the rest

Yes, that last one is way less principled than the others, and wasn’t in the paper, though probably should have been.  (My fault; I’ll blame a deadline rush I suppose.)

Anyway, I think it’s impressive that the hidden label model does exactly as well as any system using known labels.

I’d be interested to see how things do for smaller numbers of annotators per example, since there’s more headroom down there -- e.g. at 3 anno/example, naïve accuracy is only 80.5%.

I’m also wondering if a model with per-item difficulty will start becoming useful.  When I do RTE myself, I feel there’s large variance in difficulty.  Of course this highlights why the BUGS approach makes progress much easier...


[[ 
More on these new experiments.  (1) The thresholding technique: In practice it will work a little less well, because I iterated the threshold parameter and took the best one, and the space was a little bumpy.  Though no lower than 92% for all reasonable thresholds; what’s key is eliminating the several noisy+prolific workers.  In fact, the 67% threshold drops nearly half of all judgments (!).  Though in practice with the AMT feedback cycle you wouldn’t pay for all those bad judgments, just pretest and eliminate bad workers early on, and pay good workers for more work.

(2) Unlike the paper, I used here leave-one-out instead of 20-fold cross-validation.  (I have a faster implementation now; I think I'm starting to hit the pain points of R so went back to python ... http://github.com/brendano/dlanalysis/tree/master/workers.py )
]]</description>
		<content:encoded><![CDATA[<p>Bob, sorry I didn&#8217;t respond earlier &#8212; this is really exciting.  Congrats on getting the model to work so well.</p>
<p>Since we didn&#8217;t include the exact numbers for worker modelling/correction in the paper, to make this complete, here are exact numbers from both of our experiments so far (plus a new one I just did):</p>
<p>Accuracy rates at Turker ensembles matching the gold standard, RTE with 10 judgments/example:</p>
<p>89.7%  -  naïve voting<br />
92.6%  -  hidden labels, inferred worker prior  [MAP via Gibbs sampler] <a href="http://lingpipe-blog.com/2008/09/15/dolores-labs-text-entailment-data-from-amazon-mechanical-turk/" rel="nofollow">[link]</a><br />
92.9%  -  hidden labels, uniform worker prior  [MAP via Gibbs sampler] <a href="http://lingpipe-blog.com/2008/09/15/dolores-labs-text-entailment-data-from-amazon-mechanical-turk/" rel="nofollow">[link]</a><br />
92.6%  -  known labels (LOO), add-1 worker prior  [MAP via direct inference]<br />
92.9%  -  known labels (LOO), non-bayesian: drop workers with &lt;67% accuracy then naïve vote the rest</p>
<p>Yes, that last one is way less principled than the others, and wasn’t in the paper, though probably should have been.  (My fault; I’ll blame a deadline rush I suppose.)</p>
<p>Anyway, I think it’s impressive that the hidden label model does exactly as well as any system using known labels.</p>
<p>I’d be interested to see how things do for smaller numbers of annotators per example, since there’s more headroom down there -- e.g. at 3 anno/example, naïve accuracy is only 80.5%.</p>
<p>I’m also wondering if a model with per-item difficulty will start becoming useful.  When I do RTE myself, I feel there’s large variance in difficulty.  Of course this highlights why the BUGS approach makes progress much easier...</p>
<p>[[<br />
More on these new experiments.  (1) The thresholding technique: In practice it will work a little less well, because I iterated the threshold parameter and took the best one, and the space was a little bumpy.  Though no lower than 92% for all reasonable thresholds; what’s key is eliminating the several noisy+prolific workers.  In fact, the 67% threshold drops nearly half of all judgments (!).  Though in practice with the AMT feedback cycle you wouldn’t pay for all those bad judgments, just pretest and eliminate bad workers early on, and pay good workers for more work.</p>
<p>(2) Unlike the paper, I used here leave-one-out instead of 20-fold cross-validation.  (I have a faster implementation now; I think I'm starting to hit the pain points of R so went back to python ... <a href="http://github.com/brendano/dlanalysis/tree/master/workers.py" rel="nofollow">http://github.com/brendano/dlanalysis/tree/master/workers.py</a> )<br />
]]</p>
]]></content:encoded>
	</item>
</channel>
</rss>
