CrowdFlower Challenges Yelp: It’s a Nerd-Off

Dramatic Intro

It is high noon in business listing verification crowdsourcing land. We are throwing down the gauntlet. We are stepping in the ring. We are mixing our metaphors.

Undramatic Intro

Yelp engineers recently described their efforts to correct business listing data using Amazon Turk. They tapped the services of 4,660 contributors; only 79 passed their quality assurance testing (1.7% of contributors were “trusted”), and the data they output was (very roughly) 80% accurate.

This smelled funny to us. Our business listing verification service routinely returns results above 97% accuracy. In fact, some of the most recognizable names in local search and business data pay for that service. (See a full report on 100,000 listings we did for a major search company to see some typical figures). Out of the last couple dozen crowdsourcing tasks we’ve run, the absolute minimum proportion of contributors who were “trusted” was 34%. But more importantly, our platform identifies these trusted contributors within minutes, meaning the best contributors get the job done quickly.

crowdsourcing URL Precision Numbers

URL Precision Numbers (excerpted from an actual past report to client)

So. Why Did Yelp’s Project Struggle to Meet Enterprise Accuracy Standards?

It was not because of a lack of brains. The bios of the eight folks at Yelp who worked on this project are smattered with words like “Harvey Mudd” and “Computer Science” and “Stanford” and “PhD”.  And they work at Yelp, which is, y’know, awesome.

And it was not because of a lack of good contributors. CrowdFlower has first-hand experience with well over one million contributors (many from the Mechanical Turk platform), and, when given the right tools and feedback, we’ve found them to be very accurate.

No, Really. Why?

It was in part because the Yelp team did not have the tools developed over the years by CrowdFlower’s crack engineering team.

  • Our contributors face ongoing tests as they complete work, and whenever they get an answer wrong they are given feedback as to why they were wrong in real-time; these tests are carefully calibrated to test for the most common types of contributor errors.
  • Our contributor UIs are the products of dozens of A|B tests run through CrowdFlower’s custom A|B testing infrastructure.
  • We use digital assembly line technology, chaining together many very simple tasks to yield a complex result. The below is a (somewhat outdated) representation of our business listing verification assembly line, where each blue box represents one discrete user task:

 

crowdsourcing Digital Assembly Line

Digital Assembly Line

 

Just as important… this project probably struggled to succeed because crowdsourcing to enterprise standard quality is incredibly hard! The business listing verification team at CrowdFlower only succeeded after a full year and tens of millions of human judgments. We worked quite a few 24-hour days, and our social skills atrophied from lack of use.

In the end, we have achieved a solution that is fast, accurate, and affordable to use – and continues to be improved upon.

We did all this work so others won’t have to. If you’re contemplating going it alone, give us a call! It’s not worth it! So many people love you!

The Challenge

On behalf of our contributors, CrowdFlower, and the business listing verification team, I’d like to offer a challenge to you, friendly neighborhood Yelp engineers.

Give us 5,000 business listings. If we can raise the precision of those listings to 95%+ and beat any machine learning algorithms you can build, you give us two engineers. No, just kidding. If we can do so, you’ll write about the experience on your engineering blog.

If we lose (actually, regardless of whether we win or lose), we’ll happily sit with you to show and tell you everything we’ve ever learned about business listing verification.

For more information and actual sample data from another Business Listing Verification project, check out this customer report.

6 Responses to “CrowdFlower Challenges Yelp: It’s a Nerd-Off”

  1. Baltimore Computer Consultant Rob Cox

    I had read about crowdsourcing and had always been curious about using amazon’s mechanical turk (What a name for it). I did a few very small tests with it and quickly realized I was missing some way to make it work for my company. I wanted a 99% accurate marketing database of about 20,000 companies, but in the small tests I did, we never even approached the 80 percent figure of yelp. I thought I could just tell the folks what I was looking for and let them go at it. Lesson learned. Is 99% accurate even achievable, what with the economy and moves and mergers etc. Maybe I just set the bar too high. You mention 97% accuracy in the post, but what data is in the listings you are talking about? Mine have about 30 fields of data.

  2. Greg Laughlin

    Thanks for commenting, Rob. In our experience, 99% is a bit too high a bar. There are too many confusing websites out there for fewer mistakes than that to occur. We do see 99% accuracy on certain attributes on some runs of data (e.g., 99% accurate collection of phone numbers), but not regularly enough to promise it.

    And to clarify, the core attributes we clean (and that this blog post largely concerns) are Business Name, URL, Address, Phone. Additionally, we’ve successfully acquired hours of operation, gathered contact emails, found links to menus, and done a variety of other data enrichment work.

  3. MP Crosson

    Greg – I am curious as to what application this may have to SMBs or consultants… I run the largest social media group on LinkedIn (nearly 200,000 members) and I know crowd-sourcing is a fairly popular subject but I am just not knowledgeable enough myself to determine how relevant it is to the group.

    Your thoughts?

    Cheers, Mike Crosson

  4. Tony Mariotti

    Great blog post. The most interesting part for me was that you give Turkers feedback as to why they were wrong in real-time. When I took the Graduate Record Exam (GRE) for my grad school application back in 1996, they offered a new computer version of the exam that took advantage of “adaptive testing.” In the paper version (#2 pencil and bubbles) test, the questions are ordered by degree of difficulty. With the computer test, for every question you answered correctly, you were given a harder one. And for each one you got wrong, you were presented with an easier one. The net result was a shorter test duration (time) as they could determine your overall score with fewer questions. And as the test taker, you pretty much tell how you were doing by the difficulty of the questions you were given.

    Adaptive testing is different than what you describe, as GRE questions are all different and mTurk tasks are the same. But the idea of moderating and adjusting on-the-fly is a really big leap. It’s easy to see why letting workers know how they are doing and WHY they are doing well/or not would definitely lead to better outcomes. I suspect it is more satisfying to the worker.

  5. Greg Laughlin

    Mike – we do a lot of work around social media. I’d be happy to share some details with you offline.