WeSEE crawler
… and ignoring robots.txt

I found WeSEE by monitoring bot traffic accessing our media directory, which contains photos of car complaints uploaded by our users. WeSEE’s crawler was accessing files correctly — no 404 errors — better than some of these other crawlers.

But, there’s no good reason for an ad services company to index photos. It’s just a waste of our bandwidth.

Things started out well — WeSEE’s crawler useragent contains a URL to their crawler info page: - - [15/Feb/2015:18:59:18 -0800] "GET /carcomplaints.com/media/complaints/images/39/d7c17846-c2cc-1031-b743-4c3114d2dee3.png
HTTP/1.1" 200 2348 "-" "WeSEE:Ads/PictureBot (http://www.wesee.com/bot/)" - - [15/Feb/2015:18:59:18 -0800] "GET /carcomplaints.com/media/complaints/images/39/d8132e3e-c2cc-1031-b743-4c3114d2dee3.png
HTTP/1.1" 200 2222 "-" "WeSEE:Ads/PictureBot (http://www.wesee.com/bot/)" - - [15/Feb/2015:18:59:18 -0800] "GET /carcomplaints.com/media/complaints/images/120/d7757374-c2cc-1031-b743-4c3114d2dee3.png
HTTP/1.1" 200 19884 "-" "WeSEE:Ads/PictureBot (http://www.wesee.com/bot/)"

Better yet, their bot page has specific information on how to block their crawler AND a contact email address. Looking good so far.

The bot accessing our media was “WeSEE:Ads/PictureBot ” & their crawler page only has info for blocking the “WeSEE” crawler in robots.txt, so we assumed it must work for all WeSEE crawlers:

User-agent: WeSEE
Disallow: /cgi-bin

WeSEE gives a timeframe of “24 hours” for how soon the robots.txt block will be picked up.

We gave them 10 days. No dice.

I emailed them about it & got this reply back:

Hi Wick,

We work with the world's leading advertisings [sic] solutions to help them understand pages' contents in detail so they can target campaigns more effectively - and as for publishers - earn more money. Our company (WeSEE) only indexes pages that have passed through an advertisers network, i.e. we do not crawl sites randomly.
Mentioned requests were made by the image links, which were shared (published) on other websites (and directs to your site).

Our ops team will block such requests to your site in several hours today. 

We apologize for any inconvenience this caused you and appreciate your understanding of these technical issues.

If you have any further question please don’t hesitate to ask.

At this point, my B.S. meter started smoking & promptly caught on fire. I replied that their robots.txt exclusion wasn’t working — a serious problem that shouldn’t involve “technical issues”. If you operate a crawler, especially one that your entire business model depends on, it should probably follow the robots.txt standard.

We’ll see what they say.

UPDATE: To their credit, WeSEE tech support promptly replied & after a few emails back & forth, admitted they screwed up. While their content bot checks robots.txt before crawling page content, their pictures bot crawled media contained in those pages without also checking robots.txt rules. WeSEE claims this has been fixed.