WeSEE crawler
… and ignoring robots.txt

I found WeSEE by monitoring bot traffic accessing our media directory, which contains photos of car complaints uploaded by our users. WeSEE’s crawler was accessing files correctly — no 404 errors — better than some of these other crawlers.

But, there’s no good reason for an ad services company to index photos. It’s just a waste of our bandwidth.

Things started out well — WeSEE’s crawler useragent contains a URL to their crawler info page: - - [15/Feb/2015:18:59:18 -0800] "GET /carcomplaints.com/media/complaints/images/39/d7c17846-c2cc-1031-b743-4c3114d2dee3.png
HTTP/1.1" 200 2348 "-" "WeSEE:Ads/PictureBot (http://www.wesee.com/bot/)" - - [15/Feb/2015:18:59:18 -0800] "GET /carcomplaints.com/media/complaints/images/39/d8132e3e-c2cc-1031-b743-4c3114d2dee3.png
HTTP/1.1" 200 2222 "-" "WeSEE:Ads/PictureBot (http://www.wesee.com/bot/)" - - [15/Feb/2015:18:59:18 -0800] "GET /carcomplaints.com/media/complaints/images/120/d7757374-c2cc-1031-b743-4c3114d2dee3.png
HTTP/1.1" 200 19884 "-" "WeSEE:Ads/PictureBot (http://www.wesee.com/bot/)"

Better yet, their bot page has specific information on how to block their crawler AND a contact email address. Looking good so far.

The bot accessing our media was “WeSEE:Ads/PictureBot ” & their crawler page only has info for blocking the “WeSEE” crawler in robots.txt, so we assumed it must work for all WeSEE crawlers:

User-agent: WeSEE
Disallow: /cgi-bin

WeSEE gives a timeframe of “24 hours” for how soon the robots.txt block will be picked up.

We gave them 10 days. No dice.

I emailed them about it & got this reply back:

Hi Wick,

We work with the world's leading advertisings [sic] solutions to help them understand pages' contents in detail so they can target campaigns more effectively - and as for publishers - earn more money. Our company (WeSEE) only indexes pages that have passed through an advertisers network, i.e. we do not crawl sites randomly.
Mentioned requests were made by the image links, which were shared (published) on other websites (and directs to your site).

Our ops team will block such requests to your site in several hours today. 

We apologize for any inconvenience this caused you and appreciate your understanding of these technical issues.

If you have any further question please don’t hesitate to ask.

At this point, my B.S. meter started smoking & promptly caught on fire. I replied that their robots.txt exclusion wasn’t working — a serious problem that shouldn’t involve “technical issues”. If you operate a crawler, especially one that your entire business model depends on, it should probably follow the robots.txt standard.

We’ll see what they say.

UPDATE: To their credit, WeSEE tech support promptly replied & after a few emails back & forth, admitted they screwed up. While their content bot checks robots.txt before crawling page content, their pictures bot crawled media contained in those pages without also checking robots.txt rules. WeSEE claims this has been fixed.


Proximic logoProximic seems legit. They have a page about their web content crawler — the URL in their UserAgent string works! Their crawl rate isn’t excessive & they even have a nice modern responsive-design website.

Yet they don’t monitor their per-site 404 or 403 error rates. Worse, excluding proximic via robots.txt doesn’t work. Their FAQ page claims they honor robots.txt rules but doesn’t provide any details, which is an odd oversight.

Like most of these bad bots, I found Proximic by following a massive trail of 404 page not found errors. They aren’t keeping our case-sensitive paths in URLs.

In retrospect I kind of wish we hadn’t gone with mix-cased paths but it’s too late to switch now. Lowercase letters were hot new technology in the 9th century. These services either code their shit accurately or get banned.

Here’s the access log sample: - - [28/Jan/2015:14:59:29 -0800] "GET /carcomplaints.com/honda/civic/2006/safety/ HTTP/1.1" 403 304 "-" "Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)" - - [28/Jan/2015:14:59:51 -0800] "GET /carcomplaints.com/ford/edge/2011/safety/ HTTP/1.1" 403 302 "-" "Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)" - - [28/Jan/2015:14:59:55 -0800] "GET /carcomplaints.com/dodge/durango/1998/ HTTP/1.1" 403 299 "-" "Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)"

Yay, they have a unique UserAgent string & a working bot info URL. Off to a good start. They even have a FAQ section on the 404 Not Found errors:

Why does the spider access invalid URLs?

In general this should not happen. Please contact us and we will find out what is causing it.

… with a clear (and again, working!) link to their contact form. I filled it out. We’ll see if the complaint gets resolved as nicely as the whole initial experience. SEE UPDATE BELOW.

For now though, Proximic gets the boot. In lieu of useful robots.txt information from their crawler info page, we tried:

User-agent: proximic
Disallow: /

But we’ve had Proximic banned via robots.txt since 2010 & 5 years later they’re still trying to crawl pages. So we went with the strong-arm tactics of .htaccess:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} proximic [NC]
RewriteRule !^robots\.txt - [F]

That worked. Make sure to allow the robots.txt exemption & not a (.*) ban. Who knows, Proximic might follow your robots.txt exclusion someday.

With .htaccess, combine all your banned bots into a bunch of concurrent RewriteCond lines with one RewriteRule at the end. For help with that, see this page.

UPDATE February 6, 2015:

Got a nice email back from Paul Armstrong, Proximic’s Director of Technical Operations. He mentioned he didn’t think it was a good idea to ban Proximic because “some advertisers may choose not to advertise on carcomplaints.com as there will be no Proximic data available during their bid“, but they think they’ve found the cause of the problem & are in the process of resolving it.

I replied I’m not concerned about any additional revenue loss since we’ve had Proximic banned via .htaccess for several years since the last time we tried contacting them, but thanks for trying to fix the issue.

Several emails since then have gone unanswered. And it was going so well.

Gig Avenue

Found Gig Avenue by tracing 404 errors.Gig Avenue Site Pic

They have a sweet web page, & by sweet I mean 110 % shady. Their entire homepage including all text? One big image. Bonus points for the purely table-based layout. Super bonus points for the social media links in the site footer that don’t actually link anywhere.

Hope their “technology cost optimization” product is lightyears ahead of their 20-year old website but not holding my breath.

Back to their bot — they don’t use a bot-specific UserAgent string. Guess what they use?

"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20120101 Firefox/29.0"

Evil or what. They are masquerading as Firefox 29, which makes it harder to ban their bot. Terrible internet manners.

The 404 errors are because Gig Avenue doesn’t manage uppercase characters in our paths correctly, most of the time. Here’s a 30-second snapshot from our access log (filtered to show just Gig Avenue requests): - - [28/Jan/2015:08:59:25 -0800] "GET /carcomplaints.com/kia/optima/2013/steering/steering.shtml HTTP/1.1" 404 4159 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20120101 Firefox/29.0" - - [28/Jan/2015:08:59:26 -0800] "GET /carcomplaints.com/Subaru/Legacy/2015/ HTTP/1.1" 200 8510 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20120101 Firefox/29.0" - - [28/Jan/2015:08:59:26 -0800] "GET /carcomplaints.com/ram/2500/2012/windows_windshield/ HTTP/1.1" 404 4149 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20120101 Firefox/29.0" - - [28/Jan/2015:08:59:30 -0800] "GET /carcomplaints.com/Jeep/Compass/2007/ HTTP/1.1" 200 8876 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20120101 Firefox/29.0" - - [28/Jan/2015:08:59:34 -0800] "GET /carcomplaints.com/ford/expedition/1999/engine/blown_head_gasket.shtml HTTP/1.1" 404 4185 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20120101 Firefox/29.0" - - [28/Jan/2015:08:59:34 -0800] "GET /carcomplaints.com/hyundai/elantra/2009/safety/ HTTP/1.1" 404 4151 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20120101 Firefox/29.0" - - [28/Jan/2015:08:59:40 -0800] "GET /carcomplaints.com/volkswagen/jetta/2008/recalls/ HTTP/1.1" 404 4152 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20120101 Firefox/29.0" - - [28/Jan/2015:08:59:42 -0800] "GET /carcomplaints.com/ford/explorer/2005/body_paint/paint_is_peeling_off.shtml HTTP/1.1" 404 4193 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20120101 Firefox/29.0" - - [28/Jan/2015:08:59:42 -0800] "GET /carcomplaints.com/Toyota/Corolla/2003/transmission/transmission_failure.shtml HTTP/1.1" 200 20864 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20120101 Firefox/29.0" - - [28/Jan/2015:08:59:45 -0800] "GET /carcomplaints.com/Hyundai/ HTTP/1.1" 200 7487 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20120101 Firefox/29.0" - - [28/Jan/2015:08:59:46 -0800] "GET /carcomplaints.com/gmc/terrain/2012/accessories-interior/ HTTP/1.1" 404 4050 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20120101 Firefox/29.0" - - [28/Jan/2015:08:59:48 -0800] "GET /carcomplaints.com/ford/expedition/1999/windows_windshield/window_wont_go_down.shtml HTTP/1.1" 404 4197 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20120101 Firefox/29.0"

How to ban Gig Avenue

Since they’re hiding requests as Firefox, you can’t use UserAgents. Think they comply with robots.txt? You’re funny. I went with .htaccess & a RewriteRule:

RewriteEngine On
RewriteCond %{REMOTE_ADDR} ^(208\.78\.85|208\.66\.97|208\.66\.100)
RewriteRule !^robots\.txt - [F]

This looks for IPs matching 208.78.85.* & 208.66.100.* which was the range I saw. They get a bare-bones 403 Forbidden error back for everything except robots.txt, in case they start playing nice someday.

For help with RewriteCond ban types, see the RewriteRule help page.