proximic

Proximic logoProximic seems legit. They have a page about their web content crawler — the URL in their UserAgent string works! Their crawl rate isn’t excessive & they even have a┬ánice modern responsive-design website.

Yet they don’t monitor their per-site 404 or 403 error rates. Worse, excluding proximic via robots.txt doesn’t work. Their FAQ page claims they honor robots.txt rules but doesn’t provide any details, which is an odd oversight.

Like most of these bad bots, I found Proximic by following a massive trail of 404 page not found errors. They aren’t keeping our case-sensitive paths in URLs.

In retrospect I kind of wish we hadn’t gone with mix-cased paths but it’s too late to switch now. Lowercase letters were hot new technology in the 9th century. These services either code their shit accurately or get banned.

Here’s the access log sample:

54.152.93.62 - - [28/Jan/2015:14:59:29 -0800] "GET /carcomplaints.com/honda/civic/2006/safety/ HTTP/1.1" 403 304 "-" "Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)"
54.209.153.203 - - [28/Jan/2015:14:59:51 -0800] "GET /carcomplaints.com/ford/edge/2011/safety/ HTTP/1.1" 403 302 "-" "Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)"
54.164.173.39 - - [28/Jan/2015:14:59:55 -0800] "GET /carcomplaints.com/dodge/durango/1998/ HTTP/1.1" 403 299 "-" "Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)"

Yay, they have a unique UserAgent string & a working bot info URL. Off to a good start. They even have a FAQ section on the 404 Not Found errors:

Why does the spider access invalid URLs?

In general this should not happen. Please contact us and we will find out what is causing it.

… with a clear (and again, working!) link to their contact form. I filled it out. We’ll see if the complaint gets resolved as nicely as the whole initial experience. SEE UPDATE BELOW.

For now though, Proximic gets the boot. In lieu of useful robots.txt information from their crawler info page, we tried:

User-agent: proximic
Disallow: /

But we’ve had Proximic banned via robots.txt since 2010 & 5 years later they’re still trying to crawl pages. So we went with the strong-arm tactics of .htaccess:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} proximic [NC]
RewriteRule !^robots\.txt - [F]

That worked. Make sure to allow the robots.txt exemption & not a (.*) ban. Who knows, Proximic might follow your robots.txt exclusion someday.

With .htaccess, combine all your banned bots into a bunch of concurrent RewriteCond lines with one RewriteRule at the end. For help with that, see this page.

UPDATE February 6, 2015:

Got a nice email back from Paul Armstrong, Proximic’s Director of Technical Operations. He mentioned he didn’t think it was a good idea to ban Proximic because “some advertisers may choose not to advertise on carcomplaints.com as there will be no Proximic data available during their bid“, but they think they’ve found the cause of the problem & are in the process of resolving it.

I replied I’m not concerned about any additional revenue loss since we’ve had Proximic banned via .htaccess for several years since the last time we tried contacting them, but thanks for trying to fix the issue.

Several emails since then have gone unanswered. And it was going so well.