How to detect a search engine spider/crawler with PHP →

Posted on January 27, 2009 in Development with 12 comments.

I was tasked with the job of writing a small PHP script today that detects whether a search engine spider is crawling a page of your site. There are a few ways to go about it. The challenging thing about the script is that there are so many spiders on the web. The script I am currently using only checks for the main spiders. It does however allow you to add as many crawlers as you want. Here is the code with comment explanations:

if ( ! function_exists('check_if_spider'))
{
	function check_if_spider()
	{
		// Add as many spiders you want in this array
		$spiders = array('Googlebot', 'Yammybot', 'Openbot', 'Yahoo', 'Slurp', 'msnbot', 'ia_archiver', 'Lycos', 'Scooter', 'AltaVista', 'Teoma', 'Gigabot', 'Googlebot-Mobile');

		// Loop through each spider and check if it appears in
		// the User Agent
		foreach ($spiders as $spider)
		{
			if (eregi($spider, $_SERVER['HTTP_USER_AGENT']))
			{
				return TRUE;
			}
		}
		return FALSE;
	}
}

And there we have it. You can find a list of search engine crawlers here.

Now the other way of doing this check without having to specify the spiders in an array is using the get_browser() PHP function. It returns an array with very useful data. One of the things it returns is [crawler] TRUE/FALSE. I didn’t use it because I have not done enough testing to see how efficient it is. You can read more about this function on the PHP.net website.

Tagged with

12 Comments

  1. Raeven says:

    Neat, just what I needed for my site, google and other are crawling all over it, messing up my visitor stats.

  2. what if the bot name changed? and how we test it?

  3. gatorpower says:

    Just a note:

    eregi() is depreciated and will not be supported by php in the future. Use preg_match() going forward. It requires forward slashes sandwiching your pattern like so:’/pattern/’

  4. Matt says:

    That is correct Gatorpower, I have replaced eregi() in all my own scripts as from php 5.3 onwards, it is no longer supported and will even be removed in php 6.0.

  5. Peshte says:

    Instead of the eregi line, I used strpos(‘ ‘.$agent,$spider) where $agent is the variable I give it to check if is a spider.

    Because strpos could return the 0 value if the searched string begins at first character, i’ve added ‘ ‘ at the beginning of the searchin string.

  6. abdel says:

    Thanks for this piece of code. please can you post the code after replacing with preg_match()?

    is this right?
    if (preg_match($spider, $_SERVER['HTTP_USER_AGENT']))

    thanks

  7. kelly says:

    Matt,
    Could you update the script you provide above, or in another thread, with the correct code? With eregi replaced?

  8. Jason says:

    @Kelly
    Just change this line:

    if (eregi($spider, $_SERVER['HTTP_USER_AGENT']))

    to this:

    if (strpos($_SERVER['HTTP_USER_AGENT'],$spider) !=false)

  9. PoorBoy says:

    Guys,

    You can do array_search function. So, if I will be the one writing the codes to eliminate the deprecated eregi, my updated codes should be something like below

    foreach ($spiders as $spider)
    {
    if (array_search($_SERVER['HTTP_USER_AGENT'], $spider) !== FALSE) {
    return TRUE;
    $_SESSION['verify'] = “true”;
    }
    }

  10. PoorBoy says:

    just remove this part from my codes.

    $_SESSION['verify'] = “true”;

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>