How to detect a search engine spider/crawler with PHP

Posted in PHP
January 27, 2009

I was tasked with the job of writing a small PHP script today that detects whether a search engine spider is crawling a page of your site. There are a few ways to go about it. The challenging thing about the script is that there are so many spiders on the web. The script I am currently using only checks for the main spiders. It does however allow you to add as many crawlers as you want. Here is the code with comment explanations:

if ( ! function_exists('check_if_spider'))
{
	function check_if_spider()
	{
		// Add as many spiders you want in this array
		$spiders = array('Googlebot', 'Yammybot', 'Openbot', 'Yahoo', 'Slurp', 'msnbot', 'ia_archiver', 'Lycos', 'Scooter', 'AltaVista', 'Teoma', 'Gigabot', 'Googlebot-Mobile');
		
		// Loop through each spider and check if it appears in
		// the User Agent
		foreach ($spiders as $spider)
		{
			if (eregi($spider, $_SERVER['HTTP_USER_AGENT']))
			{
				return TRUE;
			}
		}
		return FALSE;
	}
}

And there we have it. You can find a list of search engine crawlers here.

Now the other way of doing this check without having to specify the spiders in an array is using the get_browser() PHP function. It returns an array with very useful data. One of the things it returns is [crawler] TRUE/FALSE. I didn’t use it because I have not done enough testing to see how efficient it is. You can read more about this function on the PHP.net website.

  • Raeven

    Neat, just what I needed for my site, google and other are crawling all over it, messing up my visitor stats.

    • http://iarematt.com Matt

      I am glad you found it useful :)

  • http://www.bft.com.my malaysia web designer

    what if the bot name changed? and how we test it?

  • gatorpower

    Just a note:

    eregi() is depreciated and will not be supported by php in the future. Use preg_match() going forward. It requires forward slashes sandwiching your pattern like so:’/pattern/’

  • http://iarematt.com Matt

    That is correct Gatorpower, I have replaced eregi() in all my own scripts as from php 5.3 onwards, it is no longer supported and will even be removed in php 6.0.

  • Peshte

    Instead of the eregi line, I used strpos(‘ ‘.$agent,$spider) where $agent is the variable I give it to check if is a spider.

    Because strpos could return the 0 value if the searched string begins at first character, i’ve added ‘ ‘ at the beginning of the searchin string.

  • Pingback: how to allow spiders to bypass login page - Webmaster Forum

  • http://www.hreations.co.za abdel

    Thanks for this piece of code. please can you post the code after replacing with preg_match()?

    is this right?
    if (preg_match($spider, $_SERVER['HTTP_USER_AGENT']))

    thanks

  • http://tunespring.com kelly

    Matt,
    Could you update the script you provide above, or in another thread, with the correct code? With eregi replaced?

  • Jason

    @Kelly
    Just change this line:

    if (eregi($spider, $_SERVER['HTTP_USER_AGENT']))

    to this:

    if (strpos($_SERVER['HTTP_USER_AGENT'],$spider) !=false)

  • http://veedeoo.com/forum/ PoorBoy

    Guys,

    You can do array_search function. So, if I will be the one writing the codes to eliminate the deprecated eregi, my updated codes should be something like below

    foreach ($spiders as $spider)
    {
    if (array_search($_SERVER['HTTP_USER_AGENT'], $spider) !== FALSE) {
    return TRUE;
    $_SESSION['verify'] = “true”;
    }
    }

  • http://veedeoo.com/forum/ PoorBoy

    just remove this part from my codes.

    $_SESSION['verify'] = “true”;

  • http://www.suvi.org suvi

    if you use the php in_array() function,you can make the code much shorter :-)

    • http://www.suvi.org suvi

      ups, sorry I was wrong array_search() would be better

      • Guest

        You were right the first time. array_search() is used if you need array entry. Just for checking if it exists in array, function in_array() is right to do. And it is just slightly faster.

        • http://mattgeri.com/ Matt Geri

          Both solutions won’t work as you need to do a regex search. The UA contains more than just the spiders name.

          Matt

  • http://tech5.net g0g0l

    Nice tutorial!
    I guess you really should replace the eregi function with preg_replace, because it’s deprecated. Awesome article otherwise.

    • Guest

      No need for eregi at all. in_array would be the best approach.

      if ( ! function_exists(‘check_if_spider’))
      {
      function check_if_spider()
      {
      // Add as many spiders you want in this array
      $spiders = array(‘Googlebot’, ‘Yammybot’, ‘Openbot’, ‘Yahoo’, ‘Slurp’, ‘msnbot’, ‘ia_archiver’, ‘Lycos’, ‘Scooter’, ‘AltaVista’, ‘Teoma’, ‘Gigabot’, ‘Googlebot-Mobile’);

      // Check if any spider appears in the User Agent
      return in_array($_SERVER['HTTP_USER_AGENT'], $spiders);
      }
      }

      • http://mattgeri.com/ Matt Geri

        Hi there,

        That code won’t work as you need to use a regular expression search on the user agent since there is a lot more information in a spiders UA than just the name.

        Matt

  • Pingback: Logic for site warning page to also allow bots/spiders video