How to detect a search engine spider/crawler with PHP

I was tasked with the job of writing a small PHP script today that detects whether a search engine spider is crawling a page of your site. There are a few ways to go about it. The challenging thing about the script is that there are so many spiders on the web. The script I am currently using only checks for the main spiders. It does however allow you to add as many crawlers as you want. Here is the code with comment explanations:

if ( ! function_exists('check_if_spider'))
{
	function check_if_spider()
	{
		// Add as many spiders you want in this array
		$spiders = array('Googlebot', 'Yammybot', 'Openbot', 'Yahoo', 'Slurp', 'msnbot', 'ia_archiver', 'Lycos', 'Scooter', 'AltaVista', 'Teoma', 'Gigabot', 'Googlebot-Mobile');
		
		// Loop through each spider and check if it appears in
		// the User Agent
		foreach ($spiders as $spider)
		{
			if (eregi($spider, $_SERVER['HTTP_USER_AGENT']))
			{
				return TRUE;
			}
		}
		return FALSE;
	}
}

And there we have it. You can find a list of search engine crawlers here.

Now the other way of doing this check without having to specify the spiders in an array is using the get_browser() PHP function. It returns an array with very useful data. One of the things it returns is [crawler] TRUE/FALSE. I didn’t use it because I have not done enough testing to see how efficient it is. You can read more about this function on the PHP.net website.

Published by

Matt

Software developer, adventure seeker, husband and dad.

  • Raeven

    Neat, just what I needed for my site, google and other are crawling all over it, messing up my visitor stats.

    • http://iarematt.com Matt

      I am glad you found it useful :)

  • http://www.bft.com.my malaysia web designer

    what if the bot name changed? and how we test it?

  • gatorpower

    Just a note:

    eregi() is depreciated and will not be supported by php in the future. Use preg_match() going forward. It requires forward slashes sandwiching your pattern like so:’/pattern/’

  • http://iarematt.com Matt

    That is correct Gatorpower, I have replaced eregi() in all my own scripts as from php 5.3 onwards, it is no longer supported and will even be removed in php 6.0.

  • Peshte

    Instead of the eregi line, I used strpos(‘ ‘.$agent,$spider) where $agent is the variable I give it to check if is a spider.

    Because strpos could return the 0 value if the searched string begins at first character, i’ve added ‘ ‘ at the beginning of the searchin string.

  • Pingback: how to allow spiders to bypass login page - Webmaster Forum()

  • http://www.hreations.co.za abdel

    Thanks for this piece of code. please can you post the code after replacing with preg_match()?

    is this right?
    if (preg_match($spider, $_SERVER[‘HTTP_USER_AGENT’]))

    thanks

  • http://tunespring.com kelly

    Matt,
    Could you update the script you provide above, or in another thread, with the correct code? With eregi replaced?

  • Jason

    @Kelly
    Just change this line:

    if (eregi($spider, $_SERVER[‘HTTP_USER_AGENT’]))

    to this:

    if (strpos($_SERVER[‘HTTP_USER_AGENT’],$spider) !=false)

  • http://veedeoo.com/forum/ PoorBoy

    Guys,

    You can do array_search function. So, if I will be the one writing the codes to eliminate the deprecated eregi, my updated codes should be something like below

    foreach ($spiders as $spider)
    {
    if (array_search($_SERVER[‘HTTP_USER_AGENT’], $spider) !== FALSE) {
    return TRUE;
    $_SESSION[‘verify’] = “true”;
    }
    }

  • http://veedeoo.com/forum/ PoorBoy

    just remove this part from my codes.

    $_SESSION[‘verify’] = “true”;

  • http://www.suvi.org suvi

    if you use the php in_array() function,you can make the code much shorter :-)

    • http://www.suvi.org suvi

      ups, sorry I was wrong array_search() would be better

      • Guest

        You were right the first time. array_search() is used if you need array entry. Just for checking if it exists in array, function in_array() is right to do. And it is just slightly faster.

        • http://mattgeri.com/ Matt Geri

          Both solutions won’t work as you need to do a regex search. The UA contains more than just the spiders name.

          Matt

  • http://tech5.net g0g0l

    Nice tutorial!
    I guess you really should replace the eregi function with preg_replace, because it’s deprecated. Awesome article otherwise.

    • Guest

      No need for eregi at all. in_array would be the best approach.

      if ( ! function_exists(‘check_if_spider’))
      {
      function check_if_spider()
      {
      // Add as many spiders you want in this array
      $spiders = array(‘Googlebot’, ‘Yammybot’, ‘Openbot’, ‘Yahoo’, ‘Slurp’, ‘msnbot’, ‘ia_archiver’, ‘Lycos’, ‘Scooter’, ‘AltaVista’, ‘Teoma’, ‘Gigabot’, ‘Googlebot-Mobile’);

      // Check if any spider appears in the User Agent
      return in_array($_SERVER[‘HTTP_USER_AGENT’], $spiders);
      }
      }

      • http://mattgeri.com/ Matt Geri

        Hi there,

        That code won’t work as you need to use a regular expression search on the user agent since there is a lot more information in a spiders UA than just the name.

        Matt

  • Pingback: Logic for site warning page to also allow bots/spiders video()