Wordpress Robot.txt - The Correct Way


PrintPrint Friendly

Whenever I take over a website to help with improving the organic rankings, I always take a look at what Google has indexed and is currently displaying in search results for the website. In almost all case, if it’s a Wordpress site, the search will show the theme folder has been indexed along with all the files in wp-content, wp-admin, and cgi-bin folder. Needless to say, Google, MSN, Yahoo, etc. has now indexed redundant files that have nothing to do with your website and in most cases this opens up possible security issues and can cause a hit to your organic search results by indexing these files.

The sad part is that all this can be prevented by the use of a good Robots.txt file that you can upload right to the root directory of your website. Takes a few minutes to do, and then you have to wait for the search engines to update your website indexing.

How to make a Great Robots.txt File.

  1. Open up Notepad ( Start menu - Programs - Accessories - Notepad )
  2. Copy the code at the bottom of this number list into the notepad
  3. Scroll down to the bottom of the copied code and change the link to the sitemap to reflect your sitemap.xml location.
  4. Save the file as robots.txt
  5. Upload / Replace your old robots.txt with the new robots.txt file in the root directory of your Wordpress website

Robots.txt Code

Note - If you have altered any directories that are now using uppercase instead of lowercase I have seen times where search bots will index these directories even though they are in the robots.txt because they are listed in lowercase and not uppercase.

I also do not care for the duggmirror to index my website when a post makes it to Digg, so I have added it a dissallow in the robots.txt.


User-agent:  *
Disallow: /cgi-bin/
Disallow: /z/j/
Disallow: /z/c/
Disallow: /stats/
Disallow: /dh_
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /contact/
Disallow: /tag/
Disallow: /wp-content/b
Disallow: /wp-content/p
Disallow: /wp-content/themes/
Disallow: /trackback/
Disallow: */trackback/

User-agent: Googlebot
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
Disallow: /*.gz$
Disallow: /*.cgi$
Disallow: /*.wmv$
Disallow: /*.cgi$
Disallow: /*.xhtml$
Disallow: */trackback*
Disallow: /z/
Disallow: /wp-*
Allow: /wp-content/uploads/
 

User-agent: Googlebot-Image
Allow: /*
 
User-agent: Mediapartners-Google*
Allow: /z/
Allow: /about/
Allow: /contact/
Allow: /wp-content/
Allow: /tag/
Allow: /manual/*
Allow: /docs/*
Allow: /*.js$
Allow: /*.inc$
Allow: /*.css$
Allow: /*.gz$
Allow: /*.cgi$
Allow: /*.wmv$
Allow: /*.cgi$
Allow: /*.xhtml$
Allow: /*.php*
Allow: /*.gif$
Allow: /*.jpg$
Allow: /*.png$
 
User-agent: ia_archiver
Disallow: /
 
User-agent: duggmirror
Disallow: /

Sitemap: http://www.yourwebsite.com/sitemap.xml


 

This robots.txt file was not created exclusively by myself, but for the life of me I can not remember the website where I picked this up. If anyone happens to know please tell me so I can include a link back to them for this amazing robots.txt file.

  • Share/Save/Bookmark


Tags: , , , , ,

You can leave a response, or trackback from your own site.


6 Comments

  1. Posted August 31, 2008 at 4:37 am | Permalink

    Wait a second, robots.txt does not have Allow command only Disallow right?

    [Reply]


  2. Posted August 31, 2008 at 4:45 am | Permalink

    I am 100% certain it has an allow command, like I said, it’s helps having a good robots.txt in place.

    [Reply]


  3. Posted August 31, 2008 at 4:53 am | Permalink

    In the global disallow area various directories and images are off limits, but I want Google imagebot to index images, and sometimes Googles Mediapartners follow the rules set in place for Google bot, unless you have noted it some where else in the text file.

    [Reply]


  4. Posted August 31, 2008 at 5:04 am | Permalink

    Allow command is not a standard command, yet some search engines follow that rule, like Googlebot. It is not necessary thought as Allow command is default to all search engines.

    [Reply]


  5. Posted November 11, 2008 at 9:17 pm | Permalink

    User-agent: Mediapartners-Google*
    Disallow:

    This would do exactly the same thing as your long list of non-standard “Allow:” directives following Mediapartners-Google*

    [Reply]


  6. Posted November 29, 2008 at 2:08 pm | Permalink

    I really like what you had to say here! It\’s about time! Would you mind if I placed a link back from my blog?

    [Reply]


One Trackback

  1. By PlugIM.com on August 30, 2008 at 9:48 pm

    Wordpress Robot.txt - The Correct Way…

    Google will show the theme folder has been indexed along with all the files in wp-content, wp-admin, and cgi-bin folder unless a good Robots.txt is in place. Here is a tutorial for robots.txt file….

Post a Comment

Your email is never published nor shared.