Robots.txt for WordPress sites

Setting Up robots.txt for WordPress Sites

1.      Why Use a robots.txt

The main use of a robots.txt file is to request that search engines desist from indexing a specified part of a website. This might be because the content is confidential but more likely it is to prevent a search engine wasting time indexing content that is not of interest to the outside world or content which is duplicated elsewhere on the site.

A search engine spider will only index a limited amount of content when it visits so we want it to spend that time on useful content rather than wasting it. This should help maintain and improve visibility on search engines and elimination of duplicate content helps rankings too.

A further use of robots.txt is to tell search engines where to find sitemap files for the website in question.

2.      User-agent

The starting point in robots.txt is to define which search engine robot(s) the following rules apply to. Rules can apply to all robots or could be specific to an individual robot. In the past Google could be stopped from indexing one set of folders but with Yahoo! permitted to view them and vice versa to provide content optimised for each engine. That would be frowned upon these days so we don’t do it.

Our robots.txt rules virtually always apply to all search engines so the first line in our file is:

User-agent: *

3.      Disallow Folders Outside WordPress

There will usually be one or more folders outside of the WP structure that need blocking. These might include /lib/, /bak/, /generator/ and so forth. Just look at the folders viewable using ftp and add those which need blocking. Comments can be added to robots.txt file by using # + a space and your comment. So this might look like:

# Disallow folders outside of WordPress
Disallow: /generator/
Disallow: /lib/

NB: ALWAYS add a trailing slash to a folder. “Disallow: /lib” will block any url starting with /lib – e.g. library.html, liberty.html, etc. This applies to the WP folders too.

4.      Disallow Standard WordPress Folders

There are at least three default folders in a WordPress installation: /wp-admin, /wp-content, and /wp-includes. In most instances you should block /wp-admin, and /wp-includes as well as other folders such as /wp-snapshots which appear in many setups.

There will often be good reason to allow indexing of the folder /wp-content/uploads/ as it will often contain pdfs, images and other content we might want indexed. In this instance we block the other folders inside /wp-content/ but not the uploads folder. For example:

# Disallow standard WP folders
Disallow: /wp-admin/
Disallow: /wp-content/cache/
Disallow: /wp-content/gallery/
Disallow: /wp-content/languages/
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/
Disallow: /wp-content/upgrade/
Disallow: /wp-content/w3tc-config/
Disallow: /wp-includes/
Disallow: /wp-snapshots/

Note that not all of the sub-folders will be present on all installations so this list will need tweaking.

I would also recommend disabling the option to keep uploads in month based folders as this creates lots of almost empty folders inside /wp-content/uploads/ unless you have masses of content. You’ll find this in >Settings >Media.

upload folders

5.      Disallowing /feed/, /trackback/, /rss/ and such like

WordPress creates much duplicate content by providing many ways to access the same data. This is great for the user but can be problematic for search engine rankings because it creates a lot of duplicate content. This is not necessarily handled by one set of rules and thought needs to be given to each instance. However here is a typical configuration for a WordPress site.

# Disallow feed, rss, trackback and such like
Disallow: /blog/feed$
Disallow: /blog/feed/$
Disallow: /blog/*/feed/$
Disallow: /blog/*/feed/rss/$
Disallow: /blog/rss/
Disallow: /blog/comments/feed/
Disallow: /blog/*/trackback/$
Disallow: /feed/
Disallow: /trackback/

6.      Disallow specific file types

All sites have certain types of files that you would not normally want search engines to index – for example stylesheet or javascript files. Also, since by default WP uses “ugly” urls, files can exist with database parameters as well as “clean urls” – e.g. my.domain.com?page_id=212 might be same as my.domain.com/keyword.html. This is easily prevented in robots.txt. A typical set up would be:

# Stop php, js and css being indexed, also anything with a ?
Disallow: /*.css$
Disallow: /*.js$
Disallow: /*.php$
Disallow: /*?

Be aware that some content can be rendered using javascript so some content could be blocked. Also check that pages outside of WP are not using the .php file type for content that needs to be indexed.

7.      XML Sitemaps

XML sitemap files are a useful tool for helping a search engine find all your useful content. You will have probably added these to Webmaster Tools and such like but it cannot hurt you to add them to robots.txt as well. It’s especially useful if you have more than one sitemap file as in the following example:

# Specify where xml sitemaps can be found
Sitemap: http://www.fastsms.co.uk/sitemap_index.xml
Sitemap: http://www.fastsms.co.uk/downloads-sitemap.xml

In this instance /sitemap_index.xml is generated by a plugin within WordPress whilst /downloads-sitemap.xml is generated outside of WordPress for a downloads folder.

8.      Removing files from search engine index

Although robots.txt is effective at preventing the indexing of new files it is not so good at removing files that were already indexed when a rule blocking them was added. In other words it doesn’t work well retrospectively.

The easiest way to deal with this is through Webmaster Tools:

Go to > Google Index > Remove URLs and specify the url you want removing from the index. It works quite quickly but take care because if you accidentally remove a page you do want indexed it could have serious consequences. Also be sure that you have a rule in robots.txt that prevents the url from being indexed again in the future.

remove files using webmaster tools

9.      Conclusions

Google is to a great extent a law unto itself but using this simple resource is a major means of gaining control of what is accessible to search robots and what is visible in search results.

It can take a little time to get it just how you want it but once established it needs little attention unless your site structure changes a lot.

10.  Sample file

Here’s the whole file containing the examples given above. Probably best not to use it 100% as shown but work through each section and customise for your own requirements.

User-agent: *

# Disallow folders outside of WordPress
Disallow: /generator/
Disallow: /lib/

# Disallow standard WP folders
Disallow: /wp-admin/
Disallow: /wp-content/cache/
Disallow: /wp-content/gallery/
Disallow: /wp-content/languages/
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/
Disallow: /wp-content/upgrade/
Disallow: /wp-content/w3tc-config/
Disallow: /wp-includes/
Disallow: /wp-snapshots/

# Disallow feed, rss, trackback and such like
Disallow: /blog/feed$
Disallow: /blog/feed/$
Disallow: /blog/*/feed/$
Disallow: /blog/*/feed/rss/$
Disallow: /blog/rss/
Disallow: /blog/comments/feed/
Disallow: /blog/*/trackback/$
Disallow: /feed/
Disallow: /trackback/

# Stop php, js and css being indexed, also anything with a ?
Disallow: /*.css$
Disallow: /*.js$
Disallow: /*.php$
Disallow: /*?

# Specify where xml sitemaps can be found
Sitemap:
Sitemap:

Please don’t treat this as gospel but use it as a guide. You can also learn more here. If you disagree with any aspect or have other ideas the throw into the pot please do so.

One thought on “Robots.txt for WordPress sites

Leave a Reply

Your email address will not be published. Required fields are marked *