Guide to the robots.txt file

Sorry, the page you're looking for in this blog does not exist. You will be redirected to homepage shortly.
Guide to the robots.txt file
Make use of the robots.txt file on your web server. This file tells crawlers which directories can or cannot be crawled. Make sure it's current for your site so that you don't accidentally block the Googlebot crawler.

- from the Google webmaster guidelines 1
Why should you learn about robots.txt?
  • Improper usage of the robots.txt file can hurt your ranking
  • The robots.txt file controls how search engine spiders see and interact with your webpages
  • This file is mentioned in several of the Google guidelines
  • This file, and the bots they interact with, are fundamental parts of how search engines work
Tip: To see if your robots.txt is blocking any important files used by Google, use the SEO TOOL.

What is a robots.txt file?
The robots.txt file defines how a search engine spider like Googlebot should interact with the pages and files of your web site.

If there are files and directories you do not want indexed by search engines, you can use a robots.txt file to define where the robots should not go. The robots.txt is a very simple text file placed on your web server.

What do they do exactly?

The first thing a search engine spider looks at when it is visiting a page is the robots.txt file.
It does this because it wants to know if it has permission to access that page or file. If the robots.txt file says it can enter, the search engine spider then continues on to the page files.

If you have instructions for a search engine robot, you must tell it those instructions. The way you do so is the robots.txt file.

Priorities for your website
There are three important things that any webmaster should do when it comes to the robots.txt file.
  • Determine if you have a robots.txt file
  • If you have one, make sure it is not harming your ranking or blocking content you don't want blocked
  • Determine if you need a robots.txt file
Determining if you have a robots.txt

you can check from any browser. The robots.txt file is always located in the same place on any website, so it is easy to determine if a site has one. Just add "/robots.txt" to the end of a domain name as shown below.

Example: " www.YourWebsite.Com/robots.txt "

If you have a file there, it is your robots.txt file. You will either find a file with words in it, find a file with no words in it, or not find a file at all.

Determine if your robots.txt is blocking important files

You can use This Free Tool, which will warn you if you are blocking certain page resources that Google needs to understand your pages.

If you have access and permission you can use the Google webmaster tools to test your robots.txt file. Instructions to do so are Found Here (tool not public - requires login).


To fully understand if your robots.txt file is not blocking anything you do not want it to block you will need to understand what it is saying. We cover that below.

Do you need a robots.txt file?

You may not even need to have a robots.txt file on your site. In fact it is often the case you do not need one.

Reasons you may want to have a robots.txt file:
  • You have content you want blocked from search engines
  • You are using paid links or advertisements that need special instructions for robots
  • You want to fine tune access to your site from reputable robots
  • You are developing a site that is live, but you do not want search engines to index it yet
  • They help you follow some Google guidelines in some certain situations
  • You need some or all of the above, but do not have full access to your webserver and how it is configured
Each of the above situations can be controlled by other methods, however the robots.txt file is a good central place to take care of them and most webmasters have the ability and access required to create and use a robots.txt file.

Reasons you may not want to have a robots.txt file:
  • It is simple and error free
  • You do not have any files you want or need to be blocked from search engines
  • You do not find yourself in any of the situations listed in the above reasons to have a robots.txt file
When you do not have a robots.txt file the search engine robots like Googlebot will have full access to your site. This is a normal and simple method that is very common.

How to make a robots.txt file

If you can type or copy and paste, you can also make a robots.txt file.

The file is just a text file, which means that you can use notepad or any other plain text editor to make one. You can also make them in a code editor. You can even "copy and paste" them.

What should the robots.txt say?

That depends on what you want it to do.
All robots.txt instructions result in one of the following three outcomes
  • Full allow: All content may be crawled.
  • Full disallow: No content may be crawled.
  • Conditional allow: The directives in the robots.txt determine the ability to crawl certain content.
Let's explain each one.

Full allow - all content may be crawled

Most people want robots to visit everything in their website. If this is the case with you, and you want the robot to index all parts of your site, there are three options to let the robots know that they are welcome.

1) Do not have a robots.txt file

If your website does not have a robots.txt file then this is what happens...

A robot like Googlebot comes to visit. It looks for the robots.txt file. It does not find it because it isn't there. The robot then feels free to visit all your web pages and content because this is what it is programmed to do in this situation.

2) Make an empty file and call it robots.txt

If your website has a robots.txt file that has nothing in it then this is what happens...

A robot like Googlebot comes to visit. It looks for the robots.txt file. It finds the file and reads it. There is nothing to read, so the robot then feels free to visit all your web pages and content because this is what it is programmed to do in this situation.

3) Make a file called robots.txt and write the following two lines in it

User-agent: *
Disallow:

If your website has a robots.txt with these instructions in it then this is what happens

A robot like Googlebot comes to visit. It looks for the robots.txt file. It finds the file and reads it. It reads the first line. Then it reads the second line. The robot then feels free to visit all your web pages and content because this is what you told it to do (I explain this below).

Full disallow - no content may be crawled

Warning: This means that Google and other search engines will not index or display your webpages.

To block all reputable search engines spiders from your site you would have these instructions in your robots.txt:

User-agent: *
Disallow: /

It is not recommended to do this as it will result in none of your web pages being indexed.

The robot.txt instructions and their meanings

Here is an explanation of what the different words mean in a robots.txt file

User-agent:

The "User-agent" part is there to specify directions to a specific robot if needed. There are two ways to use this in your file.

If you want to tell all robots the same thing you put a " * " after the "User-agent" It would look like this...

User-agent: *

The above line is saying "these directions apply to all robots".
If you want to tell a specific robot something (in this example Googlebot) it would look like this...

User-agent: Googlebot

The above line is saying "these directions apply to just Googlebot".

Disallow:

The "Disallow" part is there to tell the robots what folders they should not look at. This means that if, for example you do not want search engines to index the photos on your site then you can place those photos into one folder and exclude it.

Lets say that you have put all these photos into a folder called "photos". Now you want to tell search engines not to index that folder.

Here is what your robots.txt file should look like in that scenario:

User-agent: *
Disallow: /photos

The above two lines of text in your robots.txt file would keep robots from visiting your photos folder. The "User-agent *" part is saying "this applies to all robots". The "Disallow: /photos" part is saying "don't visit or index my photos folder".

Googlebot specific instructions

The robot that Google uses to index their search engine is called Googlebot. It understands a few more instructions than other robots.
In addition to "User-name" and "Disallow" Googlebot also uses the Allow instruction.

Allow:

The "Allow:" instructions lets you tell a robot that it is okay to see a file in a folder that has been "Disallowed" by other instructions. To illustrate this, let's take the above example of telling the robot not to visit or index your photos. We put all the photos into one folder called "photos" and we made a robots.txt file that looked like this...

User-agent: *
Disallow: /photos

Now let's say there was a photo called mycar.jpg in that folder that you want Googlebot to index. With the Allow: instruction, we can tell Googlebot to do so, it would look like this...

User-agent: *
Disallow: /photos
Allow: /photos/mycar.jpg

This would tell Googlebot that it can visit "mycar.jpg" in the photo folder, even though the "photo" folder is otherwise excluded.

Testing your robots.txt file

To find out if an individual page is blocked by robots.txt you can use this online TOOL which will tell you if file important to Google are being blocked.

Key concepts
  • If you use a robots.txt file, make sure it is being used properly
  • An incorrect robots.txt file can block Googlebot from indexing your page
  • Ensure you are not blocking pages that Google needs to rank your pages


Please feel free to leave your comments, if you are still facing any problem we are here to help you out!
Found this article beneficial? Share This to Your Friends. One of Your Friends Might be waiting for Your Share!

SHARE

About We Greenz

    Blogger Comment
    Facebook Comment

0 comments:

Post a Comment