Friday, December 25, 2015

robot.txt: What it is?

Search engines visit websites, blogs and other online portal to scan and then store or cache the information. This is then used to rank them according to the relevancy. In a large portal the whole website may be indexed according to criteria of search engines. Most of the search engine used indexed web pages collectively to rank the site. This means pages that are not important or hold information not meant for the engines are also indexed. 

In order to specify to the robots which pages to index and which to ignore a common protocol robot.txt is used. Hence a file is uploaded into the server along with the pages in the root directory. This is the robot.txt file which the robots first visit and index the pages accordingly. But remember this protocol may not be adhered to by some robots especially those with malicious intent. But all the popular search companies adhere to this standard which is public.       

The file contains the following command/commands:

User-agent: *
Disallow: /

More instructions are given below: 

To block the entire site, use a forward slash.

Disallow: /


To block a directory and everything in it, follow the directory name with a forward slash.

Disallow: /junk-directory/

To block a page, list the page.

Disallow: /private_file.html

To remove a specific image from Google Images, add the following:

User-agent: Googlebot-Image
Disallow: /images/dogs.jpg

To remove all images on your site from Google Images:

User-agent: Googlebot-Image
Disallow: /

To block files of a specific file type (for example, .gif), use the following:

User-agent: Googlebot
Disallow: /*.gif$

More information can be had from: Here Robot.TXT

Also visit Robottxt.org 

This information can also be given in the meta robot tag which should be present in every page. Another methodology is to insert x-robot header this is placed in the header which then works for all pages. Care should be taken that directives are proper and no important pages is barred.  This also applies to CMS portals.     

You have to regulary check the directives so that over time no misinformation/ wrong case is inserted. There are number of tools that can enable you to keep things in line.