SEO : A closer look at robot.txt
We (Muhammad Kashif Majeed and I have been looking for internet for information regarding robot.txt, but couldnot find a comprehensive note on robot.txt. after looking through books and getting help from some of my friends i have tried to write this article. this article is not 100% creation of mine. any help and suggestions from my visitors are really appreciated. special thanks to http://www.techiecorner.com/,
Before submitting site to a search engine, one must consider what pages from his website he want the search engine “bot” to crawl. He may have pages with sensitive information, or a scrap directory full of pages in progress that he would not like to see listed.
First method to achieve this is by placing a robots.txt file in the root directory of website. For this he must have full domain privileges in order for this to work. If one place robot.txt then he must keep in mind, do not leave it empty. Some search engine might consider it as the person doesn’t want his website to be crawled. I will discuss Robot.txt in later part of this article.
Second method which can be used to stop the bots “crawlers” from searching through website is META tag.
By Putting between the and tags in HTML files will prevent the bot from indexing that page.
Alternatively if you put between the and tags in HTML files, the page will be indexed, but any hyperlinks in that page will not be spidered by the bot.
If you use above two META tags at once the pages will not be indexed and also links will not be followed by the crawlers. If is used between the and tags, this may also prevent some web mirroring software from downloading the website.
A closer look at robot.txt
A robot is a program that automatically traverses the Web’s hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced. Web robot sometimes also calls as web crawler, web spider, web wanderer. Once a site got scan by robot, site will probably get index by the search engine. Most of the time, these robots are program that written by search engine like Google, Yahoo, Alexa, MSN, etc.
Robots.txt file is a simple text file which is responsible for preventing crawling of site’s pages. The robots.txt file is a simple text file which can be created using Notepad. It should be saved at root directory of site. eg.http://www.exampledomain.com/robots.txt
Format of writing robot.txt file is
User-agent: robot or spider name
Disallow: files or directories.
You can include comments in your robots.txt file, by putting pound-sign “#” at the start of the line to be commented.
User-agent: Titan
Disallow: / #comment
Names of robots which are crawling around the site can find in log file.
Examples of robot.txt are
User-agent: Titan
Disallow: /
User-agent: *
Disallow:
If you want to exclude all the search engine spiders from your entire domain, Just use this tiny code, but be sure you need it.
User-agent: *
Disallow: /
If you want to prevent your certain directories, you can specify them in Disallow field.
User-agent: *
Disallow: /test/
Disallow: /personal/
Similarly if you want to restrict specific files then type in the path of the files.
User-agent: *
Disallow: /personal/ test.htm
If one doesn’t want certain spiders crawl his site, which are not useful for him or are just eating up your bandwidth, you can specify them in User-agent. E.G
User-Agent: Titan
Disallow: /
While using robots.txt file you should be careful, it may stop your specified pages from appearing in search engines. There are many hundreds of bots and spiders crawling ovet the net, most of them respect your robot.txt file while some may not.