very search engine has a continually-refined and complicated process for updating its search database. The "big three", that is Google, Yahoo, and MSN, all employ a combination of methods, but all three use search engine "spiders" (or crawlers, robots, any of a dozen nicknames). These spiders are largely-autonomous programs which travel the web, jumping from page to page based on the link structure, the same way a regular user might. While the spiders sweep across the web, they collect a variety of data about the pages they encounter, including such items as modification dates, descriptions, and other information contained in the meta tags within a site.

Upon reaching a page, a spider might either index the page initially if it isn't already within the database, or it might update its record of the page, based on how much has changed. The frequency of these spider visits are determined by a variety of factors, including how "static" the content of the page is, and the relative importance of the page itself (PageRank is a measure of this, in Google's world).

Ideally, if you wish to promote your website, you need the spiders to visit often, and "see" all the important sections of your site in order to maintain an updated index that will serve your goals. Accomplishing this is a major goal of SEO on a whole, and is by no means simple. For one, the ways in which spiders move around and collect data is proprietary, and can only really be guessed at by crunching log data, and measuring how quickly a site is indexed and visited.

A separate blog entry will be devoted to each of the big three engines and their methods of indexing, but one general standard for instructing the basic behavior of spiders is a simple textfile called "robots.txt", which is placed in the top-directory of the webserver.

A number of directives can be specified within this file, most notably which areas of your site are "off limits" to search engine spiders. For instance, adding the following lines to robots.txt...

User-agent: webcrawler
Disallow: /

...will tell webcrawler not to index or collect information on any part of your website. Wildcards can be used, as illustrated in the following lines...

User-agent: *
Disallow: /secret
Disallow: /logs

...which keeps every search engine spider (that follows this standard) out of the noted folders. Note that wildcards aren't supported in the actual file path, so instead of /secret/*, just use /secret/.

There is also a META equivalent to this method; simply add the following line to your HTML file, if you don't want it indexed:


If you'd like the page indexed but not the links contained within, use:




Sir ackent said...

Thanks for sharing out this content it are really fastidious.
Asbestos Exposure

neil jonson said...

It feels awesome to read such informative and unique articles on your websites.
Discover More

chansara said...

Your online journal gave us profitable data to work with. Each and every tips of your post are amazing. You rock for sharing. Continue blogging, Elder Law

Post a Comment