Wikipedia
defines search engine spider as:
A web crawler (also known as a web spider or
web robot) is a program or automated script which browses the
World Wide Web in a methodical, automated manner. Other less frequently
used names for web crawlers are ants, automatic indexers,
bots, and worms (Kobayashi and Takeda, 2000).
This process is called web crawling or spidering.
Many legitimate sites, in particular
search engines, use spidering as a means of providing up to date data.
Web crawlers are mainly used to create a copy of all the visited pages for
later processing by a search engine, that will index the downloaded pages to
provide fast searches. Crawlers can also be used for automating maintenance
tasks on a web site, such as checking links or validating
HTML code.
Also, crawlers can be used to gather specific types of information from Web
pages, such as harvesting e-mail addresses (usually for
spam).
A web crawler is one type of
bot,
or software agent. In general, it starts with a list of
URLs to visit, called the seeds. As the crawler visits these
URLs, it identifies all the
hyperlinks in the page and adds them to the list of URLs to visit,
called the crawl frontier. URLs from the frontier are recursively
visited according to a set of policies.
Crawler identification
Web crawlers typically identify themselves to a web server
by using the
User-agent field of an
HTTP request.
Website administrators typically examine their
web
servers’ log and use the user agent field to determine which crawlers
have visited the web server and how often. The user agent field may include
a URL where the
website administrator may find out more information about the crawler.
Spambots
and other malicious web crawlers are unlikely to place identifying
information in the user agent field, or they may mask their identity as a
browser or other well-known crawler.
It is important for web crawlers to identify themselves so
website administrators can contact the owner if needed. In some cases,
crawlers may be accidentally trapped in a
crawler trap or they may be overloading a web server with requests, and
the owner needs to stop the crawler. Identification is also useful for
administrators that are interested in knowing when they may expect their web
pages to be indexed by a particular
search engine.