A search engine is a software system that is designed to search for information on the World Wide Web. The search results are generally presented in a line of results often referred to as search engine results pages (SERPs). The information may be a mix of web pages, images, and other types of files. Some search engines also mine data available in databases or open directories. Unlike web directories, which are maintained only by human editors, search engines also maintain real-time information by running an algorithm on a web crawler. The following are the processes of locating page in World Wide Web:
Maintaining an accurate “snap shot” of the Web is challenging, not only because of the size of the Web and constantly changing content, but also because pages disappear at an alarming rate (a problem commonly called linkrot ). Brewster Kahle, founder of the Internet Archive, estimates that web pages have an average life expectancy of only 100 days. And some pages cannot be found by web crawling. These are pages that are not linked to others, pages that are password-protected or are generated dynamically when submitting a web form. These pages reside in the deep Web, also called the hidden or invisible Web.
Some website owners don’t want their pages indexed by search engines for any number of reasons, so they use the Robots Exclusion Protocol (robots.txt) to tell web crawlers which URLs are off-limits. Other website owners want to ensure certain web pages are indexed by search engines, so they use the Sitemap Protocol, a method supported by all major search engines, to provide a crawler with a list of URLs they want indexed Sitemaps are especially useful in providing the crawler URLs it would be unable to find with web crawling.