Building a web search engine starts with the acceptance that it is undesirable (and more importantly impossible) to index every word in every webpage.

The World’s First Search Engine Based on Synthetic AI-Generated Data

submited by

Style Pass

2021-06-21 06:00:08

Building a web search engine starts with the acceptance that it is undesirable (and more importantly impossible) to index every word in every webpage. Even Google, evidently, doesn’t do it. There is no point trying to index the parts of a webpage that are unlikely to appear in prospective user search queries.

Suppose that for each webpage you had a list of all the search queries that ended with a user clicking on this webpage (say from Google). Clearly any part of a webpage that is not identified with any of the search queries leading to it is completely irrelevant for the retrieval process. Hence, you could focus on only indexing the search queries.

It follows that the way to start building an internet search engine is not to naively index all the words and/or phrases in a webpage, but rather to start by generating a Query Log. A Query Log is a huge database of pairs – <query,webpage>, meaning queries and their associated webpages. You can build this Query Log database by collecting it from real user queries (such as SERPing Google) or by synthetically generating it from the webpages themselves.

When a search query is entered into a search engine, the search engine shows the most relevant results by doing the following –