Can we RAG the whole web?

submited by
Style Pass
2024-04-29 13:00:07

This article implies some prior knowledge on vector embeddings. If you’re not quite sure what those are, I strongly recommend this article from Simon Willison on the matter, it will help get a better grasp on this blog post.

As I was reading Anyscale.com benchmark analysis on RAG, a question came to me: what would be the most feasible way to vectorize the whole web, and let LLM query specific domains or groups of domains when needed? So, here is my, most likely, dead in a water google-killer idea, that I believe is worth sharing with the community. It has also been an interesting writing that led me somewhere I did not foresee when I started this article.

An LLM on its own is not capable of answering this sort of question as a model is only re-trained on new data every once in a while, and has not been trained yet on “yesterday’s” data. While continuous training for a model is an active area of research, the cheapest route for getting this sort of prompt today is via RAG.

Google index is made of 50 billion pages, while Bing has 4 billion in its index. This is a lot of data and the idea of creating a crawler of this magnitude is a bit daunting. A simpler approach could be using XML sitemaps that a domain may share on the Robots.txt to get every url from a website that they specifically want to index on search engines. This has the advantages of getting every url without having to crawl the domain. But even with a perfect XML parser that gives a huge index of urls, we would still need to make a request for each url found, extract the main content, chunk that content into smaller pieces, tokenize it, and get the embeddings for each chunk. Doing this for a few urls is easy but doing it for billions of urls starts to get tricky and expensive (although not completely out of reach).

Leave a Comment