Service to import data from various sources (e.g. PDF, images, Microsoft Office, HTML) and index it in AI Search. Increases data relevance and reduces

Search code, repositories, users, issues, pull requests...

submited by
Style Pass
2024-06-09 17:30:33

Service to import data from various sources (e.g. PDF, images, Microsoft Office, HTML) and index it in AI Search. Increases data relevance and reduces final size by 90%+. Useful for RAG scenarios with LLM. Hosted in Azure with serverless architecture.

In a real-world scenario, with a public corpus of 15M characters (222 PDF, 7.330 pages), 2.940 facts were generated (8.41 MB indexed). That's a 93% reduction in document amount compared to the chunck method (48.111 chuncks, 300 characters each).

This project is a proof of concept. It is not intended to be used in production. This demonstrates how can be combined Azure serverless technologies and LLM to a high quality search engine for RAG scenarios.

Document extraction is based on Azure Document Intelligence, specifically on the prebuilt-layout model. It supports the following formats:

To override a specific configuration value, you can also use environment variables. For example, to override the llm.endpoint value, you can use the LLM__ENDPOINT variable:

Leave a Comment