Extract and parse text from documents and create vector embeddings in a scalable and distributed way (and less than 70 lines of code) Most organisatio

Scalable PDF document processing with DataChain and Unstructured.io

submited by

Style Pass

2024-09-23 19:00:10

Extract and parse text from documents and create vector embeddings in a scalable and distributed way (and less than 70 lines of code)

Most organisations keep a large source of information in the form of various internal documents, call transcripts and other unstructured data. These data contain a lot of useful insights about customers, employees or the inner workings of the company. However, they remain largely untapped by data teams due to the difficulty of dealing with large quantities of data in unstructured formats.

Today, we will see how you can process a collection of documents in less than 70 lines of code, extract and parse text from them and create vector embeddings useful for downstream tasks (e.g. for RAG or as ML features). This approach is also scalable and you will benefit from easy versioning of the final datasets.

We will work with a publicly available Google Storage bucket which contains a collection of Neurips conference papers (representing our internal company documents).