gptpdf uses PyMuPDF to parse PDFs, identifying both text and non-text regions. It then merges or filters the text regions based on certain rules, and inputs the final results into a multimodal model for parsing. This method is particularly effective.
Using a layout analysis model, each page of the PDF is parsed to identify the type of each region, which includes Text, Title, Figure, Figure caption, Table, Table caption, Header, Footer, Reference, and Equation. The coordinates of each region are also obtained.
This result includes the type, coordinates, and reading order of each region. By using this result, more precise rules can be set to parse the PDF.
Finally, input the images of the corresponding regions into a multimodal model, such as GPT-4o or Qwen-VL, to directly obtain text blocks that are friendly to RAG solutions.
If using Azure, the azure_deployment and azure_endpoint parameters need to be passed; otherwise, only the API key needs to be provided.