Extracting Financial Data from SEC 10-K Filings with LLMs

submited by
Style Pass
2024-11-07 18:30:04

Every year, U.S. public companies file a comprehensive financial report called a 10-K. These reports are filed in HTML format, which can complicate automated parsing. Traditional approaches to extracting data from these HTML tables are challenging, time-consuming, and often imprecise. What should be a straightforward data extraction task becomes an endless game of whack-a-mole with edge cases.

This blog post demonstrates how we can use structured text generation to cut through the chaos and extract clean, consistent data directly from 10-K reports. We'll show you how to transform messy HTML tables into neat CSVs (easily read by Excel) that are primed for analysis.

Fortunately, we can use structured generation to extract the information we need from the 10-K directly into tabular data. Feed some ugly text into our model, get a fresh CSV on the other side.

The first is Microsoft. We can see revenue at the top, operating income in the middle, net income on the bottom, and three columns representing each reporting year.

Leave a Comment