Applications that run documents through LLMs or embedding models need to clean the text before feeding it into the model. I'm building a personalized content feed called Scour and was looking for a Rust crate to extract text from scraped HTML. I started off using a library that's used by a couple of LLM-related projects. However, while hunting a phantom memory leak, I built a little tool (emschwartz/html-to-text-comparison) to compare 13 Rust crates for extracting text from HTML and found that the results varied widely.
TL;DR: lol_html is a very impressive HTML rewriting crate from Cloudflare and fast_html2md is a newer HTML-to-Markdown crate that makes use of it. If you're doing web scraping or working with LLMs in Rust, you should take a look at both of those.
Any of these should work for an LLM application, because we mostly care about stripping away HTML tags and extraneous content like scripts and CSS. I say "should" because some of these crates definitely do not work as well as you might expect.