The purpose of this series is to learn how to more precisely and accurately parse HTML. While HTML is an inherently unstructured language, the way tha

Hidden in HTML: Parsing Page Layouts. 2.9B Web Page Analysis

submited by
Style Pass
2025-01-14 18:30:11

The purpose of this series is to learn how to more precisely and accurately parse HTML. While HTML is an inherently unstructured language, the way that humans use language absolutely has patterns and structure.

This series builds a foundational understanding of how we write semi-semantic HTML, in order to better build things like generic article parsers, product parsers, news parsers, etc, thereby reducing the amount of site-specific parsing we have to do.

Another benefit of this is to better feed AI/ML/LLMs. If you’re not removing navigation text, for example, your LLM learns to associate whatever text tokens are in the header, side, and footer navigation with the main article text (what you actually want feeding your models), which can significantly reduce the output quality of your model.

Throughout this article series, we’ll look at an analysis of how HTML tags, attributes, and their values are used across 2.9 billion web pages from the November 2024 Common Crawl dataset.

Leave a Comment