In April 2024, we launched Jina Reader, an API that transforms any webpage into LLM-friendly markdown by simply adding r.jina.ai as a URL prefix. In S

ReaderLM v2: Frontier Small Language Model for HTML to Markdown and JSON

submited by
Style Pass
2025-01-15 12:30:07

In April 2024, we launched Jina Reader, an API that transforms any webpage into LLM-friendly markdown by simply adding r.jina.ai as a URL prefix. In September 2024, we launched two small language models, reader-lm-0.5b and reader-lm-1.5b , specifically designed to convert raw HTML into clean markdown. Today, we're excited to introduce ReaderLM's second generation, a 1.5B parameter language model that converts raw HTML into beautifully formatted markdown or JSON with superior accuracy and improved longer context handling. ReaderLM-v2 handles up to 512K tokens combined input and output length. The model offers multilingual support across 29 languages, including English, Chinese, Japanese, Korean, French, Spanish, Portuguese, German, Italian, Russian, Vietnamese, Thai, Arabic, and more.

Thanks to its new training paradigm and higher-quality training data, ReaderLM-v2 is a significant leap forward from its predecessor, particularly in handling long-form content and markdown syntax generation. While the first generation approached HTML-to-markdown conversion as a "selective-copy" task, v2 treats it as a true translation process. This shift enables the model to masterfully leverage markdown syntax, excelling at generating complex elements like code fences, nested lists, tables and LaTex equations.

Leave a Comment