News
Models
Products
keyboard_arrow_down
Reader
Convert any URL to Markdown for better grounding LLMs.
Embeddings
World-class multimodal multilingual embeddings.
Reranker
World-class reranker for maximizing search relevancy.
DeepSearch
Search, read and reason until best answer found.
More
keyboard_arrow_down
Classifier
Zero-shot and few-shot classification for image and text.
Segmenter
Cut long text into chunks and do tokenization.

API Docs
Auto codegen for your copilot IDE or LLM
open_in_new


Company
keyboard_arrow_down
About us
Contact sales
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms & Conditions


Log in
login
warning
This model is deprecated by newer models.
copyright

reader-lm-0.5b

A small language model for converting raw HTML into markdown
Release Postarrow_forward
License
copyright
CC-BY-NC-4.0
Release Date
calendar_month
2024-08-11
Input
abc
Text (HTML)
arrow_forward
Output
abc
Text (Markdown)
Model Details
Parameters: 494M
Input Token Length: 256K
Language Support
🌍 Multilingual support
Related Models
link
reader-lm-1.5b
Tags
text-understanding
multilingual
document-processing
resource-efficient
long-context
base-model
language-model
Available via
Commercial LicenseAWS SageMakerMicrosoft AzureHugging Face
Choose models to compare

Overview

Reader LM 0.5B is a specialized language model designed to solve the complex challenge of converting HTML documents into clean, structured markdown text. This model addresses a critical need in modern data processing pipelines: efficiently transforming messy web content into a format that's ideal for LLMs and documentation systems. Unlike general-purpose language models that require massive computational resources, Reader LM 0.5B achieves professional-grade HTML processing with just 494M parameters, making it accessible to teams with limited computing resources. Organizations dealing with web content processing, documentation automation, or building LLM-powered applications will find this model particularly valuable for streamlining their content preparation workflows.

Methods

The model employs an innovative "shallow-but-wide" architecture specifically optimized for selective-copy operations rather than creative text generation. Built on a decoder-only foundation with 24 layers and 896 hidden dimensions, the model uses specialized attention mechanisms with 14 query heads and 2 key-value heads to efficiently process input sequences. The training process involved two distinct stages: first with shorter, simpler HTML (32K tokens) to learn basic conversion patterns, then with complex, real-world HTML (128K tokens) to handle challenging cases. The model incorporates contrastive search during training and implements a repetition detection mechanism to prevent degeneration issues like token loops. A unique aspect of its architecture is the zigzag-ring-attention mechanism, which enables the model to handle extremely long sequences up to 256K tokens while maintaining stable performance.

Performance

In real-world testing, Reader LM 0.5B demonstrates impressive efficiency-to-performance ratios across multiple metrics. The model achieves a ROUGE-L score of 0.56, indicating strong content preservation, and maintains a low token error rate of 0.34, showing minimal hallucination. In qualitative evaluations across 22 diverse HTML sources including news articles, blog posts, and e-commerce pages in multiple languages, it shows particular strength in structure preservation and markdown syntax usage. The model excels at handling complex modern web pages where inline CSS and scripts can expand to hundreds of thousands of tokens - a scenario where traditional rule-based approaches often fail. However, it's important to note that while the model performs exceptionally well on straightforward HTML-to-markdown conversion tasks, it may require additional processing for highly dynamic or JavaScript-heavy pages.

Best Practice

To effectively deploy Reader LM 0.5B, organizations should ensure their infrastructure can handle the model's CUDA requirements, though its efficient architecture means it can run on consumer-grade GPUs. The model works best with raw HTML input and doesn't require special prefixes or instructions. For optimal performance, implement the provided repetition detection mechanism to prevent potential token loops in output generation. While the model supports multiple languages and various HTML structures, it's specifically designed for content extraction and markdown conversion - it shouldn't be used for tasks like text generation, summarization, or direct question answering. The model is available through AWS SageMaker for production deployment, and a Google Colab notebook is provided for testing and experimentation. Teams should be aware that while the model can handle extremely long documents up to 256K tokens, processing such large inputs may require additional memory management strategies.
Blogs that mention this model
September 11, 2024 • 13 minutes read
Reader-LM: Small Language Models for Cleaning and Converting HTML to Markdown
Reader-LM-0.5B and Reader-LM-1.5B are two novel small language models inspired by Jina Reader, designed to convert raw, noisy HTML from the open web into clean markdown.
Jina AI
Technical screenshot displaying "REAPER-LM-0.5B/1.5B" with HTML source code for Jina's search grounding feature.
January 15, 2025 • 17 minutes read
ReaderLM v2: Frontier Small Language Model for HTML to Markdown and JSON
ReaderLM-v2 is a 1.5B small language model for HTML-to-Markdown conversion and HTML-to-JSON extraction with exceptional quality.
Jina AI
Orange text "ReaderLM-u2" on a vibrant dark red digital screen.
Offices
location_on
Sunnyvale, CA
710 Lakeway Dr, Ste 200, Sunnyvale, CA 94085, USA
location_on
Berlin, Germany (HQ)
Prinzessinnenstraße 19-20, 10969 Berlin, Germany
location_on
Beijing, China
Level 5, Building 6, No.48 Haidian West St. Beijing, China
location_on
Shenzhen, China
402 Floor 4, Fu'an Technology Building, Shenzhen, China
Search Foundation
Reader
Embeddings
Reranker
DeepSearch
Classifier
Segmenter
API Documentation
Get Jina API key
Rate Limit
API Status
Company
About us
Contact sales
Newsroom
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms
Security
Terms & Conditions
Privacy
Manage Cookies
email
Jina AI © 2020-2025.