Reader is an AI data ingestion pipeline and web content parser designed to convert websites and documents into clean markdown for use with large language models. It functions as a headless browser content extractor and web-to-markdown converter, transforming URLs and PDF files into structured text formats while removing irrelevant web clutter.
The system optimizes retrieval augmented generation by acting as a search optimizer that retrieves web results and applies re-ranking to improve context relevance. It further enhances content accessibility by using vision models to generate descriptive captions for images and creating vector embeddings for semantic retrieval.
The tool provides broad capabilities for document conversion, web content extraction, and data preprocessing. These include headless browser rendering for JavaScript execution, multi-format conversion of office documents, and bucket-based content caching to reduce latency.
The conversion engine can be deployed as a self-hosted container including all necessary headless browsers and document processors.