Markup is a tool for converting various documentation formats and manual pages into structured HTML. It functions as a rendering engine selector and converter that transforms raw markup files into web-ready output using a pluggable pipeline.
The main features of github/markup are: Static Markup Rendering, Markup Language Detection, Language Detection, HTML Converters, Markup Language Detectors, Markup To HTML Converters, Content Type Detection, HTML Renderers.
Open-source alternatives to github/markup include: google/magika — Magika is an AI content type classifier and MIME type prediction engine that uses deep learning to identify file… pymupdf/pymupdf — PyMuPDF is a comprehensive PDF manipulation library and document analysis tool. It serves as a text extraction tool,… openvenues/libpostal — Libpostal is a C library designed for international address parsing and normalization. It utilizes statistical NLP and… apache/tika — Tika is a content analysis toolkit and Java library designed for detecting and extracting metadata and text from… richqaq/pastemd — PasteMD is a clipboard-based document processor and productivity tool designed to convert Markdown or HTML content… soimort/translate-shell — Translate-shell is a command-line translation tool and terminal dictionary client. It allows for the translation of…
Magika is an AI content type classifier and MIME type prediction engine that uses deep learning to identify file formats based on binary data. It analyzes byte sequences through a neural network to predict the content type of a file and provide associated confidence scores. The system features a foreign function interface that allows the core detection logic to be integrated across different programming languages. It includes a mechanism for configuring detection sensitivity and per-type thresholds to balance precision and recall. The project provides capabilities for bulk file analysis via
Tika is a content analysis toolkit and Java library designed for detecting and extracting metadata and text from thousands of different file types. It functions as a universal document text extractor and metadata extraction engine, converting complex files into plain text or XHTML. The system employs a specialized MIME type detector that identifies document formats using magic bytes and metadata to determine the correct parser. It serves as an OCR integration gateway, connecting to external text recognition tools to extract content from image files. The project covers a broad range of extrac
Libpostal is a C library designed for international address parsing and normalization. It utilizes statistical NLP and a language classifier to decompose unstructured global address strings into structured components and standardize street addresses by expanding abbreviations and resolving regional naming variations across multiple languages. The project provides tools for text transliteration, converting various scripts into standardized Latin-ASCII or NFD forms. It also includes capabilities for address deduplication, using symmetric fuzzy matching to identify whether different address reco
PyMuPDF is a comprehensive PDF manipulation library and document analysis tool. It serves as a text extraction tool, OCR engine, and image converter, providing a programmatic interface to edit, merge, split, and optimize PDF and Office documents. The project distinguishes itself through high-performance capabilities, including the use of C-bindings for low-level manipulation and parallelized page processing to accelerate workloads. It provides specialized conversion paths, such as transforming PDF content into Markdown for retrieval-augmented generation and large language model pipelines. It