python-ftfy is a Unicode text repair library designed to fix mojibake and encoding glitches. It provides utilities for byte encoding detection, HTML entity decoding, and the recovery of corrupted text to restore it to its intended Unicode form. The project distinguishes itself through a multi-layered decoding pipeline that identifies and reverts complex encoding mix-ups. It uses heuristic-based detection to resolve instances where text was decoded using the wrong codec across multiple layers of corruption, and it can handle non-standard UTF-8 variants and sloppy encoding mappings. The librar
This project is a formal markdown specification standard that provides a detailed markup syntax definition and a definitive set of rules for parsing plain text into consistent HTML output. It establishes a standardized grammar for structural blocks and inline elements to ensure uniform rendering across different software implementations. The specification is supported by a parser conformance suite and a reference implementation in C and JavaScript to verify that implementations adhere to the standard. It includes a system for implementation verification that compares transformed input strings
Spark NLP is a toolkit for scalable text analysis and machine learning built on the Apache Spark distributed computing framework. It provides a multimodal machine learning framework and a distributed pipeline system for sequencing annotators to process large-scale linguistic data. The library includes a transformer text processor for generating contextual vector embeddings and a dedicated inference engine for managing large language models. The project distinguishes itself through its ability to process heterogeneous data types, including text, audio, and images, within a unified vision-langu
Up to 100x faster strings for C, C++, CUDA, Python, Rust, Swift, JS, & Go, leveraging NEON, AVX2, AVX-512, SVE, GPGPU, & SWAR to accelerate search, hashing, sorting, edit distances, sketches, and memory ops 🦖
This project is a Unicode text repair tool and mojibake correction library designed to fix encoding glitches and restore original characters from mangled strings. It functions as a text encoding detector and a Unicode normalization tool to resolve issues where text has been incorrectly decoded.
The main features of rspeer/python-ftfy are: Mojibake Repair Utilities, Unicode Text Repair Libraries, Text Cleaning, Byte Encoding Detectors, Character Encoding Detectors, Lossy Encoding Detection, Inconsistent Encoding Repair, Mojibake Detection.
Open-source alternatives to rspeer/python-ftfy include: luminosoinsight/python-ftfy — python-ftfy is a Unicode text repair library designed to fix mojibake and encoding glitches. It provides utilities for… commonmark/commonmark-spec — This project is a formal markdown specification standard that provides a detailed markup syntax definition and a… johnsnowlabs/spark-nlp — Spark NLP is a toolkit for scalable text analysis and machine learning built on the Apache Spark distributed computing… guumaster/hostctl — Your dev tool to manage /etc/hosts like a pro! ai/nanoid — Nanoid is a library for generating unique, fixed-length identifiers designed for distributed systems and database… abadojack/whatlanggo — Natural language detection library for Go.