What does rspeer/python-ftfy do?

This project is a Unicode text repair tool and mojibake correction library designed to fix encoding glitches and restore original characters from mangled strings. It functions as a text encoding detector and a Unicode normalization tool to resolve issues where text has been incorrectly decoded.

What are the main features of rspeer/python-ftfy?

The main features of rspeer/python-ftfy are: Mojibake Repair Utilities, Unicode Text Repair Libraries, Text Cleaning, Byte Encoding Detectors, Character Encoding Detectors, Lossy Encoding Detection, Inconsistent Encoding Repair, Mojibake Detection.

What are some open-source alternatives to rspeer/python-ftfy?

Open-source alternatives to rspeer/python-ftfy include: luminosoinsight/python-ftfy — python-ftfy is a Unicode text repair library designed to fix mojibake and encoding glitches. It provides utilities for… commonmark/commonmark-spec — This project is a formal markdown specification standard that provides a detailed markup syntax definition and a… johnsnowlabs/spark-nlp — Spark NLP is a toolkit for scalable text analysis and machine learning built on the Apache Spark distributed computing… guumaster/hostctl — Your dev tool to manage /etc/hosts like a pro! ai/nanoid — Nanoid is a library for generating unique, fixed-length identifiers designed for distributed systems and database… abadojack/whatlanggo — Natural language detection library for Go.

Python Ftfy - fix broken Unicode and mojibake

Python Ftfy - fix broken Unicode and mojibake | Awesome Repos

Features

Mojibake Repair Utilities - Repairs complex, multi-layered encoding errors and mojibake patterns to restore the original intended text.
Unicode Text Repair Libraries - Provides a comprehensive library for repairing mojibake and encoding glitches to restore intended Unicode text.
Text Cleaning - Cleans text data by removing invisible control characters and terminal escapes while standardizing ligatures.
Byte Encoding Detectors - Identifies the likely encoding of byte strings by checking for byte-order marks and UTF-8 validity.
Character Encoding Detectors - Analyzes byte sequences to identify the most likely character encoding and detect lossy sequences.
Lossy Encoding Detection - Identifies lossy encoding sequences, such as replacement characters, to determine how text was mangled.
Inconsistent Encoding Repair - Detects and resolves instances where multiple different encodings are embedded within a single text stream.
Mojibake Detection - Uses character sequence heuristics to identify likely Unicode encoding glitches and mojibake.
Mojibake Restoration - Reverses multi-layered encoding errors to restore original characters from mangled UTF-8 and single-byte strings.
Unicode Normalization - Standardizes character widths and combining marks to ensure consistent string representations across platforms.
Recursive Encoding Reversals - Fixes multi-layered encoding errors by recursively applying decoding and encoding cycles until the text is stable.
Unicode Normalizers - Standardizes UTF-8 text through character decomposition, width normalization, and resolving Latin ligatures.
Multi-Stage Text Normalizers - Processes text through a sequence of cleaning, decoding, and normalizing steps to resolve mixed encoding glitches.
Surrogate Pair Correctors - Replaces UTF-16 surrogate pairs with correct characters to fix text decoded via obsolete standards.
Sloppy Encoding Mapping - Implements mapping of unassigned bytes in single-byte encodings to compatible Unicode codepoints for legacy browser interoperability.
Byte Order Mark Detectors - Provides a mechanism to guess original text encoding by checking for signature byte-order marks.
Encoding Error Analyzers - Analyzes strings to identify specific encoding errors and lists the transformations required to fix the text.
Escape Sequence Decoding - Converts hex and Unicode backslashed escape sequences into their corresponding Unicode characters.
Line Break Standardization - Standardizes platform-specific newline sequences like CRLF and CR into Unix-style line breaks.
Ligature Expansion - Decomposes single-character Latin ligatures into the individual letters they represent to fix copy-paste artifacts.
Encoding Repair CLIs - Provides a command-line utility for detecting and repairing mojibake and encoding glitches in files.
Unicode Character Inspectors - Analyzes Unicode strings by displaying codepoints, hexadecimal values, glyphs, and character categories.
Character Substitution Tables - Employs predefined dictionaries to replace ligatures and non-standard control characters with standard equivalents.
Character Width Normalizers - Replaces halfwidth and fullwidth forms of ASCII, katakana, and Hangul with standard Unicode representations.
Control Character Normalizers - Provides utilities that map C1 control characters to Windows-1252 equivalents to ensure web-standard compatibility.
Non-Standard UTF-8 Decoding - Decodes non-standard UTF-8 variants including CESU-8 and Java-specific null character encodings.
HTML Entity Processors - Converts HTML entity sequences and backslashed escapes into their corresponding Unicode characters.
Natural Language Processing - Utility for fixing Unicode glitches and mojibake.
General Utilities - Tool for fixing broken Unicode strings.
Text Processing - Fixes broken Unicode text to make it consistent.

Open-source alternatives to Python Ftfy

Similar open-source projects, ranked by how many features they share with Python Ftfy.

luminosoinsight/python-ftfy
LuminosoInsight/python-ftfy
4,043View on GitHub
python-ftfy is a Unicode text repair library designed to fix mojibake and encoding glitches. It provides utilities for byte encoding detection, HTML entity decoding, and the recovery of corrupted text to restore it to its intended Unicode form. The project distinguishes itself through a multi-layered decoding pipeline that identifies and reverts complex encoding mix-ups. It uses heuristic-based detection to resolve instances where text was decoded using the wrong codec across multiple layers of corruption, and it can handle non-standard UTF-8 variants and sloppy encoding mappings. The librar
Python
View on GitHub4,043
commonmark/commonmark-spec
commonmark/commonmark-spec
5,105View on GitHub
This project is a formal markdown specification standard that provides a detailed markup syntax definition and a definitive set of rules for parsing plain text into consistent HTML output. It establishes a standardized grammar for structural blocks and inline elements to ensure uniform rendering across different software implementations. The specification is supported by a parser conformance suite and a reference implementation in C and JavaScript to verify that implementations adhere to the standard. It includes a system for implementation verification that compares transformed input strings
Python
View on GitHub5,105
johnsnowlabs/spark-nlp
JohnSnowLabs/spark-nlp
4,135View on GitHub
Spark NLP is a toolkit for scalable text analysis and machine learning built on the Apache Spark distributed computing framework. It provides a multimodal machine learning framework and a distributed pipeline system for sequencing annotators to process large-scale linguistic data. The library includes a transformer text processor for generating contextual vector embeddings and a dedicated inference engine for managing large language models. The project distinguishes itself through its ability to process heterogeneous data types, including text, audio, and images, within a unified vision-langu
Scala
View on GitHub4,135
ashvardanian/stringzilla
ashvardanian/StringZilla
3,494View on GitHub
Up to 100x faster strings for C, C++, CUDA, Python, Rust, Swift, JS, & Go, leveraging NEON, AVX2, AVX-512, SVE, GPGPU, & SWAR to accelerate search, hashing, sorting, edit distances, sketches, and memory ops 🦖
Cdatasetedit-distancegpu
View on GitHub3,494

See all 30 alternatives to Python Ftfy

rspeerpython-ftfy

Python Ftfy

Features

Open-source alternatives to Python Ftfy

LuminosoInsight/python-ftfy

commonmark/commonmark-spec

JohnSnowLabs/spark-nlp

ashvardanian/StringZilla

Frequently asked questions

Star history

Open-source alternatives to Python Ftfy

LuminosoInsight/python-ftfy

commonmark/commonmark-spec

JohnSnowLabs/spark-nlp

ashvardanian/StringZilla

Frequently asked questions