# luminosoinsight/python-ftfy

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/luminosoinsight-python-ftfy).**

4,043 stars · 126 forks · Python · NOASSERTION

## Links

- GitHub: https://github.com/LuminosoInsight/python-ftfy
- Homepage: http://ftfy.readthedocs.org
- awesome-repositories: https://awesome-repositories.com/repository/luminosoinsight-python-ftfy.md

## Description

python-ftfy is a Unicode text repair library designed to fix mojibake and encoding glitches. It provides utilities for byte encoding detection, HTML entity decoding, and the recovery of corrupted text to restore it to its intended Unicode form.

The project distinguishes itself through a multi-layered decoding pipeline that identifies and reverts complex encoding mix-ups. It uses heuristic-based detection to resolve instances where text was decoded using the wrong codec across multiple layers of corruption, and it can handle non-standard UTF-8 variants and sloppy encoding mappings.

The library also covers a broad range of text standardization tasks, including Unicode normalization, line break standardization, and the expansion of Latin ligatures. It includes capabilities for character width normalization and the removal of terminal escapes and control characters.

A command-line interface is available to automate the detection and repair of Unicode glitches within files.

## Tags

### Development Tools & Productivity

- [Mojibake Detectors](https://awesome-repositories.com/f/development-tools-productivity/mojibake-detectors.md) — Implements heuristic analysis of character sequences to detect and identify mojibake corruption.
- [Mojibake Detection](https://awesome-repositories.com/f/development-tools-productivity/text-encoding-utilities/mojibake-detection.md) — Implements heuristic-based detection of mojibake and Unicode encoding glitches. ([source](https://ftfy.readthedocs.io/en/latest/heuristic.html))
- [Character Encoding Detectors](https://awesome-repositories.com/f/development-tools-productivity/character-encoding-detectors.md) — Provides heuristic detection of byte encodings using BOM and UTF-8 validation to identify the correct codec.
- [Encoding Recovery Tools](https://awesome-repositories.com/f/development-tools-productivity/encoding-recovery-tools.md) — Identifies and reverts multi-layered encoding errors and incorrect codec applications to restore corrupted text.
- [Mojibake Repair Utilities](https://awesome-repositories.com/f/development-tools-productivity/mojibake-repair-utilities.md) — Heuristically detects and fixes mojibake and multi-layered encoding glitches to restore intended text. ([source](https://ftfy.readthedocs.io/en/latest/))
- [Encoding Mix-up Repair](https://awesome-repositories.com/f/development-tools-productivity/text-encoding-utilities/encoding-mix-up-repair.md) — Provides specialized repair for text corrupted by mismatched single-byte and variable-length encodings. ([source](https://ftfy.readthedocs.io/en/latest/encodings.html))
- [Inconsistent Encoding Repair](https://awesome-repositories.com/f/development-tools-productivity/text-encoding-utilities/inconsistent-encoding-repair.md) — Detects and fixes text containing fragments of different encodings embedded together. ([source](https://ftfy.readthedocs.io/en/latest/fixes.html))
- [Mojibake Restoration](https://awesome-repositories.com/f/development-tools-productivity/text-encoding-utilities/mojibake-restoration.md) — Restores text mangled by incorrect encoding cycles to its intended Unicode representation. ([source](https://ftfy.readthedocs.io/en/latest/detect.html))
- [Unicode Text Repair Libraries](https://awesome-repositories.com/f/development-tools-productivity/unicode-text-repair-libraries.md) — Provides a comprehensive toolset for fixing mojibake and encoding glitches to restore corrupted text.
- [Escape Sequence Decoding](https://awesome-repositories.com/f/development-tools-productivity/escape-sequence-decoding.md) — Converts backslashed escape sequences into their corresponding Unicode characters. ([source](https://ftfy.readthedocs.io/en/latest/fixes.html))
- [Line Break Standardization](https://awesome-repositories.com/f/development-tools-productivity/line-break-standardization.md) — Converts platform-specific line breaks like CRLF and CR into the standard Unix format. ([source](https://ftfy.readthedocs.io/en/latest/fixes.html))
- [Latin Ligature Decomposers](https://awesome-repositories.com/f/development-tools-productivity/monospaced-fonts/programming-ligature-fonts/latin-ligature-decomposers.md) — Decomposes single-character Latin ligatures into individual letters to resolve common copy-paste artifacts.
- [Lossy Encoding Detection](https://awesome-repositories.com/f/development-tools-productivity/output-formatting-utilities/formatting-enforcement-utilities/utf-8-formatting-enforcement/utf-8-sequence-validation/lossy-encoding-detection.md) — Identifies UTF-8 decoding errors that resulted in lossy replacement characters. ([source](https://ftfy.readthedocs.io/en/latest/heuristic.html))
- [Encoding Repair CLIs](https://awesome-repositories.com/f/development-tools-productivity/text-encoding-utilities/encoding-repair-clis.md) — Provides a command-line interface to automate the detection and repair of Unicode glitches within files. ([source](https://ftfy.readthedocs.io/en/latest/cli.html))

### Data & Databases

- [Multi-Layer Decoders](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/data-transformation/data-encoding-serialization/encoding-utilities/multi-layer-decoders.md) — Implements a multi-layered decoding pipeline to sequentially peel back layers of encoding corruption.
- [Text Cleaning](https://awesome-repositories.com/f/data-databases/client-side-data-processing/text-cleaning.md) — Cleans Unicode data by removing terminal escapes and decomposing ligatures to prepare text for analysis.

### Programming Languages & Runtimes

- [Unicode Normalization](https://awesome-repositories.com/f/programming-languages-runtimes/unicode-normalization.md) — Applies standard Unicode normalization forms to ensure characters and combining marks are represented consistently.
- [Unicode Text Handling](https://awesome-repositories.com/f/programming-languages-runtimes/unicode-text-handling.md) — Fixes encoding glitches and mojibake through configurable transformations applied to text segments. ([source](https://ftfy.readthedocs.io/en/latest/explain.html))

### Software Engineering & Architecture

- [Unicode Normalizers](https://awesome-repositories.com/f/software-engineering-architecture/unicode-normalizers.md) — Provides a utility for standardizing line breaks, character widths, and control characters for consistent display.
- [Character Width Normalizers](https://awesome-repositories.com/f/software-engineering-architecture/string-validation-and-normalization/speech-to-text-normalizers/character-width-normalizers.md) — Replaces halfwidth and fullwidth forms of ASCII, Katakana, and Hangul characters with standard equivalents. ([source](https://ftfy.readthedocs.io/en/latest/fixes.html))
- [Transformation Analysis](https://awesome-repositories.com/f/software-engineering-architecture/string-validation-and-normalization/string-encodings/utf-16-encodings/text-encoding-and-decoding/transformation-analysis.md) — Offers a detailed list of the specific encoding and decoding steps used to repair a string. ([source](https://ftfy.readthedocs.io/en/latest/explain.html))
- [Non-Standard UTF-8 Decoding](https://awesome-repositories.com/f/software-engineering-architecture/string-validation-and-normalization/string-encodings/utf-8-internal-storage/utf-8-byte-operations/non-standard-utf-8-decoding.md) — Supports decoding of UTF-8 variants including CESU-8 and Java-style null encodings. ([source](https://ftfy.readthedocs.io/en/latest/bad_encodings.html))

### User Interface & Experience

- [Encoding Normalizers](https://awesome-repositories.com/f/user-interface-experience/character-encoding-support/encoding-normalizers.md) — Standardizes inconsistent line breaks, character widths, and control characters for consistent display.
- [HTML Entity Processors](https://awesome-repositories.com/f/user-interface-experience/html-content-processing/html-content-processing/html-entity-processors.md) — Converts HTML entity references and backslashed escape sequences into their corresponding Unicode characters.

### Part of an Awesome List

- [Text Processing](https://awesome-repositories.com/f/awesome-lists/devtools/text-processing.md) — Fixes broken Unicode text automatically.
