10 repositorios
Systems optimized for processing massive volumes of text with predictable memory and time complexity.
Distinct from Text Processing: Candidates focus on audio, collections, or AI inference; this is general-purpose high-performance text processing via regex.
Explore 10 awesome GitHub repositories matching data & databases · High-Performance Text Processing. Refine with filters or upvote what's useful.
cppformat is a type-safe C++ formatting library that serves as a high-performance alternative to standard C++ input and output streams for converting data into formatted strings. It integrates a compile-time format validator to ensure format specifiers match argument types, preventing runtime crashes. The library includes a positional argument engine that enables the reordering of text arguments for internationalization and localization. It also features a Unicode text formatter to ensure consistent and portable character representation across different operating systems. The project provide
Provides a high-performance alternative to standard C++ I/O streams for converting data into strings.
Xi Editor is a high-performance text editor core written in Rust. It employs a decoupled architecture that separates core logic from the presentation layer using a JSON-based client-server protocol. The project features a language-agnostic plugin system that communicates with external extensions via JSON messages over pipes. It manages text buffers using a persistent rope data structure to enable efficient editing of very large files. The system supports asynchronous editor workflows by running expensive operations in background threads using data snapshots. This prevents background processi
Ensures high-performance editing of very large files with low latency using a rope data structure.
re2 is a C++ regular expression library designed for high-performance text processing. It is a non-backtracking regex engine that provides linear-time pattern matching, ensuring that execution time remains proportional to the size of the input string regardless of the pattern used. The library supports UTF-8 and Latin-1 text encodings for searching and extracting substrings. It includes capabilities for multi-pattern optimization, allowing multiple regular expressions to be combined into a single representation to scan text for several patterns in one pass. The project covers core regex oper
Ensures predictable execution time and memory usage when processing large volumes of text with regular expressions.
Bloop is an AI code analysis tool and semantic search engine designed for understanding and querying large-scale codebases. It utilizes a high-performance indexing system written in Rust to enable fast symbol and text retrieval across multiple programming languages. The project differentiates itself by using on-device embeddings for semantic code search, allowing users to locate logic based on meaning and intent rather than exact keywords. It combines a language model with a retrieval-augmented generation approach to provide a natural language interface for conversational querying and the gen
Employs high-performance regular expression processing to rapidly filter and isolate specific text segments across large volumes of source code.
Oni2 is a high-performance, extensible text editor and project-based file manager. It functions as a modal code editor, utilizing a keyboard grammar of verbs and motions to navigate and modify source code without a mouse. It also serves as an LSP client, integrating Language Server Protocol servers to provide code completion, symbol navigation, and refactoring. The editor distinguishes itself by acting as a VSCode extension host, allowing it to load and execute language servers and debuggers from the VSCode ecosystem. It provides a programmable environment where custom functionality is implem
Utilizes a high-performance environment optimized for the speed and efficiency of writing and modifying text files.
Este proyecto es un libro de recetas de análisis de datos con pandas y una guía de ciencia de datos en Python. Proporciona una colección de recetas programáticas y ejemplos para limpiar, manipular y analizar datos estructurados. El proyecto se centra en proporcionar un entorno de análisis contenedorizado para garantizar un espacio de trabajo consistente y dependencias reproducibles al ejecutar scripts de procesamiento de datos. Cubre una amplia gama de capacidades de ciencia de datos, incluida la ingesta de datos desde fuentes externas, la limpieza de datos sin procesar y el análisis exploratorio de datos. Estas recetas demuestran cómo realizar análisis de datos estructurados mediante técnicas como el filtrado, la agregación de datos agrupados y el procesamiento de datos de texto.
Performs high-performance string operations to transform text data for analysis.
Lark es un kit de herramientas de análisis sintáctico (parsing) para Python utilizado para definir gramáticas y convertir texto sin formato en árboles de análisis anotados. Sirve como un generador de árboles de sintaxis abstracta y un lenguaje de definición de gramática para especificar reglas de lenguaje a través de terminales y expresiones regulares. La biblioteca proporciona dos implementaciones principales de análisis: una biblioteca de análisis Earley capaz de manejar todos los lenguajes libres de contexto, incluidos aquellos con ambigüedad y recursión a la izquierda, y una biblioteca de análisis LALR de alto rendimiento diseñada para lenguajes deterministas con bajo consumo de memoria. Más allá del análisis central, el kit de herramientas incluye capacidades para la composición de gramáticas modulares, transformación de árboles basada en reglas y seguimiento de coordenadas para posiciones de origen. También admite la serialización de gramáticas LALR en módulos de análisis independientes.
Uses LALR algorithms to process large volumes of text with high efficiency and low memory usage.
This is a collection of classical algorithms and data structures implemented as a header-only C++ library. It provides a suite of tools for general algorithm implementation, including data structure management, graph theory analysis, and string processing. The library is distinguished by its specialized toolkits for cryptographic hashing and encoding, featuring implementations of MD5, SHA-1, and Base64. It also includes advanced capabilities for high-performance string processing via suffix trees and arrays, as well as computational number theory for primality testing and arbitrary-precision
Uses suffix trees and arrays for high-performance pattern matching and text analysis.
Chumsky es una librería de combinadores de parser utilizada para construir parsers de alto rendimiento componiendo pequeñas funciones de parsing en gramáticas complejas. Proporciona múltiples motores de parsing, incluyendo implementaciones de descenso recursivo y escalada de precedencia para resolver el orden de operaciones en expresiones matemáticas y lógicas. La librería se distingue por su parsing de texto de copia cero (zero-copy), que minimiza las asignaciones de memoria para aumentar el rendimiento, y su capacidad para ejecutarse sin una librería estándar para su uso en entornos embebidos o con recursos limitados. También cuenta con un parser de recuperación de errores que identifica entradas mal formadas y reanuda el procesamiento para informar múltiples errores de sintaxis en una sola pasada. El framework cubre una amplia gama de capacidades, incluyendo gestión de estado sensible al contexto, soporte para gramáticas recursivas e integración de patrones de expresiones regulares. Incluye herramientas para el análisis de la estructura del parser, inspección de nodos y caché de resultados para admitir backtracking y recursión a la izquierda. La librería admite el desarrollo de lenguajes personalizados, parsing de formatos de datos y herramientas de lenguaje de programación.
Enables high-performance text processing optimized for low memory and high throughput in resource-constrained environments.
This is a Rust regular expression library that provides a finite automata engine for searching and matching text patterns. It functions as a Unicode-compliant text scanner designed to guarantee linear time execution on all inputs to prevent catastrophic backtracking. The engine supports both single and multi-pattern search capabilities, allowing it to scan a piece of text for multiple regular expressions simultaneously. It operates on both strings and raw byte slices to identify matching text segments. The library covers text parsing, string validation, and pattern searching. It includes cap
Provides high-performance text extraction with guaranteed linear time complexity to prevent performance crashes.