Open-source software for document processing, spreadsheet management, and collaborative office productivity tasks.
Kreuzberg is a document extraction engine that converts PDFs, Office files, images, and over 90 other formats into clean, structured text and metadata. It is built around a compiled Rust core that can be used as a native library, a command-line tool, a REST API server, or a WebAssembly module for browser-based processing. The system is designed to run entirely on self-hosted infrastructure, with no data leaving the user's environment. What distinguishes Kreuzberg is its breadth of integration surfaces and its pipeline architecture. It exposes extraction capabilities through native bindings for 18 programming languages, a Model Context Protocol (MCP) server for direct AI agent integration, and a REST API with an OpenAPI schema. The extraction pipeline is plugin-based and configurable, supporting multiple OCR backends (Tesseract, PaddleOCR, EasyOCR, and vision-language models) with quality-based fallback, parallel batch processing with work-stealing, and ONNX Runtime model inference with hardware acceleration for CPU, GPU, or NPU. Beyond core text extraction, Kreuzberg provides a document enrichment pipeline that includes page classification, named entity recognition, summarization, translation, captioning, and PII redaction. It prepares content for retrieval-augmented generation (RAG) workflows by chunking text, generating vector embeddings, and reranking results. The system also supports structured data extraction via LLMs, source code extraction from 306 programming languages, and transcription of audio and video files using Whisper ONNX models. The project is available as a library installable via standard package managers, a CLI tool installable via Homebrew or Docker, and a production-ready deployment option with a Helm chart for Kubernetes.
This project is an algorithmic trading engine designed for the automated execution of cryptocurrency strategies. It provides a modular execution core that connects to multiple centralized and decentralized exchanges, allowing users to deploy rule-based trading logic across various spot and futures markets. The platform serves as a comprehensive environment for the entire trading lifecycle, from initial strategy development to live market operations. What distinguishes this platform is its integrated suite for quantitative analysis and predictive modeling. It features a robust backtesting engine that simulates strategies against historical market data, alongside an automated hyperparameter optimization framework to refine performance before capital deployment. Users can also integrate machine learning models directly into their strategies, enabling the creation of adaptive systems that respond to real-time market fluctuations. The system is built for consistent, reliable operation through containerized deployment, which ensures that trading logic and data storage remain stable across different host environments. Operational control is facilitated through a command-line interface and a messaging-integrated controller, which allows for remote monitoring, manual trade intervention, and real-time performance tracking via secure communication channels. The software is distributed as a containerized application, supporting standardized orchestration to simplify dependency management and infrastructure setup.
llmware is a Python framework for AI agent orchestration and model management, designed to coordinate multi-model workflows and autonomous agents. It provides a unified model catalog and standardized interface to execute specialized language models for complex research, analysis, and structured data generation. The project distinguishes itself through its heavy emphasis on local execution and quantized inference, allowing models to run on private infrastructure using CPU, GPU, and NPU acceleration via runtimes like ONNX and OpenVino. It features a specialized ability to translate natural language queries into structured SQL or CSV formats by analyzing database schemas. The framework covers a broad range of capabilities including end-to-end retrieval-augmented generation pipelines, hybrid search engines, and multimodal content processing for PDFs, Office documents, audio, and images. It also incorporates tools for structured function calling, named entity recognition, and text risk classification to detect toxicity and prompt injections. The system integrates with various SQL and vector database backends to manage knowledge collection indexing and document embeddings.
Yazi is a high-performance terminal file manager designed for keyboard-driven navigation and organization of local file systems. Built as an asynchronous application, it utilizes a non-blocking runtime to execute concurrent file operations and interface updates, ensuring the user experience remains responsive even during intensive tasks. The interface is rendered directly into the terminal emulator using escape sequences to maintain minimal memory overhead. The application distinguishes itself through a modular architecture that supports custom functionality via an embedded scripting engine. It leverages specialized terminal protocols to render rich media previews directly within the viewport, offloading resource-heavy tasks like image processing to background worker processes. This design allows for a consistent file management experience across Linux, macOS, and Windows environments. Beyond its core navigation capabilities, the tool provides extensive support for system integration and environment management. Users can deploy the software through various package managers, including support for declarative configuration systems to ensure consistent behavior across different machines.
This project is a comprehensive, curated directory of high-quality libraries, tools, and educational resources for C and C++ development. It serves as an ecosystem discovery index, helping developers navigate the vast landscape of third-party components, frameworks, and technical documentation available for the language. The collection is distinguished by its focus on high-performance systems programming and technical mastery. It provides deep coverage of specialized domains including SIMD-accelerated data processing, compile-time template metaprogramming, and asynchronous event-driven architectures. The repository also acts as a developer knowledge base, offering access to industry-standard coding guidelines, conference materials, and academic papers that support professional software engineering. Beyond core language features, the directory catalogs a wide array of practical tools for the entire development lifecycle. This includes build systems, static analysis tooling, debuggers, and integrated development environments. It also covers a broad surface of application-level capabilities, ranging from scientific computing and embedded systems development to graphics, networking, and cross-platform library integration.
Pandas is a high-performance data analysis library that provides a comprehensive framework for manipulating, cleaning, and transforming structured datasets. It centers on labeled one-dimensional and two-dimensional data structures, allowing users to construct, filter, and reshape tabular information while performing complex arithmetic and logical operations. The library distinguishes itself through a sophisticated indexing engine that enables automatic data alignment during calculations and relational merges. By utilizing a block-based memory layout, it optimizes cache locality for vectorized operations across columns. Its capabilities extend to a robust split-apply-combine pattern for grouping, as well as specialized tools for time series analysis that handle calendar-aware offsets, frequency resampling, and time zone management. Beyond core manipulation, the project offers extensive support for data lifecycle management, including ingestion and serialization across diverse file formats and database systems. It provides advanced features for hierarchical multi-index mapping, relational joins, and flexible missing data handling, ensuring that datasets are normalized and ready for statistical or analytical workflows.
Pandoc is a universal document converter that translates content between a wide range of markup and binary formats. It functions by parsing input documents into a unified intermediate abstract syntax tree, which serves as the foundation for consistent manipulation and transformation across diverse output types. The system is distinguished by its modular reader-writer pipeline, which decouples input parsing from output generation to allow for granular control over document structure. Users can programmatically manipulate this intermediate tree through a robust filter system, supporting both external JSON-based interop and an integrated scripting environment for custom transformations. This architecture enables complex document processing tasks, such as automated scholarly publishing, where citations, bibliographies, and mathematical expressions are managed through a specialized toolchain. Beyond core conversion, the project provides a comprehensive templating engine that merges structured document data with customizable templates to produce final outputs with specific styling and layout requirements. It also offers a network-based server mode for API-driven and batch processing, allowing the tool to be integrated into automated technical content pipelines. The software is primarily operated via a command-line interface, which provides extensive configuration options for managing input formats, citation styles, and document metadata.
Typst is a programmable, markup-based typesetting engine designed for professional document creation. It functions as a scriptable publishing toolchain that transforms plain text and code into complex, paginated outputs. By utilizing a high-performance compiler, the system automates document assembly, mathematical rendering, and dynamic content generation, providing a unified workflow for academic and technical authoring. The engine distinguishes itself through a declarative layout framework that uses cascading rules to manage document structure and visual styling. Unlike traditional systems, it employs an incremental layout engine that performs multiple passes to resolve cross-references, counters, and dynamic content placement. This is supported by a sandboxed functional scripting runtime, which allows users to define custom logic for data processing and layout manipulation, ensuring that document state remains consistent throughout the compilation process. The system provides a comprehensive suite of tools for managing document elements, including automated bibliography generation, structured table creation, and hierarchical sectioning. It supports precise control over page geometry and typography, while its introspection capabilities allow for advanced querying of document state and element locations. These features are complemented by a robust set of foundational data management primitives, enabling users to handle complex collections, numeric data, and time-based logic within their documents. The project provides a command-line interface for compiling source files into portable formats like PDF, with built-in support for accessibility standards. Detailed documentation, including syntax references and architectural overviews, is available to guide users through the installation and implementation of the typesetting environment.