4 Repos
Techniques for handling and concatenating real-time token streams from language models to optimize latency.
Distinct from Stream Processing: Candidates focus on general data engineering or network buffers, not the specific application of streaming LLM outputs.
Explore 4 awesome GitHub repositories matching artificial intelligence & ml · LLM Stream Processing. Refine with filters or upvote what's useful.
Eino is an AI agent development kit and LLM application framework designed for building autonomous agents and orchestrating complex language model workflows. It serves as a multi-agent orchestration engine and workflow orchestrator, providing a graph-based execution model to route data between models, tools, and retrievers. The framework distinguishes itself through a robust set of multi-agent coordination patterns, including supervisor-led management, sequential flows, and autonomous reasoning loops like ReAct. It features advanced agent execution controls such as active turn preemption, che
Handles real-time model outputs through stream processing and concatenation to minimize response latency.
ollama-python is a Python client for interacting with large language models. It provides an interface for sending prompts to receive text and chat completions, as well as a dedicated client for generating numerical vector embeddings from text. The project includes a wrapper that emulates the OpenAI API, allowing applications built for that standard to interact with local models. It also provides a non-blocking asynchronous client for executing concurrent requests. The library covers the full model lifecycle, including the ability to pull, create, list, and delete models within a local enviro
Processes model responses piece by piece using Python generators for real-time text streaming.
This project is a long context inference engine and optimizer designed to process infinite text streams using large language models without memory growth or performance degradation. It serves as a system for maintaining constant memory usage during the generation of text from arbitrarily long input sequences. The implementation utilizes a rolling key-value cache manager and attention sink mechanisms to stabilize the attention process during continuous stream processing. By retaining initial tokens and employing a sliding window of key-value pairs, the system enables constant-time inference an
Handles and concatenates real-time token streams from language models to optimize latency and throughput.
Iggy is a distributed message streaming platform and multi-protocol message broker that functions as a persistent distributed log store. It provides infrastructure for publishing and consuming binary messages using an append-only log, ensuring high availability and data consistency across nodes through Viewstamped Replication. The platform is distinguished by its specialized LLM streaming infrastructure, which uses a server protocol to connect large language models to streaming data and system controls. This includes standardized protocols for context management and data bridging via HTTP or
Exposes a specialized server protocol that enables large language models to control message streaming operations.