30 open-source projects similar to chenfei-wu/taskmatrix, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best TaskMatrix alternative.
Visual-ChatGPT is a visual orchestration framework and multimodal AI pipeline designed to coordinate large language models with visual foundation models. It functions as an integration layer that enables the exchange of text and images between different AI models to automate image analysis and editing tasks without requiring additional model training. The system differentiates itself through model-chain orchestration and prompt-based task dispatching, allowing natural language instructions to trigger specific vision models or tools. It utilizes coordinate-based region mapping and iterative ma
OpenVINO is an AI inference engine and model serving platform designed to execute optimized deep learning models across CPUs, GPUs, and NPUs through a unified API. It includes a model optimization toolkit for converting, quantizing, and compressing models from various frameworks, alongside a specialized generative AI runtime for large language models. The project distinguishes itself through a plugin-based hardware acceleration layer that maps neural network operations to vendor-specific drivers. It features advanced execution mechanisms such as continuous batching, speculative decoding, and
VisualGLM-6B is a bilingual multimodal large language model and vision-language model designed for conversational tasks and visual understanding. It functions as a bilingual AI model capable of processing and generating responses in both Chinese and English. The system is a quantized large language model supporting 4-bit and 8-bit precision to reduce memory usage and hardware requirements during local deployment. It is also a parameter-efficient fine-tuning model, allowing for weight adjustments to adapt the system to specific downstream tasks without full retraining. The project covers mult
Koog is an LLM agent framework used to build autonomous entities that execute tool-based workflows. It utilizes a graph-based workflow engine to define agent behaviors and decision paths as a directed graph of nodes and edges. The framework distinguishes itself through a model provider orchestrator that enables dynamic switching, load balancing, and automatic fallbacks between different AI backends. It implements the Model Context Protocol to connect agents to remote tool servers and features a RAG memory system using vector embeddings to maintain long-term conversation context. The project
TaskMatrix is a visual language model orchestration framework and modular visual pipeline designed to coordinate disparate foundation models. It functions as a multi-model workflow coordinator that sequences visual and textual models through logic paths to handle image processing tasks without requiring additional training. The system integrates large language models with visual foundation models to enable the exchange of image data during interactive chat sessions. It utilizes template-based orchestration to chain specialized models together for complex visual tasks. The framework supports
OmniGen2 is a unified image generation model and multimodal large language model designed to handle text-to-image generation, image-to-image tasks, and image editing within a single framework. It functions as a causal language model visual engine capable of generating and editing images based on combined text and visual inputs. The system features in-context visual composition and subject-driven generation, allowing it to extract subjects from reference images and place them into new scenes. It also supports instruction-based image editing, where specific objects or styles are modified via na
imaginAIry is a system for generating and refining images and videos using diffusion models. It operates as a web-based server that triggers generation requests through standard API calls, allowing for the creation of visuals and video sequences from text prompts or existing files. The project provides a suite for AI image editing and upscaling, enabling the modification of visuals through natural language instructions and super-resolution tools to increase detail and image size. The system includes capabilities for structural image control using depth maps, edge maps, and body poses to main
This project provides a deep residual network framework and pre-trained PyTorch models designed for high-accuracy image recognition. It implements a neural network architecture that utilizes skip connections to enable the training of very deep models without gradient degradation. The system is designed for computer vision tasks, including image classification, object detection, and visual data segmentation. It includes weights trained on ImageNet to support transfer learning and the fine-tuning of models on custom image datasets. The architectural design focuses on residual learning blocks,
StableStudio is a generative AI frontend and image interface designed for creating and editing visual content. It provides a web-based graphical interface that connects to generative AI models via API connections to facilitate image synthesis and modification. The project functions as a pluggable AI backend manager, using a modular system to standardize diverse AI provider APIs into a unified format. This architecture allows users to swap between different generative AI backends and providers to compare outputs and optimize production. The system manages long-running generation tasks through
This project provides a transformer-based object detection model that treats the task as a direct set prediction problem. It implements a vision system capable of predicting bounding boxes and class labels for objects within an image, as well as frameworks for instance and panoptic segmentation. The architecture utilizes a transformer encoder and decoder to perform end-to-end set prediction, employing a Hungarian matcher to assign predicted boxes to ground truth objects. It incorporates a convolutional backbone for feature extraction and a system of learnable object queries to probe image loc
ComfyUI is a modular generative AI workflow orchestrator and node-based GUI for designing and executing complex diffusion model pipelines. It functions as both a visual interface for building generative logic graphs and a programmable backend API that exposes diffusion model operations for external integration. The system distinguishes itself through a graph-based execution model that supports differential workflow execution, re-running only modified nodes to reduce computation. It features dynamic model offloading to manage memory between system RAM and GPU VRAM and utilizes metadata-embedde
This project is an AI model API gateway and proxy server designed to provide a unified interface for interacting with diverse artificial intelligence service providers. It functions as a centralized middleware platform that routes, load balances, and translates API requests across multiple models, enabling developers to access text, image, audio, and video generation capabilities through a single, standardized integration. The gateway distinguishes itself through comprehensive administrative and financial controls, including event-driven usage accounting, real-time token consumption tracking,
Vercel is a cloud platform for building, deploying, and scaling web applications. It provides a unified infrastructure that automates the build process by detecting project frameworks and distributing static and dynamic content through a global content delivery network. The platform executes application logic using serverless functions that scale automatically based on real-time traffic demand. The platform distinguishes itself through a centralized AI gateway that proxies requests to multiple model providers, enabling standardized authentication, observability, and cost tracking. It supports
This project is a comprehensive collection of educational examples and reference implementations for building vision and language models using PyTorch. It serves as a deep learning tutorial covering the end-to-end process of developing neural networks, from initial architecture definition to final production deployment. The repository provides detailed guides on implementing a wide range of domain-specific models, including convolutional neural networks for object detection and segmentation, as well as transformer and recurrent architectures for natural language processing. It emphasizes gene
Corenet is a deep learning training framework and computer vision model library designed for developing neural networks across vision, text, and audio modalities. It functions as a distributed training orchestrator for scaling workloads across multiple compute nodes and provides a multimodal data pipeline for processing image, text, and video data. The project includes a model conversion toolkit for transforming weights and architectures between different machine learning frameworks. It also provides tools for optimizing model performance on Apple Silicon and reducing response latency in gene
Enchanted is iOS and macOS app for chatting with private self hosted language models such as Llama2, Mistral or Vicuna using Ollama.
Efficient-AI-Backbones is a lightweight neural network library and computer vision model zoo. It provides a collection of optimized deep learning backbones designed to minimize computational overhead and memory usage for artificial intelligence tasks. The project implements specialized architectures such as GhostNet and MLP to reduce processing requirements. It features a modular backbone design and the distribution of pretrained weights to accelerate the development and deployment of vision models. The library covers efficient neural network design and edge device AI optimization. Its capab
This project is a library of pretrained computer vision architectures and backbones for image classification and feature extraction. It serves as a comprehensive model zoo and collection of standardized image encoders, including ResNet, Vision Transformers, and EfficientNet, for use in visual analysis and as backbones for object detection and image segmentation. The library provides a framework for distributed training and evaluation of image models using advanced data augmentation and optimization scripts. It includes a dedicated toolset for converting trained PyTorch vision models into the
CogVLM is a multimodal large language model designed to integrate visual and textual data for reasoning about images and generating natural language. It functions as a visual question answering system that analyzes image content to provide detailed descriptions or answer specific questions. The project includes a visual grounding model capable of mapping text descriptions to precise bounding box coordinates within an image. It also features a vision-based automation agent that analyzes screen captures to generate execution plans and interaction coordinates for software interfaces. The system
Qwen3-VL is a multimodal vision-language model designed to process and reason across images, videos, and text. It functions as a computer vision framework capable of identifying objects, extracting structured data from documents, and interpreting spatial elements within visual media. The system operates as an automated user interface interaction agent, interpreting screen data to navigate software and mobile applications. By utilizing a unified transformer architecture, it performs complex visual reasoning to execute user-defined tasks without manual input. Beyond interface navigation, the m
InternVL is a vision-language model framework that fuses a visual encoder with a large language model to translate image features into textual tokens for reasoning. It provides a system for multimodal inference and dialogue, enabling the processing of images and text to answer questions or generate descriptions. The project is distinguished by its high-resolution image processing, which uses dynamic tiling to maintain detail for images up to 4K resolution, and its chain-of-thought visual reasoning for solving complex mathematical and spatial problems. It also supports temporal frame sampling
Star-vector is a suite of vision-language systems designed to generate scalable vector graphics from text or image inputs. It utilizes a vision-language foundation model to treat the creation of visual elements as a structured code generation task. The system employs a multimodal architecture that maps visual patterns and shapes to corresponding structural elements in a vector code sequence. It incorporates a render-loop feedback mechanism and reinforcement learning to iteratively refine the fidelity of the generated graphics by comparing rendered outputs against target images. The project c
Enchanted is a privacy-focused, cross-platform chat frontend for interacting with self-hosted large language models on iOS and macOS. It serves as a native client for communicating with private model servers, specifically providing integration for the Ollama API. The application supports multimodal interactions, allowing users to combine text, image attachments, and voice prompts. It provides tools for local AI model management, including the ability to define persistent system prompts and switch between different models for specific tasks. The interface includes capabilities for rendering m
This project is a comprehensive library of state-of-the-art neural network architectures designed for image classification and feature extraction. It provides a complete deep learning training framework that supports distributed execution, allowing users to build, train, and fine-tune vision models using optimized schedulers and pre-configured training recipes. The library distinguishes itself through a modular backbone architecture that treats neural networks as decoupled feature extractors, enabling the retrieval of multi-scale outputs for downstream tasks like object detection and segmenta
Coze Studio is a development platform for building intelligent agents and conversational applications. It provides a visual environment where users construct agents by linking workflows, knowledge bases, and custom prompts to automate complex tasks. The system functions as a central hub for managing AI model services, allowing developers to connect various providers to serve as the intelligence layer for their applications. The platform distinguishes itself through a node-based workflow orchestrator that enables the design of automated logic sequences on a visual canvas. It includes a modular
This project is a comprehensive computer vision library for the PyTorch ecosystem, providing a standardized collection of neural network architectures, datasets, and high-performance transformation utilities. It serves as a foundational framework for building, training, and deploying deep learning models, offering a centralized model registry that allows developers to instantiate architectures with pre-trained weights for tasks such as image classification, object detection, and semantic segmentation. The library distinguishes itself through its modular approach to data and compute management
This project is a monocular depth estimation model and computer vision framework designed to calculate absolute distance and scale from single images. It functions as a metric depth estimator that generates high-resolution depth maps without requiring camera-specific focal length metadata. The system utilizes a vision transformer architecture for feature extraction and zero-shot inference to produce metric-scale depth predictions. It includes specialized components for sharp-boundary depth refinement to maintain high-frequency edge details and prevent blurriness at object boundaries. The rep
This project is a machine learning educational repository providing a collection of implementations and guides for machine learning and deep learning algorithms. It serves as a deep learning model library and a reference for training workflows, covering foundational machine learning, convolutional, recurrent, and transformer architectures. The collection includes a generative adversarial network suite for synthesizing realistic images and performing image-to-image translation. It also functions as a computer vision implementation guide for object detection and semantic segmentation, alongside
DeepSeek-VL is a multimodal large language model and image-to-text reasoning engine. It functions as a vision-language model and visual question answering system that integrates visual perception with linguistic reasoning to understand and describe images. The project enables multimodal image understanding and document image analysis, specifically processing screenshots of web pages and technical diagrams. It provides capabilities for visual conversational AI, allowing users to interact with visual data to extract insights and perform complex reasoning across different types of visual informa