30 open-source projects similar to microsoft/mm-react, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best MM REACT alternative.
Emu Series: Generative Multimodal Models from BAAI
JARVIS is a system for large language model task orchestration, deployment management, and automation benchmarking. It utilizes a task orchestrator to decompose complex requests into actionable steps and coordinates various expert models to synthesize final responses. The project includes an AI model deployment manager to handle the local deployment of expert models across different hardware scales. It further provides an AI workflow API consisting of web endpoints used to trigger automated task workflows and retrieve results from model selection stages. The framework incorporates an automat
AudioGPT is an LLM-driven audio framework and processing suite that uses large language models to orchestrate neural audio pipelines. It functions as a multimodal audio generator and processing system, integrating a collection of pretrained models to handle speech synthesis, sound generation, and audio manipulation. The system is distinguished by its ability to generate audio from diverse inputs, including text and images, and its capacity to produce synchronized talking head videos. It also operates as a neural speech translator, converting spoken language between different tongues while pre
Code for the paper "ViperGPT: Visual Inference via Python Execution for Reasoning"
Agent-S is a multimodal AI agent and LLM desktop automation framework designed to control operating systems through graphical user interface interactions. It functions as a computer use interface, utilizing vision-language grounding to translate natural language goals into precise screen coordinates and system actions. The project differentiates itself by combining structured accessibility tree inspection with vision-based element localization. It manages cross-application workflows by mapping conceptual descriptions to physical pixels and simulating low-level keyboard and mouse events to mov
Jaaz is a self-hosted AI design suite and multimodal workspace used for generating and editing images and videos. It functions as a design workspace where users can produce visual content and assets through a combination of local and cloud-based AI models. The project features a hybrid model orchestrator that routes requests between local model runners and remote APIs to balance data privacy with processing performance. It utilizes an infinite canvas collaborative tool for organizing storyboards and assets, and includes an image prompt optimizer to translate rough ideas into detailed generati
Typebot is a visual chatbot builder and conversational platform designed for lead generation and data collection. It provides a drag-and-drop workflow designer that converts visual nodes into structured conversation logic, allowing users to build interactive forms and chatbots with conditional routing. The platform is designed as a self-hosted conversational infrastructure, enabling the deployment of the entire application stack on private servers using Docker and PostgreSQL. This allows for complete control over data storage and server maintenance. The system integrates with external servic
This project is a computer control framework that uses multimodal vision models to simulate mouse and keyboard inputs for automating desktop tasks. It functions as an autonomous agent and vision-based orchestrator that interprets screen visuals to interact with user interfaces. The system employs vision language models and object detection to locate and click interface elements. It utilizes visual grounding to overlay numerical markers on UI components and uses optical character recognition to map on-screen text to precise pixel coordinates. The framework supports voice-controlled computing
This project is a comprehensive machine learning interview guide and technical study resource designed for individuals preparing for machine learning and AI engineering roles. It provides a collection of materials and practice problems covering core algorithms, theoretical fundamentals, and the implementation of neural network architectures. The resource serves as a technical reference for generative AI development, focusing on the design and optimization of large language models and diffusion systems. It includes frameworks for system design, covering the architecture of production machine l
EverydayWechat is a WeChat automation bot designed to automate messages, scheduled reminders, and automatic replies within the WeChat messaging ecosystem. It functions as a multi-purpose system that combines the roles of a scheduled message sender, an auto-reply bot, and a chatbot assistant. The project enables the delivery of customized recurring messages to specific users and group chats on a fixed timetable. It also provides automated individual replies based on preconfigured rules and group chat assistance that fetches real-time data for weather, logistics, and calendars. The system inco
Integrated AI environment in the terminal. Build, test and instruct agents.
Official Repository for CVPR 2024 paper MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild.
Fairseq is a PyTorch toolkit for sequence-to-sequence modeling, specializing in neural machine translation, automatic speech recognition, and large-scale language model training. It provides a framework for processing and aligning diverse data sources, including text, audio, and video, to support tasks such as speech-to-text conversion and multimodal sequence learning. The project is distinguished by its distributed training capabilities, which utilize parameter sharding, mixed-precision training, and CPU offloading to handle models that exceed single-device memory. It also includes specializ
Updates | Datasets | Models | Environment | Running | Results | Website | Paper
This is the official repository for "CodeIE: Large Code Generation Models are Better Few-Shot Information Extractors" (ACL 2023).
Chat and Ask on your own data. Accelerator to quickly upload your own enterprise data and use OpenAI services to chat to that uploaded data and ask questions
DB-GPT is an AI-driven database management system that uses agentic reasoning to execute data tasks. It converts natural language prompts into executable database queries and combines structured database records with unstructured knowledge bases to provide grounded analysis. The system orchestrates multi-step reasoning chains that integrate database queries, custom scripts, and external tool calls. It allows for the packaging of domain knowledge into reusable analysis skills and executes generated code within sandboxed environments for system safety. The platform covers data orchestration ac
CLIPort: What and Where Pathways for Robotic Manipulation Mohit Shridhar, Lucas Manuelli, Dieter Fox CoRL 2021
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
This is the github repository for the paper to be appeared at NAACL 2024 main conference: Self-Improving for Zero-Shot Named Entity Recognition with Large Language Models.
TaskMatrix is a multimodal AI chat interface and visual task orchestrator. It combines language models with visual recognition to enable the exchange, analysis, and modification of images within a conversational environment. The system coordinates multiple foundation models through orchestration pipelines that chain language, detection, and segmentation models. This allows for complex visual operations, such as using text instructions to guide image masking and executing modular inpainting workflows to edit specific image regions. The project includes a computer vision toolset for object det
This project provides a transformer-based object detection model that treats the task as a direct set prediction problem. It implements a vision system capable of predicting bounding boxes and class labels for objects within an image, as well as frameworks for instance and panoptic segmentation. The architecture utilizes a transformer encoder and decoder to perform end-to-end set prediction, employing a Hungarian matcher to assign predicted boxes to ground truth objects. It incorporates a convolutional backbone for feature extraction and a system of learnable object queries to probe image loc
This repo contains code for Unified-IO 2, including code to run a demo, do training, and do inference. This codebase is modified from T5X.
An implementation for ACL 2023 paper Learning In-context Learning for Named Entity Recognition
This code is an implementation of a chatbot using LLM chat model API and Langchain.