# Computer vision and multimodal

> Search results for `Computer vision and multimodal` on awesome-repositories.com. 115 total matches; showing the first 50.

Explore on the web: https://awesome-repositories.com/q/computer-vision-and-multimodal

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [this search on awesome-repositories.com](https://awesome-repositories.com/q/computer-vision-and-multimodal).**

## Results

- [jbhuang0604/awesome-computer-vision](https://awesome-repositories.com/repository/jbhuang0604-awesome-computer-vision.md) (23,074 ⭐) — This project is a comprehensive, community-driven repository that serves as a centralized catalog for computer vision research and development. It functions as a structured index of academic papers, open-source software libraries, public datasets, and educational tutorials, providing a navigation point for the complex landscape of modern vision technology.

The repository distinguishes itself through a taxonomy-based indexing system that maps the relationships between foundational research, influential academic figures, and their corresponding software implementations. By utilizing a lightweig
- [othersideai/self-operating-computer](https://awesome-repositories.com/repository/othersideai-self-operating-computer.md) (10,153 ⭐) — This project is a computer control framework that uses multimodal vision models to simulate mouse and keyboard inputs for automating desktop tasks. It functions as an autonomous agent and vision-based orchestrator that interprets screen visuals to interact with user interfaces.

The system employs vision language models and object detection to locate and click interface elements. It utilizes visual grounding to overlay numerical markers on UI components and uses optical character recognition to map on-screen text to precise pixel coordinates.

The framework supports voice-controlled computing
- [567-labs/instructor](https://awesome-repositories.com/repository/567-labs-instructor.md) (13,176 ⭐) — Instructor is a framework designed for structured data extraction, validation, and language model integration. It functions as a library that transforms unstructured text into validated, type-safe objects by leveraging schema definitions and model-specific tool-calling capabilities. By acting as a validation middleware, the project ensures that language model outputs strictly conform to defined data structures.

The library distinguishes itself through a robust validation-based retry loop that automatically re-submits failed responses with error feedback to iteratively correct schema complianc
- [bytedance/ui-tars-desktop](https://awesome-repositories.com/repository/bytedance-ui-tars-desktop.md) (36,445 ⭐) — UI-TARS-desktop is a cross-platform desktop application designed to automate software interface interactions. It functions as a local agent environment that interprets graphical user interfaces through multimodal visual-language model reasoning, allowing it to navigate and manipulate software by simulating human-like mouse and keyboard inputs.

The platform distinguishes itself by executing all visual recognition and decision-making logic directly on the host machine. This local inference model ensures that screen data and sensitive information remain private, as no processing is offloaded to
- [pytorch/vision](https://awesome-repositories.com/repository/pytorch-vision.md) (17,743 ⭐) — This project is a comprehensive computer vision library for the PyTorch ecosystem, providing a standardized collection of neural network architectures, datasets, and high-performance transformation utilities. It serves as a foundational framework for building, training, and deploying deep learning models, offering a centralized model registry that allows developers to instantiate architectures with pre-trained weights for tasks such as image classification, object detection, and semantic segmentation.

The library distinguishes itself through its modular approach to data and compute management
- [bjarten/computer-vision-nd](https://awesome-repositories.com/repository/bjarten-computer-vision-nd.md) (134 ⭐) — Projects and exercises for the Udacity Computer Vision Nanodegree
- [keras-team/keras](https://awesome-repositories.com/repository/keras-team-keras.md) (64,094 ⭐) — Keras is a high-level deep learning framework designed for constructing and training neural networks through the composition of modular, functional layers. It serves as a comprehensive modeling toolkit that provides standardized procedures for defining, evaluating, and deploying complex architectures. By utilizing a directed acyclic graph approach, the framework allows users to build intricate models with multiple inputs, outputs, and shared layers, ensuring consistent numerical execution through functional state management.

The project distinguishes itself as a multi-backend machine learning
- [imclumsypanda/langchain-chatglm](https://awesome-repositories.com/repository/imclumsypanda-langchain-chatglm.md) (38,183 ⭐) — This project is a LangChain-based framework for building retrieval-augmented generation systems, autonomous agents, and multimodal chatbots. It functions as an open-source orchestrator that connects local inference engines and online APIs to manage various large language model deployments.

The system distinguishes itself by providing specialized interfaces for local knowledge bases, allowing the loading and vectorization of private documents to create context-aware assistants. It also supports multimodal capabilities, enabling the processing of both text and image inputs through vision-capabl
- [anuragreddygv323/computer-vision-projects](https://awesome-repositories.com/repository/anuragreddygv323-computer-vision-projects.md) (107 ⭐) — Computer Vision Basics - Building Your Own Custom Object Detector - Content-Based Image Retrieval - Image Classification and Machine Learning - Face Recognition - Automatic License Plate Recognition - Hadoop + Big Data - Deep Learning - Raspberry Pi Projects - Image Descriptors - Computer Vision…
- [zhayujie/chatgpt-on-wechat](https://awesome-repositories.com/repository/zhayujie-chatgpt-on-wechat.md) (45,353 ⭐) — This project is an autonomous agent framework designed to integrate large language models with popular messaging platforms. It functions as a middleware platform that enables automated, multimodal interactions by decomposing complex user goals into sequential plans, executing them through external tools, and maintaining persistent context across sessions.

The framework distinguishes itself through a modular skill architecture and a hybrid memory system. Users can extend system capabilities by installing custom logic modules from community hubs or generating them through natural language. The
- [abetlen/llama-cpp-python](https://awesome-repositories.com/repository/abetlen-llama-cpp-python.md) (9,993 ⭐) — llama-cpp-python provides a Python interface for the llama.cpp library, enabling the execution of large language models with hardware acceleration. It functions as a GGUF model loader and a structured text generator capable of running inference servers and multimodal runtimes for processing both text and image inputs.

The project distinguishes itself through a local inference server that exposes model capabilities via an OpenAI-compatible web API. It supports advanced execution techniques including speculative decoding, weight quantization, and layer-based GPU offloading to manage memory acro
- [jrobchin/computer-vision-basics-with-python-keras-and-opencv](https://awesome-repositories.com/repository/jrobchin-computer-vision-basics-with-python-keras-and-opencv.md) (435 ⭐) — Full tutorial of computer vision and machine learning basics with OpenCV and Keras in Python.
- [axolotl-ai-cloud/axolotl](https://awesome-repositories.com/repository/axolotl-ai-cloud-axolotl.md) (12,059 ⭐) — Axolotl is a configuration-driven framework designed for the fine-tuning, evaluation, and quantization of large language models. It functions as a comprehensive orchestrator for distributed training, enabling users to manage complex workflows across multi-node and multi-GPU environments. By utilizing structured configuration files, the platform streamlines the setup of training parameters, dataset paths, and hardware distribution strategies.

The project distinguishes itself through its support for diverse training methodologies, including full-parameter tuning, parameter-efficient adaptation,
- [charmve/computer-vision-in-action](https://awesome-repositories.com/repository/charmve-computer-vision-in-action.md) (2,851 ⭐) — A computer vision closed-loop learning platform where code can be run interactively online. 学习闭环《计算机视觉实战演练：算法与应用》中文电子书、源码、读者交流社区（持续更新中 ...） 📘 在线电子书 https://charmve.github.io/computer-vision-in-action/   👇项目主页
- [bytedance/ui-tars](https://awesome-repositories.com/repository/bytedance-ui-tars.md) (9,622 ⭐) — UI-TARS is an LLM GUI automation framework and multimodal action grounding system. It functions as a GUI agent orchestrator and cross-platform device controller that uses large language models to interpret graphical interfaces and execute actions across desktop and mobile operating systems.

The system translates model-generated coordinates into precise screen positions to interact with visual user interface elements. It employs a multimodal approach to interpret screen layouts and decomposes complex goals into multi-step trajectories through reasoning and error correction.

The project provid
- [livekit/livekit](https://awesome-repositories.com/repository/livekit-livekit.md) (19,358 ⭐) — LiveKit is a comprehensive framework for building and orchestrating real-time, multimodal AI agents that interact with users through voice, video, and text. It provides a centralized, event-driven architecture to manage the entire lifecycle of automated participants, from initialization and session state management to graceful shutdown. By utilizing a selective forwarding unit, the platform efficiently routes media streams between participants and agents, ensuring low-latency communication and secure, token-based authentication for all connections.

The platform distinguishes itself through it
- [agno-agi/agno](https://awesome-repositories.com/repository/agno-agi-agno.md) (40,717 ⭐) — Agno is an agent operating system designed to manage the lifecycle, tool execution, and persistent state of autonomous agents across distributed infrastructure. It provides a unified runtime environment that wraps diverse agent frameworks into a consistent, interoperable protocol, allowing developers to build and deploy complex multi-agent systems that coordinate tasks and delegate sub-processes.

The platform distinguishes itself through a robust governance and orchestration layer that includes human-in-the-loop approval gates, role-based access control, and a centralized API gateway. It feat
- [packtpublishing/opencv-computer-vision-projects-with-python](https://awesome-repositories.com/repository/packtpublishing-opencv-computer-vision-projects-with-python.md) (128 ⭐) — OpenCV-Computer-Vision-Projects-with-Python
- [codexu/note-gen](https://awesome-repositories.com/repository/codexu-note-gen.md) (12,173 ⭐) — Note-gen is an artificial intelligence-assisted note-taking application and knowledge management tool designed for local-first data ownership. It functions as a workspace that leverages language models to organize, summarize, and synthesize personal notes into structured documents while maintaining offline accessibility.

The platform distinguishes itself through a multimodal workflow orchestrator that chains sequences of tasks to process text, images, and external data. By integrating vision-language models, it extracts information from visual inputs like screenshots and documents, converting
- [the-ai-summer/gans-in-computer-vision](https://awesome-repositories.com/repository/the-ai-summer-gans-in-computer-vision.md) (78 ⭐) — GANs in computer vision AI Summer article series
- [anthropics/claude-code](https://awesome-repositories.com/repository/anthropics-claude-code.md) (132,728 ⭐) — Anthropic's terminal-native AI coding agent.
- [google-research/google-research](https://awesome-repositories.com/repository/google-research-google-research.md) (38,139 ⭐) — This repository serves as a comprehensive research platform and toolkit for advancing machine learning, quantum computing, and large-scale scientific data analysis. It provides foundational frameworks for developing complex algorithmic systems, offering the necessary infrastructure for distributed training, computational graph execution, and high-performance model development.

The project distinguishes itself by integrating specialized research domains with robust, privacy-preserving methodologies. It supports diverse scientific discovery through tools for quantum simulation, physics-informed
- [nerox8664/awesome-computer-vision-models](https://awesome-repositories.com/repository/nerox8664-awesome-computer-vision-models.md) (543 ⭐) — A list of popular deep learning models related to classification, segmentation and detection problems
- [accumulatemore/cv](https://awesome-repositories.com/repository/accumulatemore-cv.md) (21,907 ⭐) — This project is a comprehensive deep learning framework and educational platform designed for constructing, training, and evaluating neural network architectures. It provides a modular environment for building models through tensor operations and automatic differentiation, supporting a wide range of tasks from image classification and object detection to sequential data processing.

Beyond its core technical capabilities, the project distinguishes itself by integrating professional career development resources directly into its learning ecosystem. It offers structured guidance, resume reviews,
- [scutan90/deeplearning-500-questions](https://awesome-repositories.com/repository/scutan90-deeplearning-500-questions.md) (57,436 ⭐) — This project is a comprehensive study guide and knowledge base for deep learning, machine learning, and the associated mathematics required for artificial intelligence. It functions as a curated collection of technical questions and answers designed to help users study fundamental theories and practical applications.

The repository serves as a technical interview preparation resource by aggregating industry-standard questions and core knowledge points. It provides a structured reference for reviewing neural network architectures and specific techniques used in computer vision, such as object
- [berriai/litellm](https://awesome-repositories.com/repository/berriai-litellm.md) (50,579 ⭐) — LiteLLM is a unified gateway and proxy server designed to centralize access to over one hundred language model providers. It provides a standardized API interface that abstracts vendor-specific schemas, allowing developers to interact with diverse models through a single, consistent format. By acting as a central traffic management layer, it enables organizations to route, secure, and govern model interactions across multiple deployments.

The platform distinguishes itself through its policy-driven architecture, which uses configuration-based routing to manage traffic distribution, load balanc
- [osilly/vision-r1](https://awesome-repositories.com/repository/osilly-vision-r1.md) (1,475 ⭐) — The official repo for "Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models".
- [kmario23/deep-learning-drizzle](https://awesome-repositories.com/repository/kmario23-deep-learning-drizzle.md) (12,819 ⭐) — This project is a curated directory of educational roadmaps and resource hubs for artificial intelligence, deep learning, and machine learning. It serves as a centralized collection of academic lectures, instructional videos, and courses designed to provide structured learning paths for AI practitioners.

The directory covers specialized academic curricula across several core domains, including computer vision, natural language processing, and reinforcement learning. It also provides access to niche educational content such as medical imaging, Bayesian deep learning, and probabilistic graphica
- [extreme-assistant/survey-computer-vision](https://awesome-repositories.com/repository/extreme-assistant-survey-computer-vision.md) (460 ⭐) — 2020-2021年计算机视觉综述论文分方向整理
- [fastai/fastai](https://awesome-repositories.com/repository/fastai-fastai.md) (27,862 ⭐) — Fastai is a high-level deep learning library built on PyTorch that provides a unified interface for managing the entire machine learning lifecycle. It functions as a comprehensive training toolkit, abstracting hardware management and automating complex training loops to simplify the construction and execution of neural network models.

The framework is distinguished by its notebook-centric development environment and a type-dispatching data pipeline that automatically applies transformations based on input data formats. It emphasizes transfer learning through discriminative layer-wise optimiza
- [d2l-ai/d2l-en](https://awesome-repositories.com/repository/d2l-ai-d2l-en.md) (29,001 ⭐) — This project is an educational platform and research toolkit designed to teach deep learning through a combination of mathematical theory, visual diagrams, and executable code. It provides a comprehensive environment for building, training, and evaluating neural networks, grounding complex concepts in interactive computational notebooks that allow for hands-on experimentation.

The framework distinguishes itself by interleaving theoretical foundations—including linear algebra, calculus, and probability—with practical implementations across multiple industry-standard libraries. It supports flex
- [amusi/daily-paper-computer-vision](https://awesome-repositories.com/repository/amusi-daily-paper-computer-vision.md) (6,768 ⭐) — 记录每天整理的计算机视觉/深度学习/机器学习相关方向的论文
- [huggingface/smol-course](https://awesome-repositories.com/repository/huggingface-smol-course.md) (6,661 ⭐) — This project is an educational program focused on the alignment of small language models. It provides a technical curriculum and a series of courses designed to teach how to align models with human preferences and behaviors.

The material covers the implementation of preference optimization algorithms and the adaptation of vision-language models to process both text and image data simultaneously. It also includes instructional guides on synthetic data generation to improve model performance in specialized domains.

The curriculum encompasses supervised fine-tuning workflows, the use of chat te
- [ashishpatel26/500-ai-machine-learning-deep-learning-computer-vision-nlp-projects-with-code](https://awesome-repositories.com/repository/ashishpatel26-500-ai-machine-learning-deep-learning-computer-vision-nlp-projects.md) (34,579 ⭐) — This repository serves as a comprehensive, curated collection of open-source implementations focused on artificial intelligence, machine learning, and computer vision. It functions as a centralized knowledge base and technical resource index, providing students and professional engineers with a structured directory of code examples for educational and practical reference.

The project distinguishes itself through a community-driven curation model, relying on manual updates and contributions to maintain a relevant and expansive archive. By organizing these resources into categorized lists, the
- [yichuan-w/leann](https://awesome-repositories.com/repository/yichuan-w-leann.md) (11,985 ⭐) — LEANN is a framework for local retrieval augmented generation and vector indexing. It functions as a system for building local knowledge bases and source code search engines that combine large language models with retrieved private data to generate context-aware responses.

The project distinguishes itself through a vision-model based document layout extractor for parsing complex PDF figures and diagrams, and a source code search engine that employs structure-aware chunking to preserve function and class boundaries. It also implements the Model Context Protocol to integrate real-time data sour
- [ytongxie/medical-vision-and-language-tasks-and-methodologies-a-survey](https://awesome-repositories.com/repository/ytongxie-medical-vision-and-language-tasks-and-methodologies-a-survey.md) (31 ⭐) — :fire::fire: This is a collection of medical vision-language tasks and methodologies:fire::fire:
- [candacelax/bias-in-vision-and-language](https://awesome-repositories.com/repository/candacelax-bias-in-vision-and-language.md) (9 ⭐) — This is the repo for our paper Measuring Social Biases in Grounded Vision and Language Embeddings. We implement a version of WEAT/SEAT for visually grounded word embeddings. This is code borrowed and modified from this repo. Authors: Candace Ross, Boris Katz, Andrei Barbu
- [allegroai/clearml](https://awesome-repositories.com/repository/allegroai-clearml.md) (6,733 ⭐) — ClearML is a comprehensive MLOps platform designed to manage the entire machine learning lifecycle. It functions as an experiment tracking tool, a data versioning system, and a pipeline orchestrator, while providing infrastructure for GPU cluster management and model serving.

The platform is distinguished by its ability to handle hybrid-cloud compute scheduling and fractional GPU allocation, allowing multiple workloads to share a single hardware accelerator. It employs a metadata-based approach to data versioning, using virtual views to track large datasets and artifacts without duplicating r
- [abi/screenshot-to-code](https://awesome-repositories.com/repository/abi-screenshot-to-code.md) (72,926 ⭐) — This project is an artificial intelligence-powered frontend generator that translates visual design inputs into functional source code. It functions as a workflow engine that interprets graphical user interfaces, mapping layout structures and styling rules to structured markup and programming language syntax.

The tool distinguishes itself by supporting both static design mockups and dynamic video recordings. It processes temporal and spatial information from screen captures to reconstruct interaction flows and state transitions, enabling the creation of functional software prototypes from vis
- [clearml/clearml](https://awesome-repositories.com/repository/clearml-clearml.md) (6,740 ⭐) — ClearML is a comprehensive MLOps platform designed to manage the end-to-end machine learning lifecycle, from initial experimentation to production deployment. It provides a suite of integrated tools including a pipeline orchestrator for automating workflows, an experiment tracking tool for logging hyperparameters and metrics, and a metadata-driven data versioning system for managing large-scale datasets and model artifacts.

The platform is distinguished by its advanced compute management and serving capabilities. It features a GPU compute manager that supports fractional resource slicing and
- [getomni-ai/zerox](https://awesome-repositories.com/repository/getomni-ai-zerox.md) (12,241 ⭐) — Zerox is a multimodal document parser and OCR tool that uses vision models to convert PDF files and images into structured Markdown text. It functions as a visual layout extraction engine, leveraging large multimodal models to digitize documents while maintaining their original structural formatting.

The system differentiates itself through the use of coordinate-based element mapping and multimodal layout analysis to identify structural elements like tables, charts, and headers. It utilizes rasterization to convert vector PDF pages into high-resolution bitmaps, ensuring consistent input for t
- [facebookresearch/3d-vision-and-touch](https://awesome-repositories.com/repository/facebookresearch-3d-vision-and-touch.md) (75 ⭐) — Copyright (c) Facebook, Inc. and its affiliates. All rights reserved. This source code is licensed under the license found in the LICENSE file in the root directory of this source tree. -->
- [openhands/openhands](https://awesome-repositories.com/repository/openhands-openhands.md) (77,330 ⭐) — OpenHands is an autonomous agent framework designed for software engineering workflows. It provides a modular platform for orchestrating AI agents that reason, plan, and execute tasks within isolated, containerized development environments. By integrating with standard version control and development tools, the system enables agents to autonomously navigate codebases, implement features, and resolve issues through iterative reasoning and tool execution.

The platform distinguishes itself through a model-agnostic orchestrator that connects diverse language models to a unified tool registry. It
- [huggingface/transformers](https://awesome-repositories.com/repository/huggingface-transformers.md) (161,630 ⭐) — Transformers is a comprehensive library for machine learning that provides a unified interface for training, fine-tuning, and deploying transformer-based models. It supports a wide range of tasks, including text classification, language modeling, question answering, and sequence-to-sequence translation, while offering specialized architectures for both text and vision processing. The framework includes tools for managing the entire model lifecycle, from data preprocessing and tokenization to distributed training and inference.

The library features extensive support for model optimization and
- [olimorris/codecompanion.nvim](https://awesome-repositories.com/repository/olimorris-codecompanion-nvim.md) (6,166 ⭐) — CodeCompanion is a Neovim plugin that brings large language model capabilities directly into the editor, enabling turn-based conversations with AI models in a dedicated chat buffer. It provides a comprehensive interface for interacting with LLMs, supporting multiple providers through a flexible adapter system that can route requests to various hosted or local language model services.

The plugin distinguishes itself through its extensive context-sharing capabilities, allowing users to send buffer contents, visual selections, git diffs, LSP diagnostics, terminal output, quickfix lists, and view
- [ztangent/multimodal-dmm](https://awesome-repositories.com/repository/ztangent-multimodal-dmm.md) (23 ⭐) — A PyTorch implementation of the Multimodal Deep Markov Model (MDMM) and associated inference methods described in Factorized Inference in Deep Markov Models for Incomplete Multimodal Time Series. Please cite this paper if you use or modify any of this code.
- [docling-project/docling](https://awesome-repositories.com/repository/docling-project-docling.md) (61,674 ⭐) — Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elements. By normalizing diverse input formats into a consistent internal representation, the library enables uniform processing across various document types.

The project distinguishes itself through a schema-driven approach that maps document regions to strongly-typed objects, ensuring data accuracy t
- [swe-agent/swe-agent](https://awesome-repositories.com/repository/swe-agent-swe-agent.md) (18,510 ⭐) — SWE-agent is an autonomous software engineering platform designed to automate repository maintenance and issue resolution. By orchestrating language models to navigate codebases, diagnose software bugs, and apply fixes, the framework functions as an autonomous agent capable of executing shell commands, editing source code, and managing pull requests within isolated, containerized environments.

The platform distinguishes itself through its focus on end-to-end task autonomy and observability. It features a robust trajectory logging system that records every thought, action, and environment obse
- [haifengl/smile](https://awesome-repositories.com/repository/haifengl-smile.md) (6,387 ⭐) — Smile is a comprehensive JVM machine learning library and statistical computing toolkit. It provides a suite of algorithms for classification, regression, and clustering, implemented natively for Java, Scala, and Kotlin. The project also functions as a deep learning framework, a natural language processing library, and an inference engine for large language models.

The library distinguishes itself through GPU acceleration via LibTorch bindings and support for the ONNX model interchange format. It includes specialized capabilities for large language model inference, featuring Byte-Pair Encodin
- [hassony2/useful-computer-vision-phd-resources](https://awesome-repositories.com/repository/hassony2-useful-computer-vision-phd-resources.md) (0 ⭐) — [x] General advice on how to conduct your research - [x] Better/faster paper reading, tools and resources - [x] How to write a good CVPR/ECCV/ICCV paper? Advice to write a good scientific paper - [x] How to write a good review? - [x] How to release code that is easy to understand and to reuse -…
