Which open-source GitHub repositories match “Computer vision and multimodal”?

abi/screenshot-to-code is the closest match — This project is an artificial intelligence-powered frontend generator that translates visual design inputs into functional source code. It functions as a workflow engine that interprets graphical user interfaces, mapping layout structures and styling rules to structured markup and programming language syntax. The tool distinguishes itself by supporting both static design mockups and dynamic video recordings. It processes temporal and spatial inform…

Why does abi/screenshot-to-code match “Computer vision and multimodal”?

This project is an artificial intelligence-powered frontend generator that translates visual design inputs into functional source code. It functions as a workflow engine that interprets graphical user interfaces, mapping layout structures and styling rules to structured markup and programming langu…

Why does josephmisiti/awesome-machine-learning match “Computer vision and multimodal”?

This project is a comprehensive, community-driven directory of machine learning resources, software libraries, and educational materials. It serves as a centralized knowledge base for developers and researchers, organizing tools and frameworks by their primary programming language and technical dom…

Why does othersideai/self-operating-computer match “Computer vision and multimodal”?

This project is a computer control framework that uses multimodal vision models to simulate mouse and keyboard inputs for automating desktop tasks. It functions as an autonomous agent and vision-based orchestrator that interprets screen visuals to interact with user interfaces. The system employs…

Why does microsoft/omniparser match “Computer vision and multimodal”?

OmniParser is a multimodal interaction engine designed to function as a desktop automation agent. It interprets visual screen information to execute complex, multi-step tasks across operating system environments by bridging visual interface perception with language models. Through a continuous cycl…

Why does imclumsypanda/langchain-chatglm match “Computer vision and multimodal”?

This project is a LangChain-based framework for building retrieval-augmented generation systems, autonomous agents, and multimodal chatbots. It functions as an open-source orchestrator that connects local inference engines and online APIs to manage various large language model deployments. The sys…

Computer vision and multimodal

Explore open-source libraries and frameworks for image processing, object detection, and multimodal machine learning models.

Find the best repos with AI.We'll search the best matching repositories with AI.

abi/screenshot-to-code
abi/screenshot-to-code
72,926View on GitHub
This project is an artificial intelligence-powered frontend generator that translates visual design inputs into functional source code. It functions as a workflow engine that interprets graphical user interfaces, mapping layout structures and styling rules to structured markup and programming language syntax. The tool distinguishes itself by supporting both static design mockups and dynamic video recordings. It processes temporal and spatial information from screen captures to reconstruct interaction flows and state transitions, enabling the creation of functional software prototypes from visual design intent. To ensure the generated output adheres to standard development patterns, the system utilizes abstract syntax tree generation during the synthesis process. The platform relies on external intelligence services to perform complex visual analysis and code generation tasks. It is distributed as a containerized environment, which bundles all application services and dependencies to maintain consistent execution across local development machines and production infrastructure.
PythonAI Frontend GeneratorsAI-Powered UI GeneratorsDesign-to-Code Generators
View on GitHub72,926
josephmisiti/awesome-machine-learning
josephmisiti/awesome-machine-learning
72,867View on GitHub
This project is a comprehensive, community-driven directory of machine learning resources, software libraries, and educational materials. It serves as a centralized knowledge base for developers and researchers, organizing tools and frameworks by their primary programming language and technical domain to simplify discovery across the artificial intelligence ecosystem. The collection distinguishes itself by providing a cross-language development index that spans diverse programming environments, including C, C++, Rust, Clojure, and Python. It covers a wide range of specialized capabilities, from neural network implementation and deep learning frameworks to computer vision, natural language processing, and reinforcement learning. The repository also highlights hardware-accelerated compute kernels and neurosymbolic architectures, offering a broad view of both established and emerging machine learning technologies. Beyond software libraries, the directory includes a curated roadmap of foundational learning materials, such as textbooks and documentation on linear algebra, probability, statistics, and distributed machine learning patterns. This structured approach provides a technical reference for those seeking to understand both the theoretical underpinnings and the practical implementation of modern computational intelligence.
PythonAwesome ListMachine Learning ConceptsComputer Vision Libraries
View on GitHub72,867
othersideai/self-operating-computer
OthersideAI/self-operating-computer
10,153View on GitHub
This project is a computer control framework that uses multimodal vision models to simulate mouse and keyboard inputs for automating desktop tasks. It functions as an autonomous agent and vision-based orchestrator that interprets screen visuals to interact with user interfaces. The system employs vision language models and object detection to locate and click interface elements. It utilizes visual grounding to overlay numerical markers on UI components and uses optical character recognition to map on-screen text to precise pixel coordinates. The framework supports voice-controlled computing by translating spoken commands into text-based objectives. It manages a full automation loop encompassing state observation through screenshots, action planning via cloud or local APIs, and the execution of synthetic inputs.
PythonComputer Automation InterfacesGUI and Computer AgentsAutonomous UI Interaction
View on GitHub10,153
microsoft/omniparser
microsoft/OmniParser
24,377View on GitHub
OmniParser is a multimodal interaction engine designed to function as a desktop automation agent. It interprets visual screen information to execute complex, multi-step tasks across operating system environments by bridging visual interface perception with language models. Through a continuous cycle of observation and command execution, the system grounds high-level natural language instructions into precise, coordinate-based actions. The project distinguishes itself by utilizing vision-based parsing to interact with software interfaces without requiring access to underlying application programming interfaces or platform-specific accessibility frameworks. It decomposes complex screenshots into structured semantic elements and maps raw pixel data to labeled interactive components. This approach enables consistent automated workflows across varying display resolutions by normalizing coordinate spaces and relying on visual recognition rather than code-level hooks. The software provides a comprehensive framework for autonomous agent development, allowing for the transformation of static interface captures into structured data representations. This capability facilitates accurate element identification and interaction for vision-based models during repetitive desktop tasks.
Jupyter NotebookDesktop Automation AgentsVision-Language Grounding ModelsAgentic Orchestration Loops
View on GitHub24,377
imclumsypanda/langchain-chatglm
imClumsyPanda/langchain-ChatGLM
38,183View on GitHub
This project is a LangChain-based framework for building retrieval-augmented generation systems, autonomous agents, and multimodal chatbots. It functions as an open-source orchestrator that connects local inference engines and online APIs to manage various large language model deployments. The system distinguishes itself by providing specialized interfaces for local knowledge bases, allowing the loading and vectorization of private documents to create context-aware assistants. It also supports multimodal capabilities, enabling the processing of both text and image inputs through vision-capable models. The platform covers a broad range of capabilities, including autonomous agent orchestration with tool-calling loops, vector-database embedding for semantic search, and the integration of external data querying from search engines and databases. It includes a web-based user interface for managing conversations and configuring system prompts.
PythonAgent Orchestration FrameworksLLM OrchestratorsAgentic LLM Frameworks
View on GitHub38,183
openai/clip
openai/CLIP
33,779View on GitHub
CLIP is a neural network architecture designed to map visual and textual data into a shared latent vector space. By utilizing transformer-based feature extraction and multi-modal tokenization, the system aligns images and natural language strings, enabling cross-modal similarity analysis and semantic classification. The project functions as a zero-shot classification engine, identifying image content by calculating the cosine similarity between visual features and arbitrary text labels without requiring task-specific retraining. Beyond inference, it serves as a research toolkit for evaluating model robustness and performance across diverse visual domains. It supports downstream applications by providing methods for frozen representation transfer and linear probe training, allowing users to leverage pre-trained encoders for specialized tasks. The library includes diagnostic tools for model auditing, specifically focusing on fairness assessment and bias detection to identify performance disparities across demographic groups. It also incorporates usage restriction policies to limit deployment in sensitive environments. The repository provides the necessary interfaces for multimodal input processing and benchmarking to evaluate how well visual recognition systems generalize in real-world scenarios.
Jupyter NotebookContrastive Learning ModelsZero-Shot Inference EnginesComputer Vision Evaluation Tools
View on GitHub33,779
zhayujie/chatgpt-on-wechat
zhayujie/chatgpt-on-wechat
45,353View on GitHub
This project is an autonomous agent framework designed to integrate large language models with popular messaging platforms. It functions as a middleware platform that enables automated, multimodal interactions by decomposing complex user goals into sequential plans, executing them through external tools, and maintaining persistent context across sessions. The framework distinguishes itself through a modular skill architecture and a hybrid memory system. Users can extend system capabilities by installing custom logic modules from community hubs or generating them through natural language. The memory system combines vector-based similarity search with traditional keyword indexing to retrieve relevant historical context, while a dedicated web console allows for the management of these memory files, system logs, and active messaging channels. The system supports a broad range of operational capabilities, including model-agnostic task routing, automated knowledge organization, and real-time reasoning visualization. It provides comprehensive administrative control through both terminal-based commands and slash-prefixed chat inputs, allowing for the management of runtime configurations, skill installations, and background processes. The project is configured via centralized files and provides secure storage for API keys and environment secrets. It is designed for deployment as a persistent service, with support for cross-platform messaging and automated task scheduling.
PythonAgent FrameworksAgent OrchestratorsAgent Memory Systems
View on GitHub45,353
haotian-liu/llava
haotian-liu/LLaVA
24,465View on GitHub
LLaVA is a multimodal large language model architecture designed to process and interpret both image and text inputs to generate natural language responses. It functions as a research-oriented platform for visual instruction tuning, providing a framework to align language models with human intent through training on diverse datasets of paired images and text queries. The system distinguishes itself through a specialized vision-language training pipeline that connects visual data to language models using projection layers and instruction-based fine-tuning. It supports distributed inference by coordinating a central controller with independent model workers, allowing for the deployment of visual reasoning services across local or cloud-based hardware. The project includes comprehensive tools for visual model fine-tuning, featuring automated checkpoint-based persistence and multi-stage data pipelines. It also provides automated evaluation procedures to quantify model accuracy against ground truth datasets, alongside both command-line and web-based interfaces for interactive visual reasoning tasks.
PythonMultimodal Large Language ModelsVision-Language PipelinesVisual Instruction Tuning
View on GitHub24,465
bytedance/ui-tars
bytedance/UI-TARS
9,622View on GitHub
UI-TARS is an LLM GUI automation framework and multimodal action grounding system. It functions as a GUI agent orchestrator and cross-platform device controller that uses large language models to interpret graphical interfaces and execute actions across desktop and mobile operating systems. The system translates model-generated coordinates into precise screen positions to interact with visual user interface elements. It employs a multimodal approach to interpret screen layouts and decomposes complex goals into multi-step trajectories through reasoning and error correction. The project provides capabilities for cross-platform interface control, including clicking, typing, and scrolling across web, mobile, and desktop environments. It includes tools for desktop and mobile GUI interaction, automation script generation, and visual grounding evaluation to measure coordinate precision. The framework supports hosting models on cloud platforms to provide scalable inference endpoints.
PythonAutonomous Agent OrchestratorsMultimodal Vision InterfacesAction Data Normalization
View on GitHub9,622
hacksider/deep-live-cam
hacksider/Deep-Live-Cam
93,878View on GitHub
Deep-Live-Cam is a generative video transformation tool designed for real-time facial manipulation and cinematic enhancement. It functions as a local-first AI runtime, performing all media processing directly on the user's hardware to ensure complete data privacy without external network dependencies. By utilizing a high-performance processing pipeline, the application enables live face swapping and interactive video modifications during active streaming sessions or on pre-recorded media. The system distinguishes itself through a hardware-abstraction execution layer that dynamically routes compute tasks to available graphics hardware, such as CUDA or CoreML backends. This architecture supports complex operations like multi-face mapping, where distinct target faces are applied to multiple subjects simultaneously, and preserves original mouth movements to maintain natural speech synchronization. To ensure visual fidelity, the engine employs precision mask-based blending and generative detail restoration, effectively integrating source features into target video geometry. Beyond core transformation capabilities, the application includes tools for cinematic rendering, such as real-time color grading and frame interpolation. It manages system resources through chunked memory and frame-based stream processing, which prevents crashes during intensive workloads and maintains stable performance. The interface is designed for focused workflows, offering distraction-free modes and automated projection window management to streamline the user experience during live operations.
PythonCinematic Video EnhancementsHigh-Performance AI InferenceLive Performance Execution
View on GitHub93,878
codexu/note-gen
codexu/note-gen
12,173View on GitHub
Note-gen is an artificial intelligence-assisted note-taking application and knowledge management tool designed for local-first data ownership. It functions as a workspace that leverages language models to organize, summarize, and synthesize personal notes into structured documents while maintaining offline accessibility. The platform distinguishes itself through a multimodal workflow orchestrator that chains sequences of tasks to process text, images, and external data. By integrating vision-language models, it extracts information from visual inputs like screenshots and documents, converting them into structured text. Users can further extend these capabilities by connecting third-party artificial intelligence services and external search tools to ground generated content in their own local knowledge base. The system supports a variety of data management and retrieval methods, including vector-based semantic search to locate information based on intent rather than keywords. It maintains consistency across distributed environments by synchronizing files through remote storage providers such as version control systems or cloud storage.
TypeScriptAI-PoweredLocal-First Data PersistenceLocal-First Knowledge Bases
View on GitHub12,173
facebookresearch/segment-anything
facebookresearch/segment-anything
54,353View on GitHub
This project provides a deep learning architecture designed to identify and isolate distinct objects within images by generating precise pixel-level masks. It functions as a browser-based inference engine, enabling the execution of complex machine learning models directly within web environments without requiring server-side processing. The system distinguishes itself by utilizing hardware-accelerated execution and parallel processing to achieve real-time segmentation speeds. It supports prompt-based mask decoding, allowing users to generate spatial masks by providing specific points or boxes as inputs. Additionally, the framework includes an image embedding pipeline that converts raw visual data into compact numerical representations, facilitating efficient analysis and downstream task performance. The toolkit encompasses a suite of model optimization utilities that convert and compress machine learning models into standardized, portable formats. These capabilities ensure consistent performance across diverse hardware environments while maintaining high-performance execution through multithreaded memory sharing.
Jupyter NotebookBrowser-based Inference EnginesObject Mask GeneratorsBrowser-Based Image Segmentation
View on GitHub54,353
bytedance/ui-tars-desktop
bytedance/UI-TARS-desktop
36,445View on GitHub
UI-TARS-desktop is a cross-platform desktop application designed to automate software interface interactions. It functions as a local agent environment that interprets graphical user interfaces through multimodal visual-language model reasoning, allowing it to navigate and manipulate software by simulating human-like mouse and keyboard inputs. The platform distinguishes itself by executing all visual recognition and decision-making logic directly on the host machine. This local inference model ensures that screen data and sensitive information remain private, as no processing is offloaded to external servers. By mapping visual analysis to low-level operating system input drivers, the tool provides a consistent method for controlling both desktop applications and web browser environments. Beyond basic interface interaction, the software includes a modular tool server protocol that allows for the integration of external functional modules. This framework enables the agent to extend its capabilities beyond graphical tasks, connecting to external systems and services to perform complex, multi-step workflows.
TypeScriptCross-Platform Visual Automation ToolsAutomated Desktop Interaction SystemsDesktop Automation
View on GitHub36,445
s0md3v/roop
s0md3v/roop
3,527View on GitHub
This application is a deep learning tool designed for automated face swapping in images and videos. It utilizes generative adversarial networks to map facial features from a source image onto a target subject, maintaining the original head pose, lighting, and skin texture of the target media. The software functions as a computer vision pipeline that deconstructs video files into individual frames for sequential processing. It employs pre-trained models for landmark detection and high-dimensional feature extraction to align faces precisely. To accelerate these complex tensor operations, the engine distributes computational workloads across both the system processor and graphics hardware. The pipeline includes post-processing capabilities such as histogram matching and spatial blurring to integrate the swapped region with the surrounding image. Users can target specific individuals within group media by providing reference indices and can adjust detection sensitivity or image orientation to resolve processing failures.
PythonFace Swapping ApplicationsGenerative Identity ModelsInference Engines
View on GitHub3,527
abetlen/llama-cpp-python
abetlen/llama-cpp-python
9,993View on GitHub
llama-cpp-python provides a Python interface for the llama.cpp library, enabling the execution of large language models with hardware acceleration. It functions as a GGUF model loader and a structured text generator capable of running inference servers and multimodal runtimes for processing both text and image inputs. The project distinguishes itself through a local inference server that exposes model capabilities via an OpenAI-compatible web API. It supports advanced execution techniques including speculative decoding, weight quantization, and layer-based GPU offloading to manage memory across system RAM and VRAM. The library covers a broad range of AI capabilities, including text completion, embedding generation, and the enforcement of structured outputs via JSON schemas or formal grammars. It also provides infrastructure for tool use through external function calling and manages model extensions via LoRA adapter injection. Users can fetch model files directly from Hugging Face and maintain model state persistence for resuming generation.
PythonLLM Python BindingsChat Completion ServicesEmbedding Generators
View on GitHub9,993
iperov/deepfacelive
iperov/DeepFaceLive
30,536View on GitHub
DeepFaceLive is a desktop application designed for real-time facial replacement and animation within live video streams. By utilizing deep learning models, the software performs high-speed identity mapping and facial feature analysis to transform video content as it is captured. The engine relies on GPU-accelerated inference to execute these complex image manipulation tasks at interactive frame rates. The application distinguishes itself through a modular video processing pipeline that chains specialized tasks to maintain high throughput and low latency. It features a virtual camera streaming interface that exposes processed video and audio as standard hardware inputs, allowing users to route modified media directly into third-party communication and broadcasting software. To ensure synchronization during live sessions, the system supports adjustable delay settings and offset configurations. The architecture employs asynchronous frame buffering and multi-GPU load balancing to distribute computational tasks across hardware, minimizing bottlenecks during intensive processing. It supports various input sources, including network-connected mobile devices, and provides tools for optimizing performance through hardware offloading and memory management. Detailed setup instructions are available to assist with environment configuration and driver preparation on Windows systems.
PythonFacial Manipulation ModelsHardware-Accelerated InferenceReal-Time Face Swapping
View on GitHub30,536
kmario23/deep-learning-drizzle
kmario23/deep-learning-drizzle
12,819View on GitHub
This project is a curated directory of educational roadmaps and resource hubs for artificial intelligence, deep learning, and machine learning. It serves as a centralized collection of academic lectures, instructional videos, and courses designed to provide structured learning paths for AI practitioners. The directory covers specialized academic curricula across several core domains, including computer vision, natural language processing, and reinforcement learning. It also provides access to niche educational content such as medical imaging, Bayesian deep learning, and probabilistic graphical models. The resource surface includes a catalog of intensive bootcamps, summer schools, and study materials focusing on deep neural network fundamentals, computational linguistics, and convex optimization techniques. The project is organized as a hierarchical directory of markdown documentation and static link aggregations.
HTMLMachine Learning EducationComputer Vision CurationsComputer Vision Learning Resources
View on GitHub12,819
cmu-perceptual-computing-lab/openpose
CMU-Perceptual-Computing-Lab/openpose
34,145View on GitHub
OpenPose is a real-time pose estimation engine designed to detect and track human body, face, hand, and foot landmarks. It functions as a multi-person motion tracker, identifying the spatial coordinates of multiple individuals simultaneously within video streams or static images. Beyond two-dimensional detection, the software acts as a three-dimensional kinematics processor, reconstructing spatial movement data from single or multiple synchronized camera perspectives. The system distinguishes itself through a bottom-up approach that utilizes part-affinity fields to associate body parts across multiple people. It employs hardware-accelerated tensor processing with optimized GPU kernels to maintain high frame rates, supported by a multi-stage convolutional architecture that iteratively refines keypoint detection. To ensure precise spatial mapping, the engine performs multi-view triangulation and applies non-maximum suppression to filter redundant landmark data. The project serves as a computer vision integration toolkit, providing the necessary pipelines to connect live skeletal tracking data to external digital environments. This allows for the animation of virtual characters or the triggering of interactions within game engines and other simulated spaces. The architecture is modular, separating preprocessing, inference, and post-processing stages to facilitate performance tuning and benchmarking across diverse hardware configurations.
C++Pose EstimationKeypoint DetectionPose Estimation Engines
View on GitHub34,145
accumulatemore/cv
AccumulateMore/CV
21,907View on GitHub
This project is a comprehensive deep learning framework and educational platform designed for constructing, training, and evaluating neural network architectures. It provides a modular environment for building models through tensor operations and automatic differentiation, supporting a wide range of tasks from image classification and object detection to sequential data processing. Beyond its core technical capabilities, the project distinguishes itself by integrating professional career development resources directly into its learning ecosystem. It offers structured guidance, resume reviews, and job referral services alongside its technical tutorials, aiming to support students as they transition into roles within the technology industry. The framework covers a broad capability surface, including hardware-accelerated training, data pipeline automation, and the implementation of advanced architectures like vision transformers and recurrent neural networks. It provides tools for managing the full model lifecycle, from dataset preparation and weight initialization to performance validation and state serialization. The project is delivered as a collection of interactive Jupyter notebooks, providing a hands-on environment for exploring deep learning fundamentals and computer vision techniques.
Jupyter NotebookAutomatic Differentiation EnginesComputer VisionDeep Learning Education
View on GitHub21,907
ultralytics/ultralytics
ultralytics/ultralytics
58,468View on GitHub
Ultralytics is a comprehensive computer vision framework designed for training, validating, and deploying deep learning models across a wide range of visual recognition tasks. It provides a unified interface for core operations including object detection, instance segmentation, pose estimation, and image classification. By utilizing a modular architecture, the platform allows users to swap model components to balance inference speed and accuracy requirements for diverse applications. The framework distinguishes itself through its support for real-time processing and flexible deployment. It includes a streaming inference engine that manages memory usage for large-scale video analysis and a format-agnostic export pipeline that translates trained weights into standardized formats for edge and cloud environments. Beyond standard detection, it supports open-vocabulary segmentation, allowing users to identify objects using text or visual prompts, and provides robust multi-object tracking capabilities to maintain identity persistence across video frames. The platform covers the entire machine learning lifecycle, from dataset retrieval and dynamic data loading to performance benchmarking and experiment tracking. It includes specialized tools for annotating visual results and accessing structured output data, facilitating integration into automated inspection and monitoring workflows. Users can configure training hyperparameters, resume interrupted sessions, and profile model performance to ensure optimal deployment on hardware ranging from mobile devices to high-performance GPUs.
PythonComputer VisionModel Training and Inference EnginesComputer Vision Training Frameworks
View on GitHub58,468

Computer vision and multimodal

abi/screenshot-to-code

josephmisiti/awesome-machine-learning

OthersideAI/self-operating-computer

microsoft/OmniParser

imClumsyPanda/langchain-ChatGLM

openai/CLIP

zhayujie/chatgpt-on-wechat

haotian-liu/LLaVA

bytedance/UI-TARS

hacksider/Deep-Live-Cam

codexu/note-gen

facebookresearch/segment-anything

bytedance/UI-TARS-desktop

s0md3v/roop

abetlen/llama-cpp-python

iperov/DeepFaceLive

kmario23/deep-learning-drizzle

CMU-Perceptual-Computing-Lab/openpose

AccumulateMore/CV

ultralytics/ultralytics