Computer vision and multimodal

Explore open-source libraries and frameworks for image processing, object detection, and multimodal machine learning models.

Find the best repos with AI.We'll search the best matching repositories with AI.

abi/screenshot-to-code
abi/screenshot-to-code
72,926View on GitHub
This project is an artificial intelligence-powered frontend generator that translates visual design inputs into functional source code. It functions as a workflow engine that interprets graphical user interfaces, mapping layout structures and styling rules to structured markup and programming language syntax. The tool distinguishes itself by supporting both static design mockups and dynamic video recordings. It processes temporal and spatial information from screen captures to reconstruct interaction flows and state transitions, enabling the creation of functional software prototypes from vis
PythonAI Frontend GeneratorsAI-Powered UI GeneratorsDesign-to-Code Generators
View on GitHub72,926
josephmisiti/awesome-machine-learning
josephmisiti/awesome-machine-learning
72,867View on GitHub
This project is a comprehensive, community-driven directory of machine learning resources, software libraries, and educational materials. It serves as a centralized knowledge base for developers and researchers, organizing tools and frameworks by their primary programming language and technical domain to simplify discovery across the artificial intelligence ecosystem. The collection distinguishes itself by providing a cross-language development index that spans diverse programming environments, including C, C++, Rust, Clojure, and Python. It covers a wide range of specialized capabilities, fr
PythonAwesome ListMachine Learning ConceptsComputer Vision Libraries
View on GitHub72,867
othersideai/self-operating-computer
OthersideAI/self-operating-computer
10,153View on GitHub
This project is a computer control framework that uses multimodal vision models to simulate mouse and keyboard inputs for automating desktop tasks. It functions as an autonomous agent and vision-based orchestrator that interprets screen visuals to interact with user interfaces. The system employs vision language models and object detection to locate and click interface elements. It utilizes visual grounding to overlay numerical markers on UI components and uses optical character recognition to map on-screen text to precise pixel coordinates. The framework supports voice-controlled computing
PythonComputer Automation InterfacesGUI and Computer AgentsAutonomous UI Interaction
View on GitHub10,153
microsoft/omniparser
microsoft/OmniParser
24,377View on GitHub
OmniParser is a multimodal interaction engine designed to function as a desktop automation agent. It interprets visual screen information to execute complex, multi-step tasks across operating system environments by bridging visual interface perception with language models. Through a continuous cycle of observation and command execution, the system grounds high-level natural language instructions into precise, coordinate-based actions. The project distinguishes itself by utilizing vision-based parsing to interact with software interfaces without requiring access to underlying application progr
Jupyter NotebookDesktop Automation AgentsVision-Language Grounding ModelsAgentic Orchestration Loops
View on GitHub24,377
imclumsypanda/langchain-chatglm
imClumsyPanda/langchain-ChatGLM
38,183View on GitHub
This project is a LangChain-based framework for building retrieval-augmented generation systems, autonomous agents, and multimodal chatbots. It functions as an open-source orchestrator that connects local inference engines and online APIs to manage various large language model deployments. The system distinguishes itself by providing specialized interfaces for local knowledge bases, allowing the loading and vectorization of private documents to create context-aware assistants. It also supports multimodal capabilities, enabling the processing of both text and image inputs through vision-capabl
PythonAgent Orchestration FrameworksLLM OrchestratorsAgentic LLM Frameworks
View on GitHub38,183
openai/clip
openai/CLIP
33,779View on GitHub
CLIP is a neural network architecture designed to map visual and textual data into a shared latent vector space. By utilizing transformer-based feature extraction and multi-modal tokenization, the system aligns images and natural language strings, enabling cross-modal similarity analysis and semantic classification. The project functions as a zero-shot classification engine, identifying image content by calculating the cosine similarity between visual features and arbitrary text labels without requiring task-specific retraining. Beyond inference, it serves as a research toolkit for evaluating
Jupyter NotebookContrastive Learning ModelsZero-Shot Inference EnginesComputer Vision Evaluation Tools
View on GitHub33,779
zhayujie/chatgpt-on-wechat
zhayujie/chatgpt-on-wechat
45,353View on GitHub
This project is an autonomous agent framework designed to integrate large language models with popular messaging platforms. It functions as a middleware platform that enables automated, multimodal interactions by decomposing complex user goals into sequential plans, executing them through external tools, and maintaining persistent context across sessions. The framework distinguishes itself through a modular skill architecture and a hybrid memory system. Users can extend system capabilities by installing custom logic modules from community hubs or generating them through natural language. The
PythonAgent FrameworksAgent OrchestratorsAgent Memory Systems
View on GitHub45,353
haotian-liu/llava
haotian-liu/LLaVA
24,465View on GitHub
LLaVA is a multimodal large language model architecture designed to process and interpret both image and text inputs to generate natural language responses. It functions as a research-oriented platform for visual instruction tuning, providing a framework to align language models with human intent through training on diverse datasets of paired images and text queries. The system distinguishes itself through a specialized vision-language training pipeline that connects visual data to language models using projection layers and instruction-based fine-tuning. It supports distributed inference by
PythonMultimodal Large Language ModelsVision-Language PipelinesVisual Instruction Tuning
View on GitHub24,465
bytedance/ui-tars
bytedance/UI-TARS
9,622View on GitHub
UI-TARS is an LLM GUI automation framework and multimodal action grounding system. It functions as a GUI agent orchestrator and cross-platform device controller that uses large language models to interpret graphical interfaces and execute actions across desktop and mobile operating systems. The system translates model-generated coordinates into precise screen positions to interact with visual user interface elements. It employs a multimodal approach to interpret screen layouts and decomposes complex goals into multi-step trajectories through reasoning and error correction. The project provid
PythonAutonomous Agent OrchestratorsMultimodal Vision InterfacesAction Data Normalization
View on GitHub9,622
hacksider/deep-live-cam
hacksider/Deep-Live-Cam
93,878View on GitHub
Deep-Live-Cam is a generative video transformation tool designed for real-time facial manipulation and cinematic enhancement. It functions as a local-first AI runtime, performing all media processing directly on the user's hardware to ensure complete data privacy without external network dependencies. By utilizing a high-performance processing pipeline, the application enables live face swapping and interactive video modifications during active streaming sessions or on pre-recorded media. The system distinguishes itself through a hardware-abstraction execution layer that dynamically routes co
PythonCinematic Video EnhancementsHigh-Performance AI InferenceLive Performance Execution
View on GitHub93,878
codexu/note-gen
codexu/note-gen
12,173View on GitHub
Note-gen is an artificial intelligence-assisted note-taking application and knowledge management tool designed for local-first data ownership. It functions as a workspace that leverages language models to organize, summarize, and synthesize personal notes into structured documents while maintaining offline accessibility. The platform distinguishes itself through a multimodal workflow orchestrator that chains sequences of tasks to process text, images, and external data. By integrating vision-language models, it extracts information from visual inputs like screenshots and documents, converting
TypeScriptAI-PoweredLocal-First Data PersistenceLocal-First Knowledge Bases
View on GitHub12,173
facebookresearch/segment-anything
facebookresearch/segment-anything
54,353View on GitHub
This project provides a deep learning architecture designed to identify and isolate distinct objects within images by generating precise pixel-level masks. It functions as a browser-based inference engine, enabling the execution of complex machine learning models directly within web environments without requiring server-side processing. The system distinguishes itself by utilizing hardware-accelerated execution and parallel processing to achieve real-time segmentation speeds. It supports prompt-based mask decoding, allowing users to generate spatial masks by providing specific points or boxes
Jupyter NotebookBrowser-based Inference EnginesObject Mask GeneratorsBrowser-Based Image Segmentation
View on GitHub54,353
bytedance/ui-tars-desktop
bytedance/UI-TARS-desktop
36,445View on GitHub
UI-TARS-desktop is a cross-platform desktop application designed to automate software interface interactions. It functions as a local agent environment that interprets graphical user interfaces through multimodal visual-language model reasoning, allowing it to navigate and manipulate software by simulating human-like mouse and keyboard inputs. The platform distinguishes itself by executing all visual recognition and decision-making logic directly on the host machine. This local inference model ensures that screen data and sensitive information remain private, as no processing is offloaded to
TypeScriptCross-Platform Visual Automation ToolsAutomated Desktop Interaction SystemsDesktop Automation
View on GitHub36,445
s0md3v/roop
s0md3v/roop
3,527View on GitHub
This application is a deep learning tool designed for automated face swapping in images and videos. It utilizes generative adversarial networks to map facial features from a source image onto a target subject, maintaining the original head pose, lighting, and skin texture of the target media. The software functions as a computer vision pipeline that deconstructs video files into individual frames for sequential processing. It employs pre-trained models for landmark detection and high-dimensional feature extraction to align faces precisely. To accelerate these complex tensor operations, the en
PythonFace Swapping ApplicationsGenerative Identity ModelsInference Engines
View on GitHub3,527
abetlen/llama-cpp-python
abetlen/llama-cpp-python
9,993View on GitHub
llama-cpp-python provides a Python interface for the llama.cpp library, enabling the execution of large language models with hardware acceleration. It functions as a GGUF model loader and a structured text generator capable of running inference servers and multimodal runtimes for processing both text and image inputs. The project distinguishes itself through a local inference server that exposes model capabilities via an OpenAI-compatible web API. It supports advanced execution techniques including speculative decoding, weight quantization, and layer-based GPU offloading to manage memory acro
PythonLLM Python BindingsChat Completion ServicesEmbedding Generators
View on GitHub9,993
iperov/deepfacelive
iperov/DeepFaceLive
30,536View on GitHub
DeepFaceLive is a desktop application designed for real-time facial replacement and animation within live video streams. By utilizing deep learning models, the software performs high-speed identity mapping and facial feature analysis to transform video content as it is captured. The engine relies on GPU-accelerated inference to execute these complex image manipulation tasks at interactive frame rates. The application distinguishes itself through a modular video processing pipeline that chains specialized tasks to maintain high throughput and low latency. It features a virtual camera streaming
PythonFacial Manipulation ModelsHardware-Accelerated InferenceReal-Time Face Swapping
View on GitHub30,536
kmario23/deep-learning-drizzle
kmario23/deep-learning-drizzle
12,819View on GitHub
This project is a curated directory of educational roadmaps and resource hubs for artificial intelligence, deep learning, and machine learning. It serves as a centralized collection of academic lectures, instructional videos, and courses designed to provide structured learning paths for AI practitioners. The directory covers specialized academic curricula across several core domains, including computer vision, natural language processing, and reinforcement learning. It also provides access to niche educational content such as medical imaging, Bayesian deep learning, and probabilistic graphica
HTMLMachine Learning EducationComputer Vision CurationsComputer Vision Learning Resources
View on GitHub12,819
cmu-perceptual-computing-lab/openpose
CMU-Perceptual-Computing-Lab/openpose
34,145View on GitHub
OpenPose is a real-time pose estimation engine designed to detect and track human body, face, hand, and foot landmarks. It functions as a multi-person motion tracker, identifying the spatial coordinates of multiple individuals simultaneously within video streams or static images. Beyond two-dimensional detection, the software acts as a three-dimensional kinematics processor, reconstructing spatial movement data from single or multiple synchronized camera perspectives. The system distinguishes itself through a bottom-up approach that utilizes part-affinity fields to associate body parts across
C++Pose EstimationKeypoint DetectionPose Estimation Engines
View on GitHub34,145
accumulatemore/cv
AccumulateMore/CV
21,907View on GitHub
This project is a comprehensive deep learning framework and educational platform designed for constructing, training, and evaluating neural network architectures. It provides a modular environment for building models through tensor operations and automatic differentiation, supporting a wide range of tasks from image classification and object detection to sequential data processing. Beyond its core technical capabilities, the project distinguishes itself by integrating professional career development resources directly into its learning ecosystem. It offers structured guidance, resume reviews,
Jupyter NotebookAutomatic Differentiation EnginesComputer VisionDeep Learning Education
View on GitHub21,907
ultralytics/ultralytics
ultralytics/ultralytics
58,468View on GitHub
Ultralytics is a comprehensive computer vision framework designed for training, validating, and deploying deep learning models across a wide range of visual recognition tasks. It provides a unified interface for core operations including object detection, instance segmentation, pose estimation, and image classification. By utilizing a modular architecture, the platform allows users to swap model components to balance inference speed and accuracy requirements for diverse applications. The framework distinguishes itself through its support for real-time processing and flexible deployment. It in
PythonComputer VisionModel Training and Inference EnginesComputer Vision Training Frameworks
View on GitHub58,468
scutan90/deeplearning-500-questions
scutan90/DeepLearning-500-questions
57,436View on GitHub
This project is a comprehensive study guide and knowledge base for deep learning, machine learning, and the associated mathematics required for artificial intelligence. It functions as a curated collection of technical questions and answers designed to help users study fundamental theories and practical applications. The repository serves as a technical interview preparation resource by aggregating industry-standard questions and core knowledge points. It provides a structured reference for reviewing neural network architectures and specific techniques used in computer vision, such as object
JavaScriptTechnical Interview PreparationDeep Learning CurriculumDeep Learning Fundamentals
View on GitHub57,436
jaidedai/easyocr
JaidedAI/EasyOCR
29,615View on GitHub
EasyOCR is a deep learning-based computer vision library designed to perform optical character recognition on images and video frames. It functions as a comprehensive pipeline that automates the transformation of visual text into machine-readable strings, enabling the digitization of physical documents, forms, and receipts into searchable data. The engine distinguishes itself through a multi-stage processing workflow that combines convolutional neural networks for spatial feature extraction with sequence-based decoding mechanisms. This architecture allows the system to identify and interpret
PythonOCR EnginesOptical Character RecognitionComputer Vision Libraries
View on GitHub29,615
lazyprogrammer/machine_learning_examples
lazyprogrammer/machine_learning_examples
8,823View on GitHub
This project is a comprehensive collection of practical code examples and implementation libraries for machine learning. It provides a wide array of reference materials for building supervised, unsupervised, and reinforcement learning algorithms. The repository serves as a multi-domain resource, featuring specific implementation suites for financial AI, Bayesian statistical modeling, and deep learning architectures. It includes a framework for training intelligent agents using policy gradients and actor-critic models, as well as practical guides for fine-tuning transformers and utilizing larg
PythonDeep Learning ModelsMachine Learning ImplementationsActor-Critic Architectures
View on GitHub8,823
eugeneyan/applied-ml
eugeneyan/applied-ml
29,783View on GitHub
This project is a comprehensive, curated knowledge base designed to support the development and maintenance of production-grade machine learning systems. It serves as a centralized repository of industry-standard technical literature, engineering case studies, and research papers, providing a structured reference for practitioners navigating the complexities of modern data science and machine learning engineering. The resource distinguishes itself through a cross-domain approach that bridges the gap between academic research and practical implementation. By synthesizing proven industry archit
Lifecycle ManagementData PipelinesMachine Learning Operations Platforms
View on GitHub29,783
hkuds/lightrag
HKUDS/LightRAG
36,651View on GitHub
LightRAG is a graph-based retrieval framework designed to build retrieval-augmented generation pipelines. It structures unstructured text into knowledge graphs, enabling multi-hop reasoning and complex query synthesis across large document collections. By integrating dense vector embeddings with structured knowledge graphs, the system facilitates both similarity-based and relationship-aware information retrieval. The framework distinguishes itself through a dual-level retrieval strategy that combines low-level keyword matching with high-level semantic graph traversal to capture both specific
PythonKnowledge Graph Retrieval SystemsRetrieval Augmented Generation PipelinesGraph Reasoning Systems
View on GitHub36,651
eriklindernoren/ml-from-scratch
eriklindernoren/ML-From-Scratch
31,918View on GitHub
This project is an educational toolkit that provides implementations of fundamental machine learning algorithms built from scratch. By avoiding high-level library abstractions, it serves as a pedagogical reference for understanding the mathematical foundations and core mechanics of supervised learning, unsupervised learning, and reinforcement learning models. The repository distinguishes itself through a modular approach to model construction, allowing users to build custom neural networks by chaining independent functional blocks. It covers a wide range of techniques, including gradient-base
PythonMachine Learning ToolkitsSupervised LearningClustering Algorithms
View on GitHub31,918
deepfakes/faceswap
deepfakes/faceswap
55,289View on GitHub
Faceswap is a comprehensive framework for automated media manipulation and neural face synthesis. It provides a modular pipeline that manages the entire lifecycle of facial feature extraction, deep learning model training, and image conversion. By coordinating complex computer vision workflows, the system enables users to map facial identities between source and destination datasets while maintaining structural alignment and lighting consistency across video frames. The project distinguishes itself through a highly extensible plugin-based architecture that handles hardware-accelerated process
PythonAutomated Face SwappingFace Swapping EnginesAutomated
View on GitHub55,289
naptha/tesseract.js
naptha/tesseract.js
38,141View on GitHub
Tesseract.js is a JavaScript library that provides optical character recognition capabilities directly within web browsers and Node.js environments. It functions as a client-side engine, enabling the conversion of images containing printed text into machine-readable strings without the need for external APIs or server-side infrastructure. The library distinguishes itself by running the original C++ optical character recognition engine within the browser through WebAssembly modules. To maintain interface responsiveness during intensive computation, it utilizes background threads for parallel p
JavaScriptOptical Character Recognition LibrariesWeb-Based Text RecognitionWebAssembly
View on GitHub38,141
yichuan-w/leann
yichuan-w/LEANN
11,985View on GitHub
LEANN is a framework for local retrieval augmented generation and vector indexing. It functions as a system for building local knowledge bases and source code search engines that combine large language models with retrieved private data to generate context-aware responses. The project distinguishes itself through a vision-model based document layout extractor for parsing complex PDF figures and diagrams, and a source code search engine that employs structure-aware chunking to preserve function and class boundaries. It also implements the Model Context Protocol to integrate real-time data sour
PythonRetrieval Augmented GenerationHybrid Search EnginesLocal Knowledge Bases
View on GitHub11,985
tesseract-ocr/tesseract
tesseract-ocr/tesseract
74,751View on GitHub
Tesseract is a neural network-based optical character recognition engine designed to convert scanned images and digital documents into machine-readable, searchable text. It functions as both a command-line utility for automating large-scale digitization workflows and a cross-platform library that can be embedded into desktop, mobile, or server-side applications. By utilizing long short-term memory networks, the engine provides robust text extraction across more than one hundred languages and dozens of scripts. The project distinguishes itself through a sophisticated document layout analysis f
C++OCR EnginesAutomated Digitization EnginesCommand-Line Document Processors
View on GitHub74,751
itseez/opencv
Itseez/opencv
89,221View on GitHub
OpenCV is an open-source computer vision library and visual analysis toolkit. It provides a framework for processing static images and dynamic video frames to analyze visual data and extract information using deep learning. The project functions as a real-time image processing framework, enabling the execution of vision algorithms on live video streams for immediate analysis and data processing. The toolkit covers a broad range of capabilities including image pattern recognition, real-time video analysis, and visual data extraction. It also supports automated visual inspection for detecting
C++Computer Vision LibrariesImage Content AnalyzersImage Processing
View on GitHub89,221
ffmpeg/ffmpeg
FFmpeg/FFmpeg
61,176View on GitHub
FFmpeg is a cross-platform multimedia framework designed for the recording, conversion, and streaming of audio and video content. It functions as a comprehensive toolkit that provides both a command-line utility for direct media manipulation and a collection of low-level libraries for integration into custom applications. At its core, the project utilizes a packet-based stream engine and a format-agnostic abstraction layer to handle diverse media standards, containers, and network protocols. The framework distinguishes itself through a modular, graph-based filter execution model that allows f
CMultimedia Format ConvertersMultimedia Processing SuitesAudio and Video
View on GitHub61,176
jbhuang0604/awesome-computer-vision
jbhuang0604/awesome-computer-vision
23,074View on GitHub
This project is a comprehensive, community-driven repository that serves as a centralized catalog for computer vision research and development. It functions as a structured index of academic papers, open-source software libraries, public datasets, and educational tutorials, providing a navigation point for the complex landscape of modern vision technology. The repository distinguishes itself through a taxonomy-based indexing system that maps the relationships between foundational research, influential academic figures, and their corresponding software implementations. By utilizing a lightweig
Awesome ListComputer Vision BenchmarksComputer Vision Curations
View on GitHub23,074
compvis/stable-diffusion
CompVis/stable-diffusion
73,125View on GitHub
Stable Diffusion is a generative machine learning pipeline that synthesizes high-resolution visual content by performing iterative denoising within a compressed latent space. By mapping natural language embeddings into pixel outputs through conditioned probabilistic processes, the framework enables the generation of images from text prompts and the transformation of existing visual inputs based on semantic instructions. The architecture utilizes a modular execution environment that decouples model loading, scheduler logic, and inference components to support diverse hardware configurations. I
Jupyter NotebookCross-Attention MechanismsDenoising SchedulersGenerative Image Engines
View on GitHub73,125
hybridgroup/gocv
hybridgroup/gocv
7,463View on GitHub
GoCV is a computer vision library and Go language binding for OpenCV. It serves as an image processing toolkit and deep learning inference engine, providing programmatic access to a wide range of algorithms for image manipulation, object detection, and video analysis. The project differentiates itself through high-performance native bindings and hardware acceleration. It utilizes a foreign function interface to map Go calls to C++ functions and includes a hardware-agnostic backend dispatch to route neural network tasks to computation engines such as CUDA and OpenVINO. The library covers a br
GoComputer Vision LibrariesComputer Vision LibrariesBounding Box Detection
View on GitHub7,463
lllyasviel/controlnet
lllyasviel/ControlNet
33,942View on GitHub
ControlNet is a framework for structural image generation that extends pre-trained diffusion models with neural network architectures designed for precise spatial control. By injecting structural guidance directly into the latent-space denoising process, the system enables users to enforce geometric or semantic constraints on generated outputs while maintaining style consistency. The framework distinguishes itself through a weight-locked copying mechanism that preserves the integrity of the original model while introducing new control signals. It supports multi-condition synthesis, allowing f
PythonDiffusion Conditioning ArchitecturesGenerative Model Training ToolsStructural Guidance
View on GitHub33,942
microsoft/windows-universal-samples
microsoft/Windows-universal-samples
9,696View on GitHub
This repository is a comprehensive collection of reference implementations and sample libraries for the Universal Windows Platform. It provides practical examples of how to use Windows Runtime APIs to build cross-device applications, including detailed guidance on XAML-based declarative user interfaces and DirectX-integrated rendering. The project distinguishes itself by providing a wide array of hardware integration suites, covering low-level communication with USB, Serial, I2C, SPI, and GPIO peripherals. It includes specialized implementations for mixed reality holographic rendering, advanc
JavaScriptFramework Sample Libraries3D Graphics PipelinesAdaptive UI Layouts
View on GitHub9,696
obsproject/obs-studio
obsproject/obs-studio
73,384View on GitHub
This project is a professional live video production suite designed for capturing, encoding, and broadcasting high-quality media. At its core, it features a real-time media processing engine that utilizes hardware acceleration to composite multiple audio and video sources with minimal latency. The application provides a centralized studio interface for managing complex scene transitions, layering visual sources through a hierarchical scene-graph engine, and streaming content to multiple platforms simultaneously. The software is built on a cross-platform abstraction layer that ensures consiste
CForeign Function InterfacesHardware-Accelerated Video PipelinesLive Video Production Suites
View on GitHub73,384
babalae/better-genshin-impact
babalae/better-genshin-impact
12,606View on GitHub
Better Genshin Impact is a computer vision-based automation framework designed to perform repetitive tasks and combat sequences within game environments. It functions as a macro scripting engine that utilizes synthetic input injection to simulate human interaction with the operating system, allowing for hands-free execution of complex gameplay loops. The system distinguishes itself through a combination of template-matching visual recognition and state-machine logic, which enables the software to identify on-screen game elements and transition between operational states in real time. By mappi
C#Automation Macro EnginesMobile Game Automation ToolsComputer Vision Screen Interaction Tools
View on GitHub12,606
opencv/opencv
opencv/opencv
89,201View on GitHub
OpenCV is a comprehensive computer vision library designed for real-time performance and cross-platform deployment. It provides a native execution environment that leverages multi-threaded operations and automated memory management to handle intensive computational tasks, including image processing and machine learning model inference. The library distinguishes itself through a data-oriented matrix framework that utilizes proxy-based array abstractions to provide a consistent interface for multidimensional data. By employing factory-pattern algorithm interfaces and runtime type dispatching, i
C++Computer Vision LibrariesObject Detection and TrackingModel Inference and Serving
View on GitHub89,201
hybridgroup/gobot
hybridgroup/gobot
9,425View on GitHub
Gobot is a robotics framework for the Go programming language designed for developing robotics, drones, and IoT applications. It provides a hardware abstraction layer with standardized drivers to interact with GPIO, I2C, SPI, and PWM interfaces across various single-board computers and microcontrollers. The framework functions as an IoT device orchestrator and BLE device manager, enabling the coordination of multiple sensors, actuators, and Bluetooth Low Energy peripherals. It includes specialized interfaces for drone control, allowing for the management of flight maneuvers and video streams
GoHardware Device InterfacesRobotics and ControlBLE Sensor Monitors
View on GitHub9,425
blakeblackshear/frigate
blakeblackshear/frigate
33,778View on GitHub
Frigate is a self-hosted network video recorder that functions as a private, local AI-powered vision engine. It manages video streams by performing real-time object detection, tracking, and classification directly on local hardware, ensuring that security monitoring and activity recording remain independent of cloud services. The system distinguishes itself through a modular, hardware-accelerated video pipeline that offloads intensive decoding and machine learning inference to dedicated GPUs, NPUs, or specialized accelerators like Coral TPUs and Hailo modules. It utilizes state-based object t
TypeScriptNetwork Video RecordersVideo Surveillance SystemsTranscoding Engines
View on GitHub33,778
bytedeco/javacv
bytedeco/javacv
8,310View on GitHub
JavaCV provides a Java-based interface for native computer vision and video processing libraries. It functions as a wrapper for native vision libraries, allowing Java applications to perform image analysis, object detection, and video stream processing. The project integrates comprehensive computer vision capabilities, including facial recognition, image segmentation, and optical flow analysis for motion tracking. It also provides tools for hardware geometry calibration and projector-camera alignment to ensure accurate spatial representation. The system covers high-performance media renderin
JavaComputer Vision LibrariesForeign Function InterfacesJNI Bridges
View on GitHub8,310
mrdoob/three.js
mrdoob/three.js
113,086View on GitHub
This project is a high-level 3D graphics engine designed to render complex, hardware-accelerated environments within web browsers. It provides a comprehensive abstraction layer that manages scene graphs, cameras, and lighting, mapping high-level scene definitions onto low-level graphics APIs. By decoupling these definitions from specific hardware targets, the engine ensures consistent performance across diverse browsers and devices. The framework distinguishes itself through a robust architecture that includes a unified math library for high-frequency spatial calculations and a physically bas
JavaScript3D Rendering EnginesAbstraction-Layer Rendering BackendsBrowser-Based 3D Visualizations
View on GitHub113,086
remotion-dev/remotion
remotion-dev/remotion
50,931View on GitHub
Remotion is a programmatic video framework that enables the creation of video content using component-based logic and standard web technologies. By leveraging a declarative animation engine, it allows developers to structure visual content as a hierarchy of reusable components, ensuring that animations and state updates remain consistent through deterministic frame execution. The framework distinguishes itself by utilizing a headless browser renderer that captures visual output frame-by-frame to generate high-quality video files. This architecture supports a cloud-native media pipeline, allow
TypeScriptCross-Platform Media FrameworksProgrammatic Video FrameworksAnimation Engines
View on GitHub50,931
kornia/kornia
kornia/kornia
11,238View on GitHub
Kornia is a differentiable computer vision library and cross-framework tensor vision toolset. It implements vision operations as differentiable tensors to enable integration into deep learning pipelines and supports the transpilation of operations across PyTorch, TensorFlow, JAX, and NumPy. The project provides specialized toolsets for geometric vision and stereo depth, including algorithms for 3D scene reconstruction, camera calibration, and pose estimation. It further distinguishes itself as a differentiable image augmentation framework, applying random geometric and color transformations w
PythonComputer Vision LibrariesCross-Framework Tensor DispatchDifferentiable Vision Operations
View on GitHub11,238
pmndrs/react-three-fiber
pmndrs/react-three-fiber
31,172View on GitHub
This project is a React-based WebGL renderer that enables the creation of three-dimensional scenes using a declarative, component-driven architecture. It functions as a bridge between a component-based user interface library and a low-level graphics engine, allowing developers to manage lights, cameras, and geometry as standard elements within a reactive tree structure. The library distinguishes itself by treating the scene graph as a declarative hierarchy that synchronizes directly with application state and lifecycle events. It utilizes a custom reconciler to map component updates to object
TypeScriptWebGL & GPU RenderingRenderingComponent-Based Scene Graphs
View on GitHub31,172
pixijs/pixijs
pixijs/pixijs
47,416View on GitHub
PixiJS is a high-performance 2D rendering engine designed for building interactive visual content and browser-based games. It provides a hardware-accelerated graphics library that leverages WebGL and WebGPU backends to execute complex scenes, utilizing a hierarchical scene graph to manage object transformations and display order. The project distinguishes itself through a sophisticated architecture that decouples rendering logic from hardware APIs, allowing for consistent performance across diverse browser environments. It features a robust, asynchronous asset pipeline that handles loading, c
TypeScriptGame EnginesRenderingScene Graph Frameworks
View on GitHub47,416

Computer vision and multimodal

abi/screenshot-to-code

josephmisiti/awesome-machine-learning

OthersideAI/self-operating-computer

microsoft/OmniParser

imClumsyPanda/langchain-ChatGLM

openai/CLIP

zhayujie/chatgpt-on-wechat

haotian-liu/LLaVA

bytedance/UI-TARS

hacksider/Deep-Live-Cam

codexu/note-gen

facebookresearch/segment-anything

bytedance/UI-TARS-desktop

s0md3v/roop

abetlen/llama-cpp-python

iperov/DeepFaceLive

kmario23/deep-learning-drizzle

CMU-Perceptual-Computing-Lab/openpose

AccumulateMore/CV

ultralytics/ultralytics

scutan90/DeepLearning-500-questions

JaidedAI/EasyOCR

lazyprogrammer/machine_learning_examples

eugeneyan/applied-ml

HKUDS/LightRAG

eriklindernoren/ML-From-Scratch

deepfakes/faceswap

naptha/tesseract.js

yichuan-w/LEANN

tesseract-ocr/tesseract

Itseez/opencv

FFmpeg/FFmpeg

jbhuang0604/awesome-computer-vision

CompVis/stable-diffusion

hybridgroup/gocv

lllyasviel/ControlNet

microsoft/Windows-universal-samples

obsproject/obs-studio

babalae/better-genshin-impact

opencv/opencv

hybridgroup/gobot

blakeblackshear/frigate

bytedeco/javacv

mrdoob/three.js

remotion-dev/remotion

kornia/kornia

pmndrs/react-three-fiber

pixijs/pixijs