30 open-source projects similar to cocodataset/cocoapi, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Cocoapi alternative.
Labelme is a Python-based image annotation tool used to create computer vision datasets. It serves as a visual editor for semantic segmentation, allowing users to define object boundaries using polygons, rectangles, points, and circles. The application also functions as a multispectral image annotator, supporting high-bit depth TIFF files used in satellite and scientific imagery. The tool incorporates AI-assisted labeling capabilities to automate the creation of masks and polygons. These features allow for shape generation driven by text prompts or interactive point selections, which propose
Supervision is a computer vision toolset for normalizing model outputs, managing datasets, and visualizing annotations. It provides a framework to convert predictions from various classification and detection models into a standardized data format to ensure interoperability across different computer vision pipelines. The library features a post-processor for filtering, counting, and tracking detected objects across image frames and video streams. It includes capabilities for large image tiling to improve the detection of small objects and tools for assigning persistent identities to objects t
labelImg is a desktop image annotation tool and dataset preparation utility used to create labeled datasets for computer vision training. It provides a graphical interface for drawing bounding boxes around objects in images and assigning them class labels to build ground truth data for machine learning models. The software specifically supports the Pascal VOC XML annotation format, exporting image coordinates and class names into standard XML or text structures. It allows users to load predefined class lists from text files to standardize naming across an entire project. Beyond initial label
This project is a comprehensive instructional resource and course for building neural networks using PyTorch. It covers the fundamental building blocks of deep learning, including tensor manipulation, automatic differentiation, and the construction of modular neural network components. The repository serves as a technical guide for several specialized domains. It provides implementation details for computer vision tasks such as image classification, object detection, and semantic segmentation, as well as natural language processing workflows involving transformers, recurrent networks, and gen
labelImg is a computer vision labeling tool and image bounding box annotator used to create training datasets for machine learning models. It functions as a desktop utility for drawing rectangular labels on images and saving object coordinates and class names in common machine learning formats. The tool is specifically designed to generate and edit PascalVOC formatted XML files and create image labels in the text-based format required by YOLO object detection pipelines. The software covers object detection annotation and training data preparation, including the ability to manage label catego
This is an image segmentation framework and masking toolkit for constructing binary and multi-class neural network architectures. It serves as a deep learning encoder wrapper that integrates pre-trained convolutional neural network architectures into semantic segmentation models. The library enables the use of pre-trained backbones to isolate complex patterns and leverages transfer learning to accelerate training. It provides a collection of overlap-based loss functions and precision metrics specifically designed to evaluate and refine the accuracy of image masks. The toolkit covers the full
FastSAM is an image segmentation framework that uses convolutional neural networks to isolate visual elements and generate masks for detectable objects within images. It provides a system for both automatic all-object segmentation and promptable image segmentation. The project utilizes an inference-optimized architecture to reduce computational overhead, enabling faster mask generation and real-time visual analysis. It supports the creation of precise masks through various prompt inputs, including points, bounding boxes, and text descriptions. The framework covers broader computer vision cap
Layout-parser is a deep learning document layout parser and image analysis framework. It provides a toolkit for extracting structural information and layout patterns from scanned documents and digital images, transforming them into programmatic data structures for automated analysis. The framework integrates layout detection with optical character recognition to convert tabular regions into machine-readable data. It utilizes neural networks to identify and classify structural elements within document images without relying on manual rule-based systems. The system covers a broad range of docu
This project is a foundation model and research toolkit designed for promptable object segmentation and temporal tracking. It provides a unified framework for isolating specific regions or objects within both static images and dynamic video sequences. The system distinguishes itself through a streaming memory architecture that maintains temporal consistency by storing and retrieving object features across frames. This mechanism allows the model to resolve occlusions and preserve object identity even when targets move out of view or change appearance. By utilizing a shared backbone for both im
This project is a collection of educational resources and implementation frameworks providing deep learning model recipes, code samples, and step-by-step guides for computer vision tasks. It organizes complex workflows into modular recipes and implementation guides to facilitate the building of image and video analysis models. The framework focuses on specialized vision capabilities, including an image similarity framework for fast retrieval and re-ranking, human pose estimation, and video action recognition. It also provides specific tools for crowd density estimation and document image clea
Detectron2 is a PyTorch computer vision framework and visual recognition platform designed for training and deploying models for object detection, image segmentation, and visual recognition. It provides a research-oriented environment for training complex vision models with multi-GPU acceleration. The project includes a specialized object detection library for identifying and locating multiple objects via bounding boxes, as well as an image segmentation toolkit for creating pixel-level masks through instance, semantic, and panoptic segmentation. Additionally, it features a human pose estimati
This project is an educational platform and research toolkit designed to teach deep learning through a combination of mathematical theory, visual diagrams, and executable code. It provides a comprehensive environment for building, training, and evaluating neural networks, grounding complex concepts in interactive computational notebooks that allow for hands-on experimentation. The framework distinguishes itself by interleaving theoretical foundations—including linear algebra, calculus, and probability—with practical implementations across multiple industry-standard libraries. It supports flex
Face-recognition.js is a computer vision software development kit for Node.js that provides tools for detecting, mapping, and identifying human faces within images and video streams. It functions as a bridge to high-performance native libraries, enabling developers to perform complex facial analysis tasks directly within JavaScript and TypeScript environments. The library distinguishes itself by combining deep learning inference with geometric landmark mapping. It utilizes pre-trained neural networks to extract facial feature vectors and employs Euclidean distance calculations to determine th
This project is a desktop screen capture and annotation utility designed for Linux environments. It provides an interactive graphical overlay that allows users to select specific screen regions, apply visual annotations such as shapes, text, and pixelation, and manage the resulting images through a configurable post-capture pipeline. The application distinguishes itself through deep system integration and automation capabilities. It operates as a persistent background daemon that monitors global hotkeys and supports inter-process communication via a system message bus, enabling users to trigg
eSearch is a desktop tool that combines screen capture, image annotation, screen recording, optical character recognition (OCR), and text search and translation into a single application. It is built around a modular architecture that coordinates these tasks through an event-driven capture pipeline, allowing users to capture screen regions, annotate them with drawing and shape tools, and then extract text using a local-first OCR engine or optional cloud services. The project distinguishes itself by integrating a command-line interface for triggering capture and recognition tasks, enabling scr
This project is an AI-powered screenshot manager and visual assistant designed for capturing screen content and processing it through large language models. It functions as an OCR translation application and screen annotation tool, allowing users to extract text from images and perform intelligent analysis of visual data. The software differentiates itself through an AI-driven OCR pipeline and the ability to convert screenshots into structured Markdown or HTML via layout-aware document transformation. It features a visual AI assistant capable of analyzing screen content and a prompt-engineere
ArduinoJson is a C++ library for parsing and manipulating JSON data and MessagePack binary streams on microcontrollers with limited memory and processing power. It provides the core primitives necessary for embedded data serialization and parsing, enabling devices to exchange structured data over serial or network interfaces. The library is distinguished by its focus on microcontroller memory management, employing strategies such as pool-based allocation, string deduplication, and non-owning string views to minimize RAM usage. It further optimizes for constrained environments by allowing cons
Flameshot is a cross-platform desktop screenshot tool and image annotation utility. It provides the ability to capture full displays or specific screen regions and save them as image files. The software features a built-in editor for adding arrows, shapes, text, and markers to screen captures for visual documentation. It also includes functionality for transferring captured and annotated images directly to external hosting services for remote storage and sharing. The utility includes a command line interface for automating screen captures, managing application settings via scripts, and trigg
JSON5 is a parser and serializer for a human-readable configuration format that extends JSON. It serves as a JavaScript-based data parser that allows for a more flexible version of the JSON specification to simplify manual editing of data files. The project provides capabilities to support comments, trailing commas, and multi-line strings. It includes utilities to convert this extended syntax into standard JSON for compatibility with tools requiring strict specifications. The library covers data serialization, string parsing, and structural syntax validation. It also provides integration for
tui.image-editor is a JavaScript image manipulation library and web-based photo editor. It provides a browser-based interface for cropping, resizing, and applying filters to images using the HTML5 Canvas API. The project is distinguished by its role as a canvas-based annotation tool, allowing users to add text, shapes, and freehand drawings as graphic overlays. It offers extensive UI customization through theme configuration, interface text localization, and the ability to replace default icons with custom SVG files. The library covers geometry manipulation, visual filter application, and im
JSONKit is an Objective-C library used for parsing, serializing, and manipulating JSON data. It functions as a JSON parser that converts text into native data structures and a serializer that transforms native objects into formatted JSON text. The library includes a gzip compression wrapper that compresses serialized JSON payloads to reduce network transfer sizes and automatically detects and decompresses gzip buffers before decoding. The toolset provides capabilities for JSON parsing and serialization, supporting customizable indentation, character escaping, and flexible comment handling.
JSON-java is a Java library for parsing and generating JSON text and mapping it to Java objects and collections. It functions as a serialization framework for converting class instances and data structures into standardized JSON strings. The project includes a JSON pointer implementation for retrieving specific values from documents using string or URI fragment representations. It also provides a converter for translating data structures between JSON and XML, as well as a translator for transforming data between JSON and web formats such as HTTP headers, cookies, and comma-delimited lists. T
This project is a research library and toolkit for deep learning computer vision, focused on implementing transformer and mixer-based architectures for image classification. It processes visual data by converting images into sequences of patches, allowing standard attention mechanisms to capture global dependencies without relying on traditional convolutional operations. The framework distinguishes itself through its support for multimodal embedding analysis, which maps images and text into a shared latent vector space. This capability enables zero-shot classification and cross-modal retrieva
This project is a modular research toolkit designed for developing, training, and evaluating deep learning models for object detection, segmentation, and video instance tracking. It provides a flexible training engine that manages complex neural network execution, including distributed training, custom lifecycle hooks, and weight optimization. The framework is built around a hierarchical configuration system that allows users to define architectures, data pipelines, and training hyperparameters through composable, inheritable files. The project distinguishes itself through its highly modular
ShareX is a desktop utility designed for screen capture, image annotation, and automated file sharing. It provides a comprehensive suite of tools for capturing screen regions, windows, or scrolling content, and includes a layered image editor that allows users to manipulate, scale, and transform graphical elements and annotations directly on captured media. The application distinguishes itself through an event-driven post-capture pipeline that triggers automated workflows, such as image processing, external command execution, or file uploads, immediately after a capture event. Users can exten
MMF is a modular framework for building, training, and evaluating vision-and-language models. It provides a configuration-driven experiment system where model, dataset, and training parameters are defined through composable YAML files, alongside a curated model zoo of pretrained checkpoints for state-of-the-art multimodal architectures. The framework includes a multimodal dataset loader that downloads, processes, and batches vision-and-language data, and a vision-language model trainer supporting distributed training, mixed precision, and checkpoint-based resumption. The framework distinguish
🚀 A Complete Fast Android Networking Library that also supports HTTP/2 🚀
Hutool is a comprehensive suite of Java extensions designed to serve as a standard library extension. Its primary purpose is to reduce development boilerplate for common programming tasks and data manipulation through a collection of utility classes. The project provides specialized toolkits for database management using active record patterns and connection pooling, as well as network communication via a simplified HTTP client and asynchronous socket management. It includes security and identity capabilities such as symmetric and asymmetric encryption, image captcha generation, and JWT token
Dolphin is a multimodal layout analyzer and image-to-structure converter that transforms photographed or digital document images into machine-readable structured data. It functions as an LLM document parser, utilizing vision-language models to simultaneously predict spatial layout and text content. The system is designed as a concurrent document processor, employing parallel document parsing to process multiple elements across distributed compute nodes. This high-throughput approach reduces the total time required to convert large volumes of images into structured formats. The project covers
This is a JavaScript library for parsing and serializing JSON data, with a particular focus on handling objects that contain circular references. It provides a standard JSON parser that reads text and reconstructs JavaScript values without using the eval function, guarding against code injection, alongside a standard serializer that converts objects into JSON strings for data interchange. The library distinguishes itself by offering specialized encoding and decoding for cyclical object graphs. It can serialize objects with circular references by replacing repeated object paths with JSONPath s