Cocoindex is an incremental data processing engine that builds and maintains live indexes for AI agents, with a core focus on codebase indexing and knowledge graph extraction. The engine uses a function-graph execution model where user-defined Python functions are composed into a directed acyclic graph, and it processes data incrementally so only changed source records or code paths are re-computed, avoiding full recomputation at any scale. It supports automatic schema inference from transformation pipeline type annotations and provides full data lineage tracing, tagging every output record with its source items and transformation version.
The project distinguishes itself through declarative target-state reconciliation, where users describe the desired end state of a data store in Python and the engine computes the minimal set of mutations needed to reach it. It offers file-granularity change tracking, mapping each source file to its own processing component for independent transformation and precise delta detection. The engine natively handles typed multi-dimensional vectors for multimodal AI pipelines and supports elastic distributed indexing that scales to petabyte-scale corpora without manual partitioning.
Cocoindex covers a broad capability surface including building semantic text indexes, constructing knowledge graphs from documents, indexing codebases for AI agents with AST-aware parsing, and serving code context through MCP, CLI, or Claude skills. It can ingest data from any custom source, transform structured and unstructured data together, and export indexed data to local files, cloud storage, or REST APIs. The platform also provides observability tools for tracing data lineage end-to-end and debugging pipeline steps in real time.
The project is configured and extended through Python code, with documentation and installation resources available through its repository.