Lingbot-map is a feed-forward neural network designed for real-time 3D scene reconstruction from streaming video. It processes video frames one at a time without iterative optimization, producing dense geometry and camera poses at interactive frame rates directly from a live feed.
The project distinguishes itself through its ability to maintain stable geometry and pose alignment across very long video sequences, handling thousands of frames without drift. It achieves this through a combination of coordinate grounding memory, sliding-window inference with overlapping keyframes, and a paged KV cache attention mechanism that manages transformer memory within limited GPU resources. The system also includes a headless rendering pipeline that can produce MP4 flythrough videos of reconstructed point clouds, with camera motion controlled by configurable YAML presets supporting chase-cam, birdseye, static, and pivot shots.
Additional capabilities include keyframe caching to reduce memory footprint during long sequences, and ONNX-based sky point filtering to improve visual quality in outdoor reconstructions. The project provides tools for custom virtual camera path design and supports processing both live video feeds and pre-recorded image sequences.