VOGS-CP

Cheng Chen¹ Hao Huang² Saurabh Bagchi¹

¹Purdue University ²New York University Abu Dhabi

[Paper] [Code]

Figure: Comparison of shared representations. BEV and tri-plane methods transmit implicit planar features as messages, losing geometric detail and complicating alignment. We instead share explicit, interpretable 3D Gaussian primitives that preserve 3D structure and enable straightforward cross-agent fusion.

Overview

Collaborative perception enables connected vehicles to share information, overcoming occlusions and extending the limited sensing range inherent in single-agent (non-collaborative) systems. Existing vision-only methods for 3D semantic occupancy prediction commonly rely on dense 3D voxels, which incur high communication costs, or 2D planar features, which require accurate depth estimation or additional supervision, limiting their applicability to collaborative scenarios. To address these challenges, we propose the first approach leveraging sparse 3D semantic Gaussians for collaborative 3D semantic occupancy prediction. By sharing and fusing intermediate Gaussian primitives, our method provides three benefits: a neighborhood-based cross-agent fusion that removes duplicates and suppresses noisy or inconsistent Gaussians; a joint encoding of geometry and semantics in each primitive, which reduces reliance on depth supervision and allows simple rigid alignment; and sparse, object-centric messages that preserve structural information while reducing communication volume. Extensive experiments demonstrate that our approach outperforms single-agent perception and baseline collaborative methods by +8.42 and +3.28 points in mIoU, and +5.11 and +22.41 points in IoU, respectively. When further reducing the number of transmitted Gaussians, our method still achieves a +1.9 improvement in mIoU, using only 34.6% communication volume, highlighting robust performance under limited communication budgets.

Framework

Figure: Overview of the proposed pipeline. An initial set of randomly initialized 3D Gaussians is refined by an Image-to-Gaussian module that attends to multi-scale image features, producing single-agent Gaussians. Neighbor agents (top and bottom) are rigidly transformed into the ego frame (middle) and culled to the ego region of interest; the blue and yellow dashed box marks the Gaussians that lie within the ego region of interest and are packaged and transmitted to the ego. A cross-agent Gaussian fusion module aggregates these with the ego set. The fused Gaussians are then rendered to semantic occupancy via Gaussian-to-voxel splatting. For clarity, the figure shows the zero-shot variant.

Visualization & Demo

Figure: Qualitative comparison of ego-only ground truth, collaborative ground truth, predicted Gaussians, predicted occupancy, and the zero-shot variant. Red boxes highlight occluded object structure captured by Gaussian primitives in collaborative setting. The zero-shot variant can look plausible but often shows clustered redundancy and noise; black boxes mark cases where the neighborhood-based fusion suppresses redundancy and improves consistency. An opacity threshold is applied for display, so the predicted Gaussians are not exhaustive.

We also include a demo video, which demonstrates the continuous driving scene semantic occupancy prediction achieved by our method.

BibTeX

  @article{chen2025vision,
    title={Vision-Only Gaussian Splatting for Collaborative Semantic Occupancy Prediction},
    author={Chen, Cheng and Huang, Hao and Bagchi, Saurabh},
    journal={arXiv preprint arXiv:2508.10936},
    year={2025}
  }

Related Work

Xiangbo Gao, Runsheng Xu, Jiachen Li, Ziran Wang, Zhiwen Fan, Zhengzhong Tu. STAMP: Scalable Task And Model-agnostic Collaborative Perception. ICLR, 2025.
Comment: Proposes lightweight adapter-reverter pairs to transform Bird’s Eye View (BEV) features between agent-specific domains and a shared protocol domain

Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, Jiwen Lu. GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction. ECCV, 2024.
Comment: Proposes an object-centric representation to describe 3D scenes with sparse 3D semantic Gaussians where each Gaussian represents a flexible region of interest and its semantic features.

Rui Song, Chenwei Liang, Hu Cao, Zhiran Yan, Walter Zimmer, Markus Gross, Andreas Festag, Alois Knoll. Collaborative Semantic Occupancy Prediction with Hybrid Feature Fusion in Connected Automated Vehicles. CVPR, 2024.
Comment: Proposes hybrid fusion of (i) semantic and occupancy task features, and (ii) compressed orthogonal attention features shared between vehicles.

Yifan Lu, Yue Hu, Yiqi Zhong, Dequan Wang, Yanfeng Wang, Siheng Chen. An Extensible Framework for Open Heterogeneous Collaborative Perception. ICLR, 2024.
Comment: Proposes HEterogeneous ALliance (HEAL), a novel extensible collaborative perception framework.

AAAI 2026 (Oral)