Overview of FLEG. Our FLEG adopts a large transformer with a DPT-based decoder and corresponding prediction heads to predict language-embedded Gaussians. We propose a 3D-annotation-free training framework to eliminate the reliance on 3D annotation. To embed semantics into 3D representations, we construct InstanceMV-14K to enrich semantic diversity. We also introduce an instance-guided contrastive learning to effectively align 2D instances with 3D representations. We further propose a geometry-semantic hierarchical sparsification strategy to avoid the cost of per-pixel predictions.
Comparsion with feed-forward methods:
Comparsion with per-scene optimized methods:
@misc{tian2025flegfeedforwardlanguageembedded,
title={FLEG: Feed-Forward Language Embedded Gaussian Splatting from Any Views},
author={Qijian Tian and Xin Tan and Jiayu Ying and Xuhong Wang and Yuan Xie and Lizhuang Ma},
year={2025},
eprint={2512.17541},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.17541},
}