FLEG: Feed-Forward Language Embedded Gaussian Splatting from Any Views via Compact Semantic Representation

arXiv 2025

1Shanghai Jiao Tong University
2East China Normal University
3Shanghai Artificial Intelligence Laboratory

Abstract

We present FLEG, a feed-forward network that reconstructs language-embedded 3D Gaussians from arbitrary views. Previous feed-forward language-embedded Gaussian reconstruction methods are restricted to a fixed number of input views and typically attach a language-aligned semantic embedding to each Gaussian, resulting in impractical input settings and semantic redundancy. In contrast, we introduce a geometry-semantic dual-branch distillation framework that enables flexible input from arbitrary multi-view images without camera parameters. We also propose a novel-view-based distillation strategy during training that mitigates overfitting to input views. In addition, we observe that semantic representations are significantly sparser than geometric ones, and per-Gaussian language embedding is unnecessary. To exploit this sparsity, we design a decoupled language embedding strategy that represents language information with a sparse set of semantic Gaussians, rather than attaching embeddings to every Gaussian. Compared with dense pixel-aligned per-Gaussian embedding schemes, our method uses only 5% of the language embeddings while maintaining comparable semantic fidelity, effectively reducing storage costs. Extensive experiments demonstrate that FLEG outperforms state-of-the-art feed-forward reconstruction and language-embedded Gaussian methods in both reconstruction quality and language-aligned semantic representation.

Method

Overview of FLEG. We propose a dual-branch distillation to train the network without 3D ground truth and semantic labels. We also propose a Decoupled Gaussian Language Embedding (DGLE) module that produces the compact semantic representation, which significantly reduces semantic redundancy and storage overhead.

Interpolate start reference image

Qualitative Results

Novel view synthesis:

Interpolate start reference image Interpolate start reference image Interpolate start reference image


Open-vocabulary query segmentation:

Interpolate start reference image Interpolate start reference image Interpolate start reference image

BibTeX

@misc{tian2025flegfeedforwardlanguageembedded,
      title={FLEG: Feed-Forward Language Embedded Gaussian Splatting from Any Views}, 
      author={Qijian Tian and Xin Tan and Jiayu Ying and Xuhong Wang and Yuan Xie and Lizhuang Ma},
      year={2025},
      eprint={2512.17541},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.17541}, 
}