FLEG: Feed-Forward Language Embedded Gaussian Splatting from Any Views

arXiv 2025

1Shanghai Jiao Tong University
2East China Normal University
3Shanghai Artificial Intelligence Laboratory

Abstract

We present FLEG, a feed-forward network that reconstructs language-embedded 3D Gaussians from any views. Previous straightforward solutions combine feed-forward reconstruction with Gaussian heads but suffer from fixed input views and insufficient 3D training data. In contrast, we propose a 3D-annotation-free training framework for 2D-to-3D lifting from arbitrary uncalibrated and unposed multi-view images. Since the framework does not require 3D annotations, we can leverage large-scale video data with easily obtained 2D instance information to enrich semantic embedding. We also propose an instance-guided contrastive learning to align 2D semantics with the 3D representations. In addition, to mitigate the high memory and computational cost of dense views, we further propose a geometry-semantic hierarchical sparsification strategy. Our FLEG efficiently reconstructs language-embedded 3D Gaussian representation in a feed-forward manner from arbitrary sparse or dense views, jointly producing accurate geometry, high-fidelity appearance, and language-aligned semantics. Extensive experiments show that it outperforms existing methods on various related tasks.

Method

Overview of FLEG. Our FLEG adopts a large transformer with a DPT-based decoder and corresponding prediction heads to predict language-embedded Gaussians. We propose a 3D-annotation-free training framework to eliminate the reliance on 3D annotation. To embed semantics into 3D representations, we construct InstanceMV-14K to enrich semantic diversity. We also introduce an instance-guided contrastive learning to effectively align 2D instances with 3D representations. We further propose a geometry-semantic hierarchical sparsification strategy to avoid the cost of per-pixel predictions.

Interpolate start reference image

Qualitative Results

Comparsion with feed-forward methods:

Interpolate start reference image


Comparsion with per-scene optimized methods:

Interpolate start reference image

BibTeX

@misc{tian2025flegfeedforwardlanguageembedded,
      title={FLEG: Feed-Forward Language Embedded Gaussian Splatting from Any Views}, 
      author={Qijian Tian and Xin Tan and Jiayu Ying and Xuhong Wang and Yuan Xie and Lizhuang Ma},
      year={2025},
      eprint={2512.17541},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.17541}, 
}