FreeGaussian

FreeGaussian: Annotation-free Control of Articulated Objects
via 3D Gaussian Splats with Flow Derivatives

Qizhi Chen*^1,2
Delin Qu*^2,3
Junli Liu²
Yiwen Tang²
Haoming Song²
Dong Wang²
Bin Zhao²
Xuelong Li²

Zhejiang University¹
Shanghai AI Laboratory²
Fudan University³

equal contribution*

Abstract

Reconstructing controllable Gaussian splats for articulated objects from monocular video is especially challenging due to its inherently insufficient constraints. Existing methods address this by relying on dense masks and manually defined control signals, limiting their real-world applications. In this paper, we propose an annotation-free method, \textbf{\ours}, which mathematically disentangles camera egomotion and articulated movements via flow derivatives. By establishing a connection between 2D flows and 3D Gaussian dynamic flow, our method enables optimization and continuity of dynamic Gaussian motions from flow priors without any control signals. Furthermore, we introduce a 3D spherical vector controlling scheme, which represents the state as a 3D Gaussian trajectory, thereby eliminating the need for complex 1D control signal calculations and simplifying controllable Gaussian modeling. Extensive experiments on articulated objects demonstrate the state-of-the-art visual performance and precise, part-aware controllability of our method.

Pipeline

The overview of FreeGaussian. Given a set of video stream \(\{\mathbf{P}(t), \mathbf{I}(t)\}\), our method recover controllable 3D Gaussians \(\mathbf{G}^{\ast}\) with two stages. First, we pre-train a deformable 3DGS and calculate dynamic Gaussian flow \(\mathbf{u}^\text{GS}\) from optical and camera flow with Lemma 1. Then, we reproject dynamic Gaussian flow maps and cluster the highlight 3DGS with the DBSCAN algorithm, followed with trajectory calculation. In the controllable Gaussian training stage, we optimize Gaussians \(\mathbf{G}\) and network \(\mathbf{\Theta}\) using rasterization-based loss function in Sec 3.4, which measures the discrepancy between rendered images and input images, as well as dynamic Gaussian flows.

Dynamic Gaussian Flow Analysis

In interactive scenes, consider an instantaneous motion model, where the camera and 3D Gaussian hold separate velocities in consecutive frames. The projected optical flow \(\mathbf{u}\) can be decomposed into camera flow \(\mathbf{u}^\text{Cam}\) and dynamic Gaussian flow \(\mathbf{u}^\text{GS}\), as described in Lemma 1 and Corollary 1.

Lemma 1

Dynamic Gaussian flow \(\mathbf{u}^\text{GS}\) under instantaneous motion can be derived from optical flow \(\mathbf{u}\) and camera flow \(\mathbf{u}^\text{Cam}\) with the following transform: \[ \begin{equation} \begin{aligned} \label{eq:gaussian_flow_analysis} & \mathbf{u} = \mathbf{u}^\text{Cam} + \mathbf{u}^\text{GS} + \mathbf{\Delta}, \\ & \mathbf{u}^\text{Cam} = \frac{\mathbf{A}\boldsymbol{v}}{Z} + \mathbf{B}\boldsymbol{\omega}, \quad \mathbf{u}^\text{GS} = \mathbf{A} \sum_{i=1}^{M} T_i \alpha_i \frac{\boldsymbol{v}^\text{GS}}{Z_i}, \quad \mathbf{\Delta} = \mathbf{A} \sum_{i=1}^{M} T_i \alpha_i \boldsymbol{v}(\frac{1}{Z_i} - \frac{1}{Z}), \\ & \mathbf{A} = \begin{bmatrix} -f_x & 0 & x - c_x \\ 0 & -f_y & y - c_y \end{bmatrix}, \quad \mathbf{B} = \begin{bmatrix} \frac{(x - c_x)(y - c_y)}{f_y} & - f_x - \frac{(x - c_x)^2}{f_x} & \frac{(y - c_y) f_x}{f_y} \\ f_y + \frac{(y - c_y)^2}{f_y} & -\frac{(x - c_x)(y - c_y)}{f_x} & -\frac{(x - c_x)f_y}{f_x} \end{bmatrix}, \\ \end{aligned} \end{equation} \]

where \(f_x, f_y, c_x, c_y\) are camera intrinsics, \(M\) denotes the number of Gaussian projections sorted with Gaussian depth \(Z_i\) intersecting the pixel \(\mathbf{m}\). Flow residual term \(\mathbf{\Delta}\) are preserved to guarantee accuracy, even when they approach zero after refined optimization.

Corollary 1

The dynamic Gaussian flow \(\mathbf{\tilde{u}}^\text{GS}\) on image plane can be accumulated with 2D Gaussian means displacement \(\boldsymbol{\mu}_{i,t} - \boldsymbol{\mu}_{i,0}\). \[ \begin{align} \mathbf{u} = \mathbf{u}^\text{Cam} + \tilde{\mathbf{u}}^\text{GS} + \mathbf{\Delta}, \quad \tilde{\mathbf{u}}^\text{GS} = \sum_{i=1}^{M} T_i \alpha_i (\boldsymbol{\mu}_{i,t} - \boldsymbol{\mu}_{i,0}). \label{eq:dynamic_gs_flow} \end{align} \]

Dynamic Gaussian clustering and tracking

With the formulations in Corollary 1, we pretrain a deformable 3DGS \(\mathbf{G}^{\prime}\) with a set of camera streams first. Then dynamic Gaussian flow \(\mathbf{u}^\text{GS}\) from Corollary 1 can be extracted frame-by-frame and binaried to obtain flow maps. By back-projecting the flow maps to identify dynamic 3D Gaussians, we highlight Gaussians \(\mathcal{D} = \{g_i \mid i = 1, 2, \ldots, Q\}\) with sharp dynamics, as illustrated in Pipeline. Next, we use unsupervised clustering algorithm DBSCAN to group dynamic Gaussians into clusters \(\mathcal{C} = \{c_i \mid i = 1, 2, \ldots, K\}\), where \(K\) is the number of interactive objects. The cluster centers evolve over time, generating continuous trajectories \(\boldsymbol{\varsigma}(t, k)\), where \(k\) indexing which objects the trajectory belongs to.

3D Spherical Vector Control

In the training stage, we represent the Gaussian dynamics state using cluster trajectory coordinates \(\mathbf{v}_c^i = \boldsymbol{\varsigma}(t, k) - \boldsymbol{\varsigma}(0, k)\), concatenated with Gaussian centers \(\mathbf{X}_i\). Then, we encode the coordinates with \(\mathbf{E}(\mathbf{v}_{c}^i, \mathbf{X}_i)\) and jointly train the model \(\Theta\) to recover Gaussian dynamics \(\left \langle \Delta\mathbf{X}_i, \Delta\mathbf{\Sigma}_i \right \rangle\): \[ \begin{align} \boldsymbol{f}_{\Theta}\left(\mathbf{X}_i, \mathbf{E}(\boldsymbol{\varsigma}(t, k) - \boldsymbol{\varsigma}(0, k)) \right) \mapsto \left \langle \Delta\mathbf{X}_i, \Delta\mathbf{\Sigma}_i \right \rangle. \label{eq:training} \end{align} \] Then, we perform splatting rasterization with the Gaussian combining with predicted dynamics. In contrast, during the control stage, we manually input interactive 3D vector \(\mathbf{v}_c^\prime\), retrieving the Gaussian dynamics from the network by \( \boldsymbol{f}_{\Theta}\left(\mathbf{X}_i, \mathbf{v}_c^\prime \right)\).

More Demos

Citation

If you want to cite our work, please use:

@misc{chen2025freegaussianannotationfreecontrollable3d,
    title={FreeGaussian: Annotation-free Control of Articulated Objects via 3D Gaussian Splats with Flow Derivatives}, 
    author={Qizhi Chen and Delin Qu and Junli Liu and Yiwen Tang and Haoming Song and Dong Wang and Bin Zhao and Xuelong Li},
    year={2025},
    eprint={2410.22070},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2410.22070}, 
}

Acknowledgements

The website template was borrowed from Michaël Gharbi. Image sliders are based on dics. We adopt code from Nerfstudio. Thanks for making the code available!