Grand Diomande Research ยท Full HTML Reader

SGT: Semantic Generative Tuning for Unified Multimodal Models

[![Project Page](https://img.shields.io/badge/๐ŸŒ_Project_Page-Visit-6366f1?style=for-the-badge)](https://song2yu.github.io/SGT/) [![Paper](https://img.shields.io/badge/๐Ÿ“„_Paper-arXiv-8b5cf6?style=for-the-badge)](https://arxiv.org/pdf/2605.18714) [![Hugging Face](https://img.shields.io/badge/๐Ÿค—_Hugging_Face-SAM--SGT_Dataset-FFD21E?style=for-the-badge)](https://huggingface.co/datasets/Two-hot/SAM-SGT)

Embodied Trajectory Systems research note experiment writeup candidate score 28 .md

Full Public Reader

SGT: Semantic Generative Tuning for Unified Multimodal Models

<div align="center">

[![Project Page](https://img.shields.io/badge/๐ŸŒ_Project_Page-Visit-6366f1?style=for-the-badge)](https://song2yu.github.io/SGT/)
[![Paper](https://img.shields.io/badge/๐Ÿ“„_Paper-arXiv-8b5cf6?style=for-the-badge)](https://arxiv.org/pdf/2605.18714)
[![Hugging Face](https://img.shields.io/badge/๐Ÿค—_Hugging_Face-SAM--SGT_Dataset-FFD21E?style=for-the-badge)](https://huggingface.co/datasets/Two-hot/SAM-SGT)

</div>

---

Overview

SGT (Semantic Generative Tuning) is the first systematic investigation into generative post-training for Unified Multimodal Models (UMMs). By leveraging image segmentation as a generative proxy, SGT bridges the gap between visual understanding and generation, enabling true synergy between the two capabilities within a single architecture.

If you find our project or paper useful, we would greatly appreciate it if you could star this repository or cite our work.

---

Why SGT?

Existing UMMs optimize understanding and generation independently โ€” this leads to misaligned representations and missed synergies. Previous pixel-level alignment methods over-emphasize texture and fail to provide structural semantic guidance.

SGT takes a different approach: use high-level segmentation as the generative training objective. This simple yet effective proxy:

  • โœ… Improves multimodal perception & understanding
  • โœ… Enhances generative spatial fidelity
  • โœ… Is architecture-agnostic โ€” validated on both BAGEL (7B+7B) and OmniGen2 (3B+4B)
  • โœ… Scales monotonically with more segmentation data

Empirical Findings

We probe the effect of three proxy task levels (edge / depth / segmentation) on both understanding and generation capabilities of BAGEL and OmniGen2.

<p align="center">
<img src="assets/und6dim.png" alt="Understanding capability gains across proxy task levels" width="85
</p>
<p align="center">
<em>Understanding capability gains.</em>
</p>

<p align="center">
<img src="assets/gen6dim.png" alt="Generation capability gains across proxy task levels" width="85
</p>
<p align="center">
<em>Generation capability gains.</em>
</p>

Three consistent observations emerge:

1. High-level semantic tasks dominate. Segmentation consistently outperforms mid-level (depth) and low-level (edge) tasks across all understanding benchmarks. High-level supervision aligns with perception demands, while texture-focused tasks distract the model with irrelevant details.
2. Visual supervision enhances perception, not reasoning. Generative tuning fortifies vision-centric tasks (spatial reasoning, hallucination resistance) while math/chart reasoning remains unaffected โ€” visual supervision improves representation quality but does not impart logical priors.
3. Spatial fidelity improves universally. Regardless of semantic granularity, all proxy tasks improve generative spatial fidelity, especially for position-aware prompts. Reconstructing visual structure forces accurate spatial layouts.

---
## Usage

bash
git clone https://github.com/song2yu/SGT.git
cd SGT

---
## Download Datasets
Here we sample a subset of LLaVA-OneVision, you may also choose to download the full dataset.
Modify `OUTPUT_DIR` in `dowload_ov.py` to your desired location.

bash
# download LLaVA-OneVision subset
python dowload_ov.py
# download sam subset || Chinese users can use --use-mirror
python download_sam.py --target-dir ./data/SAM-SGT --use-mirror

## BAGEL
### for BAGEL Installation

bash
bash setup_bagel.sh
cd BAGEL && source activate_env.sh
bash shells/download_ckpt.sh
bash shells/download_bagel.sh

for BAGEL Inference

bash
# for vision2text
PYTHONPATH=. python scripts/infer_understanding.py
# for text2image
PYTHONPATH=. python scripts/infer_t2i_show.py
# for image2image
PYTHONPATH=. python scripts/infer_edit.py

### for BAGEL Training
Modify the paths of llava-ov and sam in `/efs/brucessyu/SGT/BAGEL/data/dataset_info.py`.

bash
bash shells/train_sgt.sh

---
## OmniGen2
### for OmniGen2 Installation

bash
bash setup_gen2.sh
cd OmniGen2 && source activate_env.sh
export HF_TOKEN="<your hf token>"
bash shells/download_ckpt.sh
bash shells/download_gen2.sh
bash shells/download_pretrained.sh # for training

for OmniGen2 Inference

bash
# for vision2text
PYTHONPATH=. python scripts/infer_und.py
# for text2image
PYTHONPATH=. python scripts/infer_text2image.py
# for image2image
PYTHONPATH=. python scripts/infer_edit.py

### for OmniGen2 Training
Modify the paths of llava-ov and sam.

bash
export OMNIGEN2_SAM_ROOT=/your/datasets/sam-qa
export OMNIGEN2_QWEN_PROCESSOR_PATH=/your/path/Qwen2.5-VL-3B-Instruct
bash scripts/train/train_sgt.sh

---

Training Data

SGT uses 190k segmentation samples from SAM alongside standard VQA SFT data.
Optimal batch ratio: 2:1 (Segmentation : VQA).

Data SourceSamples
SGT โ€” Segmentation (SAM)190k
General VQA180k
Doc / Chart / Screen103k
Math / Reasoning101k
Language72k
General OCR45k
Total~691k

---

Acknowledgements

We gratefully acknowledge the authors and contributors of the following open-source projects, whose codebases were used in this work:

  • ReCA: [https://github.com/HorizonWind2004/reconstruction-alignment](https://github.com/HorizonWind2004/reconstruction-alignment)
  • BAGEL: [https://github.com/ByteDance-Seed/Bagel](https://github.com/ByteDance-Seed/Bagel)
  • OmniGen2: [https://github.com/VectorSpaceLab/OmniGen2](https://github.com/VectorSpaceLab/OmniGen2)

---

Citation

bibtex
@article{yu2026sgt,
  title     = {Semantic Generative Tuning for Unified Multimodal Models},
  author    = {Yu, Songsong and Chen, Yuxin and Shan, Ying and Li, Yanwei},
  journal   = {arXiv preprint arXiv:2605.18714},
  year      = {2026},
}

---

<div align="center">
<a href="https://song2yu.github.io/SGT/">
<img src="https://img.shields.io/badge/๐ŸŒ_Visit_Project_Page-song2yu.github.io/SGT-6366f1?style=flat-square&labelColor=1e1b4b" alt="Project Page"/>
</a>
</div>

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

MotionMix/research/external/audio-ai/sgt-project-page/README.md

Detected Structure

Method ยท Evaluation ยท Figures ยท Code Anchors ยท Architecture