SGT: Semantic Generative Tuning for Unified Multimodal Models
[](https://song2yu.github.io/SGT/) [](https://arxiv.org/pdf/2605.18714) [](https://huggingface.co/datasets/Two-hot/SAM-SGT)
Full Public Reader
SGT: Semantic Generative Tuning for Unified Multimodal Models
<div align="center">
[](https://song2yu.github.io/SGT/)
[](https://arxiv.org/pdf/2605.18714)
[](https://huggingface.co/datasets/Two-hot/SAM-SGT)
</div>
---
Overview
SGT (Semantic Generative Tuning) is the first systematic investigation into generative post-training for Unified Multimodal Models (UMMs). By leveraging image segmentation as a generative proxy, SGT bridges the gap between visual understanding and generation, enabling true synergy between the two capabilities within a single architecture.
If you find our project or paper useful, we would greatly appreciate it if you could star this repository or cite our work.
---
Why SGT?
Existing UMMs optimize understanding and generation independently โ this leads to misaligned representations and missed synergies. Previous pixel-level alignment methods over-emphasize texture and fail to provide structural semantic guidance.
SGT takes a different approach: use high-level segmentation as the generative training objective. This simple yet effective proxy:
- โ Improves multimodal perception & understanding
- โ Enhances generative spatial fidelity
- โ Is architecture-agnostic โ validated on both BAGEL (7B+7B) and OmniGen2 (3B+4B)
- โ Scales monotonically with more segmentation data
Empirical Findings
We probe the effect of three proxy task levels (edge / depth / segmentation) on both understanding and generation capabilities of BAGEL and OmniGen2.
<p align="center">
<img src="assets/und6dim.png" alt="Understanding capability gains across proxy task levels" width="85
</p>
<p align="center">
<em>Understanding capability gains.</em>
</p>
<p align="center">
<img src="assets/gen6dim.png" alt="Generation capability gains across proxy task levels" width="85
</p>
<p align="center">
<em>Generation capability gains.</em>
</p>
Three consistent observations emerge:
1. High-level semantic tasks dominate. Segmentation consistently outperforms mid-level (depth) and low-level (edge) tasks across all understanding benchmarks. High-level supervision aligns with perception demands, while texture-focused tasks distract the model with irrelevant details.
2. Visual supervision enhances perception, not reasoning. Generative tuning fortifies vision-centric tasks (spatial reasoning, hallucination resistance) while math/chart reasoning remains unaffected โ visual supervision improves representation quality but does not impart logical priors.
3. Spatial fidelity improves universally. Regardless of semantic granularity, all proxy tasks improve generative spatial fidelity, especially for position-aware prompts. Reconstructing visual structure forces accurate spatial layouts.
---
## Usage
git clone https://github.com/song2yu/SGT.git
cd SGT---
## Download Datasets
Here we sample a subset of LLaVA-OneVision, you may also choose to download the full dataset.
Modify `OUTPUT_DIR` in `dowload_ov.py` to your desired location.
# download LLaVA-OneVision subset
python dowload_ov.py
# download sam subset || Chinese users can use --use-mirror
python download_sam.py --target-dir ./data/SAM-SGT --use-mirror## BAGEL
### for BAGEL Installation
bash setup_bagel.sh
cd BAGEL && source activate_env.sh
bash shells/download_ckpt.sh
bash shells/download_bagel.shfor BAGEL Inference
# for vision2text
PYTHONPATH=. python scripts/infer_understanding.py
# for text2image
PYTHONPATH=. python scripts/infer_t2i_show.py
# for image2image
PYTHONPATH=. python scripts/infer_edit.py### for BAGEL Training
Modify the paths of llava-ov and sam in `/efs/brucessyu/SGT/BAGEL/data/dataset_info.py`.
bash shells/train_sgt.sh---
## OmniGen2
### for OmniGen2 Installation
bash setup_gen2.sh
cd OmniGen2 && source activate_env.sh
export HF_TOKEN="<your hf token>"
bash shells/download_ckpt.sh
bash shells/download_gen2.sh
bash shells/download_pretrained.sh # for trainingfor OmniGen2 Inference
# for vision2text
PYTHONPATH=. python scripts/infer_und.py
# for text2image
PYTHONPATH=. python scripts/infer_text2image.py
# for image2image
PYTHONPATH=. python scripts/infer_edit.py### for OmniGen2 Training
Modify the paths of llava-ov and sam.
export OMNIGEN2_SAM_ROOT=/your/datasets/sam-qa
export OMNIGEN2_QWEN_PROCESSOR_PATH=/your/path/Qwen2.5-VL-3B-Instruct
bash scripts/train/train_sgt.sh---
Training Data
SGT uses 190k segmentation samples from SAM alongside standard VQA SFT data.
Optimal batch ratio: 2:1 (Segmentation : VQA).
| Data Source | Samples |
|---|---|
| SGT โ Segmentation (SAM) | 190k |
| General VQA | 180k |
| Doc / Chart / Screen | 103k |
| Math / Reasoning | 101k |
| Language | 72k |
| General OCR | 45k |
| Total | ~691k |
---
Acknowledgements
We gratefully acknowledge the authors and contributors of the following open-source projects, whose codebases were used in this work:
- ReCA: [https://github.com/HorizonWind2004/reconstruction-alignment](https://github.com/HorizonWind2004/reconstruction-alignment)
- BAGEL: [https://github.com/ByteDance-Seed/Bagel](https://github.com/ByteDance-Seed/Bagel)
- OmniGen2: [https://github.com/VectorSpaceLab/OmniGen2](https://github.com/VectorSpaceLab/OmniGen2)
---
Citation
@article{yu2026sgt,
title = {Semantic Generative Tuning for Unified Multimodal Models},
author = {Yu, Songsong and Chen, Yuxin and Shan, Ying and Li, Yanwei},
journal = {arXiv preprint arXiv:2605.18714},
year = {2026},
}---
<div align="center">
<a href="https://song2yu.github.io/SGT/">
<img src="https://img.shields.io/badge/๐_Visit_Project_Page-song2yu.github.io/SGT-6366f1?style=flat-square&labelColor=1e1b4b" alt="Project Page"/>
</a>
</div>
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
MotionMix/research/external/audio-ai/sgt-project-page/README.md
Detected Structure
Method ยท Evaluation ยท Figures ยท Code Anchors ยท Architecture