SGT: Semantic Generative Tuning for Unified Multimodal Models

Full HTML reader

Read the full artifact

Extracted abstract or opening context

[![Project Page](https://img.shields.io/badge/🌐_Project_Page-Visit-6366f1?style=for-the-badge)](https://song2yu.github.io/SGT/) [![Paper](https://img.shields.io/badge/📄_Paper-arXiv-8b5cf6?style=for-the-badge)](https://arxiv.org/pdf/2605.18714) [![Hugging Face](https://img.shields.io/badge/🤗_Hugging_Face-SAM--SGT_Dataset-FFD21E?style=for-the-badge)](https://huggingface.co/datasets/Two-hot/SAM-SGT) **SGT (Semantic Generative Tuning)** is the first systematic investigation into generative post-training for Unified Multimodal Models (UMMs). By leveraging **image segmentation as a generative proxy**, SGT bridges the gap between visual understanding and generation, enabling true synergy between the two capabilities within a single architecture. If you find our project or paper useful, we would greatly appreciate it if you could star this repository or cite our work. Existing UMMs optimize understanding and generation independently — this leads to misaligned representations and missed synergies. Previous pixel-level alignment methods over-emphasize texture and fail to provide structural semantic guidance. SGT takes a different approach: use **high-level segmentation** as the generative training objective. This simple yet effective proxy:

Promotion decision

What has to happen next

Attach run IDs, datasets, metrics, and reproduction commands.

Why this is not always a full paper yet

Corpus pages are public-safe readers for discovered workspace artifacts. They are not automatically final papers. A corpus item becomes a polished paper only after the editable source, evidence checkpoints, references, figures, render path, and release status are attached through the paper schema.