Back to corpus
research noteexperiment writeup candidatescore 18
SGT: Semantic Generative Tuning for Unified Multimodal Models
This repository hosts checkpoints fine-tuned with **Semantic Generative Tuning (SGT)** — a training paradigm that couples visual *understanding* and *generation* in Unified Multimodal Models (UMMs) by using **image segmentation as a generative proxy**.
Full HTML reader
Read the full artifact
Extracted abstract or opening context
--- license: apache-2.0 pipeline_tag: any-to-any library_name: bagel-mot tags: - sgt - semantic-generative-tuning - unified-multimodal - image-segmentation - visual-understanding - visual-generation ---
This repository hosts checkpoints fine-tuned with **Semantic Generative Tuning (SGT)** — a training paradigm that couples visual *understanding* and *generation* in Unified Multimodal Models (UMMs) by using **image segmentation as a generative proxy**.
> Unified multimodal models typically optimize understanding and generation with *misaligned* > objectives (sparse text tokens vs. dense pixel targets), which isolates the two capabilities. > SGT introduces segmentation — a **high-level semantic task** — as a unified generative objective > that aligns the two branches, improves feature linear separability, and optimizes visual-textual > attention allocation.
SGT reformulates classical visual tasks as generative proxies and establishes a **hierarchical taxonomy** (low-/mid-/high-level). Extensive experiments show that **high-level semantic tasks (e.g. image segmentation) are the optimal proxy**, outperforming depth, edge, reconstruction and MAE/inpainting for synergizing understanding and generation.
1. **High-level > low-level**: segmentation gives larger gains in visual understanding than depth / edge / pixel reconstruction. 2. **Perception, not reasoning**: visual supervision mainly strengthens perception (spatial, hallucination, vision-centric, general VQA), rather than abstract reasoning (e.g. math, chart) 3. **Architecture-agnostic**: the gains hold for both **BAGEL** and **OmniGen2**.
Promotion decision
What has to happen next
Attach run IDs, datasets, metrics, and reproduction commands.
Why this is not always a full paper yet
Corpus pages are public-safe readers for discovered workspace artifacts. They are not automatically final papers. A corpus item becomes a polished paper only after the editable source, evidence checkpoints, references, figures, render path, and release status are attached through the paper schema.