SGT: Semantic Generative Tuning for Unified Multimodal Models

Full HTML reader

Read the full artifact

Extracted abstract or opening context

--- license: apache-2.0 pipeline_tag: any-to-any library_name: bagel-mot tags: - sgt - semantic-generative-tuning - unified-multimodal - image-segmentation - visual-understanding - visual-generation --- This repository hosts checkpoints fine-tuned with **Semantic Generative Tuning (SGT)** — a training paradigm that couples visual *understanding* and *generation* in Unified Multimodal Models (UMMs) by using **image segmentation as a generative proxy**. > Unified multimodal models typically optimize understanding and generation with *misaligned* > objectives (sparse text tokens vs. dense pixel targets), which isolates the two capabilities. > SGT introduces segmentation — a **high-level semantic task** — as a unified generative objective > that aligns the two branches, improves feature linear separability, and optimizes visual-textual > attention allocation. SGT reformulates classical visual tasks as generative proxies and establishes a **hierarchical taxonomy** (low-/mid-/high-level). Extensive experiments show that **high-level semantic tasks (e.g. image segmentation) are the optimal proxy**, outperforming depth, edge, reconstruction and MAE/inpainting for synergizing understanding and generation. 1. **High-level > low-level**: segmentation gives larger gains in visual understanding than depth / edge / pixel reconstruction. 2. **Perception, not reasoning**: visual supervision mainly strengthens perception (spatial, hallucination, vision-centric, general VQA), rather than abstract reasoning (e.g. math, chart) 3. **Architecture-agnostic**: the gains hold for both **BAGEL** and **OmniGen2**.

Promotion decision

What has to happen next

Attach run IDs, datasets, metrics, and reproduction commands.

Why this is not always a full paper yet

Corpus pages are public-safe readers for discovered workspace artifacts. They are not automatically final papers. A corpus item becomes a polished paper only after the editable source, evidence checkpoints, references, figures, render path, and release status are attached through the paper schema.