HeadRouter: A Training-free Image Editing Framework for MM-DiTs by Adaptively Routing Attention Heads
Diffusion Transformers (DiTs) have exhibited robust capabilities in image generation tasks. However, accurate text-guided image editing for multimodal DiTs (MM-DiTs) still poses a significant challenge. Unlike UNet-based structures that could utilize self/cross-attention maps for semantic editing, MM-DiTs inherently lack support for explicit and consistent incorporated text guidance, resulting in semantic misalignment between the edited results and texts. In this study, we disclose the sensitivity of different attention heads to different image semantics within MM-DiTs and introduce HeadRouter , a training-free image editing framework that edits the source image by adaptively routing the text guidance to different attention heads in MM-DiTs. Furthermore, we propose a dual-token refinement module to refine text/image token representations for precise semantic guidance and accurate region expression. Experiments on multiple benchmarks demonstrate HeadRouter’s performance in terms of editing fidelity and image quality. The code is available at https://github.com/ICTMCG/HeadRouter.
Reproducibility Dossier
GEOMDIGEST treats reproducibility as an evidence trail: public artifacts, documentation, data, packaging, archival stability, and verification checks. Numeric scores are only exposed for audited records; public pages prioritize the evidence itself.
Implementation Index
This paper is in the knowledge graph, but we have not attached a runnable artifact yet.
Citation Lineage
This paper is in the knowledge graph, but no in-corpus reference or citing-paper links have been attached yet.