CSD-VAR: Content-Style Decomposition in Visual Autoregressive Models

Abstract

Disentangling content and style from a single image, known as content-style decomposition (CSD), enables recontextualization of extracted content and stylization of extracted styles, offering greater creative flexibility in visual synthesis. While recent personalization methods have explored explicit content-style decomposition, they remain tailored for diffusion models. Meanwhile, Visual Autoregressive Modeling (VAR) has emerged as a promising alternative with a next-scale prediction paradigm, achieving performance on par with diffusion models. In this paper, we explore VAR as a generative framework for CSD, leveraging its scale-wise generation process for improved disentanglement. To this end, we propose CSD-VAR, a novel method that introduces three key innovations: (1) a scale-aware alternating optimization strategy that aligns content and style representation with their respective scales to enhance separation, (2) an SVD-based rectification method to mitigate content leakage into style representations, and (3) Augmented Key-Value (K-V) memory enhancing content identity preservation. To benchmark this task, we introduce CSD-100, a dataset specifically designed for content-style decomposition, featuring diverse subjects rendered in various artistic styles. Experiments demonstrate that CSD-VAR outperforms prior approaches, achieving superior content preservation and stylization fidelity.

CSD-100 Benchmark Dataset

To standardize the evaluation of the content-style decomposition (CSD) task, we introduce CSD-100, a dataset of 100 images designed to capture diverse content and styles for comprehensive benchmark.

Method Overview

Given an input image $I^*$ containing a subject $y_c$ in style $y_s$, our objective is to disentangle its style and content into two distinct representations, enabling the generation of separate images: $I_c$, which accurately preserves the content of $I^*$, and $I_s$, which effectively captures its style. To achieve this, we explore the use of Visual Autoregressive Models (VAR) for this task.

📈 Qualitative Results on CSD-100

🖼️ More Showcases on CSD-100

Demonstrating the universality of our method, we currently support two SOTA scale-wise autoregressive T2I models: Switti and Infinity.

Citation

Feel free to contact Quang-Binh Nguyen at binhnq@qti.qualcomm.com for any question. If you find this work useful, please consider citing:


      @InProceedings{Nguyen_2025_ICCV,
        author    = {Nguyen, Quang-Binh and Luu, Minh and Nguyen, Quang and Tran, Anh and Nguyen, Khoi},
        title     = {CSD-VAR: Content-Style Decomposition in Visual Autoregressive Models},
        booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
        month     = {October},
        year      = {2025},
        pages     = {17013-17023}
      }

Content-Style Decomposition in Visual Autoregressive Models

ICCV 2025, Honolulu, Hawai'i 🌺