[review] Adding Conditional Control to Text-to-Image Diffusion Models

ICCV 2023에서 발표된 논문으로, text-to-image 모델인 ControlNet을 소개하고 있습니다.

https://github.com/lllyasviel/ControlNet

GitHub - lllyasviel/ControlNet: Let us control diffusion models!

Let us control diffusion models! Contribute to lllyasviel/ControlNet development by creating an account on GitHub.

github.com

1. Introduction

[그림 1] ControlNet을 이용한 Stable Diffusion conditioning

Text-to-image 모델의 발전으로 텍스트 프롬프트를 이용해 이미지를 생성하는 것이 가능해졌습니다. 하지만 텍스트만을 사용하여 레이아웃, 자세, 모양 등을 정밀하게 제어하는 것은 여전히 어려운 과제입니다.

사용자가 자신이 원하는 이미지 형태를 입력값에 함께 넣도록 하는 것은 어떨까요? 실제로 많은 image-to-image 모델들은 이미지를 conditioning 값 중 하나로 취급합니다. Text-to-image 모델에서도 마스킹, 이미지 수정, 모델 개인화 등의 작업에 이미지를 사용하기도 합니다. 몇 가지의 작업은 학습 없이 모델 구조를 조금만 수정해도 되지만, depth-to-image나 pose-to-image 같은 작업들은 직접 end-to-end 학습*을 해야 합니다.

* End-to-end 학습: 입력 데이터로부터 최종 결과를 예측하는 전체 과정을 중간 단계 없이 단일 모델이 직접 학습하는 방식

거대한 text-to-image 모델에 대한 end-to-end 학습은 상당히 어렵습니다. 특정 조건에 대한 데이터가 일반적인 text-to-image 학습 데이터에 비해 부족할 수도 있기 때문입니다. 적은 데이터로 학습을 한다면 과적합이나 망각현상을 유발할 수도 있습니다. 이런 문제를 위해서 더 정교한 구조의 인공신경망을 설계할 필요가 있습니다.

본 연구는 미리 학습된 거대한 text-to-image diffusion model에 대한 conditioning을 학습하는 ControlNet을 소개합니다. ControlNet은 기존 모델의 매개변수를 그대로 유지하며, encoding 계층을 복사하여 학습 가능하도록 따로 두어 사용합니다. 이 학습 가능한 복사본과 원본(매개변수가 잠겨 있는 모델)은 zero convolution 계층을 통해 하나로 묶일 수 있습니다. Zero convolution은 초기 가중치가 모두 0인 convolution으로, 학습을 거치며 서서히 가중치를 형성할 수 있어서 학습 초기에 불필요한 값이 반영되는 것을 방지할 수 있습니다.

본 연구는 ControlNetdl Stable Diffusion을 다양한 조건을 이용해 조절할 수 있음을 보여주었습니다. 이는 하나의 conditioning image를 통해(텍스트 프롬프트 있을 때와 없을 때 모두 포함) 진행되었습니다. 또한 데이터셋의 크기와 관계 없이 견고하고 조절 가능한 모델임을 확인하기도 했습니다.

2. Related Work

2.1. Finetuning Neural Networks

HyperNetwork
Adapter
Additive Learning
Low-Rank Adaption(LoRA)
Zero-Initialized Layers

2.2. Image Diffusion

Image Diffusion Models
Controlling Image Diffusion Models

2.3. Image-to-Image Translation

3. Method

3.1. ControlNet

ControlNet은 기본적으로 기존 신경망 구조에 condition을 추가로 삽입하는 방식입니다. 여기서 몇 개의 계층이 묶인 하나의 단위를 network block이라고 명명합니다(e.g. resnot block, conv-bn-relu block). 매개변수 $\Theta$로 작동하는 neural block $\mathcal{F}(\cdot ; \Theta)$는 feature map $x$를 feature map $y$로 변환합니다. 여기서 보통 $x$와 $y$는 2차원 이미지에 대한 feature map이기 때문에, $x, y \in \mathbb{R}^{h \times w \times c}$의 차원을 가집니다.

$y = \mathcal{F}(x; \Theta) \quad (1)$

[그림 2] ControlNet 구조. (a) 모델의 입력값과 출력값인 feature map x, y (b) 모델에 추가되는 conditioning c

[그림 2]와 같이 이미 학습된 neural block을 사용할 때는 매개변수 $\Theta$를 잠궈서 수정하지 않습니다. 대신, neural block 자체를 수정가능한 용도로 복사하여 사용하게 됩니다(이 때 매개변수를 $\Theta_c$로 둡니다). 이 복사본은 zero convolution 계층을 통해 conditioning vector $c$를 받아 연산을 수행합니다.

이후 복사본은 zero convolution 계층 $\mathcal{Z}(\cdot;\cdot)$을 통해 다시 원본 neural block과 합쳐집니다. $\mathcal{Z}(\cdot;\cdot)$는 가중치와 편향이 0인 $1 \times 1$ 합성곱 계층입니다. 이렇게 한 번의 conditioning에서 zero convolution은 두 번 시행됩니다. 각 계층의 매개변수들을 각각 $\Theta_{z1}$, $\Theta_{z2}$라고 할 때, ControlNet 전체 계산은 다음과 같이 이루어집니다.

$y_c = \mathcal{F}(x;\Theta) + \mathcal{Z}(\mathcal{F}(x + \mathcal{Z}(c;\Theta_{z1});\Theta_{c});\Theta_{z2}) \quad (2)$

$ \mathcal{Z}(c;\Theta_{z1}) $ : c의 zero convolution
$ x + \cdot $ : x와 c 결합
$ \mathcal{F}(\cdot; \Theta_{c}) $ : trainable copy block 계산
$ \mathcal{Z}(\cdot;\Theta_{z2}) $ : trainable copy 출력값의 zero convolution
$ \mathcal{F}(x;\Theta) $ : 원래 neural block 계산

첫 번째 학습 step에서 가중치와 편향은 모두 0이므로, ControlNet의 출력값은 입력값과 동일합니다. 때문에 학습 초기에 불필요한 잡음(noise)이 복사본 block에 영향을 주는 것을 방지할 수 있습니다.

$y_c = y \quad (3)$

3.2. ControlNet for Text-to-Image Diffusion

본 연구는 Stable Diffusion을 예시로 들어 ControlNet을 적용하는 방법에 대해 알려줍니다. 그리고 그 전에 Stable Diffusion의 신경망 구조가 어떻게 되어 있는지 설명합니다.

Stable Diffusion은 encoder, middle block, decoder로 이루어진 U-Net 구조를 가지고 있습니다. Encoder와 decoder는 각각 12개, middel block은 하나로 총 25개의 block들이 있으며, 그 중 8개 block은 upsampling 또는 downsampling convolution을 수행합니다. 나머지 17개 block은 각각 4개 resnet 계층과 2개 Vision Transformer(ViT) 계층으로 구성됩니다. 각 ViT는 다시 몇 개의 self-attention과 cross-attention으로 이루어집니다. 텍스트 프롬프트 encoding은 CLIP text encoder를, timestep은 positional encoding을 사용합니다.

[그림 3] Stable Diffusion UNet의 구조에 ControlNet을 적용한 모습

ControloNet는 모든 encoder에 적용됩니다. 따라서, 총 12개의 encoder block과 1개의 middle block 복사본이 생성됩니다. 복사본의 결과값은 decoder block들과 middle block에 더해집니다.

이렇게 하면 기존의 매개변수들은 전부 학습시키지 않고 그대로 두기 때문에 계산 속도와 GPU 메모리 사용량을 줄일 수 있습니다. 실제로 ControlNet 없이 최적화한 Stable Diffusion과 비교하였을 때 GPU는 23%, 계산시간은 34% 정도 밖에 사용하지 않았다고 합니다.

Stable Diffusion은 대부분의 계산을 계산량이 적은 latent space에서 수행합니다. 본 연구에서 사용한 Stable diffusion 모델은 512 * 512 크기의 이미지를 64 * 64 크기의 latent space 이미지로 변환하여 계산을 수행하였습니다. 따라서 ControlNet의 condition vector $c_i$도 같은 latent space상 벡터 $c_f$로 변환시켜 적용해야 합니다(4 * 4 kernel, 2 * 2 stride 합성곱 계층 4개 사용).

$\mathbf{c}_f = \mathcal{E}(\mathbf{c}_i) \quad (4)$

3.3. Training

Timestep $t$, 텍스트 프롬프트 $c_t$, 그리고 그에 따른 condition vector $c_f$가 주어졌을 때, diffusion model은 UNet을 통해 noise 예측값을 계산하게 됩니다. 따라서 ControlNet의 목적 함수는 다음과 같습니다.

$\mathcal{L} = \mathbb{E}_{\mathbf{z}_0, t, \mathbf{c}_t, \mathbf{c}_f, \boldsymbol{\epsilon} \sim \mathcal{N}(0, 1)} \left[ \left\| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{z}_t, t, \mathbf{c}_t, \mathbf{c}_f) \right\|_2^2 \right] \quad (5)$

학습이 진행되는 동안 텍스트 프롬프트 $c_t$의 절반을 무작위로 선택하여 빈 문자열로 바꿉니다. 이는 ControlNet의 맥락을 파악하는 능력을 향상시킨다고 합니다. 또한, zero convolution을 사용했기 때문에, 모델은 10K step 이전에 빠르게 정확한 이미지 생성을 학습했다고 합니다. 이를 "sudden convergence phenomenon"이라고 부릅니다.

[그림 4] Sudden convergence penomenon 예시. 6133 step부터 제대로 동작하기 시작함

3.4. Inference

3.4.1. Classifier-free guidance(CFG) resolution weighting

Condition이 없을 때와 있을 때의 출력값 $\epsilon_{uc}$와 $\epsilon_{c}$, 그리고 가중치 $\beta_{cfg}$가 주어져 있을 때, CFG에 의한 최종 출력값 $\epsilon_{prd}$은 다음과 같이 계산됩니다.

$ \epsilon_{prd} = \epsilon_{uc} + \beta_{cfg}(\epsilon_{c} - \epsilon_{uc}) $

그런데 만약 ControlNet의 이미지 조건값을 $\epsilon_{uc}$와 $\epsilon_{c}$에 모두 적용하면 CFG의 효과가 사라집니다. 대신 $\epsilon_{c}$에만 넣으면 guidance를 잘 따르게 됩니다.

[그림 5] Classifier-free guidance와 CFG Resolution Weighting

3.4.2. Composing multiple ControlNets

여러 종류의 조건을 적용할 때는 추가적인 가중치나 보간법 없이 바로 Stable Diffusion 모델에 ControlNet 결과값을 집어 넣었습니다.

4. Experiments

4.1. Qualitative Results

[그림 7] 다양한 condition 종류에 따른 ControlNet 결과값 (프롬프트 미적용)

4.2. Ablative Study

아래 세 가지의 변경사항을 통해 ablative study를 진행했습니다.

(a) 기본 ControlNet
(b) Zero convolution을 Gaussian 분포값을 가지는 convolution으로 대체
(c) Trainable copy를 단일 계층의 convolution으로 대체 (= ContorlNet-lite)

프롬프트는 아래 네 가지 경우로 나누어 시도해보았습니다.

No prompt : 프롬프트 없음
Insufficient prompt : 부족한 묘사
Conflicting prompt : 이미지 condition과 모순되는 묘사
Perfect prompt : 정상적인 묘사

[그림 8] Ablation 버전에 따른 (좌) 모델 구조, (우) 프롬프트 유형에 따른 결과

4.3. Quantitative Evaluation

4.3.1. User study

20개의 스케치를 무작위로 선별한 후, 각 스케치 당 PITI, Sketch-Guided Diffusion(SGD)에 대해 $\beta = 1.6$과 $\beta = 3.2$일 때, ControlNEt-lite, 그리고 ControlNet을 수행하였습니다. 사용자들은 각 스케치에 대해 1 - 5위까지 우선순위를 부여했습니다.

[표 1] 평균 사용자 순위 (AUR).각 스케치 당 1 - 5위 순위 사이에서.

4.3.2. Comparison to industrial models

두 모델의 결과값을 보여주고 사용자에게 올바른 모델을 선택하도록 했을 때, 52% 정도의 정확도를 보였습니다. 이는 두 모델의 성능이 비슷함을 보여줍니다.

	Stable Diffusion V2 Depth-to-Image	ControlNet
GPU	NVIDIA A 100	NVIDIA RTX 3090Ti
training time	thousands of hours	5 days
training samples	more than 12M	200k

4.3.3. Condition reconstruction and FID score

ControlNet이 얼마나 주어진 조건에 맞게 segmentation을 구현했는지 평가하기 위해, SoTA segmentation 모델인 OneFormer를 사용하였습니다. ADE20K 데이터셋을 학습한 ControlNet을 통해 이미지를 생성한 후, 이 이미지에 OneFormer를 적용하여 IoU*값을 측정하였습니다.

* IoU(Intersection over Union) : 두 개의 영역(보통 정답과 예측 결과)의 겹치는 정도를 측정하는 지표

또한, FID 점수와 CLIP 점수를 사용하여 생성 성능을 평가하였습니다.

4.4. Comparison to Previous Methods

4.5. Discussion

4.5.1. Influence of training dataset sizes

4.5.2. Capability to interpret contents

4.5.3. Transferring to community models

5. Conclusion

ControlNet은 미리 학습된 거대한 text-to-image diffusion 모델에게 conditioning을 학습시키는 신경망 모델입니다. 이 모델은 미리 학습된 모델의 계층들은 그대로 둔 채, 더 깊고 강력한 encoder를 학습하여 특정 조건들을 잘 처리할 수 있도록 합니다. 다양한 실험을 통해 ControlNet은 Stable Diffusion의 조건 제어를 훌륭하게 수행할 수 있음을 보여주었습니다.

728x90

'Reveiw > Paper' 카테고리의 다른 글

[review] Yo'Chameleon:Personalized Vision and Language Generation (3)	2025.07.04
[review] InteractVLM: 3D Interaction Reasoning from 2D Foundational Models (1)	2025.07.02
[review] Attention Is All You Need (0)	2025.06.29
[review] Visual Instruction Tuning (0)	2025.06.26
[review] MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models (1)	2025.06.25

1. Introduction

2. Related Work

2.1. Finetuning Neural Networks

2.2. Image Diffusion

2.3. Image-to-Image Translation

3. Method

3.1. ControlNet

3.2. ControlNet for Text-to-Image Diffusion

3.3. Training

3.4. Inference

3.4.1. Classifier-free guidance(CFG) resolution weighting

3.4.2. Composing multiple ControlNets

4. Experiments

4.2. Ablative Study

4.3. Quantitative Evaluation

4.3.1. User study

4.3.2. Comparison to industrial models

4.3.3. Condition reconstruction and FID score

4.4. Comparison to Previous Methods

4.5. Discussion

4.5.1. Influence of training dataset sizes

4.5.2. Capability to interpret contents

4.5.3. Transferring to community models

5. Conclusion

'Reveiw > Paper' 카테고리의 다른 글

티스토리툴바