๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐Ÿ˜ŽAI/Generative AI

[Paper Review] Prompt-to-Prompt Image Editing with Cross Attention Control

by SolaKim 2025. 2. 13.

https://arxiv.org/abs/2208.01626

 

Prompt-to-Prompt Image Editing with Cross Attention Control

Recent large-scale text-driven synthesis models have attracted much attention thanks to their remarkable capabilities of generating highly diverse images that follow given text prompts. Such text-based synthesis methods are particularly appealing to humans

arxiv.org

 

 

 

๊ธฐ์กด LLI (Large-scale language-image) models ์˜ ๊ฒฝ์šฐ text prompt ์—์„œ ์กฐ๊ธˆ์˜ ๋ณ€ํ™”๋งŒ์œผ๋กœ๋„ ์™„์ „ํžˆ ๋‹ค๋ฅธ ๊ฒฐ๊ณผ์˜ ์ด๋ฏธ์ง€๋ฅผ ๋งŒ๋“ค์–ด๋ƒ…๋‹ˆ๋‹ค.

์ด ๋ฌธ์ œ์ ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด LLI-based methods ๋“ค์€ ์‚ฌ์šฉ์ž์—๊ฒŒ ๋ช…ํ™•ํ•œ masking ์„ ์š”๊ตฌํ–ˆ์Šต๋‹ˆ๋‹ค.
๊ทธ๋ฆฌ๊ณ  ์ด๋ฏธ์ง€์—์„œ masked ๋ถ€๋ถ„๋งŒ ํŽธ์ง‘ํ•ฉ๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ ์ด masking ๊ณผ์ •์€ ๋ณต์žกํ•˜๊ณ  ๋Š๋ฆฌ๋ฉฐ ์ง๊ด€์ ์ธ text-driven editing ์„ ๋ฐฉํ•ดํ•ฉ๋‹ˆ๋‹ค.
๋˜ํ•œ masking ํ•œ ๋ถ€๋ถ„ ์ค‘์— ์ค‘์š”ํ•œ ๊ตฌ์กฐ์  ์ •๋ณด๋ฅผ ๊ฐ€์ง„ ๊ณณ์ด ์ง€์›Œ์งˆ ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

masking

 

์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๊ทธ๋Ÿฌํ•œ ํ•œ๊ณ„์ ์„ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด์„œ Prompt-to-Prompt ์กฐ์ ˆ์„ ํ†ตํ•œ pre-trained text-conditioned diffusion models ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์ด๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด์„œ cross-attention layers ๋ฅผ ๊นŠ์ด ๋“ค์–ด๋‹ค๋ด…๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ƒ์„ฑ ์ด๋ฏธ์ง€๋ฅผ control ํ•˜๊ธฐ ์œ„ํ•ด ๊ทธ๋“ค์˜ semantic strength ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค.

 

์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๋‚ด๋ถ€์˜ cross-attention map ์„ ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

  • high-dimensional tensors
    • bind pixels 
    • tokens extracted form the prompt text
  • ์ด์™€ ๊ฐ™์€ ๋งต์ด semantic ๊ด€๊ณ„๋ฅผ ๊ฐ•ํ•˜๊ฒŒ ํฌํ•จํ•˜๊ณ  ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

๐Ÿ”‘ Main Key

  • ํ…์ŠคํŠธ-์ด๋ฏธ์ง€ ์ƒ์„ฑ๋ชจ๋ธ(Imagen) ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ๋งŒ์œผ๋กœ ์ด๋ฏธ์ง€ ์ˆ˜์ •์ด ๊ฐ€๋Šฅํ•˜๋„๋ก ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.
  • Cross-attention layer ์˜ cross-attention map ์„ ์ˆ˜์ •ํ•˜์—ฌ ์›๋ž˜ ์ด๋ฏธ์ง€์˜ ๊ณต๊ฐ„์  ๊ตฌ์กฐ๋ฅผ ๋ณด์กดํ•˜๋ฉด์„œ๋„ ํ…์ŠคํŠธ์— ๋”ฐ๋ฅธ ์ˆ˜์ •์ด ๊ฐ€๋Šฅํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
  • Diffusion model ์˜ cross-attention map injection ์„ ํ†ตํ•ด ๊ตฌ์กฐ์  ๋ณ€ํ™”๋ฅผ ์กฐ์ •ํ•˜๊ณ , ์ด๋ฏธ์ง€ ๋‚ด ๊ฐ์ฒด์˜ ์œ„์น˜๋‚˜ ํ…์Šค์ฒ˜๋งŒ ์ˆ˜์ • ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

 

 

๐Ÿ“Œ ์ฃผ์š” ๊ธฐ์ˆ 

  • Word Swap: ํ…์ŠคํŠธ์˜ ํŠน์ • ๋‹จ์–ด๋ฅผ ๊ต์ฒดํ•ด ์ด๋ฏธ์ง€ ๋‚ด ํŠน์ • ๊ฐ์ฒด๋ฅผ ๋ณ€๊ฒฝํ•ฉ๋‹ˆ๋‹ค.
    • ์˜ˆ: "dog" → "cat"
    • ์ดˆ๊ธฐ ์ด๋ฏธ์ง€์˜ ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•˜๋ฉฐ ์›ํ•˜๋Š” ๊ฐ์ฒด๋กœ ๋ฐ”๊ฟ€ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Adding a New Phrase: ํ”„๋กฌํ”„ํŠธ์— ์ƒˆ๋กœ์šด ํ…์ŠคํŠธ ์ถ”๊ฐ€๋กœ ์Šคํƒ€์ผ์ด๋‚˜ ์†์„ฑ ๋ณ€ํ™”
    • ์˜ˆ: "a castle next to a river" → "a children drawing of a castle next to a river"
  • Attention Re-weighting: ์–ด๋–ค ๋‹จ์–ด์˜ ์ค‘์š”๋„๋ฅผ ์กฐ์ •ํ•ด ํ…์ŠคํŠธ์˜ ์˜ํ–ฅ์„ ์ค„์ด๊ฑฐ๋‚˜ ๊ฐ•ํ™”ํ•ฉ๋‹ˆ๋‹ค.
    • ์˜ˆ: "snowy mountain"์˜ "snowy" ์ •๋„๋ฅผ ์กฐ์ •ํ•ด ๋ˆˆ์˜ ์–‘์„ ์กฐ์ ˆํ•ฉ๋‹ˆ๋‹ค.
  • Real Image Editing: ์‹ค์ œ ์ด๋ฏธ์ง€๋ฅผ diffusion ๋ชจ๋ธ๋กœ inversion(๋ณ€ํ™˜) ํ•˜๊ณ  ํ…์ŠคํŠธ๋กœ ์ˆ˜์ •ํ•ฉ๋‹ˆ๋‹ค.

 

โ–ถ ์ด ๋ชจ๋“  ๊ฒƒ์ด ๊ฐ„๋‹จํ•œ ์ธํ„ฐํŽ˜์ด์Šค(ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ ์ˆ˜์ •)๋กœ ์ด๋ฃจ์–ด์ง„๋‹ค๋Š” ๊ฒŒ ๊ธฐ์กด ๋งˆ์Šคํ‚น ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•๋ณด๋‹ค ํ›จ์”ฌ ์ง๊ด€์ ์ด๊ณ  ๊ฐ•๋ ฅํ•œ ์ ‘๊ทผ ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

 

 

Method

์œ„์˜ ์ด๋ฏธ์ง€๋Š” ์ด ๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉํ•˜๋Š” Method ์˜ Overview ์ž…๋‹ˆ๋‹ค. 

๊ทธ๋ฆผ์„ ๋ณด๋ฉด ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด, Cross-Attention ์„ ์‚ฌ์šฉํ•˜๋Š”๋ฐ Query ๋Š” Pixel (์ด๋ฏธ์ง€)์—์„œ Key ์™€ Value ๋Š” Tokens (ํ…์ŠคํŠธ) ์—์„œ ๊ฐ€์ ธ์˜ค๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

  • Word Swap ์˜ ๊ฒฝ์šฐ, Source image Map Mt๋ฅผ Target image Map์ธ Mt*๋กœ ๊ต์ฒดํ•ฉ๋‹ˆ๋‹ค.
  • Adding a New Phrase ์˜ ๊ฒฝ์šฐ, Mt*๋ฅผ Mt์˜ ๋ฐ”๋€Œ์ง€ ์•Š๋Š” ๋ถ€๋ถ„์— ์ฃผ์ž…ํ•ด์ค๋‹ˆ๋‹ค.
  • ๋‹จ์–ด์˜ ์ค‘์š”๋„๋ฅผ ์ค„์ด๊ฑฐ๋‚˜ ๋Š˜๋ฆฌ๋Š” ๋ถ€๋ถ„์€ Attention Re-weighting ์„ ํ†ตํ•ด ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

 

 

 

  • I : text-guided diffusion model ๋กœ ๋งŒ๋“ค์–ด์ง„ ์ด๋ฏธ์ง€
  • I* : edited image
  • P : text prompt
  • P* : edited text prompt
  • s: random seed

 

๋งŒ์•ฝ attention maps ์ด ๊ณ ์ •๋˜์ง€ ์•Š๋Š”๋‹ค๋ฉด(Bottom) completely different image with a different structure and composition.

์œ„์˜ ๊ทธ๋ฆผ์—์„œ Top ์€ attention weights ๊ฐ€ ์ฃผ์ž…๋˜์—ˆ๊ณ , Bottom์€ ์ฃผ์ž…๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.

์ด๋ฅผ ํ†ตํ•ด ์ƒ์„ฑ๋œ ์ด๋ฏธ์ง€์˜ structure ์™€ appearances ๊ฐ€ random seed ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ diffusion ๊ณผ์ •์—์„œ์˜ ํ”ฝ์…€๊ณผ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ ์‚ฌ์ด์˜ ์ƒํ˜ธ์ž‘์šฉ์—๋„ depend on ์ด ๋˜์–ด์žˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

=> cross-attention ์ธต์—์„œ ์ผ์–ด๋‚˜๋Š” pixel-to-text ์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ์ˆ˜์ •ํ•˜๋ฉด Prompt-to-Prompt Editing ์„ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ๊ฒ ๋„ค!

 

<Cross-Attention in text-conditioned Diffusion Models>

1. Cross-attention์˜ ๊ธฐ๋ณธ ๊ตฌ์กฐ

Cross-attention์€ ํ…์ŠคํŠธ(ํ”„๋กฌํ”„ํŠธ)์™€ ์ด๋ฏธ์ง€์˜ ๊ณต๊ฐ„์  ํŠน์ง•(ํ”ฝ์…€ ์ •๋ณด) ์‚ฌ์ด์˜ ์—ฐ๊ฒฐ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ฉ”์ปค๋‹ˆ์ฆ˜์ž…๋‹ˆ๋‹ค.
์ด ๊ณผ์ •์—์„œ Query (Q), Key (K), Value (V)๋ผ๋Š” ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ์š”์†Œ๊ฐ€ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

  • Query (Q): ์ด๋ฏธ์ง€์˜ ํ”ฝ์…€ ํŠน์ง•์„ ํ‘œํ˜„ํ•˜๋Š” ๋ฒกํ„ฐ
  • Key (K): ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ์—์„œ ์ถ”์ถœ๋œ ํ† ํฐ์˜ ํŠน์ง•์„ ํ‘œํ˜„ํ•˜๋Š” ๋ฒกํ„ฐ
  • Value (V): ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ์˜ ์ •๋ณด ์ž์ฒด (์ฆ‰, ํ† ํฐ์˜ ํ‘œํ˜„)

 

2. Attention์˜ ์ˆ˜์‹ ํ‘œํ˜„

Attention์€ ๊ฐ Query๊ฐ€ ์–ด๋–ค Key์— ์–ผ๋งˆ๋‚˜ ์ฃผ๋ชฉํ• ์ง€(๊ด€๋ จ์„ฑ์ด ์–ผ๋งˆ๋‚˜ ํฐ์ง€)๋ฅผ ๊ฐ€์ค‘์น˜(Attention Map)๋กœ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

  • ์€ Attention Map์œผ๋กœ, ๊ฐ ํ”ฝ์…€์ด ํ…์ŠคํŠธ ํ† ํฐ์— ์–ผ๋งˆ๋‚˜ ์ง‘์ค‘ํ•˜๋Š”์ง€(๊ฐ€์ค‘์น˜)๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  • d๋Š” ์ฐจ์› ์ˆ˜๋กœ, ์•ˆ์ •์ ์ธ ๊ณ„์‚ฐ์„ ์œ„ํ•ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

 

3. Cross-attention Output M⋅V

์ด์ œ Attention Map M์„ Value V์— ์ ์šฉํ•˜๋ฉด Cross-attention Output ฯ•^(zt)=M⋅V ๋ฅผ ์–ป์Šต๋‹ˆ๋‹ค.
์ด ๊ฐ’์ด ์ตœ์ข…์ ์œผ๋กœ ์ด๋ฏธ์ง€ ๋‚ด ํ”ฝ์…€ ์œ„์น˜์™€ ํ…์ŠคํŠธ ํ† ํฐ ๊ฐ„์˜ ์—ฐ๊ฒฐ์„ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.

  • M⋅V๋Š” Value V์˜ ๊ฐ€์ค‘ ํ‰๊ท  (weighted average)์„ ๊ตฌํ•˜๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค.
  • Mij ๋Š” i๋ฒˆ์งธ ํ”ฝ์…€์ด j๋ฒˆ์งธ ํ…์ŠคํŠธ ํ† ํฐ๊ณผ ์–ผ๋งˆ๋‚˜ ๊ฐ•ํ•˜๊ฒŒ ๊ด€๋ จ์ด ์žˆ๋Š”์ง€ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  • ๊ฒฐ๊ณผ์ ์œผ๋กœ, ํŠน์ • ํ…์ŠคํŠธ ํ† ํฐ์ด ์ด๋ฏธ์ง€ ๋‚ด ํŠน์ • ํ”ฝ์…€๋“ค์— ์–ผ๋งˆ๋‚˜ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

์˜ˆ๋ฅผ ๋“ค์–ด, "A red car on the street" ๋ผ๋Š” ํ”„๋กฌํ”„ํŠธ์—์„œ:

  • "red" ๋ผ๋Š” ํ…์ŠคํŠธ ํ† ํฐ์˜ Attention Map์€ ์ฐจ์ฒด ๋ถ€๋ถ„์˜ ํ”ฝ์…€์— ์ง‘์ค‘์ ์œผ๋กœ ์˜ํ–ฅ์„ ์ค๋‹ˆ๋‹ค.
  • ์ด๋•Œ M⋅V๋Š” "red" ๊ฐ€ ์ฐจ์ฒด์— ๋นจ๊ฐ„์ƒ‰์„ ์ ์šฉํ•˜๋„๋ก ์œ ๋„ํ•˜๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.

 

์—ฌ๊ธฐ์„œ ์ž ๊น, ํ—ท๊ฐˆ๋ฆฌ๋Š” ๋ถ€๋ถ„์„ ์งš๊ณ  ๋„˜์–ด๊ฐ€ ๋ด…์‹œ๋‹ค...

  • Self-attention: ์ž…๋ ฅ ๋‚ด๋ถ€์—์„œ ๊ฐ ์š”์†Œ๊ฐ€ ๋‹ค๋ฅธ ์š”์†Œ์™€ ์–ผ๋งˆ๋‚˜ ๊ด€๋ จ์ด ์žˆ๋Š”์ง€๋ฅผ ๊ณ„์‚ฐ
    • ์˜ˆ: ์ด๋ฏธ์ง€์˜ ํ”ฝ์…€ ๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ณ„์‚ฐํ•  ๋•Œ ์‚ฌ์šฉ (์ด๋ฏธ์ง€ ๋‚ด์—์„œ ์ง€์—ญ์  ์ •๋ณด๋ฅผ ํ™•์žฅ)
  • Cross-attention: ์„œ๋กœ ๋‹ค๋ฅธ ๋‘ ์ž…๋ ฅ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๊ณ„์‚ฐ
    • ์˜ˆ: ํ…์ŠคํŠธ ํ† ํฐ๊ณผ ์ด๋ฏธ์ง€์˜ ํ”ฝ์…€์ด ์–ผ๋งˆ๋‚˜ ๊ด€๋ จ์ด ์žˆ๋Š”์ง€๋ฅผ ๊ณ„์‚ฐ

 

 

๐ŸŒฑ Diffusion ๋ชจ๋ธ์—์„œ์˜ Self-attention๊ณผ Cross-attention

Diffusion ๊ธฐ๋ฐ˜ ํ…์ŠคํŠธ-์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ชจ๋ธ (์˜ˆ: Imagen, Stable Diffusion)์—์„œ๋Š” Transformer-like ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•˜์ง€๋งŒ, ๋™์ž‘ ๋ฐฉ์‹์— ๋ช‡ ๊ฐ€์ง€ ์ฐจ์ด๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

Encoder-Decoder ๊ตฌ๋ถ„์ด ๋ช…ํ™•ํ•˜์ง€ ์•Š์Œ

  • Diffusion ๋ชจ๋ธ์—์„œ๋Š” U-Net ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, ์ด๊ฒƒ์ด Encoder-Decoder ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
  • Self-attention๊ณผ Cross-attention์ด ์„œ๋กœ ๋‹ค๋ฅธ ๋ ˆ์ด์–ด์—์„œ ์‚ฌ์šฉ๋˜๊ฑฐ๋‚˜, ๊ฐ™์€ ๋ ˆ์ด์–ด์—์„œ ํ•จ๊ป˜ ์‚ฌ์šฉ (Hybrid Attention)๋˜๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค.

Self-attention๊ณผ Cross-attention์˜ ๋™์ž‘ ํ๋ฆ„

  • Encoder-like (์ €ํ•ด์ƒ๋„ ์ƒ์„ฑ): Self-attention ์ค‘์‹ฌ
    • ์ž…๋ ฅ ์ด๋ฏธ์ง€์˜ ๊ณต๊ฐ„์  ํŠน์ง•์„ ์ถ”์ถœํ•˜๋ฉด์„œ ํ”ฝ์…€ ๊ฐ„ ๊ด€๊ณ„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
    • Cross-attention์€ ์ด ๋‹จ๊ณ„์—์„œ ํ…์ŠคํŠธ-์ด๋ฏธ์ง€ ์ดˆ๊ธฐ ๋งค์นญ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
  • Bottleneck (์ค‘๊ฐ„ ๋‹จ๊ณ„): Self-attention + Cross-attention ๊ฒฐํ•ฉ (Hybrid)
    • ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€์˜ ๊ด€๊ณ„๋ฅผ ๊นŠ์ด ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
  • Decoder-like (๊ณ ํ•ด์ƒ๋„ ๋ณต์›): Cross-attention ๊ฐ•ํ™”
    • ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ์— ๋งž์ถฐ ์„ธ๋ถ€ ๋””ํ…Œ์ผ ์ˆ˜์ • ๋ฐ ์Šคํƒ€์ผ ๋ฐ˜์˜์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

 

 

<Controlling the Cross-attention>

  • ์ƒ์„ฑ๋œ ์ด๋ฏธ์ง€์˜ spatial layout ๊ณผ geometry๋Š” cross-attention ์— ์˜์กดํ•ฉ๋‹ˆ๋‹ค. ์œ„์˜ figure 4๋ฅผ ๋ณด๋ฉด ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • figure 4์˜ bottom row ๋ฅผ ๋ณด์‹œ๋ฉด, ์ด๋ฏธ์ง€์˜ ๊ตฌ์กฐ๋Š” ์ด๋ฏธ ์ด๋ฅธ ๋‹จ๊ณ„์˜ diffusion ๊ณผ์ •์—์„œ๋„ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค.

 

์œ„์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ํ…์ŠคํŠธ ํŽธ์ง‘์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ด๋ฏธ์ง€๋ฅผ ์žฌ์ƒ์„ฑํ•˜๋˜, ๊ธฐ์กด ์ด๋ฏธ์ง€์˜ ๊ตฌ์กฐ๋ฅผ ์ตœ๋Œ€ํ•œ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์ด ๋ชฉ์ ์ž…๋‹ˆ๋‹ค.

  • ์ž…๋ ฅ๋œ ์ด๋ฏธ์ง€ I ์™€ ์›๋ž˜ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ P ๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋ฅผ ํŽธ์ง‘ํ•œ P๋ฅผ ์‚ฌ์šฉํ•ด ํŽธ์ง‘๋œ ์ด๋ฏธ์ง€ I๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • Cross-attention map์„ ์ˆ˜์ •ํ•˜์—ฌ ์›๋ž˜ ์ด๋ฏธ์ง€์˜ ๊ณต๊ฐ„์  ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ํ…์ŠคํŠธ์— ๋”ฐ๋ฅธ ๋ณ€ํ™”๋ฅผ ๋ฐ˜์˜ํ•ฉ๋‹ˆ๋‹ค.
  • Diffusion ํ”„๋กœ์„ธ์Šค ์ค‘๊ฐ„์— Cross-attention map์„ injectionํ•˜์—ฌ ๊ธฐ์กด ์ด๋ฏธ์ง€์˜ ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ๋ณด์กดํ•ฉ๋‹ˆ๋‹ค.

๋‹จ๊ณ„๋ณ„ ์„ค๋ช…์„ ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

  • ์ดˆ๊ธฐํ™” (Line 1~4)
    Random seed s ๋ฅผ ์‚ฌ์šฉํ•ด ๋…ธ์ด์ฆˆ zT๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด ๋…ธ์ด์ฆˆ๋Š” Diffusion ๋ชจ๋ธ์˜ ์ดˆ๊ธฐ ์ž…๋ ฅ์ž…๋‹ˆ๋‹ค.
    • zT∗ ← zT: ๋™์ผํ•œ ๋…ธ์ด์ฆˆ๋กœ ์›๋ž˜ ์ด๋ฏธ์ง€์™€ ํŽธ์ง‘๋œ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • Diffusion Reverse Process (Line 5~10)
    T ๋ถ€ํ„ฐ 1 ๊นŒ์ง€ ์—ญ๋ฐฉํ–ฅ์œผ๋กœ Diffusion ๊ณผ์ •์„ ์ˆ˜ํ–‰ํ•˜๋ฉฐ ๋‘ ํ”„๋กฌํ”„ํŠธ P ์™€ P* ๋ฅผ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
    • Line 6: ๊ณ„์‚ฐ
      • ์›๋ž˜ ํ”„๋กฌํ”„ํŠธ P ๋กœ๋ถ€ํ„ฐ ํ˜„์žฌ ๋‹จ๊ณ„ t์—์„œ์˜ Cross-attention map Mt ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
    • Line 7: Mt* ๊ณ„์‚ฐ
      • ์ˆ˜์ •๋œ ํ”„๋กฌํ”„ํŠธ P* ๋กœ ๋ถ€ํ„ฐ Cross-attention map Mt* ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
    • Line 8: Edit ํ•จ์ˆ˜ ์ ์šฉ
      • Edit ํ•จ์ˆ˜ Edit(Mt,Mt∗,t) ๋ฅผ ์‚ฌ์šฉํ•ด ๋‘ Cross-attention map์„ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค.
      • ์˜ˆ: ํŠน์ • ๋‹จ๊ณ„๊นŒ์ง€๋งŒ Mt๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์ดํ›„์—๋Š” Mt∗ ๋ฅผ ์‚ฌ์šฉํ•ด ์ ์ง„์  ๋ณ€ํ™”๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
    • Line 9: Mt๋ฅผ Injection
      • ๊ธฐ์กด Mt ๋Œ€์‹  ์ˆ˜์ •๋œ Mt∗ ๋ฅผ ์‚ฌ์šฉํ•ด ๋…ธ์ด์ฆˆ๋ฅผ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ํŽธ์ง‘๋œ ์ด๋ฏธ์ง€ I๊ฐ€ ๊ธฐ์กด ์ด๋ฏธ์ง€์˜ ๊ตฌ์กฐ๋ฅผ ์ตœ๋Œ€ํ•œ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.
  • ์ตœ์ข… ์ถœ๋ ฅ (Line 11)
    • ๋‘ ๊ฐœ์˜ ์ด๋ฏธ์ง€ z0 (์›๋ž˜ ์ด๋ฏธ์ง€)์™€ z0∗ (ํŽธ์ง‘๋œ ์ด๋ฏธ์ง€)๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

 

Diffusion ๊ณผ์ •์—์„œ ๋‘ ํ”„๋กฌํ”„ํŠธ๋ฅผ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•˜๋ฉฐ Cross-attention ์„ ์กฐ์ž‘ํ•ด ์›ํ•˜๋Š” ํŽธ์ง‘ ํšจ๊ณผ๋ฅผ ์–ป์ง€๋งŒ, ๋™์ผํ•œ ์ถœ๋ ฅ์„ ์œ„ํ•ด์„œ๋Š” ๋žœ๋ค ์‹œ๋“œ๋ฅผ ๋ฐ˜๋“œ์‹œ ๊ณ ์ •ํ•ด์•ผ ๋ฉ๋‹ˆ๋‹ค.

 

์ด์ œ Edit ํ•จ์ˆ˜๋ฅผ ์ž์„ธํžˆ ์‚ดํŽด๋ด…์‹œ๋‹ค.

  1. Word Swap
    1. ์›๋ณธ ์ด๋ฏธ์ง€์˜ ๊ตฌ์„ฑ์„ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ์ƒˆ๋กœ์šด ํ”„๋กฌํ”„ํŠธ์˜ ๋ณ€ํ™”์‚ฌํ•ญ์„ ์ž˜ ์ฒ˜๋ฆฌํ•ด์•ผ๋ฉ๋‹ˆ๋‹ค.
    2. ์ด๋ฅผ ์œ„ํ•ด source image ์˜ attention maps ๋ฅผ ์ˆ˜์ •๋œ ํ”„๋กฌํ”„ํŠธ์™€ ํ•จ๊ป˜ ์ƒ์„ฑ๊ณผ์ •์— ์ฃผ์ž…ํ•ฉ๋‹ˆ๋‹ค. 
      Diffusion์˜ Back process ๋‹จ๊ณ„ (์œ„์˜ ์‹)
      •   t < τ (ํ›„๊ธฐ ๋‹จ๊ณ„):
        • (์ˆ˜์ •๋œ Cross-attention map)์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
        •   ์ด ์‹œ์ ์—์„œ๋Š” ์ด๋ฏธ์ง€๊ฐ€ ๊ฑฐ์˜ ์™„์„ฑ๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ƒˆ๋กœ์šด ๋‹จ์–ด์˜ ๋””ํ…Œ์ผ(ํ…์Šค์ฒ˜, ์ƒ‰์ƒ, ์„ธ๋ถ€ ๋ชจ์–‘)์„ ๋ฐ˜์˜ํ•ฉ๋‹ˆ๋‹ค.
        •   ์˜ˆ: "bicycle"์˜ ๋””ํ…Œ์ผ์ด "car"์˜ ๋””ํ…Œ์ผ๋กœ ๋ฐ”๋€œ.
      • t ≥ τ (์ดˆ๊ธฐ ๋‹จ๊ณ„):
        • (๊ธฐ์กด Cross-attention map)์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
        •   ์ดˆ๊ธฐ ๋‹จ๊ณ„์—์„œ๋Š” ์ด๋ฏธ์ง€์˜ ์ „๋ฐ˜์ ์ธ ๊ณต๊ฐ„์  ๊ตฌ์„ฑ๊ณผ ํฐ ํ˜•ํƒœ๋ฅผ ๊ธฐ์กด ์ด๋ฏธ์ง€์—์„œ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.
    3. Alignment Function ์˜ ์—ญํ• 
      •   ํ”„๋กฌํ”„ํŠธ์—์„œ ๋‘ ๋‹จ์–ด๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ๊ฐœ์ˆ˜์˜ ํ† ํฐ์œผ๋กœ ํ‘œํ˜„๋  ๋•Œ Alignment Function ์ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
      •   "car" → "sports car"
        •   "car"๋Š” ํ•˜๋‚˜์˜ ํ† ํฐ, "sports car"๋Š” ๋‘ ๊ฐœ์˜ ํ† ํฐ์œผ๋กœ ํ‘œํ˜„๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
        •   ์ด๋•Œ Cross-attention map์„ ์ค‘๋ณตํ•˜๊ฑฐ๋‚˜ ํ‰๊ท ํ•˜์—ฌ ๋‘ ๊ฐœ์˜ ํ† ํฐ์ด ์›๋ž˜ ์ด๋ฏธ์ง€์˜ ๋™์ผํ•œ ๋ถ€๋ถ„์— ๋Œ€์‘๋˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
  2. Adding a New Phrase
    1. ํ”„๋กฌํ”„ํŠธ์— ์ƒˆ๋กœ์šด ๋‹จ์–ด(๋˜๋Š” ๋ฌธ๊ตฌ)๋ฅผ ์ถ”๊ฐ€ํ•ด ์ด๋ฏธ์ง€๋ฅผ ์ƒˆ๋กœ์šด ์Šคํƒ€์ผ๋กœ ๋ณ€ํ˜•ํ•˜๊ฑฐ๋‚˜ ์†์„ฑ์„ ์ถ”๊ฐ€ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค

      •   Adding New Phrase ๋Š” ๊ธฐ์กด ํ”„๋กฌํ”„ํŠธ์™€ ์ƒˆ ํ”„๋กฌํ”„ํŠธ์˜ ํ† ํฐ์„ ์ •๋ ฌ(align) ํ•˜๋Š” ๊ณผ์ •์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
      •   Alignment Function A(j) ๋Š” ๋‘ ํ”„๋กฌํ”„ํŠธ์˜ ํ† ํฐ์„ ๋น„๊ตํ•˜๊ณ , ์ƒˆ๋กœ ์ถ”๊ฐ€๋œ ํ† ํฐ๋งŒ ๊ณจ๋ผ๋ƒ…๋‹ˆ๋‹ค.
      •   A(j)๋Š” ์ƒˆ ํ”„๋กฌํ”„ํŠธ์˜ ํ† ํฐ j๊ฐ€ ๊ธฐ์กด ํ”„๋กฌํ”„ํŠธ์— ๋Œ€์‘ํ•˜๋Š” ํ† ํฐ์ด ์žˆ๋Š”์ง€ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.
      • ๋Š” ํ”ฝ์…€ ์œ„์น˜, ๋Š” ํ…์ŠคํŠธ ํ† ํฐ์˜ ์ธ๋ฑ์Šค
      •   None์ด๋ฉด ์ƒˆ๋กœ ์ถ”๊ฐ€๋œ ํ† ํฐ์ด๋ฏ€๋กœ Mt∗ ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
      •   ๊ธฐ์กด ํ† ํฐ์— ๋Œ€์‘ํ•  ๊ฒฝ์šฐ Mt ๋ฅผ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.
    2. ์˜ˆ์‹œ
      •   Original Prompt: "A castle next to a river"
          Edited Prompt: "A children’s drawing of a castle next to a river"
      •   "castle", "next", "river" → ๊ธฐ์กด ๋‹จ์–ด์ด๋ฏ€๋กœ M ์œ ์ง€
      •   "children’s drawing" → ์ƒˆ๋กœ ์ถ”๊ฐ€๋œ ํ‘œํ˜„์ด๋ฏ€๋กœ Mt∗ ๋กœ ๋ฐ˜์˜
  3. Attention Re-weighting
    1. ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ ๋‚ด ํŠน์ • ๋‹จ์–ด์˜ ์ค‘์š”๋„(์˜ํ–ฅ๋ ฅ)๋ฅผ ์กฐ์ ˆํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.
    2. Attention Re-weighting์€ ํŠน์ • ๋‹จ์–ด์— ํ•ด๋‹นํ•˜๋Š” Cross-attention map์„ ์กฐ์ •ํ•ด ํ•ด๋‹น ๋‹จ์–ด์˜ ์˜ํ–ฅ์„ ์ฆ๊ฐ€ ๋˜๋Š” ๊ฐ์†Œ์‹œํ‚ค๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.
      •   j* : ์‚ฌ์šฉ์ž๊ฐ€ ์„ ํƒํ•œ ํŠน์ • ๋‹จ์–ด์˜ ํ† ํฐ ์ธ๋ฑ์Šค
      •   c: ๊ฐ€์ค‘์น˜ ์กฐ์ • ํŒŒ๋ผ๋ฏธํ„ฐ (c>1 ์ด๋ฉด ๊ฐ•ํ™”, 0<c<1 ์ด๋ฉด ์•ฝํ™”, c<0 ์ด๋ฉด ๋ฐ˜๋Œ€๋ฐฉํ–ฅ์œผ๋กœ ๊ฐ•ํ™”) ∈ [−2, 2]