๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐Ÿ˜ŽAI/3D Reconstruction

[Paper Review] ThemeStation: Generating Theme-Aware 3D assets from Few Exemplars

by SolaKim 2025. 3. 11.

https://arxiv.org/abs/2403.15383

 

ThemeStation: Generating Theme-Aware 3D Assets from Few Exemplars

Real-world applications often require a large gallery of 3D assets that share a consistent theme. While remarkable advances have been made in general 3D content creation from text or image, synthesizing customized 3D assets following the shared theme of in

arxiv.org

 

 

 

๋ฌธ์ œ ์ •์˜

 

  • ๊ฐ€์ƒํ˜„์‹ค(VR)์ด๋‚˜ ๋น„๋””์˜ค ๊ฒŒ์ž„์—์„œ๋Š” ํ…Œ๋งˆ์ ์œผ๋กœ ์ผ๊ด€์„ฑ ์žˆ๋Š” ๋™์‹œ์— ๋‹ค์–‘ํ•œ 3D ๋ชจ๋ธ์„ ๋Œ€๋Ÿ‰์œผ๋กœ ์ƒ์„ฑํ•ด์•ผ ํ•จ.
  • ์ˆ™๋ จ๋œ ์žฅ์ธ์€ ํ•˜๋‚˜ ๋˜๋Š” ๋ช‡ ๊ฐœ์˜ 3D ๋ชจ๋ธ์„ ๋งŒ๋“œ๋Š” ๊ฒƒ์€ ์‰ฝ์ง€๋งŒ, ๋Œ€๋Ÿ‰์˜ 3D ๋ชจ๋ธ์„ ๋งŒ๋“œ๋Š” ๊ฒƒ์€ ์–ด๋ ต๊ณ  ์‹œ๊ฐ„์ด ๋งŽ์ด ๊ฑธ๋ฆผ.
  • ๊ธฐ์กด์˜ 3D ์ƒ์„ฑ ๋ชจ๋ธ๋“ค์€ ์ž…๋ ฅ ์ •๋ณด(ํ…์ŠคํŠธ, ์ด๋ฏธ์ง€)๊ฐ€ ์ œํ•œ์ ์ด์–ด์„œ 3D ๋ชจ๋ธ์ด ๋ชจํ˜ธํ•˜๊ฑฐ๋‚˜ ์ผ๊ด€์„ฑ์ด ๋ถ€์กฑํ•œ ๋ฌธ์ œ๊ฐ€ ์žˆ์Œ.

 

์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ž…๋ ฅ 3D ์˜ˆ์ œ๋“ค๊ณผ ์ผ๊ด€๋œ ํ…Œ๋งˆ๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ๋งž์ถคํ˜• 3D ์—์…‹์„ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ค์šด ๋ฌธ์ œ๋ผ๊ณ  ์–ธ๊ธ‰ํ•ฉ๋‹ˆ๋‹ค.

1๏ธโƒฃ ์Šคํƒ€์ผ & ํ…Œ๋งˆ ์œ ์ง€์˜ ์–ด๋ ค์›€

  • ์ž…๋ ฅ๋œ 3D ๋ชจ๋ธ๋“ค์ด ๋ชจ๋‘ ์กฐ๊ธˆ์”ฉ ๋‹ค๋ฅธ ๋””์ž์ธ์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์Œ.
  • ์˜ˆ๋ฅผ ๋“ค์–ด, ์–ด๋–ค ๋ชจ๋ธ๋“ค์€ ๋” ๊ฐ์ง„ ์Šคํƒ€์ผ์ด๊ณ , ์–ด๋–ค ๋ชจ๋ธ๋“ค์€ ๋ถ€๋“œ๋Ÿฌ์šด ๊ณก์„ ์„ ๊ฐ€์งˆ ์ˆ˜๋„ ์žˆ์–ด.
  • ์ƒˆ๋กœ์šด 3D ์• ์…‹์„ ๋งŒ๋“ค ๋•Œ, ์–ด๋–ค ์Šคํƒ€์ผ์„ ์œ ์ง€ํ• ์ง€ ๊ฒฐ์ •ํ•˜๋Š” ๊ฒƒ์ด ์‰ฝ์ง€ ์•Š์Œ.
  • ์˜ˆ์‹œ:
    • ์—ฌ๋Ÿฌ ๊ฐœ์˜ ํŒํƒ€์ง€ ์Šคํƒ€์ผ ๊ฑด์ถ•๋ฌผ(์˜ˆ: ์„ฑ, ํƒ‘, ์ง‘)์ด ์ฃผ์–ด์กŒ์„ ๋•Œ,
    • ๋™์ผํ•œ ์Šคํƒ€์ผ์„ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ์ƒˆ๋กœ์šด ๊ฑด์ถ•๋ฌผ์„ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์€ ์‰ฝ์ง€ ์•Š์Œ.
    • ์ƒ์„ฑ ๋ชจ๋ธ์ด ๋„ˆ๋ฌด ๋ณด์ˆ˜์ ์ด๋ฉด ๊ธฐ์กด ๋ชจ๋ธ๊ณผ ๋„ˆ๋ฌด ๋น„์Šทํ•ด์ง€๊ณ ,
    • ๋„ˆ๋ฌด ์ฐฝ์˜์ ์ด๋ฉด ์›๋ž˜ ์Šคํƒ€์ผ์—์„œ ๋ฒ—์–ด๋‚˜๊ฒŒ ๋จ.

2๏ธโƒฃ ๊ตฌ์กฐ์  ์ผ๊ด€์„ฑ ๋ฌธ์ œ

  • 3D ๋ชจ๋ธ์€ ๋‹จ์ˆœํ•œ ์ด๋ฏธ์ง€๊ฐ€ ์•„๋‹ˆ๋ผ **์ •ํ™•ํ•œ ๊ธฐํ•˜ํ•™์  ๊ตฌ์กฐ(geometry)**๋ฅผ ๊ฐ€์ ธ์•ผ ํ•จ.
  • ์ƒˆ๋กœ์šด 3D ์• ์…‹์ด ๊ธฐ์กด ์˜ˆ์ œ๋“ค๊ณผ ์œ ์‚ฌํ•œ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ ธ์•ผ ํ•˜์ง€๋งŒ, ๋‹จ์ˆœํ•œ ๋ณต์‚ฌ(copy)๋Š” ์•„๋‹ˆ์–ด์•ผ ํ•จ.
  • ์˜ˆ๋ฅผ ๋“ค์–ด, ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์ฐจ๋Ÿ‰ ๋ชจ๋ธ์„ ๋ณด๊ณ  ์ƒˆ๋กœ์šด ์ฐจ๋Ÿ‰์„ ์ƒ์„ฑํ•  ๋•Œ,
    • ํœ , ๋ฌธ, ์ฐฝ๋ฌธ, ์ฐจ์ฒด ๋น„์œจ ๋“ฑ์˜ ๊ตฌ์กฐ์  ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•ด์•ผ ํ•˜์ง€๋งŒ
    • ๊ธฐ์กด ๋ชจ๋ธ๋“ค๊ณผ ์™„์ „ํžˆ ๋™์ผํ•ด์„œ๋„ ์•ˆ ๋จ.
  • ์˜ˆ์‹œ:
    • ๋ช‡ ๊ฐœ์˜ 3D ๊ฐ€๊ตฌ ๋ชจ๋ธ(์˜ˆ: ์˜์ž, ์ฑ…์ƒ)์ด ์ฃผ์–ด์กŒ์„ ๋•Œ,
    • ๊ฐ™์€ ์Šคํƒ€์ผ์„ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ์ƒˆ๋กœ์šด ์˜์ž๋ฅผ ๋งŒ๋“ค๋ ค๋ฉด,
    • ์˜์ž์˜ ๋‹ค๋ฆฌ ๊ฐœ์ˆ˜, ๋†’์ด, ์ฟ ์…˜ ๋‘๊ป˜ ๋“ฑ์„ ๊ณ ๋ คํ•ด์•ผ ํ•จ.

3๏ธโƒฃ ํ…์Šค์ฒ˜ & ๋””ํ…Œ์ผ ์กฐํ•ฉ์˜ ์–ด๋ ค์›€

  • ๋‹จ์ˆœํžˆ 3D ํ˜•ํƒœ(geometry)๋งŒ ๋น„์Šทํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ,
    ์žฌ์งˆ(Material)๊ณผ ํ…์Šค์ฒ˜(Texture)๋„ ๊ณ ๋ คํ•ด์•ผ ํ•จ.
  • ์˜ˆ์ œ 3D ๋ชจ๋ธ๋“ค์ด ์„œ๋กœ ๋‹ค๋ฅธ ํ…์Šค์ฒ˜(์˜ˆ: ๋‚˜๋ฌด, ๊ธˆ์†, ํ”Œ๋ผ์Šคํ‹ฑ)๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ๋‹ค๋ฉด,
    • ์ƒˆ๋กœ์šด 3D ๋ชจ๋ธ์ด ์–ด๋–ค ์žฌ์งˆ์„ ๊ฐ€์ ธ์•ผ ํ• ์ง€ ๊ฒฐ์ •ํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ค์›€.
  • ํŠนํžˆ, ๊ณ ํ•ด์ƒ๋„ ๋””ํ…Œ์ผ์ด ํ•„์š”ํ•  ๊ฒฝ์šฐ,
    • ๋ชจ๋ธ ํ‘œ๋ฉด์˜ ์„ธ๋ฐ€ํ•œ ์งˆ๊ฐ๊นŒ์ง€ ๋ฐ˜์˜ํ•˜๋Š” ๊ฒƒ์ด ๋” ์–ด๋ ค์›€.
  • ์˜ˆ์‹œ:
    • ์ž…๋ ฅ ์˜ˆ์ œ๊ฐ€ ๋ชจ๋‘ "๋ชฉ์žฌ ๊ฐ€๊ตฌ"๋ผ๋ฉด, ์ƒˆ๋กœ์šด ๊ฐ€๊ตฌ๋„ ๋ชฉ์žฌ ์งˆ๊ฐ์„ ์œ ์ง€ํ•ด์•ผ ํ•˜์ง€๋งŒ
    • ๋‹จ์ˆœํžˆ ๊ธฐ์กด ๊ฐ€๊ตฌ์˜ ํ…์Šค์ฒ˜๋ฅผ ๋ณต์‚ฌํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์ƒˆ๋กœ์šด ๋””์ž์ธ์— ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ ์šฉํ•ด์•ผ ํ•จ.

5๏ธโƒฃ ์ž๋™ํ™”๋œ ๋ฐฉ๋ฒ•๋ก  ๋ถ€์กฑ

  • ํ˜„์žฌ๊นŒ์ง€์˜ 3D ์ƒ์„ฑ ๋ชจ๋ธ(์˜ˆ: GAN, Diffusion Model)์€ ์ฃผ๋กœ ๊ฐœ๋ณ„์ ์ธ ์Šคํƒ€์ผ์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐ ์ดˆ์ ์ด ๋งž์ถฐ์ ธ ์žˆ์Œ.
  • ํ•˜์ง€๋งŒ, ๊ธฐ์กด์˜ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์˜ˆ์ œ๋ฅผ ๋ณด๊ณ  "์ผ๊ด€์„ฑ ์žˆ๋Š”" ์ƒˆ๋กœ์šด 3D ๋ชจ๋ธ์„ ์ƒ์„ฑํ•˜๋Š” ๊ธฐ๋ฒ•์€ ์•„์ง ์™„๋ฒฝํ•˜์ง€ ์•Š์Œ.
  • 2D ์Šคํƒ€์ผ ํŠธ๋žœ์Šคํผ(Style Transfer) ๊ฐ™์€ ๊ธฐ๋ฒ•์ด 3D์—์„œ๋Š” ๋œ ํšจ๊ณผ์ ์ผ ์ˆ˜ ์žˆ์Œ.
  • ์˜ˆ์‹œ:
    • 2D์—์„œ๋Š” "ํ™”๊ฐ€์˜ ๊ทธ๋ฆผ ์Šคํƒ€์ผ"์„ ์ƒˆ๋กœ์šด ์ด๋ฏธ์ง€์— ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ•˜์ง€๋งŒ,
    • 3D์—์„œ๋Š” ๋ชจ๋ธ์˜ ๊ตฌ์กฐ๊นŒ์ง€ ๊ณ ๋ คํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ›จ์”ฌ ์–ด๋ ค์›€.

 

ThemeStation ์—์„œ๋Š” ์ž…๋ ฅ๋œ ๋ช‡ ๊ฐœ์˜ 3D ์˜ˆ์ œ(Examplars) ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ƒˆ๋กœ์šด 3D ์—์…‹์„ ์ƒ์„ฑํ•˜๋Š”๋ฐ, ๋‘๊ฐ€์ง€ ๋ชฉํ‘œ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

  • Unity (ํ†ต์ผ์„ฑ)
    • ์ƒ์„ฑ๋œ 3D ์—์…‹์ด ์ฃผ์–ด์ง„ ์˜ˆ์ œ๋“ค๊ณผ ํ…Œ๋งˆ์ ์œผ๋กœ ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•ด์•ผ ํ•จ.
    • ์ฆ‰, ์ƒˆ๋กœ์šด 3D ๋ชจ๋ธ์ด ๊ธฐ์กด ์˜ˆ์ œ๋“ค๊ณผ ๊ฐ™์€ ์Šคํƒ€์ผ, ๋ถ„์œ„๊ธฐ, ํ˜•ํƒœ์  ํŠน์ง•์„ ๊ฐ€์ ธ์•ผ ํ•จ.
    • ์˜ˆ๋ฅผ ๋“ค์–ด, ํŒํƒ€์ง€ ์Šคํƒ€์ผ์˜ ์„ฑ(Castle) ์˜ˆ์ œ๋“ค์ด ์ž…๋ ฅ๋˜๋ฉด, ์ƒˆ๋กœ์šด 3D ์—์…‹๋“ค๋„ ๊ฐ™์€ ํŒํƒ€์ง€ ๋Š๋‚Œ์„ ์œ ์ง€ํ•ด์•ผํ•จ
  • Diversity (๋‹ค์–‘์„ฑ)
    • ์ƒ์„ฑ๋œ  3D ์—์…‹๋“ค์ด ๋„ˆ๋ฌด ๋˜‘๊ฐ™์œผ๋ฉด ์•ˆ ๋˜๋ฏ€๋กœ, ๋‹ค์–‘์„ฑ์„ ๊ทน๋Œ€ํ™”ํ•ด์•ผํ•จ
    • ์ฆ‰, ๊ฐ™์€ ํ…Œ๋งˆ ์•ˆ์—์„œ๋„ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๋ณ€ํ™”(variation) ๊ฐ€ ์žˆ๋Š” ๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด์•ผํ•œ๋‹ค.
    • ์˜ˆ๋ฅผ ๋“ค์–ด, ๊ฐ™์€ ํŒํƒ€์ง€ ์„ฑ์„ ๋งŒ๋“ค๋”๋ผ๋„, ๊ณ ๋”• ์Šคํƒ€์ผ, ๋งˆ๋ฒ•์‚ฌ ์Šคํƒ€์ผ, ์ค‘์„ธ ์Šคํƒ€์ผ ๋“ฑ ๋‹ค์–‘ํ•œ ๋””์ž์ธ ์š”์†Œ๋ฅผ ์ ์šฉํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•จ.

 

 

Exemplars(์˜ˆ์ œ)๋ž€?

  • ThemeStation ์ด ์ฐธ๊ณ ํ•˜๋Š” ๋ช‡ ๊ฐœ์˜ 3D ๋ชจ๋ธ(์ž…๋ ฅ ์˜ˆ์ œ๋“ค) ์„ ์˜๋ฏธ
  • ์ด ์˜ˆ์ œ๋“ค์„ ๊ธฐ๋ฐ”๋Šฅ๋กœ ์ƒˆ๋กœ์šด 3D ์—์…‹์„ ๋งŒ๋“ค ๋•Œ, ์Šคํƒ€์ผ๊ณผ ํ…Œ๋งˆ๋ฅผ ์ •ํ•˜๋Š” ๊ธฐ์ค€์ด ๋จ

 

 

Text Prompts ์™€ Images ์— ๋น„ํ•ด์„œ 3D exemplars ๋Š” ๋”์šฑ ํ’๋ถ€ํ•œ ์ •๋ณด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. 

3D exemplars ๋Š” Geometry ์™€ Appearance source ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๊ณ , ์ด๊ฒŒ 3D modeling ์˜ ๋ชจํ˜ธ์„ฑ์„ ์ค„์—ฌ์ค๋‹ˆ๋‹ค!

 

์ด์ „ ์—ฐ๊ตฌ๋“ค(Sin3DM, Sin3DGen)์—์„œ๋Š” Simply training a generative model on a few limited variation ์ด์—ˆ๊ธฐ ๋•Œ๋ฌธ์— limited variation ์„ ๋ณด์ž…๋‹ˆ๋‹ค. ๊ทธ์ € input ๋ชจ๋ธ์„ resizing ํ•˜๊ฑฐ๋‚˜ repeating randomly ๋ฅผ ์ง„ํ–‰ํ•  ๋ฟ์ž…๋‹ˆ๋‹ค. 

 

์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ ThemeStation ์—์„œ๋Š” two-stage generative scheme ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. 

์ฒซ๋ฒˆ์งธ๋กœ๋Š” concept art ๋ฅผ ๊ทธ๋ฆฌ๊ณ  ๊ทธ ๋‹ค์Œ์— progressive ํ•œ 3D modeling ์„ ์‚ฌ์šฉํ•ด์„œ 3D๋กœ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

 

First Stage : ๊ฐœ๋… ์•„ํŠธ ์ƒ์„ฑ (Concept Art Generation)

  • ๊ธฐ์กด์˜ ์ด๋ฏธ์ง€ Diffusion ๋ชจ๋ธ์„ Fine-Tuningํ•˜์—ฌ ์ž…๋ ฅ 3D ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‹ค์–‘ํ•œ ์ปจ์…‰ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑ.
  • ๊ธฐ์กด ๊ธฐ๋ฒ•(์˜ˆ: DreamBooth, LoRA)๊ณผ ๋‹ฌ๋ฆฌ, ๋‹จ์ˆœํžˆ ํŠน์ • ๊ฐ์ฒด๋ฅผ ์žฌํ˜„ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ์ƒˆ๋กœ์šด ์ฃผ์ œ(subject)๋ฅผ ํฌํ•จํ•œ ํ…Œ๋งˆ์ ์ธ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๋„๋ก ํ•™์Šต.

 

Second Stage : 3D ๋ชจ๋ธ ์ƒ์„ฑ (3D Asset Generation)

  • ์ƒ์„ฑ๋œ ์ปจ์…‰ ์ด๋ฏธ์ง€๋ฅผ 3D ๋ชจ๋ธ๋กœ ๋ณ€ํ™˜ํ•˜๋ฉด์„œ, ์ž…๋ ฅ๋œ 3D Exemplars๋ฅผ ๋ณด์กฐ ์ •๋ณด๋กœ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋ธ๋ง ์ง„ํ–‰.
  • ๊ธฐ์กด ์ด๋ฏธ์ง€-to-3D ๋ฐฉ๋ฒ•๊ณผ ๋‹ค๋ฅด๊ฒŒ, ์ปจ์…‰ ์ด๋ฏธ์ง€๋Š” ๋‹จ์ˆœ ๊ฐ€์ด๋“œ ์—ญํ• ์„ ํ•˜๊ณ , ์ž…๋ ฅ 3D ๋ชจ๋ธ์—์„œ ์–ป์€ ์ถ”๊ฐ€์ ์ธ ๊ธฐํ•˜ํ•™์  ์ •๋ณด(geometry)์™€ ๋‹ค๊ฐ๋„ ์ •๋ณด(multi-view appearance)๋ฅผ ๋ฐ˜์˜ํ•˜์—ฌ ๋” ์ •๊ตํ•œ 3D ๋ชจ๋ธ ์ƒ์„ฑ.

 

ํ•ต์‹ฌ ๊ธฐ์ˆ : Reference-informed Dual Score Distillation (DSD)

๋‘ ๊ฐ€์ง€ Diffusion ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์—ฌ ์ƒ์„ฑ ๊ณผ์ •์˜ ํ’ˆ์งˆ์„ ํ–ฅ์ƒํ•ฉ๋‹ˆ๋‹ค.

  • Concept Prior: ์ปจ์…‰ ์ด๋ฏธ์ง€์—์„œ ์ „์ฒด์ ์ธ ๊ตฌ์กฐ๋ฅผ ๋ณด์กดํ•˜๋„๋ก ๊ฐ€์ด๋“œ.
  • Reference Prior: ์ž…๋ ฅ 3D ์˜ˆ์ œ์—์„œ ์„ธ๋ถ€์ ์ธ ๋””ํ…Œ์ผ์„ ๋ฐ˜์˜ํ•˜๋„๋ก ๊ฐ€์ด๋“œ.
  • Noise Level(๋…ธ์ด์ฆˆ ๋‹จ๊ณ„)์— ๋”ฐ๋ผ ๋‘ ๊ฐ€์ง€ Prior์„ ๋‹ค๋ฅด๊ฒŒ ์ ์šฉํ•˜์—ฌ ์ตœ์ ์˜ 3D ๋ชจ๋ธ ์ƒ์„ฑ.
    • ๋†’์€ ๋…ธ์ด์ฆˆ ๋‹จ๊ณ„ → Concept Prior ์ ์šฉ (์ „์ฒด์ ์ธ ํ˜•ํƒœ ๊ฐ€์ด๋“œ).
    • ๋‚ฎ์€ ๋…ธ์ด์ฆˆ ๋‹จ๊ณ„ → Reference Prior ์ ์šฉ (์„ธ๋ถ€์ ์ธ ๋””ํ…Œ์ผ ๋ณด์กด).

 

1๏ธโƒฃ Subject-driven (๊ธฐ์กด ๋ฐฉ์‹)

  • ํŠน์ • ๊ฐœ์ฒด(๊ฐ์ฒด)๋ฅผ ์ค‘์‹ฌ์œผ๋กœ ํ•™์Šต → "๊ทธ ๋Œ€์ƒ๊ณผ ์œ ์‚ฌํ•œ ๋ณ€ํ˜•๋งŒ ์ƒ์„ฑ ๊ฐ€๋Šฅ"
  • ์˜ˆ์‹œ:
    • "์ด ๊ณ ์–‘์ด ์‚ฌ์ง„์„ ๋ณด๊ณ  ๊ณ ์–‘์ด๋ฅผ ๋งŒ๋“ค์–ด๋ด!" ๐Ÿฑ → ๊ฑฐ์˜ ๊ฐ™์€ ๊ณ ์–‘์ด๊ฐ€ ๋‚˜์˜ค์ง€๋งŒ, ์•ฝ๊ฐ„์˜ ๋ณ€ํ˜•๋งŒ ์žˆ์Œ.
    • DreamBooth (Gal et al., 2022) ๊ฐ™์€ ๊ธฐ๋ฒ•์ด ๋Œ€ํ‘œ์ .

2๏ธโƒฃ Theme-driven (ThemeStation ๋ฐฉ์‹)

  • ํŠน์ • ๊ฐœ์ฒด๋ฅผ ์ค‘์‹ฌ์œผ๋กœ ํ•™์Šตํ•˜๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ, ํ…Œ๋งˆ(์Šคํƒ€์ผ, ๋ถ„์œ„๊ธฐ ๋“ฑ)๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ๋‹ค์–‘ํ•œ ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑ.
  • ์˜ˆ์‹œ:
    • "์ด ๊ณ ์–‘์ด ๊ทธ๋ฆผ์„ ์ฐธ๊ณ ํ•ด์„œ, ๊ณ ์–‘์ด ์Šคํƒ€์ผ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ๋‹ค์–‘ํ•œ ์ƒˆ๋กœ์šด ๊ณ ์–‘์ด ์บ๋ฆญํ„ฐ๋ฅผ ๋งŒ๋“ค์–ด๋ด!"
    • ์ฆ‰, ๊ณ ์–‘์ด์˜ ํŠน์ •ํ•œ ๋ชจ์Šต์ด ์•„๋‹ˆ๋ผ, '๊ณ ์–‘์ด์Šค๋Ÿฌ์šด ์Šคํƒ€์ผ'์„ ๋ฐ˜์˜ํ•˜์—ฌ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ํ˜•ํƒœ์˜ ๊ณ ์–‘์ด๋ฅผ ์ƒ์„ฑ.

 

Rework ์„ ์ž ์‹œ ์‚ดํŽด๋ณด์ž๋ฉด,

  • 3D ์ƒ์„ฑ์—์„œ์˜ Diffusion Prior
    • DreamFusion [Poole et al. 2023]
      • ์‚ฌ์ „ ํ•™์Šต๋œ ํ…์ŠคํŠธ-์ด๋ฏธ์ง€(T2I) Diffusion ๋ชจ๋ธ์—์„œ ์ด๋ฏธ์ง€ ๋ถ„ํฌ์˜ ์ ์ˆ˜(Score) ๋ฅผ ์ถ”์ถœํ•˜์—ฌ 3D๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ• ์ œ์•ˆ.
      • Score Distillation Sampling (SDS) ๊ธฐ๋ฒ•์„ ๋„์ž…ํ•˜์—ฌ text-to-3d ์„ฑ๋Šฅ์„ ๊ฐœ์„ 
  • Exemplar-Based (์˜ˆ์ œ ๊ธฐ๋ฐ˜) 3D ์ƒ์„ฑ
    • DreamBooth [Rag et al. 2023]
      • ์†Œ์ˆ˜์˜ ์ด๋ฏธ์ง€๋งŒ์œผ๋กœ ์‚ฌ์ „ ํ•™์Šต๋œ Diffusion ๋ชจ๋ธ์„ Fine-tuning ํ•˜์—ฌ, ํŠน์ • ๊ฐ์ฒด ์ค‘์‹ฌ(Subject-driven) ์œผ๋กœ text-to-3d ์ƒ์„ฑ
      • ํ•˜์ง€๋งŒ, ์ž…๋ ฅ๋œ ์ด๋ฏธ์ง€์˜ 3D ์ •๋ณด ๋ถ€์กฑ์œผ๋กœ ์ธํ•ด ์ƒ์„ฑ๋œ ๊ฒฐ๊ณผ๋ฌผ์˜ ์ผ๊ด€์„ฑ์ด ๋ถ€์กฑ

 

 

Workflow

Theme-Driven Concept Image Generation

 

  • Diffusion ๋ชจ๋ธ์„ ์ž…๋ ฅ 3D Exemplars์˜ ๋ Œ๋”๋ง ์ด๋ฏธ์ง€ {xr}๋กœ ํŒŒ์ธํŠœ๋‹.
  • ๊ธฐ์กด ๋ชจ๋ธ์„ ์งง์€ ํ•™์Šต(iteration์ด ์ ์€ fine-tuning)๋งŒ์œผ๋กœ๋„ ํ…Œ๋งˆ๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Œ.
  • ๊ฒฐ๊ณผ์ ์œผ๋กœ, ์ƒˆ๋กœ์šด ๊ฐœ์ฒด(Subjects)๋ฅผ ์ƒ์„ฑํ•˜๋˜, ํ…Œ๋งˆ์  ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ์Œ.
  • ํ…Œ๋งˆ(semantics & style)์™€ ์ปจํ…์ธ (subject)๋ฅผ ๋ถ„๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด, ๋ชจ๋“  ์˜ˆ์ œ์—์„œ ๊ณตํ†ต๋œ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์‚ฌ์šฉ.
    • ์˜ˆ์‹œ:
      • "a 3D model of an owl, in the style of [V]"
      • ํ”„๋กฌํ”„ํŠธ์—์„œ "in the style of [V]"๋ฅผ ํ†ตํ•ด ํ…Œ๋งˆ๋ฅผ ์œ ์ง€ํ•˜๋Š” ํ•™์Šต์„ ์œ ๋„

 

Reference-Informed 3D Asset Modeling

  • ์ด ๋‹จ๊ณ„์—์„œ๋Š” ์•ž์„  ๋‹จ๊ณ„์—์„œ ์ƒ์„ฑํ•œ ์ปจ์…‰ ์ด๋ฏธ์ง€๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜์—ฌ 3D ๋ชจ๋ธ๋ง์„ ์ง„ํ–‰
  • ์ž…๋ ฅ ๋ฐ์ดํ„ฐ
    • 1 ๋‹จ๊ณ„์—์„œ ์ƒ์„ฑ๋œ ์ปจ์…‰ ์ด๋ฏธ์ง€ xc
    • ์ž…๋ ฅ๋œ 3D ์˜ˆ์ œ ๋ชจ๋ธ (Exemplars) mr
  • ๋ชฉํ‘œ
    • ์ปจ์…‰ ์ด๋ฏธ์ง€ xc ๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ์ƒˆ๋กœ์šด 3D ๋ชจ๋ธ์„ ์ƒ์„ฑํ•˜๋˜,
    • ์ž…๋ ฅ๋œ 3D Exemplars ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์„ธ๋ถ€ ๋””ํ…Œ์ผ์„ ๋ณด์™„ํ•˜๊ณ  ๋ฉ€ํ‹ฐ๋ทฐ ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•จ
  • ๋ชจ๋ธ๋ง ๊ณผ์ •
    • ์ดˆ๊ธฐ 3D ๋ชจ๋ธ ์ƒ์„ฑ (Rough Initial Model Creation)
      • ์ผ๋ฐ˜์ ์ธ 3D ๋ชจ๋ธ๋ง ๊ณผ์ •์ฒ˜๋Ÿผ, ๊ธฐ๋ณธ์ ์ธ ํ˜•ํƒœ(Primitive model) ์—์„œ ์‹œ์ž‘
      • ์ปจ์…‰ ์ด๋ฏธ์ง€ xc ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๊ธฐ์กด Image-to-3D ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•ด ์ดˆ๊ธฐ 3D ๋ชจ๋ธ m_init ์„ ์ƒ์„ฑ
    • ์ดˆ๊ธฐ ๋ชจ๋ธ์˜ ํ•œ๊ณ„ ๋ฐ ํ•ด๊ฒฐ์ฑ…
      • ์ดˆ๊ธฐ ๋ชจ๋ธ์€ ๊ณต๊ฐ„์  ๊ตฌ์กฐ(spatial structure) ๋ถˆ์™„์ „ํ•˜๊ฑฐ๋‚˜ ์•„ํ‹ฐํŒฉํŠธ(๋ถˆ์™„์ „ํ•œ ์š”์†Œ)๊ฐ€ ์กด์žฌํ•  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Œ
      • ๋”ฐ๋ผ์„œ, ์ดˆ๊ธฐ ๋ชจ๋ธ์„ ์ปจ์…‰ ์ด๋ฏธ์ง€์— ์—„๊ฒฉํ•˜๊ฒŒ ๋งž์ถœ ํ•„์š”๋Š” ์—†์Œ!
      • ๋Œ€์‹ , ์ปจ์…‰ ์ด๋ฏธ์ง€ + ์ดˆ๊ธฐ 3D ๋ชจ๋ธ์„ ์ฐธ๊ณ ํ•ด์„œ ์ตœ์ข… 3D ๋ชจ๋ธ m0๋ฅผ ๋ฐœ์ „์‹œํ‚ด
  • Dual Score Distillation (DSD) Loss ๊ฐœ๋…
    • ๋‘ ๊ฐœ์˜ Diffusion Prior ๋ฅผ ํ™œ์šฉํ•˜์—ฌ 3D ๋ชจ๋ธ ํ’ˆ์งˆ์„ ๊ฐœ์„ ์‹œํ‚ค๊ณ ์ž ํ•จ
    • Concept Prior ( ฯ•c )
      • ๊ธฐ๋ณธ์ ์ธ ์ปจ์…‰์„ ์œ ์ง€ํ•˜๋„๋ก ๊ฐ€์ด๋“œ
      • ์ปจ์…‰์ด๋ฏธ์ง€ xc ์—์„œ ์–ป์€ Diffusion Prior ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ปจ์…‰์„ ์ถฉ์‹คํžˆ ๋ฐ˜์˜ํ•˜๋Š” 3D ๋ชจ๋ธ ์ƒ์„ฑ
    • Reference Prior ( ฯ•r ) 
      • ์ž…๋ ฅ 3D ์˜ˆ์ œ ๋ชจ๋ธ์˜ ์„ธ๋ถ€ ๋””ํ…Œ์ผ์„ ๋ณด์กดํ•˜๊ณ , ๋ฉ€ํ‹ฐ๋ทฐ ์ผ๊ด€์„ฑ์„ ํ–ฅ์ƒ
      • ์ž…๋ ฅ ์˜ˆ์ œ ๋ชจ๋ธ mr ์—์„œ ์–ป์€ Diffusion Prior ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์„ธ๋ฐ€ํ•œ ํŠน์ง• ๋ณต์›์„ ์ง€์›

 

Dual Score Distillation (DSD)

Preliminaries.

  • DreamFusion
    • Text-to-3D ์ƒ์„ฑ ๊ธฐ๋ฒ•์œผ๋กœ, 3D ๋ชจ๋ธ์„ ์ตœ์ ํ™”ํ•˜์—ฌ ์ฃผ์–ด์ง„ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ์— ์ ํ•ฉํ•œ 3D ํ‘œํ˜„์„ ์ƒ์„ฑ
    • 3D ํ‘œํ˜„์€ ํŒŒ๋ผ๋ฏธํ„ฐ θ ๋ฅผ ๊ฐ€์ง€๋ฉฐ g(θ) ๋ฅผ ํ†ตํ•ด ๋ Œ๋”๋ง๋œ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑ
    • g(θ) ๋Š” NeRF(Neural Radiance Fields)์™€ ์œ ์‚ฌํ•œ ๋ Œ๋”๋ง ์—”์ง„์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ ์นด๋ฉ”๋ผ ๊ฐ๋„์—์„œ 3D ๋ชจ๋ธ์„ 2D ์ด๋ฏธ์ง€๋กœ ๋ณ€ํ™˜ํ•จ.
    • ์ฆ‰, ๋žœ๋”๋ง๋œ 3D ๋ชจ๋ธ์ด ์ฃผ์–ด์ง„ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ(y)์— ๋งž๋Š” 2D ์ด๋ฏธ์ง€์ฒ˜๋Ÿผ ๋ณด์ด๋„๋ก ํ•™์Šต
    • Score Distillation Sampling (SDS)
      • ์‚ฌ์ „ ํ•™์Šต๋œ Text-to-Image (T2I) Diffusion ๋ชจ๋ธ(φ)๋Š” ์ฃผ์–ด์ง„ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ(y)์— ๋Œ€ํ•ด, ๋žœ๋”๋ง๋œ 2D ๋ทฐ(xt)์˜ ๋…ธ์ด์ฆˆ(εφ)๋ฅผ ์˜ˆ์ธก
        ∂θ/∂xโ€‹ : 3D ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ(θ)๊ฐ€ ์ด๋ฏธ์ง€(x)์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ(์ฆ‰, 3D ๋ชจ๋ธ์„ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๋ฐฉํ–ฅ).

      • Diffusion ๋ชจ๋ธ์ด 2D ์ด๋ฏธ์ง€์—์„œ ์˜ˆ์ธกํ•œ ๋…ธ์ด์ฆˆ( εφ ) ๊ฐ€ ์ตœ์†Œํ™”๋˜๋„๋ก 3D ํ‘œํ˜„(θ)์„ ์—…๋ฐ์ดํŠธ
  • Variational Score Distillation (VSD)
    • SDS ๋ฅผ ๊ฐœ์„ ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ, VSD ๋Š” 3D ํ‘œํ˜„์„ ๋‹จ์ผํ•œ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋กœ ๊ฐ„์ฃผํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ํ™•๋ฅ ์  ๋ณ€์ˆ˜๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ƒ์„ฑ ๋‹ค์–‘์„ฑ์„ ๋†’์ž„
    • VSD ๋Š” ์ €์ฐจ์› ์ ์‘(LoRA; Low-Rank Adaption) ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ Diffusion ๋ชจ๋ธ์„ ์ ์‘์ ์œผ๋กœ ํ™œ์šฉ
    • VSD ๊ทธ๋ผ๋””์–ธํŠธ ๊ณ„์‚ฐ ๊ณต์‹

      • εlora: LoRA(Low-Rank Adaptation)๋ฅผ ์ ์šฉํ•œ T2I Diffusion ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•œ ๋…ธ์ด์ฆˆ.
      • c: ์นด๋ฉ”๋ผ ํŒŒ๋ผ๋ฏธํ„ฐ(Camera parameter).
      • LoRA๋ฅผ ํ™œ์šฉํ•˜๋ฉด, ๊ธฐ์กด Diffusion ๋ชจ๋ธ๋ณด๋‹ค ์ ์€ ๊ณ„์‚ฐ๋Ÿ‰์œผ๋กœ ํŠน์ •ํ•œ ์กฐ๊ฑด(์˜ˆ: ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ)์— ๋งž์ถฐ ์ ์‘ ๊ฐ€๋Šฅ.
      • ์ฆ‰, VSD๋Š” SDS๋ณด๋‹ค ๋” ์ •๊ตํ•˜๊ฒŒ ๋…ธ์ด์ฆˆ๋ฅผ ๋ณด์ •ํ•˜๋ฉฐ, ํ…์ŠคํŠธ ์กฐ๊ฑด์— ๋” ์ ํ•ฉํ•œ 3D ๋ชจ๋ธ์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Œ.
  • SDS ์™€ VSD ๋Š” ํ•˜๋‚˜์˜ ๋‹จ์ผ Diffusion ๋ชจ๋ธ์—์„œ ํ•™์Šต๋œ ์ •๋ณด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ 3D ๋ชจ๋ธ์„ ์—…๋ฐ์ดํŠธํ•จ
  • ํ•˜์ง€๋งŒ, ๋งŒ์•ฝ ์„œ๋กœ ๋‹ค๋ฅธ diffusion ๋ชจ๋ธ์—์„œ ๋‚˜์˜จ Conflicted Prior ๊ฐ€ ์žˆ์„ ๊ฒฝ์šฐ, ์ž˜ ๋™์ž‘ํ•˜์ง€ ์•Š์Œ

 

Learning of concept prior.

  • Concept Prior ๋ž€ ์ž…๋ ฅ๋œ ์ปจ์…‰ ์ด๋ฏธ์ง€์—์„œ ํ…Œ๋งˆ์  ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•˜๋ฉด์„œ 3D ๋ชจ๋ธ ์ƒ์„ฑ์— ํ•„์š”ํ•œ ํ•ต์‹ฌ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•˜๋Š” ์š”์†Œ
  • ์ฆ‰, 3D ๋ชจ๋ธ์„ ๋งŒ๋“ค ๋•Œ ์–ด๋–ค ์Šคํƒ€์ผ๊ณผ ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•  ๊ฒƒ์ธ์ง€์— ๋Œ€ํ•œ ๊ธฐ์ค€์„ ํ•™์Šตํ•˜๋Š” ๊ณผ์ •
  • ๊ธฐ์กด ๋ฌธ์ œ์ 
    • ์ปจ์…‰ ์ด๋ฏธ์ง€๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ƒ์„ฑํ•œ ์ดˆ๊ธฐ 3D ๋ชจ๋ธ(m_init)์€ ํ’ˆ์งˆ์ด ๋‚ฎ์Œ
    • ํ…์Šค์ฒ˜(์ƒ‰๊ฐ, ์žฌ์งˆ)์ด ํ๋ฆฟ => Blurry texture
    • geometry ๊ฐ€ ๋„ˆ๋ฌด ๋ถ€๋“œ๋Ÿฌ์›Œ์„œ ์„ธ๋ถ€์ ์ธ ๋””ํ…Œ์ผ์ด ๋ถ€์กฑํ•จ
  • ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Augmented View ๋ฅผ ํ™œ์šฉ
    • ์ดˆ๊ธฐ 3D ๋ชจ๋ธ์—์„œ ๋‹ค์–‘ํ•œ ์‹œ์ (view)์—์„œ ๋ Œ๋”๋ง๋œ ์ด๋ฏธ์ง€(x_init) ๋ฅผ ๋ณด๊ฐ•(augment) ํ•˜์—ฌ, pseudo-multi-view ์ด๋ฏธ์ง€(x^_init) ์„ ์ƒ์„ฑ
    • ์ฆ‰, ๋‹จ์ˆœํžˆ ํ•œ๊ฐœ์˜ ์ปจ์…‰ ์ด๋ฏธ์ง€๊ฐ€ ์•„๋‹ˆ๋ผ, ๋‹ค์–‘ํ•œ ๊ฐ๋„์—์„œ ๋ณธ ์ด๋ฏธ์ง€๋ฅผ ์ถ”๊ฐ€๋กœ ์ƒ์„ฑํ•˜์—ฌ 3D ์ •๋ณด ๋ถ€์กฑ ๋ฌธ์ œ ํ•ด๊ฒฐ

  • ์—ฌ๊ธฐ์„œ a(.) ๋Š” ์ด๋ฏธ์ง€ ๋ณ€ํ™˜ (image-to-image translation) ์—ฐ์‚ฐ
  • Augmented Views๋Š” ์ปจ์…‰ ์ด๋ฏธ์ง€์˜ ์˜๋„๋œ 3D ๊ตฌ์กฐ๋ฅผ ๋ณด์™„ํ•˜๋Š” ์—ญํ• ์„ ํ•จ

 

Learning of reference prior.

  • ์ฐธ์กฐ prior ์€ ์ž…๋ ฅ๋œ 3D ์˜ˆ์ œ ๋ชจ๋ธ(reference models, mr) ์„ ๊ธฐ๋ฐ˜์œผ๋กœ 3D ๋ชจ๋ธ์˜ ์„ธ๋ถ€ ๋””ํ…Œ์ผ๊ณผ ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•จ
  • ์ฃผ์š” ์—ญํ• ์€ ๋‹ค์Œ ๋‘๊ฐ€์ง€
    • ํ…์Šค์ฒ˜ ์ •๋ณด ๋ณด์กด (Texture Consistency)
    • ์ •ํ™•ํ•œ ๊ธฐํ•˜ํ•™์  ๊ตฌ์กฐ(Geometry) ์œ ์ง€

ํ•™์Šต ๊ณผ์ • โ–ผ

1๏ธโƒฃ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ (Reference Model์„ ํ™œ์šฉํ•œ 2D ๋ฐ์ดํ„ฐ ๋ณ€ํ™˜)

  • ๋ Œ๋”๋ง๋œ ์ปฌ๋Ÿฌ ์ด๋ฏธ์ง€(xr):
    • ๋‹ค์–‘ํ•œ ๋ทฐํฌ์ธํŠธ์—์„œ ์ฐธ์กฐ ๋ชจ๋ธ์˜ ํ…์Šค์ฒ˜(texture) ์ •๋ณด๋ฅผ ์ œ๊ณต
  • ๋ Œ๋”๋ง๋œ ๋…ธ๋ฉ€ ๋งต(nr):
    • 3D ํ˜•์ƒ์˜ ๋ฏธ์„ธํ•œ ๊ธฐํ•˜ํ•™์  ๋””ํ…Œ์ผ(geometry details) ์ •๋ณด๋ฅผ ํฌํ•จ
    • 3D ๋ชจ๋ธ์˜ ํ‘œ๋ฉด์ด ์‹ค์ œ๋กœ๋Š” ํ‰ํ‰(flat) ํ•˜์ง€๋งŒ, ๋…ธ๋ฉ€ ๋งต์„ ์šธํ‰๋ถˆํ‰ํ•œ ๋””ํ…Œ์ผ์ด ์žˆ๋Š”๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ด๊ฒŒ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Œ
    • ๋…ธ๋ฉ€ ๋งต์€ RGB ์ƒ‰์ƒ์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ํ”ฝ์…€์—์„œ์˜ ๋ฒ•์„  ๋ฒกํ„ฐ(Normal Vector) ๋ฅผ ์ €์žฅํ•œ๋‹ค.
  • ๋žœ๋คํ•œ ๋ทฐํฌ์ธํŠธ์—์„œ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์ด๋ฏธ์ง€ ์ƒ์„ฑ → ๋ฉ€ํ‹ฐ๋ทฐ ์ผ๊ด€์„ฑ ํ™•๋ณด

2๏ธโƒฃ ์ปฌ๋Ÿฌ ์ด๋ฏธ์ง€์™€ ๋…ธ๋ฉ€ ๋งต์˜ ๊ณต๋™ ํ•™์Šต (Joint Usage of Two Types of Rendering)

  • ์ปฌ๋Ÿฌ ์ด๋ฏธ์ง€ → ํ…์Šค์ฒ˜(texture) ๊ด€๋ จ 3D ์ผ๊ด€์„ฑ ์œ ์ง€
  • ๋…ธ๋ฉ€ ๋งต → ์„ธ๋ฐ€ํ•œ ๊ธฐํ•˜ํ•™์  ๋””ํ…Œ์ผ(geometry details) ํ•™์Šต
  • ๋‘ ๋ฐ์ดํ„ฐ๋ฅผ ํ•จ๊ป˜ ํ•™์Šตํ•˜์—ฌ ๋ณด๋‹ค ์ •ํ™•ํ•œ 3D ์ฐธ์กฐ ํ”„๋ผ์ด์–ด๋ฅผ ๊ตฌ์ถ•!

3๏ธโƒฃ ์ด๋ฏธ์ง€ ํ”„๋ผ์ด์–ด์™€ ๋…ธ๋ฉ€ ํ”„๋ผ์ด์–ด ํ•™์Šต ๋ถ„๋ฆฌ (Disentangling Image Prior & Normal Prior)

  • ์ปฌ๋Ÿฌ ์ด๋ฏธ์ง€์™€ ๋…ธ๋ฉ€ ๋งต์„ ๊ฐ๊ฐ ๋‹ค๋ฅธ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ(yx, yn)๋กœ ํ•™์Šต.
  • ์˜ˆ์ œ:
    • ์ปฌ๋Ÿฌ ์ด๋ฏธ์ง€: "a 3D model of an owl, in the style of [V]"
    • ๋…ธ๋ฉ€ ๋งต: "a 3D model of an owl, in the style of [V], normal map"
  • ์ด๋ฅผ ํ†ตํ•ด ๊ฐ๊ฐ์˜ ์—ญํ• ์„ ๋ถ„๋ฆฌํ•˜์—ฌ ์ตœ์ ์˜ ํ•™์Šต ํšจ๊ณผ๋ฅผ ์–ป์Œ.

4๏ธโƒฃ Diffusion ๋ชจ๋ธ(ฯ•) ํ•™์Šต

  • ๊ธฐ์กด T2I Diffusion ๋ชจ๋ธ์„ ์ž…๋ ฅ๋œ ๋ฐ์ดํ„ฐ({xr},yx,{nr},yn)๋กœ ํŒŒ์ธํŠœ๋‹ํ•˜์—ฌ ์ฐธ์กฐ ํ”„๋ผ์ด์–ด(ฯ•r)๋ฅผ ๊ตฌ์ถ•.
  • DSD ์—์„œ๋Š” ์ปฌ๋Ÿฌ ์ด๋ฏธ์ง€์™€ ๋…ธ๋ฉ€ ๋งต์„ ๊ฐ๊ฐ ๋”ฐ๋กœ ํ•™์Šตํ•˜์ง€๋งŒ, ์ตœ์ข…์ ์œผ๋กœ๋Š” "ํ•˜๋‚˜์˜ Diffusion ๋ชจ๋ธ"์„ ํŒŒ์ธํŠœ๋‹ํ•˜์—ฌ Reference Prior( ) ๋ฅผ ๊ตฌ์ถ•ํ•จ
  • ์ฆ‰, Diffusion ๋ชจ๋ธ์ด ๊ฐ™์€ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ๋„, ์ž…๋ ฅ๋œ ๋ฐ์ดํ„ฐ์˜ ์œ ํ˜•(์ปฌ๋Ÿฌ ์ด๋ฏธ์ง€ vs ๋…ธ๋ฉ€ ๋งต)์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ์ •๋ณด(์ƒ‰์ƒ/ํ˜•์ƒ)๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋„๋ก ์œ ๋„ํ•˜๋Š” ๊ตฌ์กฐ

 

=> 3D ๋ชจ๋ธ์„ 2D ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ์ด์œ  

  • 2D Diffusion ๋ชจ๋ธ๋“ค์€ ํ’๋ถ€ํ•œ 2D ๋ฐ 3D ์‹œ๊ฐ์  ์ •๋ณด(Prior) ๋ฅผ ์ด๋ฏธ ํฌํ•จํ•˜๊ณ  ์žˆ์Œ
  • 3D ๋ฐ์ดํ„ฐ๋ฅผ 2D ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ Diffusion ๋ชจ๋ธ์˜ ๊ฐ•๋ ฅํ•œ ํ•™์Šต ๋Šฅ๋ ฅ ํ™œ์šฉ ๊ฐ€๋Šฅ
  • ์ปฌ๋Ÿฌ ์ด๋ฏธ์ง€์™€ ๋…ธ๋ฉ€ ๋งต์„ ๋ฉ€ํ‹ฐ๋ทฐ๋กœ ์ œ๊ณตํ•˜๋ฉด, 3D ์ •๋ณด๋ฅผ ์•”๋ฌต์ ์œผ๋กœ ์œ ์ง€ ๊ฐ€๋Šฅ

 

How Does Dual Score Distillation Work?

  • ๊ธฐ์กด Score Distillation Sampling (SDS) ๋ฐฉ์‹์˜ ๋ฌธ์ œ์ 
    • ๊ธฐ๋ณธ์ ์ธ SDS ๋ฐฉ์‹์—์„œ๋Š” ๋‹จ์ˆœํžˆ ๋‘ ๊ฐœ์˜ Diffusion ๋ชจ๋ธ(์ปจ์…‰ ํ”„๋ผ์ด์–ด ฯ•c ์™€ ์ฐธ์กฐ ํ”„๋ผ์ด์–ด ฯ•r)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ๊ฐ ๋”ฐ๋กœ Score Distillation์„ ์ˆ˜ํ–‰ํ•œ ํ›„ ํ•ฉ์‚ฐํ•˜๋Š” ๋ฐฉ์‹
    • ํ•˜์ง€๋งŒ, ๋‘ ๊ฐœ์˜ ํ”„๋ผ์ด์–ด๋ฅผ ๋‹จ์ˆœํžˆ ํ•ฉํ•˜๋ฉด ์ตœ์ ํ™” ๊ณผ์ •์—์„œ ์ถฉ๋Œ(Loss Conflicts)์ด ๋ฐœ์ƒ → ์™œ๊ณก๋œ 3D ๋ชจ๋ธ์ด ์ƒ์„ฑ๋จ
  • DSD(Dual Score Distillation) Loss ์˜ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•
    • Diffusion ๋ชจ๋ธ์˜ ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ(reverse diffusion) ๊ณผ์ •์—์„œ, ๋…ธ์ด์ฆˆ ๋ ˆ๋ฒจ(denoising timesteps)์— ๋”ฐ๋ผ ์—ญํ• ์„ ๋‚˜๋ˆ„์ž!
    • ๊ณ ์ฃผํŒŒ(High-frequency) ์ •๋ณด์™€ ์ €์ฃผํŒŒ(Low-frequency) ์ •๋ณด๊ฐ€ ๋‹ค๋ฅธ ๋‹จ๊ณ„์—์„œ ํ•™์Šต๋œ๋‹ค๋Š” ์ ์„ ํ™œ์šฉ!

 

  • High noise level (์ดˆ๊ธฐ ๋‹จ๊ณ„, th)๊ฑฐ์นœ ๋ ˆ์ด์•„์›ƒ(Global Layout)๊ณผ ์ƒ‰์ƒ(Color) ๋ถ„ํฌ๋ฅผ ์กฐ์ ˆ
    • ์ปจ์…‰ ํ”„๋ผ์ด์–ด ฯ•c ๋Š” ์ „์ฒด์ ์ธ ํ˜•ํƒœ & ์ƒ‰์ƒ์„ ๊ฒฐ์ •ํ•˜๋ฏ€๋กœ ์ดˆ๊ธฐ ๋…ธ์ด์ฆˆ ๋‹จ๊ณ„(th)์—์„œ ์ ์šฉ


  • Low noise level (ํ›„๋ฐ˜ ๋‹จ๊ณ„, tl)๊ณ ์ฃผํŒŒ ๋””ํ…Œ์ผ(High-frequency Details)์„ ์ƒ์„ฑ
    • ์ฐธ์กฐ ํ”„๋ผ์ด์–ด ฯ•r ๋Š” ์„ธ๋ถ€์ ์ธ ์งˆ๊ฐ(Texture)๊ณผ ํ˜•์ƒ์„ ๋ณด์กดํ•˜๋ฏ€๋กœ ํ›„๋ฐ˜ ๋…ธ์ด์ฆˆ ๋‹จ๊ณ„(tl)์—์„œ ์ ์šฉ
  • ์ตœ์ข… DSD Loss
    ์œ„์™€ ๊ฐ™์ด DSD Loss ๋ฅผ ์‚ฌ์šฉํ•˜๊ฒŒ ๋˜๋ฉด
  • Loss ์ถฉ๋Œ ๋ฌธ์ œ ํ•ด๊ฒฐ => Noise Level ์„ ๋‹ค๋ฅด๊ฒŒ ์ ์šฉ
  • ์ปจ์…‰์„ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ์„ธ๋ถ€ ๋””ํ…Œ์ผ ๋ณด์กด => ์ดˆ๊ธฐ ๋‹จ๊ณ„์—์„œ๋Š” ์ปจ์…‰์„, ํ›„๋ฐ˜ ๋‹จ๊ณ„์—์„œ๋Š” ์„ธ๋ถ€ ๋””ํ…Œ์ผ์„ ํ•™์Šตํ•˜์—ฌ ์ตœ์ ์˜ 3D ๋ชจ๋ธ์„ ์ƒ์„ฑ