๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐Ÿ˜ŽAI/3D Reconstruction

[Paper Review] SyncDreamer: Generating Multiview-Consistent Images From a Single-View Image

by SolaKim 2025. 3. 18.

https://arxiv.org/abs/2309.03453

 

SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

In this paper, we present a novel diffusion model called that generates multiview-consistent images from a single-view image. Using pretrained large-scale 2D diffusion models, recent work Zero123 demonstrates the ability to generate plausible novel views f

arxiv.org

 

 

 

Abstract

์ด ๋…ผ๋ฌธ์—์„œ๋Š” SyncDreamer ๋ผ๋Š” ์ƒˆ๋กœ์šด ํ™•์‚ด ๋ชจ๋ธ(diffusion model)์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. SyncDreamer๋Š” ๋‹จ์ผ ์ด๋ฏธ์ง€(Single-View image) ์—์„œ ๋‹ค๊ฐ๋„๋กœ ์ผ๊ด€๋œ (multiview-consistnet) ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

๐Ÿ” ๊ธฐ์กด ๋ฌธ์ œ์  

  • Zero123 ๋“ฑ์˜ ๊ธฐ์กด ์—ฐ๊ตฌ๋Š” 2D ํ™•์‚ฐ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹จ์ผ ์ด๋ฏธ์ง€๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ƒˆ๋กœ์šด ์‹œ์ (novel views) ์˜ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ฐ ์„ฑ๊ณตํ–ˆ์ง€๋งŒ, 
  • ๊ธฐํ•˜ํ•™์ (geometry) ๋ฐ ์ƒ‰์ƒ(color) ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ค์›€
  • ์ƒ์„ฑ๋œ ์ด๋ฏธ์ง€๋“ค ๊ฐ„์— ๋ถˆ์ผ์น˜๊ฐ€ ๋ฐœ์ƒํ•˜์—ฌ 3D ์žฌ๊ตฌ์„ฑ์— ํ™œ์šฉํ•˜๊ธฐ ์–ด๋ ค์›€

๐ŸŽ€ SyncDreamer ์˜ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•

  • ๋™๊ธฐํ™”๋œ(multiview-synchronized) ๋‹ค๊ฐ๋„ ํ™•์‚ฐ ๋ชจ๋ธ์„ ์ œ์•ˆ
  • ์—ฌ๋Ÿฌ ๋ทฐ์—์„œ์˜ ๊ณตํ†ต์ ์ธ ํŠน์ง•(feature) ์„ ๊ณต์œ ํ•˜๋Š” 3D ์ธ์‹(feature attention) ๋งค์ปค๋‹ˆ์ฆ˜์„ ํ™œ์šฉ
  • ์—ญํ™•์‚ฐ(reverse diffusion) ๊ณผ์ •์—์„œ ๋ชจ๋“  ๋ทฐ์˜ ์ค‘๊ฐ„ ์ƒํƒœ๋ฅผ ๋™๊ธฐํ™”ํ•˜์—ฌ ๋‹ค๊ฐ๋„ ์ผ๊ด€์„ฑ์„ ์œ ์ง€

 

SyncDreamer ๋Š” ๋” ๋†’์€ ๋‹ค๊ฐ๋„ ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•˜๋ฉฐ, 3D ์žฌ๊ตฌ์„ฑ ๋ฐ ์ด๋ฏธ์ง€-๊ธฐ๋ฐ˜ 3D ์ƒ์„ฑ์— ํšจ๊ณผ์ ์ž…๋‹ˆ๋‹ค.

๊ธฐ์กด ๋ฐฉ๋ฒ•๋ณด๋‹ค ๋” ์ •ํ™•ํ•œ 3D ๋ณต์›์ด ๊ฐ€๋Šฅํ•˜๋ฉฐ, text-to-3D ๋ฐ ์ƒˆ๋กœ์šด ์‹œ์ (novel-view synthesis)์—๋„ ํ™œ์šฉ ๊ฐ€๋Šฅ ํ•ฉ๋‹ˆ๋‹ค.

 

 

Introduction

๊ธฐ์กด ์—ฐ๊ตฌ ๋ฐ ํ•œ๊ณ„

  • ๋”ฅ๋Ÿฌ๋‹๊ณผ ์‹ ๊ฒฝ๋ง(NeRF ํฌํ•จ)์ด 3D ์ •๋ณด ์ถ”์ถœ์— ํฐ ๋ฐœ์ „์„ ์ด๋ฃจ์—ˆ์ง€๋งŒ,
    โŒ ๋‹จ์ผ ์ด๋ฏธ์ง€์—์„œ ๋‹ค๊ฐ๋„ ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•˜๋ฉฐ 3D ๋ณต์›ํ•˜๋Š” ๊ฒƒ์€ ์—ฌ์ „ํžˆ ์–ด๋ ค์›€.
  • 2D ํ™•์‚ฐ ๋ชจ๋ธ(diffusion models)์ด ์ด๋ฏธ์ง€ ์ƒ์„ฑ์—์„œ ํฐ ์„ฑ๊ณผ๋ฅผ ๋ƒ„.
  • ํ•˜์ง€๋งŒ 3D ํ™•์‚ฐ ๋ชจ๋ธ์„ ์ง์ ‘ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์€ ๋ฐ์ดํ„ฐ ๋ถ€์กฑ์œผ๋กœ ์–ด๋ ค์›€.

๊ธฐ์กด ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•: Text-to-3D ๋ชจ๋ธ

  • ๊ธฐ์กด ๋ฐฉ๋ฒ•: ํ…์ŠคํŠธ-์ด๋ฏธ์ง€ ํ™•์‚ฐ ๋ชจ๋ธ์„ 3D๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋ฐฉ์‹(text-to-3D distillation)
  • ๋ฌธ์ œ์ :
    โŒ ํ•™์Šต ๊ณผ์ •์ด ๋ณต์žกํ•˜๊ณ  ์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆผ.
    โŒ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ๋งŒ์œผ๋กœ 3D ํ˜•ํƒœ๋ฅผ ์ •ํ™•ํžˆ ํ‘œํ˜„ํ•˜๋Š” ๋ฐ ํ•œ๊ณ„.
    โŒ ๊ฐ์ฒด์˜ ์„ธ๋ถ€์ ์ธ ์นดํ…Œ๊ณ ๋ฆฌ, ์™ธํ˜•, ์ž์„ธ(pose) ์ •๋ณด ์†์‹ค โ†’ ํ’ˆ์งˆ ์ €ํ•˜.

๊ธฐ์กด ๋ฌธ์ œ์ 

  •  ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค์€ 2D ํ™•์‚ฐ ๋ชจ๋ธ(diffusion models)์„ ํ™œ์šฉํ•˜์—ฌ ๋‹จ์ผ ์ด๋ฏธ์ง€์—์„œ ๋‹ค๊ฐ๋„ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๊ณ  3D ๋ณต์›(reconstruction)์„ ์ˆ˜ํ–‰
  •  ๊ทธ๋Ÿฌ๋‚˜, ๋™์ผํ•œ ๊ฐ์ฒด์˜ ๋‹ค๊ฐ๋„ ์ด๋ฏธ์ง€ ๊ฐ„ ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ค์›€
  •  ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ์ž…๋ ฅ ์กฐ๊ฑด์„ ์ถ”๊ฐ€ํ•จ:
    • ์ž…๋ ฅ ์ด๋ฏธ์ง€ ์กฐ๊ฑด ์‚ฌ์šฉ (Zhou & Tulsiani, 2023; Tseng et al., 2023)
    • ์ด์ „ ์ƒ์„ฑ๋œ ์ด๋ฏธ์ง€ ํ™œ์šฉ (Tewari et al., 2023; Chan et al., 2023)
    • Neural Field ๋ Œ๋”๋ง ์‚ฌ์šฉ (Gu et al., 2023b)
  • ํ•˜์ง€๋งŒ, ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค์€ ํŠน์ • ๋ฐ์ดํ„ฐ์…‹(์˜ˆ: ShapeNet, Co3D)์—์„œ๋งŒ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ, ์ž„์˜์˜ ๊ฐ์ฒด(any arbitrary object)์— ๋Œ€ํ•ด ์ผ๋ฐ˜ํ™”ํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ค์›€

 

๐Ÿงธ SyncDreamer: ์ƒˆ๋กœ์šด ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•

โœ” ํ•ต์‹ฌ ์•„์ด๋””์–ด

  • ํ™•์‚ฐ ๋ชจ๋ธ์˜ ํ™•์žฅ๋œ ๋ฒ„์ „์œผ๋กœ, ๋‹ค๊ฐ๋„ ์ด๋ฏธ์ง€์˜ ํ™•๋ฅ  ๋ถ„ํฌ(joint probability distribution)๋ฅผ ๋ชจ๋ธ๋ง.
  • ๋™๊ธฐํ™”๋œ(multiview-synchronized) ํ™•์‚ฐ ๋ชจ๋ธ์„ ๋„์ž…ํ•˜์—ฌ ๋ชจ๋“  ์ƒ์„ฑ๋œ ์ด๋ฏธ์ง€๊ฐ€ ๋™๊ธฐํ™”๋˜๋„๋ก ํ•™์Šต.

โœ” SyncDreamer์˜ ์ฃผ์š” ํŠน์ง•

โ‘  N๊ฐœ์˜ ๋…ธ์ด์ฆˆ ์˜ˆ์ธก๊ธฐ(Noise Predictors)๋ฅผ ๊ณต์œ ํ•˜์—ฌ ๋™์‹œ์— ๋‹ค๊ฐ๋„ ์ด๋ฏธ์ง€ ์ƒ์„ฑ

  • ํ•œ ๋ฒˆ์˜ ์—ญํ™•์‚ฐ(reverse diffusion) ๊ณผ์ •์—์„œ N๊ฐœ์˜ ์ด๋ฏธ์ง€๋ฅผ ๋™์‹œ์— ์ƒ์„ฑ.
  • ๊ฐ ๋…ธ์ด์ฆˆ ์˜ˆ์ธก๊ธฐ๋“ค์ด ์ฃผ์˜(attention) ๋ ˆ์ด์–ด๋ฅผ ํ†ตํ•ด ์„œ๋กœ ์ •๋ณด๋ฅผ ๊ณต์œ ํ•˜์—ฌ ์ผ๊ด€์„ฑ์„ ์œ ์ง€.

โ‘ก ์‚ฌ์ „ ํ•™์Šต๋œ Zero123 ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์—ฌ ๊ฐ•๋ ฅํ•œ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ํ™•๋ณด

  • Stable Diffusion์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šต๋œ Zero123 ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์—ฌ ๋ณด๋‹ค ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ์— ๋Œ€์‘ ๊ฐ€๋Šฅ.
  • ์‹ค์ œ ์‚ฌ์ง„๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์† ๊ทธ๋ฆผ, ๋งŒํ™”, ์ˆ˜์ฑ„ํ™” ๋“ฑ ๋‹ค์–‘ํ•œ ์Šคํƒ€์ผ์˜ 2D ์ด๋ฏธ์ง€์—์„œ๋„ ์‚ฌ์šฉ ๊ฐ€๋Šฅ.

โ‘ข 3D ์žฌ๊ตฌ์„ฑ์„ ๋” ์‰ฝ๊ฒŒ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Œ

  • ์ƒ์„ฑ๋œ ์ด๋ฏธ์ง€๊ฐ€ ๊ธฐํ•˜ํ•™์ ์œผ๋กœ ์ผ๊ด€๋˜๋ฏ€๋กœ, ๊ธฐ๋ณธ์ ์ธ NeRF ๋˜๋Š” NeuS๋ฅผ ๊ทธ๋Œ€๋กœ ์ ์šฉ ๊ฐ€๋Šฅ.
  • ํŠน์ˆ˜ํ•œ ์†์‹ค ํ•จ์ˆ˜(SDS Loss) ์—†์ด๋„ ๋ณด๋‹ค ์ง๊ด€์ ์œผ๋กœ 3D ํ’ˆ์งˆ์„ ์˜ˆ์ธก ๊ฐ€๋Šฅ.

โ‘ฃ ์ฐฝ์˜์„ฑ๊ณผ ๋‹ค์–‘์„ฑ ์œ ์ง€ ๊ฐ€๋Šฅ

  • ๊ธฐ์กด distillation ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•๋“ค์€ ํ•œ ๊ฐœ์˜ 3D ํ˜•ํƒœ๋กœ ์ˆ˜๋ ดํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์ง€๋งŒ,
  • SyncDreamer๋Š” ๊ฐ™์€ ์ž…๋ ฅ ์ด๋ฏธ์ง€๋กœ๋„ ๋‹ค์–‘ํ•œ 3D ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑ ๊ฐ€๋Šฅ.

์‹คํ—˜ ๊ฒฐ๊ณผ

  • Google Scanned Object (GSO) ๋ฐ์ดํ„ฐ์…‹์—์„œ SyncDreamer์™€ ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค์„ ๋น„๊ต.
  • ๊ฒฐ๊ณผ์ ์œผ๋กœ, ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค๋ณด๋‹ค ๋” ๋†’์€ ๋‹ค๊ฐ๋„ ์ผ๊ด€์„ฑ๊ณผ 3D ๋ณต์› ์„ฑ๋Šฅ์„ ๋ณด์ž„.
  • ๋˜ํ•œ, ๋‹ค์–‘ํ•œ ์Šคํƒ€์ผ(์Šค์ผ€์น˜, ๋งŒํ™”, ์œ ํ™” ๋“ฑ)์˜ 2D ์ž…๋ ฅ ์ด๋ฏธ์ง€์—์„œ๋„ ๋†’์€ ์„ฑ๋Šฅ์„ ํ™•์ธ.

 

 


Related Works

 

Diffusion Models 

  • ํ™•์‚ฐ ๋ชจ๋ธ(Diffusion Models)์˜ ์„ฑ๊ณต
    • Ho et al. (2020), Rombach et al. (2022), Croitoru et al. (2023) ๋“ฑ ์—ฐ๊ตฌ์—์„œ 2D ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ถ„์•ผ์—์„œ ํ™•์‚ฐ ๋ชจ๋ธ์ด ๋›ฐ์–ด๋‚œ ์„ฑ๊ณผ๋ฅผ ๋ณด์ž„.
  • ๋‹ค๊ฐ๋„(multiview) ํ™•์‚ฐ ๋ชจ๋ธ ์—ฐ๊ตฌ
    • MVDiffusion (Tang et al., 2023b) โ†’ ๊ณ ์ •๋œ ๊ธฐํ•˜ํ•™ ์ •๋ณด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ…์Šค์ฒ˜ ๋ฐ ํŒŒ๋…ธ๋ผ๋งˆ ์ƒ์„ฑ.
    • SyncDreamer (๋ณธ ๋…ผ๋ฌธ) โ†’ ๊ธฐ์กด๊ณผ ๋‹ค๋ฅด๊ฒŒ ๊ธฐํ•˜ํ•™ ์ •๋ณด๊ฐ€ ์—†๋Š” ์ƒํƒœ์—์„œ(multiview with unknown geometry) ๋‹ค๊ฐ๋„ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑ.
    • MultiDiffusion (Bar-Tal et al., 2023), SyncDiffusion (Lee et al., 2023) โ†’ 2D ์ด๋ฏธ์ง€ ๋‚ด ์—ฌ๋Ÿฌ ์˜์—ญ์—์„œ ํ™•์‚ฐ ๋ชจ๋ธ์„ ๊ฒฐํ•ฉ.
  • ํ™•์‚ฐ ๋ชจ๋ธ์„ 3D ์ƒ์„ฑ์— ์ ์šฉํ•˜๋Š” ์—ฐ๊ตฌ
    • Nichol et al. (2022), Jun & Nichol (2023), Mรผller et al. (2023), Zhang et al. (2023a) ๋“ฑ
      โ†’ 3D ์ƒ์„ฑ ๋ชจ๋ธ์—์„œ๋„ ํ™•์‚ฐ ๋ชจ๋ธ์„ ์ ์šฉํ•˜๋ ค๋Š” ์—ฐ๊ตฌ ์ง„ํ–‰.
    • ๊ทธ๋Ÿฌ๋‚˜, ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฌธ์ œ์ ์ด ์กด์žฌ:
      โŒ 3D ๋ฐ์ดํ„ฐ ๋ถ€์กฑ โ†’ ์ง์ ‘์ ์ธ 3D ํ™•์‚ฐ ๋ชจ๋ธ ํ•™์Šต์ด ์–ด๋ ค์›€.
      โŒ 2D ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ชจ๋ธ๋ณด๋‹ค ํ’ˆ์งˆ์ด ๋‚ฎ๊ณ  ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง.
  • 2D ๋ฐ์ดํ„ฐ๋งŒ ํ™œ์šฉํ•œ 3D ํ™•์‚ฐ ๋ชจ๋ธ ์—ฐ๊ตฌ
    • Anciukevicห‡ius et al. (2023), Chen et al. (2023a), Karnewar et al. (2023b)
      โ†’ 3D ๋ฐ์ดํ„ฐ ๋ถ€์กฑ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด 2D ์ด๋ฏธ์ง€๋งŒ์„ ํ™œ์šฉํ•œ 3D ํ™•์‚ฐ ๋ชจ๋ธ ํ•™์Šต ์‹œ๋„.

 

Using 2D Diffusion Models For 3D

  • 3D ํ™•์‚ฐ ๋ชจ๋ธ์„ ์ง์ ‘ ํ•™์Šตํ•˜๋Š” ๋Œ€์‹ , ๊ณ ํ’ˆ์งˆ 2D ํ™•์‚ฐ ๋ชจ๋ธ์„ 3D ์ƒ์„ฑ์— ํ™œ์šฉ
    • DreamFusion (Poole et al., 2023), SJC (Wang et al., 2023a) โ†’
      โ†’ 2D ํ…์ŠคํŠธ-์ด๋ฏธ์ง€ ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์—ฌ 3D ํ˜•์ƒ์„ ์ƒ์„ฑํ•˜๋Š” distillation ๊ธฐ๋ฒ• ์ œ์•ˆ

  • ๋‹จ์ผ ์ด๋ฏธ์ง€์—์„œ 3D ๋ณต์›์„ ์œ„ํ•œ distillation ๋ฐฉ์‹
    • Tang et al., 2023a; Melas-Kyriazi et al., 2023; Qian et al., 2023 ๋“ฑ
      โ†’ NeRF ๊ธฐ๋ฐ˜ ์ตœ์ ํ™” ๋ฐ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ ํ™œ์šฉํ•˜์—ฌ ๋‹จ์ผ ์ด๋ฏธ์ง€์—์„œ 3D ๋ณต์› ์ˆ˜ํ–‰.
      โŒ ํ•˜์ง€๋งŒ, ํ…์ŠคํŠธ ๋ณ€ํ™˜(textual inversion)๊ณผ NeRF ์ตœ์ ํ™” ๊ณผ์ •์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฌ๋ฉฐ ํ’ˆ์งˆ์ด ๋ถˆ์•ˆ์ •.

  • 2D ํ™•์‚ฐ ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์—ฌ ๋‹ค๊ฐ๋„(multiview) ์ด๋ฏธ์ง€ ์ƒ์„ฑ โ†’ 3D ๋ณต์›์— ์ ์šฉ
    • Watson et al. (2022), Gu et al. (2023b), Zhou & Tulsiani (2023) ๋“ฑ
      โ†’ ์ž…๋ ฅ ์ด๋ฏธ์ง€(attention layers)๋ฅผ ์กฐ๊ฑด์œผ๋กœ ํ™œ์šฉํ•˜์—ฌ ์ƒˆ๋กœ์šด ์‹œ์ (novel views) ์ƒ์„ฑ.
    • Xiang et al. (2023), Zhang et al. (2023b)
      โ†’ ๊นŠ์ด(depth) ์ง€๋„ ์˜ˆ์ธก์„ ํ™œ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€ ์™œ๊ณก ๋ฐ ๋ณด์™„(warp & inpaint) ์ˆ˜ํ–‰.
      โŒ ํ•˜์ง€๋งŒ, ๊นŠ์ด ์ง€๋„ ์ถ”์ •์ด ๋ถ€์ •ํ™•ํ•˜๋ฉด ํ’ˆ์งˆ ์ €ํ•˜ ๋ฌธ์ œ ๋ฐœ์ƒ.
    • Chan et al. (2023), Tewari et al. (2023)
      โ†’ ์ƒˆ๋กœ์šด ์ด๋ฏธ์ง€๋ฅผ ๋ฐ˜๋ณต์ ์œผ๋กœ ์ƒ์„ฑํ•˜๋Š” autoregressive ๋ฐฉ์‹ ์ ์šฉ.
      โ†’ ํŠน์ • ๊ฐ์ฒด ์นดํ…Œ๊ณ ๋ฆฌ์—์„œ๋Š” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด์ง€๋งŒ ๋ฒ”์šฉ์ ์ด์ง€ ์•Š์Œ.

 

SyncDreamer์˜ ์ฐจ๋ณ„์ 

โ‘  ๊ธฐ์กด distillation ๋ฐฉ์‹๊ณผ ๋‹ค๋ฅด๊ฒŒ, ํ•˜๋‚˜์˜ ์—ญํ™•์‚ฐ(reverse diffusion) ๊ณผ์ •์—์„œ ๋ชจ๋“  ๋‹ค๊ฐ๋„ ์ด๋ฏธ์ง€๋ฅผ ๋™์‹œ์— ์ƒ์„ฑ
โ‘ก ํŠน์ • ๊ฐ์ฒด ์นดํ…Œ๊ณ ๋ฆฌ(Scene-specific)๋‚˜ ์ถ”๊ฐ€์ ์ธ ๊นŠ์ด ์ง€๋„ ์—†์ด๋„ ๋‹ค์–‘ํ•œ ๊ฐ์ฒด์— ๋Œ€ํ•ด ์ ์šฉ ๊ฐ€๋Šฅ
โ‘ข Viewset Diffusion(Szymanowicz et al., 2023)๊ณผ ์œ ์‚ฌํ•˜์ง€๋งŒ,

  • ๋ฐฉ์‚ฌ์žฅ(Radiance Field) ์˜ˆ์ธก์ด ํ•„์š” ์—†์Œ โ†’ ๊ณ„์‚ฐ๋Ÿ‰ ์ ˆ๊ฐ
  • ๊ณ ์ •๋œ ์‹œ์ (Viewpoints) ์„ค์ • โ†’ ๋” ์•ˆ์ •์ ์ธ ํ•™์Šต ๊ฐ€๋Šฅ

โ‘ฃ MVDream (Shi et al., 2023)๊ณผ ์œ ์‚ฌํ•˜์ง€๋งŒ, SyncDreamer๋Š” text-to-3D๊ฐ€ ์•„๋‹ˆ๋ผ ๋‹จ์ผ ์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜ reconstruction์„ ๋ชฉํ‘œ๋กœ ํ•จ.

 

Other Single-View Reconstruction Methods

  • ๋‹จ์ผ ์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜ 3D ๋ณต์›์˜ ์–ด๋ ค์›€
    • ๋‹จ์ผ ์ด๋ฏธ์ง€(single-view)์—์„œ 3D ๊ตฌ์กฐ๋ฅผ ๋ณต์›ํ•˜๋Š” ๊ฒƒ์€ ๋งค์šฐ ์–ด๋ ค์šด ๋ฌธ์ œ
      ๊ณผ๊ฑฐ์—๋Š” ํšŒ๊ท€(regression) ๋˜๋Š” ๊ฒ€์ƒ‰(retrieval) ๋ฐฉ์‹์œผ๋กœ 3D ๋ณต์›์„ ์‹œ๋„ํ–ˆ์œผ๋‚˜,
      โŒ ์ƒˆ๋กœ์šด ๊ฐ์ฒด ์นดํ…Œ๊ณ ๋ฆฌ์— ๋Œ€ํ•œ ์ผ๋ฐ˜ํ™”๊ฐ€ ์–ด๋ ค์›€ (Tatarchenko et al., 2019; Li et al., 2020).
  • ์ตœ๊ทผ NeRF-GAN ๊ธฐ๋ฒ•์„ ์ ์šฉํ•˜์—ฌ ํŠน์ • ๊ฐ์ฒด(์˜ˆ: ์‚ฌ๋žŒ ์–ผ๊ตด, ๊ณ ์–‘์ด ์–ผ๊ตด ๋“ฑ)์—์„œ ๋” ๋‚˜์€ 3D ๋ณต์› ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑ
    • ํ•™์Šต๋œ NeRF ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‹จ์ผ ์ด๋ฏธ์ง€์—์„œ 3D ๊ตฌ์กฐ๋ฅผ ์ƒ์„ฑ
    • ํ•˜์ง€๋งŒ, NeRF-GAN์€ ํŠน์ • ์นดํ…Œ๊ณ ๋ฆฌ์—์„œ๋งŒ ์ž˜ ๋™์ž‘ํ•˜๋ฉฐ, ๋ฒ”์šฉ์ ์ธ 3D ๋ณต์›์—๋Š” ํ•œ๊ณ„
      โŒ ๋‹ค์–‘ํ•œ ์ž„์˜์˜ ๊ฐ์ฒด(arbitrary objects)์— ๋Œ€ํ•ด ์ผ๋ฐ˜ํ™”ํ•˜๊ธฐ ์–ด๋ ค์›€
      โŒ ImageNet ๊ฐ™์€ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์—์„œ ํ•™์Šต์„ ์‹œ๋„ํ•˜์ง€๋งŒ ์—ฌ์ „ํžˆ ์–ด๋ ค์›€ (Skorokhodov et al., 2023; Sargent et al., 2023). NeRF-GAN์„ ํ™œ์šฉํ•œ 3D ๋ณต์› ์—ฐ๊ตฌ

 


METHOD

๐ŸŽˆ ๋ชฉํ‘œ: ๋‹จ์ผ ์ด๋ฏธ์ง€์—์„œ ๋‹ค๊ฐ๋„(multiview) ์ด๋ฏธ์ง€ ์ƒ์„ฑ

์ž…๋ ฅ ์ด๋ฏธ์ง€ ๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, ๋‹ค๊ฐ๋„ ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•˜๋Š” ์ด๋ฏธ์ง€๋“ค์„ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ.
๊ฐ์ฒด๋Š” ์›์ (origin)์— ์œ„์น˜ํ•˜๊ณ , 1x1 ํฌ๊ธฐ์˜ ์ •๊ทœํ™”๋œ ํ๋ธŒ ์•ˆ์— ์žˆ๋‹ค๊ณ  ๊ฐ€์ •.
N๊ฐœ์˜ ๊ณ ์ •๋œ ์‹œ์ (fixed viewpoints) ์—์„œ ์ƒ์„ฑํ•˜๋ฉฐ,

  • ๋ฐฉ์œ„๊ฐ(Azimuth): 0ยฐ ~ 360ยฐ ๋ฒ”์œ„์—์„œ ๊ท ๋“ฑ ๋ถ„ํฌ.
  • ๊ณ ๋„(Elevation): 30ยฐ๋กœ ๊ณ ์ •.

์ด ๋ชจ๋ธ์€ ํ™•์‚ฐ ํ™•๋ฅ  ๋ชจ๋ธ(diffusion probabilistic model)์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•จ.

๐Ÿ”ฝ DDPM ์— ๊ด€ํ•œ ์„ค๋ช…

 

 

MULTIVIEW DIFFUSION

Vanilla DDPM ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ์‹œ์ ์„ ๋…๋ฆฝ์ ์œผ๋กœ ์ƒ์„ฑํ•˜๊ฒŒ ๋˜๋ฉด ๋‹ค๊ฐ๋„ ์ผ๊ด€์„ฑ ์œ ์ง€๊ฐ€ ์–ด๋ ค์›€

โ†’ ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋“  ์‹œ์ ์˜ ์ด๋ฏธ์ง€ ์ƒ์„ฑ์„ ์—ฐ๊ด€์‹œํ‚จ Multiview Diffusion Model ์„ ์„ค๊ณ„
โ†’ ๋ชฉํ‘œ๋Š” ๋‹จ์ผ ์ด๋ฏธ์ง€์—์„œ ๋‹ค๊ฐ๋„ ์ด๋ฏธ์ง€ ์ƒ์„ฑ์„ ํ•˜๋‚˜์˜ ํ™•๋ฅ  ๋ถ„ํฌ๋กœ ๋ชจ๋ธ๋ง ํ•˜๋Š” ๊ฒƒ

 

๐Ÿงก ์ •๋ฐฉํ–ฅ ํ™•์‚ฐ ๊ณผ์ • (Foward Diffusion Process)

๊ฐ ์‹œ์ ์˜ ์ด๋ฏธ์ง€์— ๋…๋ฆฝ์ ์œผ๋กœ ๊ฐ€์šฐ์‹œ์•ˆ ๋…ธ์ด์ฆˆ๋ฅผ ์ถ”๊ฐ€

 

๐Ÿงก ์—ญ๋ฐฉํ–ฅ ํ™•์‚ฐ ๊ณผ์ • (Reverse Diffusion Process)

  • ์ •๋ฐฉํ–ฅ ํ™•์‚ฐ ๊ณผ์ •์—์„œ ์ถ”๊ฐ€๋œ ๋…ธ์ด์ฆˆ๋ฅผ ์ œ๊ฑฐํ•˜์—ฌ ๊นจ๋—ํ•œ ๋‹ค๊ฐ๋„ ์ด๋ฏธ์ง€๋ฅผ ๋ณต์›
  • ๋‹จ์ˆœํžˆ ๊ฐœ๋ณ„์ ์ธ ์ด๋ฏธ์ง€ ์ƒ์„ฑ์ด ์•„๋‹ˆ๋ผ, ๊ฐ ์‹œ์ ์˜ ์ด๋ฏธ์ง€๋ฅผ ์—ฐ๊ด€์‹œ์ผœ ํ•™์Šต

 

ฮผฮธ ํ‰๊ท  ์‹ ๋ถ„์„

๊ฐ ํ•ญ์˜ ์˜๋ฏธ๋ฅผ ๋ถ„์„ํ•ด๋ณด์ž

 

์†์‹ค ํ•จ์ˆ˜ (L) ๋ถ„์„

  • ๋…ธ์ด์ฆˆ ์˜ˆ์ธก๊ธฐ ฯตฮธ(n) ๊ฐ€ ์‹ค์ œ ๊ฐ€์šฐ์‹œ์•ˆ ๋…ธ์ด์ฆˆ ฯต(n) ์™€ ์ตœ๋Œ€ํ•œ ์ผ์น˜ํ•˜๋„๋ก ํ•™์Šตํ•˜๋Š” ๊ณผ์ •
  • ๋„คํŠธ์›Œํฌ๊ฐ€ ํ•™์Šตํ•  ๋•Œ, ๊ฐ ์‹œ์ (t)๊ณผ ๊ฐ ๋ทฐ(n)์— ๋Œ€ํ•ด ๋…ธ์ด์ฆˆ๋ฅผ ์ตœ์†Œํ™”ํ•˜๋„๋ก ์ตœ์ ํ™”

๊ฐ ํ•ญ์˜ ์˜๋ฏธ๋ฅผ ๋ถ„์„ํ•ด๋ณด์ž


Training Procedure

๐Ÿ“ Step 1: N ๊ฐœ์˜ ๋‹ค๊ฐ๋„ ์ด๋ฏธ์ง€ ์ƒ˜ํ”Œ๋ง

  • ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋™์ผํ•œ ๊ฐ์ฒด์— ๋Œ€ํ•œ N ๊ฐœ์˜ ์‹œ์ (Views) ๋ฅผ ๊ฐ€์ ธ์˜ด
  • ์ฆ‰, ๋ฐ์ดํ„ฐ์…‹์—๋Š” ๋‹จ์ผ ๊ฐ์ฒด๋ฅผ ๋‹ค์–‘ํ•œ ์‹œ์ ์—์„œ ์ดฌ์˜ํ•œ ์ด๋ฏธ์ง€๋“ค์ด ์ €์žฅ๋˜์–ด ์žˆ์Œ

โžก ์ด ๋‹จ๊ณ„์—์„œ ํ•™์Šต ๋ฐ์ดํ„ฐ๊ฐ€ ๋‹ค๊ฐ๋„ ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•˜๋„๋ก ์„ค๊ณ„๋จ

 

๐Ÿ“ Step 2: ๋ฌด์ž‘์œ„ ํƒ€์ž„์Šคํ… t ์ƒ˜ํ”Œ๋ง ๋ฐ ๊ฐ€์šฐ์‹œ์•ˆ ๋…ธ์ด์ฆˆ ์ถ”๊ฐ€

  • ๋Š” ํ™•์‚ฐ ๋ชจ๋ธ์—์„œ ์‚ฌ์šฉ๋˜๋Š” ๋žœ๋ค ํƒ€์ž„์Šคํ…, ์ฆ‰ ๋…ธ์ด์ฆˆ๋ฅผ ์ถ”๊ฐ€ํ•  ๋‹จ๊ณ„
  • ๊ฐ€์šฐ์‹œ์•ˆ ๋…ธ์ด์ฆˆ ฯต(1:N)๋ฅผ ์ƒ˜ํ”Œ๋งํ•˜์—ฌ ๊ฐ ์ด๋ฏธ์ง€์— ์ ์šฉ

  • ๊ฐ ์ด๋ฏธ์ง€ x0(1:N) ์— ๋…ธ์ด์ฆˆ ์ถ”๊ฐ€ํ•˜์—ฌ xt(1:N) ์ƒ์„ฑ:

  • ์—ฌ๊ธฐ์„œ ฮฑtห‰ ๋Š” ๋…ธ์ด์ฆˆ ์Šค์ผ€์ผ๋ง ๊ณ„์ˆ˜์ด๋ฉฐ, ๋…ธ์ด์ฆˆ ๊ฐ•๋„๋Š” t ์— ๋”ฐ๋ผ ์ฆ๊ฐ€

โžก ์ด ๋‹จ๊ณ„์—์„œ ๋ชจ๋ธ์€ ๋‹ค์–‘ํ•œ ๋…ธ์ด์ฆˆ ๋ ˆ๋ฒจ์—์„œ ์ด๋ฏธ์ง€ ๋ณ€ํ˜•์„ ํ•™์Šตํ•จ

 

๐Ÿ“ Step 3: ๋žœ๋ค ๋ทฐ n ์„ ํƒ ๋ฐ ๋…ธ์ด์ฆˆ ์˜ˆ์ธก๊ธฐ ์ ์šฉ

  • n ๋ฒˆ์งธ ๋ทฐ๋ฅผ ๋ฌด์ž‘์œ„๋กœ ์„ ํƒ (nโˆผU(1,N))
  • ํ•ด๋‹น ๋ทฐ์— ๋Œ€ํ•ด ๋…ธ์ด์ฆˆ ์˜ˆ์ธก๊ธฐ ฯตฮธ(n) ์ ์šฉํ•˜์—ฌ ๋…ธ์ด์ฆˆ๋ฅผ ์˜ˆ์ธก.

  • ์ด ๊ณผ์ •์—์„œ ์ „์ฒด ์‹œ์ ์˜ ์ด๋ฏธ์ง€ xt(1:N ๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ํŠน์ • ๋ทฐ์˜ ๋…ธ์ด์ฆˆ๋ฅผ ์˜ˆ์ธก.
  • ์ฆ‰, ๋‹จ์ˆœํžˆ xt(n) ํ•˜๋‚˜๋งŒ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ๋‹ค๋ฅธ ์‹œ์ ์˜ ์ด๋ฏธ์ง€ ์ •๋ณด๋ฅผ ํ•จ๊ป˜ ํ™œ์šฉํ•˜์—ฌ ๋…ธ์ด์ฆˆ๋ฅผ ์˜ˆ์ธก.
  • ์ด ๊ณผ์ •์ด ๋‹ค๊ฐ๋„ ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•˜๋Š” ํ•ต์‹ฌ ์š”์†Œ.

โžก ์ด ๋‹จ๊ณ„์—์„œ ๋ชจ๋ธ์€ ๋‹ค๋ฅธ ์‹œ์  ์ •๋ณด๋ฅผ ํ™œ์šฉํ•˜์—ฌ ํŠน์ • ์‹œ์ ์˜ ๋…ธ์ด์ฆˆ๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ํ•™์Šต๋จ.

 

๐Ÿ“ Step 4: ์†์‹ค ํ•จ์ˆ˜(Loss) ๊ณ„์‚ฐ ๋ฐ ํ•™์Šต

  • ์ƒ˜ํ”Œ๋ง๋œ ์‹ค์ œ ๋…ธ์ด์ฆˆ ฯต(n) ์™€ ์˜ˆ์ธก๋œ ๋…ธ์ด์ฆˆ ฯตฮธ(n) ๊ฐ„์˜ L2 ๊ฑฐ๋ฆฌ(MSE ์†์‹ค) ๊ณ„์‚ฐ
  • ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•œ ๋…ธ์ด์ฆˆ๊ฐ€ ์‹ค์ œ ๊ฐ€์šฐ์‹œ์•ˆ ๋…ธ์ด์ฆˆ์™€ ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ์†์‹ค์ด ์ž‘์•„์ง.
  • ํ•™์Šต ๊ณผ์ •์—์„œ ์ด ์†์‹ค์„ ์ตœ์†Œํ™”ํ•˜๋„๋ก ๋„คํŠธ์›Œํฌ์˜ ๊ฐ€์ค‘์น˜๋ฅผ ์—…๋ฐ์ดํŠธ.

โžก ์ด ๋‹จ๊ณ„์—์„œ ๋ชจ๋ธ์ด ์ ์  ๋” ์ •ํ™•ํ•œ ๋…ธ์ด์ฆˆ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•˜๋„๋ก ํ•™์Šต๋จ.

 

๋”๋ณด๊ธฐ

์ธ๊ณผ๊ด€๊ณ„๊ฐ€ ํ—ท๊ฐˆ๋ ค์š”...

 

1. ํ•™์Šต ๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ

  • ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์—๋Š” ์ด๋ฏธ ์—ฌ๋Ÿฌ ์‹œ์ ์—์„œ ์ดฌ์˜๋œ ์ด๋ฏธ์ง€๋“ค์ด ํฌํ•จ๋˜์–ด ์žˆ์Œ
  • ์ฆ‰, ๋‹จ์ผ ๊ฐ์ฒด์— N ๊ฐœ์˜ ๋‹ค๋ฅธ ๊ฐ๋„์—์„œ ์ดฌ์˜๋œ ์ด๋ฏธ์ง€ ์„ธํŠธ๊ฐ€ ์ œ๊ณต๋จ
  • ์ด ๋ฐ์ดํ„ฐ์…‹์„ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋ธ์€ "๋‹ค๊ฐ๋„ ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ๋…ธ์ด์ฆˆ๋ฅผ ์ œ๊ฑฐํ•˜๋Š” ๋ฒ•"์„ ํ•™์Šตํ•˜๊ฒŒ ๋จ

 

2. ํ•™์Šต ๊ณผ์ •์—์„œ ํ•˜๋Š” ์ผ

๋ชฉํ‘œ :

  • ์ž…๋ ฅ๋œ ๋‹ค๊ฐ๋„ ์ด๋ฏธ์ง€ ์„ธํŠธ(x0(1:N))์— ๋…ธ์ด์ฆˆ๋ฅผ ์ถ”๊ฐ€ํ•˜๊ณ , ์ด๋ฅผ ์—ญํ™•์‚ฐ์œผ๋กœ ๋ณต์›ํ•˜๋„๋ก ๋ชจ๋ธ์„ ํ•™์Šต
  • ํŠน์ • ์‹œ์  n ์˜ ๋…ธ์ด์ฆˆ๋ฅผ ์˜ˆ์ธกํ•  ๋•Œ, ๋‹ค๋ฅธ ์‹œ์ ์˜ ์ด๋ฏธ์ง€ ์ •๋ณด๊นŒ์ง€ ํ™œ์šฉํ•˜์—ฌ ์˜ˆ์ธก ์ •ํ™•๋„๋ฅผ ๋†’์ด๋„๋ก ํ•™์Šต

 

3. ํ…Œ์ŠคํŠธ ๊ณผ์ •์—์„œ ํ•˜๋Š” ์ผ

ํ…Œ์ŠคํŠธ ์‹œ์˜ ํ•ต์‹ฌ ์งˆ๋ฌธ : "๋ชจ๋ธ์ด ํ•™์Šต๋œ ๋…ธ์ด์ฆˆ ์˜ˆ์ธก๊ธฐ๋ฅผ ์–ด๋–ป๊ฒŒ ํ™œ์šฉํ•˜๋Š”๊ฐ€?"

  • TEST ๋•Œ๋Š” ๋‹จ์ผ ์‹œ์  ์ด๋ฏธ์ง€(y) ๋งŒ ์ฃผ์–ด์ง€๊ณ , ๋‹ค๊ฐ๋„ ์ด๋ฏธ์ง€๋ฅผ ์กด์žฌํ•˜์ง€ ์•Š์Œ
  • ํ•˜์ง€๋งŒ ๋ชจ๋ธ์€ ํ•™์Šต ๊ณผ์ •์—์„œ ๋‹ค๊ฐ๋„ ์ •๋ณด๊ฐ€ ์–ด๋–ป๊ฒŒ ์—ฐ๊ด€๋˜๋Š”์ง€ ๋ฐฐ์› ๊ธฐ ๋•Œ๋ฌธ์—, ์ž…๋ ฅ๋œ ๋‹จ์ผ ์ด๋ฏธ์ง€๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‹ค๊ฐ๋„ ์˜ˆ์ธก์ด ๊ฐ€๋Šฅ

 


Synchronized N-view Noise Predictor

SyncDreamer์˜ ๋‹ค๊ฐ๋„ ํ™•์‚ฐ ๋ชจ๋ธ(Multiview Diffusion Model) ์€ N๊ฐœ์˜ ๋…ธ์ด์ฆˆ ์˜ˆ์ธก๊ธฐ(Noise Predictors) ๊ฐ€ ๋™๊ธฐํ™”๋œ ์ƒํƒœ์—์„œ ์ž‘๋™ํ•จ
โœ… ๊ฐ ํƒ€์ž„์Šคํ… t ๋งˆ๋‹ค, ๊ฐ ๋…ธ์ด์ฆˆ ์˜ˆ์ธก๊ธฐ ฯตฮธ(n) ๋Š” ๊ฐœ๋ณ„ ๋ทฐ xt(n) ์—์„œ ๋…ธ์ด์ฆˆ๋ฅผ ์˜ˆ์ธกํ•˜์—ฌ xtโˆ’1(n) ๋ฅผ ๋ณต์›
โœ… ์ด ๊ณผ์ •์—์„œ ๋ชจ๋“  ๋…ธ์ด์ฆˆ ์˜ˆ์ธก๊ธฐ๊ฐ€ ๋™๊ธฐํ™”(Synchronized)๋จ โ†’ ๋‹ค๋ฅธ ์‹œ์ ์˜ ์ •๋ณด์™€ ์—ฐ๊ด€๋˜์–ด ๋…ธ์ด์ฆˆ๋ฅผ ์˜ˆ์ธก
โœ… ์ฆ‰, ๋‹จ์ˆœํžˆ ๊ฐœ๋ณ„์ ์œผ๋กœ ๋…ธ์ด์ฆˆ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ๋‹ค๊ฐ๋„ ์ •๋ณด๋ฅผ ์„œ๋กœ ๊ณต์œ ํ•˜๋ฉฐ ์ตœ์ ์˜ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์ด ํŠน์ง•!

UNet ์„ ํ™œ์šฉํ•œ ๊ณต์œ ๋œ ๋…ธ์ด์ฆˆ ์˜ˆ์ธก๊ธฐ(Shared UNet) ์‚ฌ์šฉ

  • ์‹ค์ œ ๊ตฌํ˜„์—์„œ๋Š” ๊ฐ ๋ทฐ์— ๊ฐœ๋ณ„์ ์ธ ๋…ธ์ด์ฆˆ ์˜ˆ์ธก๊ธฐ๋ฅผ ๋”ฐ๋กœ ๋‘์ง€ ์•Š๊ณ , ํ•˜๋‚˜์˜ ๊ณต์œ ๋œ UNet์„ ์‚ฌ์šฉ
  • ์ฆ‰, N๊ฐœ์˜ ๋…ธ์ด์ฆˆ ์˜ˆ์ธก๊ธฐ๋ฅผ ๊ฐœ๋ณ„์ ์œผ๋กœ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ํ•˜๋‚˜์˜ UNet ๋ชจ๋ธ์ด ๋ชจ๋“  N๊ฐœ์˜ ๋ทฐ์— ๋Œ€ํ•ด ํ•™์Šต๋˜๋„๋ก ์„ค๊ณ„๋จ

 

์ž ์—ฌ๊ธฐ์„œ, ฯตฮธ ๋Š” ์–ด๋–ค ํ•จ์ˆ˜์ธ๊ฐ€?

  • ฯตฮธ ๋Š” ๋…ธ์ด์ฆˆ ์˜ˆ์ธก๊ธฐ(Noise Predictor) ํ•จ์ˆ˜์ด๋‹ค.
  • ์ด ํ•จ์ˆ˜์˜ ์—ญํ• ์€ ํ˜„์žฌ ๋…ธ์ด์ฆˆ๊ฐ€ ์ถ”๊ฐ€๋œ ์ด๋ฏธ์ง€ xt(n) ๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„, ๊ทธ์— ๋Œ€์‘ํ•˜๋Š” ๋…ธ์ด์ฆˆ ฯต(n) ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ
  • ์ฆ‰, "์ด ์ด๋ฏธ์ง€์—์„œ ์ œ๊ฑฐํ•ด์•ผ ํ•  ๋…ธ์ด์ฆˆ๋Š” ๋ฌด์—‡์ธ๊ฐ€?" ๋ฅผ ํ•™์Šตํ•˜๋Š” ํ•จ์ˆ˜
  • ์ด ํ•จ์ˆ˜๋Š” ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ(ํŠนํžˆ UNet)๋กœ ๊ตฌํ˜„๋จ โ†’ ์ž…๋ ฅ ์ด๋ฏธ์ง€๋ฅผ ๋ฐ›์•„ ๋…ธ์ด์ฆˆ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ์—ญํ•  ์ˆ˜ํ–‰

์„ธ๋ฏธ์ฝœ๋ก ์˜ ์—ญํ• 

 

๐Ÿ“Œ SyncDreamer์—์„œ ๋™๊ธฐํ™”๊ฐ€ ์ด๋ฃจ์–ด์ง€๋Š” ๋ฐฉ์‹:

  1. ๊ฐ ๋ทฐ์— ๋Œ€ํ•ด ๊ฐœ๋ณ„์ ์ธ ๋…ธ์ด์ฆˆ ์˜ˆ์ธก๊ธฐ๊ฐ€ ์กด์žฌํ•˜์ง€๋งŒ, ๋ชจ๋“  ๋ทฐ๊ฐ€ ๊ณต์œ ๋œ UNet์„ ํ†ตํ•ด ํ•™์Šต๋จ.
  2. ๊ฐ ๋ทฐ์˜ ๋…ธ์ด์ฆˆ ์˜ˆ์ธก ๊ณผ์ •์—์„œ ๋‹ค๋ฅธ ๋ทฐ์˜ ์ƒํƒœ๋ฅผ ํ•จ๊ป˜ ์ž…๋ ฅ๊ฐ’์œผ๋กœ ํ™œ์šฉ.
  3. ์ž…๋ ฅ ๋ทฐ์™€ ๋ชฉํ‘œ ๋ทฐ ๊ฐ„์˜ ์‹œ์  ์ฐจ์ด ฮ”v(n) ๋ฅผ ์กฐ๊ฑด์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค๊ฐ๋„ ์ •ํ•ฉ์„ฑ์„ ์œ ์ง€.

 ใ„ด ์‹œ์  ์ฐจ์ด (ฮ”v(n)) ์˜ ์—ญํ• 

โœ… ฮ”v(n) ๋Š” ์ž…๋ ฅ ๋ทฐ์™€ ํŠน์ • ๋ทฐ(n) ๊ฐ„์˜ ์นด๋ฉ”๋ผ ์‹œ์  ์ฐจ์ด๋ฅผ ์˜๋ฏธํ•จ.
โœ… ์ด๋Š” ๋ชจ๋ธ์ด "ํŠน์ • ์‹œ์ ์—์„œ ๋‹ค๋ฅธ ์‹œ์ ์œผ๋กœ ์–ด๋–ป๊ฒŒ ๋ณ€ํ™˜๋˜๋Š”์ง€"๋ฅผ ์ดํ•ดํ•˜๋„๋ก ๋„์™€์คŒ.
โœ… ์˜ˆ๋ฅผ ๋“ค์–ด, ฮ”v(n)๊ฐ€ ํฌ๋‹ค๋ฉด ๋ชจ๋ธ์€ "ํ˜„์žฌ ๋ทฐ์—์„œ ์ƒ๋‹นํžˆ ๋‹ค๋ฅธ ์‹œ์ ์„ ์˜ˆ์ธกํ•ด์•ผ ํ•œ๋‹ค"๋Š” ์ ์„ ๋ฐ˜์˜ํ•  ์ˆ˜ ์žˆ์Œ.

โžก ๊ฒฐ๊ณผ์ ์œผ๋กœ, ์‹œ์  ์ฐจ์ด๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ์€ ๋ชจ๋ธ์ด ๋” ์ •๋ฐ€ํ•œ ๋‹ค๊ฐ๋„ ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•˜๋„๋ก ๋•๋Š” ์ค‘์š”ํ•œ ์š”์†Œ!

 

 

3D-AWARE FEATURE ATTENTION FOR DENOSING

 

1. UNet ๊ธฐ๋ฐ˜ ๋…ธ์ด์ฆˆ ์˜ˆ์ธก๊ธฐ ฯตฮธ

  • ๊ธฐ๋ณธ์ ์œผ๋กœ UNet์„ ์‚ฌ์šฉํ•˜์—ฌ ๋…ธ์ด์ฆˆ๋ฅผ ์ œ๊ฑฐํ•˜๋Š” ๋ฐฉ์‹
  • Zero123 (Liu et al., 2023b) ๋ชจ๋ธ์˜ ์‚ฌ์ „ ํ•™์Šต๋œ UNet์„ ์‚ฌ์šฉํ•˜์—ฌ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ์œ ์ง€
    • Zero123๊ฐ€ ํ•˜๋Š” ์ผ:
    • ์ž…๋ ฅ ์ด๋ฏธ์ง€์™€ ๋…ธ์ด์ฆˆ๊ฐ€ ํฌํ•จ๋œ ๋ชฉํ‘œ ๋ทฐ(target view)๋ฅผ ํ•จ๊ป˜ UNet์— ์ž…๋ ฅ
    • ์ด ๋‘ ์ด๋ฏธ์ง€๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ƒˆ๋กœ์šด ์‹œ์ ์˜ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑ
    • โœ… SyncDreamer๋„ Zero123์˜ ๊ตฌ์กฐ๋ฅผ ๋”ฐ๋ฅด์ง€๋งŒ, UNet๊ณผ ํ…์ŠคํŠธ ์ฃผ์˜(attention) ๋ ˆ์ด์–ด๋Š” ๋™๊ฒฐ(freeze)ํ•˜์—ฌ ํ•™์Šตํ•˜์ง€ ์•Š์Œ

โžก ์ฆ‰, ๊ธฐ๋ณธ์ ์ธ ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ ๊ณผ์ •์€ UNet์œผ๋กœ ์ˆ˜ํ–‰๋˜์ง€๋งŒ, ๋‹ค๊ฐ๋„ ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ์ถ”๊ฐ€์ ์ธ 3D-aware feature attention์„ ๋„์ž…!

 

2. 3D-Aware Feature Attention (3D ์ธ์‹ ํŠน์„ฑ ์ฃผ์˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜)

๋ฌธ์ œ:

  • ๋‹จ์ˆœํ•œ 2D CNN์ด๋‚˜ ๊ธฐ์กด UNet ๊ตฌ์กฐ๋งŒ ์‚ฌ์šฉํ•˜๋ฉด ๋‹ค๊ฐ๋„ ์ด๋ฏธ์ง€ ๊ฐ„ ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•˜๊ธฐ ์–ด๋ ค์›€
  • ๊ฐ๊ฐ์˜ ์‹œ์ (View)์—์„œ ๋…ธ์ด์ฆˆ๋ฅผ ์ œ๊ฑฐํ•  ๋•Œ, ๋‹ค๋ฅธ ์‹œ์ ์—์„œ ๊ณต์œ ํ•˜๋Š” 3D ํŠน์ง•์„ ์ธ์‹ํ•ด์•ผ ๋” ์ •๋ฐ€ํ•œ ๋ณต์›์ด ๊ฐ€๋Šฅ
  •  ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•:
    • 3D ๋ณผ๋ฅจ(3D feature volume)์„ ๋จผ์ € ์ƒ์„ฑ
      • 3D ๊ณต๊ฐ„ ๋‚ด ๊ฐ€์ƒ์˜ ๊ฒฉ์ž์ (vertices) V^3 ์„ค์ •
      • ์ด ๊ฒฉ์ž์ ์„ ๋ชจ๋“  ์‹œ์ ์˜ ์ด๋ฏธ์ง€์— ํˆฌ์˜ํ•˜์—ฌ ํ•ด๋‹น ์œ„์น˜์˜ ํŠน์ง•(feature)์„ ๊ฐ€์ ธ์˜ด
      • ์ฆ‰, ๊ฐ ์‹œ์ ์—์„œ ํŠน์ • 3D ํฌ์ธํŠธ๊ฐ€ ์–ด๋–ป๊ฒŒ ๋ณด์ด๋Š”์ง€๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ
    • ๋ชจ๋“  ๋ทฐ์—์„œ ์ปจ๋ณผ๋ฃจ์…˜ ๋ ˆ์ด์–ด(Convolution Layers)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํŠน์ง•(feature) ์ถ”์ถœ
      • ๊ฐ ์‹œ์ ์˜ ์ด๋ฏธ์ง€์—์„œ ์˜๋ฏธ ์žˆ๋Š” ํŠน์ง•์„ ๊ฐ์ง€ํ•˜์—ฌ 3D ๊ณต๊ฐ„์œผ๋กœ ํ†ตํ•ฉ
      • ์ปจ๋ณผ๋ฃจ์…˜ ๋ ˆ์ด์–ด๋Š” ๋กœ์ปฌ ํŒจํ„ด์„ ํ•™์Šตํ•˜๋Š” ์—ญํ• ์„ ํ•˜๋ฏ€๋กœ, ํŠน์ • ๋ฌผ์ฒด์˜ ํŠน์ง•์„ ์•ˆ์ •์ ์œผ๋กœ ์ธ์‹ ๊ฐ€๋Šฅ
    • 3D CNN์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ณต๊ฐ„์ (spatial) ๊ด€๊ณ„๋ฅผ ํ•™์Šต
      • ๊ฐ ์‹œ์ ์—์„œ ํˆฌ์˜๋œ ํŠน์ง•๋“ค์„ ํ•˜๋‚˜์˜ 3D ๊ณต๊ฐ„์œผ๋กœ ๋ชจ์•„์„œ ํ•™์Šต
      • ์ฆ‰, ๋‹จ์ˆœํžˆ 2D ์ด๋ฏธ์ง€์—์„œ ํŠน์ง•์„ ์ฐพ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, 3D ๊ณต๊ฐ„์—์„œ์˜ ์˜๋ฏธ ์žˆ๋Š” ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ณผ์ •

โžก ๊ฒฐ๋ก ์ ์œผ๋กœ, ์ปจ๋ณผ๋ฃจ์…˜ ๋ ˆ์ด์–ด๋ฅผ ํ†ตํ•ด ๊ฐ ์‹œ์ ์—์„œ ๋™์ผํ•œ 3D ํฌ์ธํŠธ๋ฅผ ์ฐพ๊ณ , ์ด๋ฅผ 3D ๋ณผ๋ฅจ์œผ๋กœ ํ†ตํ•ฉํ•˜์—ฌ ๋ณด๋‹ค ์ •๋ฐ€ํ•œ 3D ์ธ์‹์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•จ!

๋”๋ณด๊ธฐ

N๊ฐœ์˜ target view์˜ 2D feature map๋“ค์„ 3D ๊ณต๊ฐ„์œผ๋กœ unprojectํ•˜์—ฌ, ๊ณตํ†ต์˜ 3D ๊ณต๊ฐ„(spatial volume)์— ๋ชจ์€ ๋‹ค์Œ,
์ด๊ฑธ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐ ๋ทฐ๊ฐ€ ์ž๊ธฐ ์‹œ์ ์— ๋งž๋Š” ์ •๋ณด๋ฅผ ๊บผ๋‚ด๊ฐ€๋„๋ก ํ•ฉ๋‹ˆ๋‹น

 

3. ํŠน์ • ์‹œ์  n ์˜ ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ ๊ณผ์ •

3D ๋ณผ๋ฅจ์ด ์ƒ์„ฑ๋œ ์ดํ›„, ํŠน์ • ์‹œ์  ์— ๋Œ€ํ•œ ๋…ธ์ด์ฆˆ๋ฅผ ์ œ๊ฑฐํ•˜๋Š” ๊ณผ์ •:

  • ํ•ด๋‹น ์‹œ์ ์— ๋งž์ถฐ ํ”ฝ์…€ ๋‹จ์œ„๋กœ ์ •๋ ฌ๋œ(View Frustum) ๋ณผ๋ฅจ์„ ๊ตฌ์„ฑ
    • ์ด View Frustum์€ ํ•ด๋‹น ๋ทฐ์˜ ํ”ฝ์…€๊ณผ ์ •ํ™•ํžˆ ๋งž์•„๋–จ์–ด์ง€๋Š” 3D ๊ณต๊ฐ„์ƒ์˜ ํŠน์ง•์„ ๊ฐ€์ ธ์˜ด
  • ์ด View Frustum์—์„œ ํ”ฝ์…€ ๋‹จ์œ„๋กœ ํŠน์ง•์„ ๋ณด๊ฐ„(Interpolate)ํ•˜์—ฌ ์ ์šฉ
    • ์ฆ‰, ๋‹ค๋ฅธ ์‹œ์ ์—์„œ ์–ป์€ 3D ํŠน์ง•์„ ํ•ด๋‹น ์‹œ์ ์˜ ์ด๋ฏธ์ง€์— ๋งž์ถฐ ๋ณ€ํ™˜
  • UNet์˜ ์ค‘๊ฐ„ ํŠน์ง• ๋งต(intermediate feature map)์—์„œ ์ƒˆ๋กœ์šด Depth-Wise Attention ์ ์šฉ
    • ์ด ๊ณผ์ •์—์„œ ๊นŠ์ด(depth) ๋ฐฉํ–ฅ์˜ ํŠน์„ฑ์„ ๋ฐ˜์˜ํ•˜์—ฌ ๋ณด๋‹ค ์ •๋ฐ€ํ•œ ํŠน์ง• ํ•™์Šต ๊ฐ€๋Šฅ
    • ์ด๋Ÿฌํ•œ Depth-Wise Attention์€ Epipolar Attention ๊ฐœ๋…๊ณผ ์œ ์‚ฌํ•จ

โžก ์ฆ‰, ํŠน์ • ์‹œ์ ์—์„œ ๋…ธ์ด์ฆˆ๋ฅผ ์ œ๊ฑฐํ•  ๋•Œ, ๋‹ค๋ฅธ ์‹œ์ ์—์„œ ๊ฐ€์ ธ์˜จ 3D ํŠน์ง•์„ ํ™œ์šฉํ•˜์—ฌ ๋ณด๋‹ค ์ •ํ™•ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑ!

๐Ÿ’› VIEW FRUSTUM VOLUME ์ด๋ž€?

    • View Frustum(๋ทฐ ํ”„๋Ÿฌ์Šคํ…€) ์€ 3D ๊ทธ๋ž˜ํ”ฝ์Šค ๋ฐ ์ปดํ“จํ„ฐ ๋น„์ „์—์„œ ํŠน์ • ์นด๋ฉ”๋ผ ์‹œ์ (Viewpoint)์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋Š” ๊ณต๊ฐ„์„ ์ •์˜ํ•˜๋Š” 3D ์˜์—ญ
    • ์‰ฝ๊ฒŒ ๋งํ•˜๋ฉด, ์นด๋ฉ”๋ผ๊ฐ€ ์ดฌ์˜ํ•  ์ˆ˜ ์žˆ๋Š” ์˜์—ญ์„ ํ”ผ๋ผ๋ฏธ๋“œ ๋ชจ์–‘์œผ๋กœ ํ‘œํ˜„ํ•œ ๊ฒƒ
    • ์ด View Frustum ์•ˆ์— ํฌํ•จ๋œ 3D ์ ๋“ค๋งŒ ์นด๋ฉ”๋ผ์— ์˜ํ•ด ๋ณด์ด๊ฒŒ ๋จ!
      • ๐Ÿ“Œ ์˜ˆ์ œ: ์นด๋ฉ”๋ผ์—์„œ View Frustum์ด ์–ด๋–ป๊ฒŒ ๋ณด์ด๋Š”์ง€
        • ๊ฐ€๊นŒ์šด ํ‰๋ฉด(Near Plane): ์นด๋ฉ”๋ผ์— ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด 2D ํ‰๋ฉด
        • ๋จผ ํ‰๋ฉด(Far Plane): ์นด๋ฉ”๋ผ์—์„œ ๊ฐ€์žฅ ๋จผ 2D ํ‰๋ฉด
        • ์ขŒ/์šฐ/์ƒ/ํ•˜ ๊ฒฝ๊ณ„(Frustum Boundaries): ์นด๋ฉ”๋ผ์˜ FOV(Field of View, ์‹œ์•ผ๊ฐ)์— ๋”ฐ๋ผ ๊ฒฐ์ •๋จ
  • View Frustum Volume(๋ทฐ ํ”„๋Ÿฌ์Šคํ…€ ๋ณผ๋ฅจ) ์€ ํŠน์ • ์‹œ์ ์—์„œ 3D ๊ณต๊ฐ„์„ ํ”ฝ์…€ ๋‹จ์œ„๋กœ ์ •๋ ฌํ•œ 3D ํŠน์ง• ๋ณผ๋ฅจ(feature volume) ์ž…๋‹ˆ๋‹ค.
  • ์ฆ‰, ๊ฐ ํ”ฝ์…€์—์„œ 3D ๊ณต๊ฐ„ ์† ์—ฌ๋Ÿฌ ๊นŠ์ด(depth) ๊ฐ’์— ํ•ด๋‹นํ•˜๋Š” ํŠน์ง•(feature)๋“ค์„ ์ €์žฅํ•œ 3D ํ…์„œ์ž…๋‹ˆ๋‹ค.
  • ์ด๋Ÿฌํ•œ ๋ณผ๋ฅจ์„ ์‚ฌ์šฉํ•˜๋ฉด, ํŠน์ • ํ”ฝ์…€์— ๋Œ€ํ•ด ์—ฌ๋Ÿฌ ๊นŠ์ด์—์„œ์˜ ์ •๋ณด๋ฅผ ์ฐธ์กฐํ•  ์ˆ˜ ์žˆ์–ด 3D ๊ตฌ์กฐ๋ฅผ ๋” ์ž˜ ๋ณต์› ๊ฐ€๋Šฅ!
  • ์˜ˆ)
    • ํŠน์ • ์นด๋ฉ”๋ผ ์‹œ์ ์—์„œ ํ”ฝ์…€(100, 200)์„ ๋ณด๋ฉด ์—ฌ๋Ÿฌ ๊นŠ์ด(depth)์—์„œ ์—ฌ๋Ÿฌ ๊ฐœ์˜ 3D ํŠน์ง•์ด ์กด์žฌํ•  ์ˆ˜ ์žˆ์Œ.
    • View Frustum Volume์—๋Š” ์ด ํ”ฝ์…€์˜ ์œ„์น˜์—์„œ ๊นŠ์ด์— ๋”ฐ๋ผ ์ถ”์ถœ๋œ 3D ํŠน์ง•๋“ค์„ ์ €์žฅ.
    • ๊นŠ์ด ๋ฐฉํ–ฅ์œผ๋กœ ์ •๋ ฌ๋œ ํŠน์ง• ๋ณผ๋ฅจ์„ ํ™œ์šฉํ•˜์—ฌ ํŠน์ • ํ”ฝ์…€์—์„œ ๊ฐ€์žฅ ์ ์ ˆํ•œ 3D ํŠน์ง•์„ ์„ ํƒ ๊ฐ€๋Šฅ!

 

 

4. ์„ค๊ณ„ ์›๋ฆฌ ๋ฐ ๋””์ž์ธ ๊ณ ๋ ค์‚ฌํ•ญ

โœ… ์ฒซ ๋ฒˆ์งธ ๊ณ ๋ ค์‚ฌํ•ญ: ์ „์—ญ์ (global) ๋‹ค๊ฐ๋„ ์ •ํ•ฉ์„ฑ

  • ๋ชจ๋“  ์‹œ์ ์ด ๋™์ผํ•œ ๊ฐ์ฒด๋ฅผ ๋ณด๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ํ•˜๋‚˜์˜ ํ†ตํ•ฉ๋œ 3D ๋ณผ๋ฅจ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค๊ฐ๋„ ์ •ํ•ฉ์„ฑ์„ ์œ ์ง€

โœ… ๋‘ ๋ฒˆ์งธ ๊ณ ๋ ค์‚ฌํ•ญ: ๊นŠ์ด ๋ฐฉํ–ฅ์˜ Attention ์ ์šฉ

  • ์ƒˆ๋กœ์šด Attention Layer๋Š” ๊นŠ์ด ๋ฐฉํ–ฅ(depth dimension)์—์„œ๋งŒ ๋™์ž‘
  • ์ด๊ฒƒ์€ Epipolar Line Constraint์™€ ์—ฐ๊ด€๋จ โ†’ ํŠน์ • ์ง€์ ์—์„œ ๊ด€์ฐฐ๋œ ํŠน์ง•์€ ๋‹ค๋ฅธ ์‹œ์ ์—์„œ๋„ ๋™์ผํ•ด์•ผ ํ•จ

โžก ์ฆ‰, SyncDreamer๋Š” ๋‹จ์ˆœํžˆ 2D ์ด๋ฏธ์ง€ ๊ฐ๊ฐ์„ ๋…๋ฆฝ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, 3D ๊ณต๊ฐ„์—์„œ์˜ ์˜๋ฏธ ์žˆ๋Š” ์ •๋ณด๋ฅผ ํ•™์Šตํ•˜์—ฌ ์ผ๊ด€๋œ ๋‹ค๊ฐ๋„ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑ!