๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐Ÿ˜ŽAI/Generative AI

[Paper Review] ๐Ÿ“ŒAttention Is All You Need (aka. Transformer)

by SolaKim 2025. 2. 11.

๋“œ๋””์–ด ๋‚˜์˜ค์…จ์Šต๋‹ˆ๋‹ค. Transformer! ๐Ÿฅ ๋‘๋‘ฅํƒ!

https://arxiv.org/abs/1706.03762

 

Attention Is All You Need

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new

arxiv.org

 

 

 

"Attention Is All You Need" ๋…ผ๋ฌธ์ด ๋‚˜์˜ค๊ฒŒ ๋œ ๊ณ„๊ธฐ๋Š” ๊ธฐ์กด RNN(Recurrent Neural Network) ๊ณผ CNN(Convolutional Neural Network) ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋“ค์ด ๊ฐ€์ง„ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ž…๋‹ˆ๋‹ค!

 

RNN๊ณผ LSTM์˜ ํ•œ๊ณ„

  1. Sequential Processing
    • RNN์ด๋‚˜ LSTM์€ ์ž…๋ ฅ์„ ์ˆœ์ฐจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ณ‘๋ ฌํ™”๊ฐ€ ์–ด๋ ต๊ณ , ์†๋„๊ฐ€ ๋Š๋ฆผ
    • ๊ธด ๋ฌธ์žฅ์—์„œ ์žฅ๊ธฐ ์˜์กด์„ฑ(long-range dependency) ์„ ํ•™์Šตํ•˜๋Š” ๋ฐ ์–ด๋ ค์›€
  2. Vanishing Gradient ๋ฌธ์ œ
    • LSTM์ด๋‚˜ GRU ๊ฐ™์€ ๊ฐœ์„ ๋œ ๊ตฌ์กฐ๊ฐ€ ์žˆ๊ธด ํ•˜์ง€๋งŒ, ์—ฌ์ „ํžˆ ๊ธด ์‹œํ€€์Šค์—์„œ๋Š” ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค(vanishing gradient) ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Œ
  3. CNN์˜ ํ•œ๊ณ„
    • CNN์€ ์‹œํ€€์Šค ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•  ๋•Œ ์ง€์—ญ์  ํŒจํ„ด์—๋Š” ๊ฐ•ํ•˜์ง€๋งŒ, ๊ธ€๋กœ๋ฒŒ ์˜์กด์„ฑ(global dependency) ์„ ํ•™์Šตํ•˜๋Š” ๋ฐ ์•ฝํ•จ

 

"Attention Is All You Need"์˜ ํ•ต์‹ฌ ๋™๊ธฐ

  1. ๋ณ‘๋ ฌํ™” ๊ฐ€๋Šฅ
    • Attention ๋ฉ”์ปค๋‹ˆ์ฆ˜์€ ๋ชจ๋“  ์ž…๋ ฅ์„ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์–ด์„œ ๋ณ‘๋ ฌ ์—ฐ์‚ฐ์ด ๊ฐ€๋Šฅํ•จ
      => Dot-Product Attention
    • ๋•๋ถ„์— ๊ธฐ์กด RNN ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋ณด๋‹ค ํ•™์Šต ์†๋„๊ฐ€ ํ›จ์”ฌ ๋น ๋ฆ„
  2. ๋” ๊นŠ์€ ์ปจํ…์ŠคํŠธ ํ•™์Šต
    • Attention์€ ์ž…๋ ฅ ์‹œํ€€์Šค์˜ ๋ชจ๋“  ์œ„์น˜๋ฅผ ๋™์‹œ์— ๊ณ ๋ คํ•  ์ˆ˜ ์žˆ์–ด์„œ ์žฅ๊ธฐ ์˜์กด์„ฑ์„ ๋” ์ž˜ ํ•™์Šต
      => Positioning Encoding
  3. Self-Attention์˜ ๋‹จ์ˆœํ•จ๊ณผ ํ™•์žฅ์„ฑ
    • RNN์ฒ˜๋Ÿผ ๋ณต์žกํ•œ ์ˆœํ™˜ ๊ตฌ์กฐ๋ฅผ ์—†์• ๊ณ , Self-Attention๋งŒ์œผ๋กœ ๋” ๋‹จ์ˆœํ•œ ๊ตฌ์กฐ์˜ ๋ชจ๋ธ์„ ๋งŒ๋“ฌ
    • ์ดํ›„ Transformer ๊ตฌ์กฐ๊ฐ€ NLP๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์Œ์„ฑ ์ธ์‹, ์ปดํ“จํ„ฐ ๋น„์ „ ๋“ฑ ์—ฌ๋Ÿฌ ๋ถ„์•ผ๋กœ ํ™•์žฅ๋จ

 

"Attention Is All You Need"์˜ Contribution

  1. RNN/CNN ์™„์ „ ์ œ๊ฑฐ
    • ์ด์ „์˜ ๋Œ€๋ถ€๋ถ„์˜ ๋ชจ๋ธ์€ Attention์„ ๋ณด์กฐ ๊ธฐ๋ฒ•์œผ๋กœ ์‚ฌ์šฉํ–ˆ์ง€๋งŒ, Transformer๋Š” ์˜ค์ง Attention๋งŒ ์‚ฌ์šฉ
  2. Self-Attention + Positional Encoding
    • ์œ„์น˜ ์ •๋ณด๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด Positional Encoding์„ ์ถ”๊ฐ€ํ•˜๊ณ , Self-Attention์„ ์‚ฌ์šฉํ•ด ๋ชจ๋“  ํ† ํฐ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํ•™์Šต
  3. ๋ณ‘๋ ฌ ์—ฐ์‚ฐ์œผ๋กœ ํ•™์Šต ์†๋„ ๋Œ€ํญ ํ–ฅ์ƒ
    • ๋‹จ์ผ GPU๋กœ๋„ ํ›จ์”ฌ ๋” ๋น ๋ฅด๊ฒŒ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋จ

 

Transformer ์•„ํ‚คํ…์ฒ˜: ์ „์ฒด ๊ตฌ์กฐ

Transformer๋Š” ํฌ๊ฒŒ ๋‘ ๊ฐœ์˜ ์Šคํƒ(stack)์œผ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.

  1. Encoder Stack (6๊ฐœ Layer๋กœ ๊ตฌ์„ฑ)
    • ์ž…๋ ฅ ์‹œํ€€์Šค๋ฅผ Self-Attention๊ณผ Feed-Forward Network๋ฅผ ํ†ตํ•ด ์ธ์ฝ”๋”ฉํ•ด ๊ณ ์ฐจ์› ํ‘œํ˜„์œผ๋กœ ๋ณ€ํ™˜.
    • Self-Attention ๊ณผ Feed-forward Network ๋Š” sub layers ๋ฅผ ๊ตฌ์„ฑ. ์ด ๋‘๊ฐœ์˜ ์ธต์ด ํ•œ๊ฐœ์˜ layer๋ฅผ ๊ตฌ์„ฑ.
    • output์œผ๋กœ dmodel = 512  ์ฐจ์›์ด ๋‚˜์˜จ๋‹ค. (๊ท ํ˜•์„ ๋งž์ถฐ์„œ ์ด์ •๋„๋กœ output์ด ๋‚˜์˜ค๋„๋ก ์„ค์ •)
  2. Decoder Stack (6๊ฐœ Layer๋กœ ๊ตฌ์„ฑ)
    • Encoder์˜ ์ถœ๋ ฅ์„ ๋ฐ”ํƒ•์œผ๋กœ Self-Attention, Encoder-Decoder Attention, Feed-Forward Network๋ฅผ ์‚ฌ์šฉํ•ด ์ถœ๋ ฅ ์‹œํ€€์Šค๋ฅผ ์ƒ์„ฑ.
      • Self-Attention ๊ณผ Feed-forward Network ๋Š” sub layers ๋ฅผ ๊ตฌ์„ฑ. ์ด ๋‘๊ฐœ์˜ ์ธต์ด ํ•œ๊ฐœ์˜ layer๋ฅผ ๊ตฌ์„ฑ.

 

Transformer - model architecture

 

 

Encoder ๊ตฌ์„ฑ ์š”์†Œ

  1. Input Representation 
    • ์ž…๋ ฅ ๋ฐ์ดํ„ฐ(๋ฌธ์žฅ)๋ฅผ ์ž„๋ฒ ๋”ฉ(Embedding) ์œผ๋กœ ๋ณ€ํ™˜ํ•˜๊ณ , Positional Encoding์„ ์ถ”๊ฐ€ํ•ด ์œ„์น˜ ์ •๋ณด๋ฅผ ๋ฐ˜์˜
    • Positional Encoding ์€ sin ๊ณผ cos ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด ๊ฐ ์œ„์น˜์— ๊ณ ์œ ํ•œ ๊ฐ’์„ ๋ถ€์—ฌ 
      • pos: ๋‹จ์–ด๊ฐ€ ๋ช‡๋ฒˆ์งธ ๋‹จ์–ด์ธ์ง€ 
      • i: ํ˜„์žฌ ๋ช‡ ๋ฒˆ์งธ ์ฐจ์›์„ ๊ณ„์‚ฐ ์ค‘์ธ์ง€ ๋‚˜ํƒ€๋‚ด๋Š” ์ธ๋ฑ์Šค
      • dmodel: ์ „์ฒด ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์˜ ํฌ๊ธฐ (๋ช‡ ์ฐจ์›์œผ๋กœ ํ‘œํ˜„ํ• ์ง€ ๊ฒฐ์ •)
      • ์ง์ˆ˜ ์ฐจ์›์€ sin ๊ฐ’ ์‚ฌ์šฉ, ํ™€์ˆ˜ ์ฐจ์›์€ cos ๊ฐ’ ์‚ฌ์šฉ
  2. Self Attention Mechanism (Encoder/Decoder ๊ณตํ†ต) 
    • ์ž…๋ ฅ ์‹œํ€€์Šค์˜ ๋ชจ๋“  ํ† ํฐ์ด ๋‹ค๋ฅธ ๋ชจ๋“  ํ† ํฐ๊ณผ์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•ด ์„œ๋กœ ์˜์กด์„ฑ์„ ํ•™์Šต
      • Q, K, V ํ–‰๋ ฌ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ƒ์„ฑ (3. Multi-Head Attetion ์„ธ์…˜์—์„œ ์ข€ ๋” ์ž์„ธํžˆ ๊ณ„์‚ฐ๋ฐฉ๋ฒ• ์„œ์ˆ )
        • ์ž…๋ ฅ ๋ฐ์ดํ„ฐ(์ž„๋ฒ ๋”ฉ+ํฌ์ง€ ์…”๋„ ์ธ์ฝ”๋”ฉ) ํ–‰๋ ฌ์„ ๋˜‘๊ฐ™์ด 3๊ฐœ๋กœ ๋ณต์‚ฌ
        • Q ํ–‰๋ ฌ์„ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด ์ž„์˜์˜ ์ •์‚ฌ๊ฐ ํ–‰๋ ฌ(dmodel x dk ํฌ๊ธฐ์˜)์„ ๊ณฑํ•ด์ค€๋‹ค.
        • K์™€ V ๋˜ํ•œ ์ž„์˜์˜ ์ •์‚ฌ๊ฐ ํ–‰๋ ฌ์„ ๊ฐ๊ฐ ๊ณฑํ•ด์ค€๋‹ค. (์ž„์˜์˜ ์ •์‚ฌ๊ฐํ–‰๋ ฌ์€ Q, K, V ๋ชจ๋‘ ๋‹ค๋ฆ„)
      • Self-Attention์˜ ํ•ต์‹ฌ์€ Query(Q), Key(K), Value(V) ํ–‰๋ ฌ์„ ์œ„์™€ ๊ฐ™์ด ์ƒ์„ฑํ•˜๊ณ , ๊ฐ€์ค‘ํ•ฉ(weighted sum)์„ ๊ณ„์‚ฐ
      • ์ธ์ฝ”๋”ฉ ๋ถ€๋ถ„์—์„œ๋Š” Mask ๋ถ€๋ถ„์€ Skip ํ•œ๋‹ค.
      • attention ๊ฒฐ๊ณผ๋ฅผ softmax ์ฒ˜๋ฆฌํ•œ๋‹ค.
  3. Multi-Head Attention
    • Self-Attention ์„ ๋ณ‘๋ ฌ์ ์œผ๋กœ ์—ฌ๋Ÿฌ ๋ฒˆ(Head) ์ˆ˜ํ–‰ํ•œ ํ›„ ๊ฒฐ๊ณผ๋ฅผ ํ•ฉ์นจ.
      • ๋ชจ๋ธ์˜ ์ „์ฒด ์ฐจ์› dmodel = 512
      • head ์˜ ๊ฐœ์ˆ˜ h = 8 (๋…ผ๋ฌธ์—์„œ ์„ค์ •)
      • ๊ฐ Query, Key, Value ํ–‰๋ ฌ์€ dmodel/h = 64 ์ฐจ์›์œผ๋กœ ๋ถ„ํ• 
        ์ฆ‰, ํ•˜๋‚˜์˜ head ๋Š” 64 ์ฐจ์›์˜ ๋ฒกํ„ฐ๋กœ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰
    • ์ด๋ฅผ ํ†ตํ•ด ๋‹ค์–‘ํ•œ ๊ด€์ ์—์„œ์˜ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Œ
      Q, K, V ํ–‰๋ ฌ์˜ ์ƒ์„ฑ ๋ฐฉ์‹
    • Multi-Head Attention์—์„œ๋Š” ์ž…๋ ฅ X์— ๋Œ€ํ•ด ๊ฐ head๋งˆ๋‹ค ๋‹ค๋ฅธ ์„ ํ˜• ๋ณ€ํ™˜์ด ์ ์šฉ
    • ๊ฐ head๋ณ„๋กœ Query, Key, Value๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ƒ์„ฑ
    • 8 ๊ฐœ์˜ head ๊ฐ€ ๊ฐ๊ฐ ๋…๋ฆฝ์ ์ธ Q, K, V ๋ฅผ ์ƒ์„ฑํ•˜๊ณ  ๋ณ‘๋ ฌ์ ์œผ๋กœ attention ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰
    • ์ตœ์ข…์ ์œผ๋กœ ์„ ํ˜• ๋ณ€ํ™˜์„ ํ†ตํ•ด ๋ชจ๋ธ์˜ ์ถœ๋ ฅ dmodel = 512 ๋กœ ๋ณต์›๋œ๋‹ค
      • MultiHead(Q,K,V) = Concat(head1โ€‹,...,head8โ€‹)WO
      • WO: ์ตœ์ข… ์„ ํ˜• ๋ณ€ํ™˜์„ ์œ„ํ•œ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ ((8 x dv) x dmodel) (512 x 512)
      • WO ๋ฅผ ๊ณฑํ•ด์ฃผ๋Š” ์ด์œ ๋Š” Concat ๋œ head ์˜ ์ถœ๋ ฅ์€ ๋‹จ์ˆœํžˆ ๊ฐ’์ด ๋‚˜์—ด๋œ ์ƒํƒœ์ผ ๋ฟ, ์ด๋ฅผ ํ•˜๋‚˜์˜ ํ†ตํ•ฉ๋œ ํ‘œํ˜„์œผ๋กœ ๋ณ€ํ™˜ํ•ด์ค˜์•ผํ•˜๊ธฐ ๋•Œ๋ฌธ.
  4. Feed-Forward Network(FNN)
    • Attention์˜ ์ถœ๋ ฅ์ด ๊ฐ ํ† ํฐ๋งˆ๋‹ค ๋…๋ฆฝ์ ์œผ๋กœ FNN ์„ ํ†ตํ•ด ๋น„์„ ํ˜• ๋ณ€ํ™˜์ด ์ด๋ฃจ์–ด์ง
      • ๋น„์„ ํ˜• ๋ณ€ํ™˜์€ ReLU ๋กœ ์ธํ•œ ๊ฒƒ์ž„.
    • ๋‘๊ฐœ์˜ Fully-connected Layer๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Œ
    • [์ž…๋ ฅ๋ฒกํ„ฐ] ----> Linear  ----> [์ค‘๊ฐ„๋ฒกํ„ฐ] ----> ReLU ----> [์ค‘๊ฐ„๋ฒกํ„ฐ] ----> Linear ----> [๋งŒ๋“ค์–ด์ง„ ๋ฒกํ„ฐ]
    • ์ฒซ๋ฒˆ์งธ Linear Layer: ์ž…๋ ฅ dmodel=512 -> ๋‚ด๋ถ€ ์ฐจ์› dff = 2048 ๋กœ ํ™•์žฅ
    • ๋‘๋ฒˆ์งธ Linear Layer: dff=2048 -> ๋‹ค์‹œ dmodel=512 ๋กœ ์ถ•์†Œ
    • ์™œ dff=2048๋กœ ํ™•์žฅํ• ๊นŒ?
      • 2048 ์ฐจ์›์œผ๋กœ ํ™•์žฅํ•˜๋ฉด์„œ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ๋” ๊ณ ์ฐจ์› ๊ณต๊ฐ„์œผ๋กœ ๋ณ€ํ™˜ (๋น„์„ ํ˜• ํ‘œํ˜„๋ ฅ ๊ฐ•ํ™”)
  5. Residual Connection & Layer Normalization
    • ๊ฐ Self-Attention ๊ณผ Feed-Forward Layer ์—๋Š” Residual Connection ๊ณผ Layer Normalization ์ด ์ ์šฉ๋จ
    • ์ด๋Š” ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•˜๊ณ , ๋” ๊นŠ์€ ํ•™์Šต์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•จ

Layer Nomalization ์€ ๋ฐ์ดํ„ฐ์˜ ์ฑ„๋„์„ nomalize

 

 

 

Decoding ๊ตฌ์„ฑ ์š”์†Œ

Transformer ์˜ Decoder ๋Š” 6๊ฐœ์˜ ๋™์ผํ•œ layer ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐ ๋ ˆ์ด์–ด๋Š” ์„ธ๊ฐ€์ง€ ์„œ๋ธŒ ๋ธ”๋ก์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

 

  1. Masked Multi-Head Self-Attention
    • ์ž๊ธฐ ์ž์‹ (์ด์ „ ์ถœ๋ ฅ ํ† ํฐ๋“ค)๊ณผ์˜ ๊ด€๊ณ„๋ฅผ ๊ณ„์‚ฐ
    • ๋ฏธ๋ž˜์˜ ํ† ํฐ์„ ๋งˆ์Šคํ‚น ํ•˜์—ฌ ํ˜„์žฌ ์‹œ์ ๊นŒ์ง€์˜ ์ •๋ณด๋งŒ ์ด์šฉ
      • Decoder ๊ฐ€ ์ด๋ฏธ ์ƒ์„ฑ๋œ ์ด์ „ ํ† ํฐ๋“ค๋งŒ ์ฐธ๊ณ ํ•˜๋„๋ก... ๋ฏธ๋ž˜ ์ •๋ณด๋ฅผ ๋งˆ์Šคํ‚นํ•จ
    • ๋งˆ์Šคํฌ ํ–‰๋ ฌ(mask matrix) ๋ฅผ ์ด์šฉํ•ด์„œ softmax ๊ฒฐ๊ณผ๋ฅผ 0์œผ๋กœ ๋งŒ๋“ค์–ด ๋ฏธ๋ž˜ ์ •๋ณด๋ฅผ ๋ฌด์‹œํ•จ
    • ๋งŒ์•ฝ Decoder ๊ฐ€ '"The cat" ๊นŒ์ง€ ์ƒ์„ฑํ•œ ์ƒํƒœ๋ผ๋ฉด ๊ทธ ๋‹ค์Œ ๋‹จ์–ด "sat" ์„ ์˜ˆ์ธกํ• ๋•Œ๋Š” "the", "cat" ์ •๋ณด๋งŒ ์ฐธ๊ณ  
  2. Multi-Head Cross-Attetion  (Encoder-Decoder Attention)
    • Encoder์˜ ์ถœ๋ ฅ(์ž…๋ ฅ ์‹œํ€€์Šค์˜ ์ธ์ฝ”๋”ฉ ํ‘œํ˜„, Key, Value)๊ณผ ํ˜„์žฌ Decoder ์ž…๋ ฅ(Query)์˜ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•˜์—ฌ ์ž…๋ ฅ-์ถœ๋ ฅ ๊ฐ„์˜ ์˜์กด์„ฑ์„ ๋ชจ๋ธ๋ง (Cross Attention)
       
      • Q ๋Š” Decoder ์˜ ๊ฒƒ์ด๊ณ , K์™€ V ๋Š” Encoder ์˜ ๊ฒƒ
      • ์ž…๋ ฅ ์‹œํ€€์Šค์˜ ์ค‘์š”ํ•œ ์ •๋ณด๋ฅผ ๊ฐ•์กฐํ•ด ์ถœ๋ ฅ ์‹œํ€€์Šค ์˜ˆ์ธก ๋„์›€์„ ์คŒ
  3. Position-wise Feed-Forward Network(FNN)
    • ๊ฐ ํ† ํฐ๋ณ„๋กœ ๋น„์„ ํ˜• ๋ณ€ํ™˜์„ ์ˆ˜ํ–‰ํ•ด ๋” ๋ณต์žกํ•œ ๊ด€๊ณ„๋ฅผ ํ•™์Šต
    • Encoder ์˜ FNN ๊ณผ ๋™์ผํ•˜๊ฒŒ ReLU ๋กœ ๋น„์„ ํ˜•์„ฑ ์ ์šฉ

 

 

 

๐Ÿ˜ฒ ๋ฐ์ดํ„ฐ ํ๋ฆ„ (ํ•œ๋ˆˆ์— ์š”์•ฝ!)

Encoding ๋‹จ๊ณ„:

  1. ์ž…๋ ฅ ์‹œํ€€์Šค โ†’ ์ž„๋ฒ ๋”ฉ + Positional Encoding (๊ฐ๊ฐ์˜ ๋ฒกํ„ฐ๋ฅผ ๋”ํ•ด์ฃผ๋ฉด ๋)
  2. Self-Attention โ†’ Feed-Forward Network
  3. 6๊ฐœ Encoder Layer๋ฅผ ํ†ต๊ณผํ•ด ์ตœ์ข… ์ธ์ฝ”๋”ฉ ํ‘œํ˜„ ์ƒ์„ฑ

Decoding ๋‹จ๊ณ„:

  1. ์ด์ „ ์ถœ๋ ฅ ์‹œํ€€์Šค โ†’ ์ž„๋ฒ ๋”ฉ + Positional Encoding
  2. Self-Attention (Decoder์—์„œ ๋งˆ์Šคํ‚น ์ ์šฉ)
  3. Encoder-Decoder Attention (Encoder์˜ ์ธ์ฝ”๋”ฉ ํ‘œํ˜„๊ณผ์˜ ๊ด€๊ณ„ ํ•™์Šต)
  4. Feed-Forward Network
  5. 6๊ฐœ Decoder Layer๋ฅผ ํ†ต๊ณผํ•ด ์ตœ์ข… ์ถœ๋ ฅ ์ƒ์„ฑ

 

 

๊ฒฐ๋ก 

 

  • Self-Attention์€ ๊ณ„์‚ฐ ํšจ์œจ์„ฑ, ๋ณ‘๋ ฌํ™” ๊ฐ€๋Šฅ์„ฑ, ์žฅ๊ธฐ ์˜์กด์„ฑ ํ•™์Šต์—์„œ ๊ธฐ์กด RNN๊ณผ CNN๋ณด๋‹ค ๋›ฐ์–ด๋‚ฉ๋‹ˆ๋‹ค.
  • RNN์€ ๋Š๋ฆฌ๊ณ  ์ˆœ์ฐจ์ , CNN์€ ๋ณ‘๋ ฌํ™”๋Š” ๋˜์ง€๋งŒ ์žฅ๊ธฐ ์˜์กด์„ฑ ํ•™์Šต์— ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Self-Attention์€ ๋ชจ๋ธ ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ๋„ ๋†’์•„, ๋” ์ง๊ด€์ ์ธ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.