Transformer๋Š” sequence๋ฅผ ์ˆœ์„œ๋Œ€๋กœ ํ•˜๋‚˜์”ฉ ์ฒ˜๋ฆฌํ•˜๋Š” ๋Œ€์‹ , sequence ์•ˆ์˜ token๋“ค์ด ์„œ๋กœ ์–ด๋–ค ๊ด€๊ณ„๋ฅผ ๋งบ๋Š”์ง€ attention์œผ๋กœ ์ง์ ‘ ๊ณ„์‚ฐํ•˜๋Š” ์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค. 2017๋…„ ๋…ผ๋ฌธ Attention Is All You Need์—์„œ ์ œ์•ˆ๋˜์—ˆ๊ณ , ์ดํ›„ BERT์™€ GPT ๊ณ„์—ด ๋ชจ๋ธ์˜ ๊ธฐ๋ณธ ๋ผˆ๋Œ€๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

ํ•œ ๋ฌธ์žฅ์œผ๋กœ ๋งํ•˜๋ฉด Transformer๋Š” ๊ฐ token์ด ๋‹ค๋ฅธ token์„ ์–ผ๋งˆ๋‚˜ ์ฐธ๊ณ ํ•ด์•ผ ํ•˜๋Š”์ง€๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , ๊ทธ ์ •๋ณด๋ฅผ ์—ฌ๋Ÿฌ ์ธต์œผ๋กœ ์Œ“์•„ ๋ฌธ๋งฅ ํ‘œํ˜„์„ ๋งŒ๋“œ๋Š” ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค.

์™œ ํ•„์š”ํ•œ๊ฐ€

๋ฌธ์žฅ์€ ์ˆœ์„œ๋ฅผ ๊ฐ–์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ๊ณผ๊ฑฐ์—๋Š” RNN์ฒ˜๋Ÿผ ์•ž์—์„œ ๋’ค๋กœ ์ƒํƒœ๋ฅผ ๋„˜๊ธฐ๋Š” ๋ฐฉ์‹์ด ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์“ฐ์˜€์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด ๋ฐฉ์‹์—๋Š” ๋‘ ๊ฐ€์ง€ ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

์ฒซ์งธ, ๊ณ„์‚ฐ์ด ์ˆœ์ฐจ์ ์ž…๋‹ˆ๋‹ค. ์•ž token์˜ ๊ณ„์‚ฐ์ด ๋๋‚˜์•ผ ๋‹ค์Œ token์„ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ๊ธด ๋ฌธ์žฅ์„ ๋น ๋ฅด๊ฒŒ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค.

๋‘˜์งธ, ๋ฉ€๋ฆฌ ๋–จ์–ด์ง„ ๋‹จ์–ด ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ๋ฐฐ์šฐ๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ์•ž๋ถ€๋ถ„์˜ ์ •๋ณด๊ฐ€ ๋’ค์ชฝ๊นŒ์ง€ ์ „๋‹ฌ๋˜๋ ค๋ฉด ์—ฌ๋Ÿฌ ๋‹จ๊ณ„๋ฅผ ์ง€๋‚˜์•ผ ํ•˜๊ณ , ๊ทธ ๊ณผ์ •์—์„œ ์ •๋ณด๊ฐ€ ์•ฝํ•ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Transformer๋Š” ์ด ๋ฌธ์ œ๋ฅผ self-attention์œผ๋กœ ์šฐํšŒํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋“  token์ด ๊ฐ™์€ layer ์•ˆ์—์„œ ์„œ๋กœ๋ฅผ ์ง์ ‘ ๋ฐ”๋ผ๋ณผ ์ˆ˜ ์žˆ๊ฒŒ ๋งŒ๋“ค์–ด, ๊ธด ๊ฑฐ๋ฆฌ์˜ ๊ด€๊ณ„๋ฅผ ์งง์€ ๊ฒฝ๋กœ๋กœ ์—ฐ๊ฒฐํ•ฉ๋‹ˆ๋‹ค.

๊ธฐ๋ณธ ๊ตฌ์กฐ

์›๋ž˜ ๋…ผ๋ฌธ์˜ Transformer๋Š” encoder-decoder ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค.

์ž…๋ ฅ ๋ฌธ์žฅ
  โ†“
Encoder stack
  โ†“
๋ฌธ๋งฅ ํ‘œํ˜„
  โ†“
Decoder stack
  โ†“
์ถœ๋ ฅ ๋ฌธ์žฅ

Encoder๋Š” ์ž…๋ ฅ sequence๋ฅผ ์ฝ๊ณ , decoder๋Š” encoder์˜ ํ‘œํ˜„์„ ์ฐธ๊ณ ํ•˜๋ฉด์„œ ์ถœ๋ ฅ sequence๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์˜ค๋Š˜๋‚ ์˜ GPT ๊ณ„์—ด ๋ชจ๋ธ์€ ์ด ์ค‘ decoder ์ชฝ ์•„์ด๋””์–ด๋ฅผ ์ค‘์‹ฌ์œผ๋กœ ๋ณ€ํ˜•ํ•œ ๊ตฌ์กฐ์— ๊ฐ€๊น๊ณ , BERT๋Š” encoder ์ชฝ ๊ตฌ์กฐ๋ฅผ ์ค‘์‹ฌ์œผ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋ฆผ์œผ๋กœ ์ดํ•ดํ•˜๊ธฐ

Figure 1. Transformer architecture

Figure 1. Transformer ์ „์ฒด ๊ตฌ์กฐ. ์™ผ์ชฝ์€ encoder, ์˜ค๋ฅธ์ชฝ์€ decoder์ž…๋‹ˆ๋‹ค. ์ถœ์ฒ˜: Vaswani et al., Attention Is All You Need.

๋…ผ๋ฌธ PDF์—์„œ ์ถ”์ถœํ•œ Figure 1์€ Transformer๋ฅผ encoder์™€ decoder ๋‘ ๋ฉ์–ด๋ฆฌ๋กœ ๋‚˜๋ˆ„์–ด ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์™ผ์ชฝ encoder๋Š” ์ž…๋ ฅ embedding์— positional encoding์„ ๋”ํ•œ ๋’ค, ๊ฐ™์€ layer๋ฅผ N๋ฒˆ ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค. ๊ฐ layer๋Š” multi-head self-attention, Add & Norm, feed-forward network, Add & Norm ์ˆœ์„œ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

์˜ค๋ฅธ์ชฝ decoder๋Š” ๋น„์Šทํ•˜์ง€๋งŒ attention์ด ํ•˜๋‚˜ ๋” ์žˆ์Šต๋‹ˆ๋‹ค. ๋จผ์ € masked multi-head self-attention์œผ๋กœ ์ด์ „ ์ถœ๋ ฅ token๋“ค๋งŒ ์ฐธ๊ณ ํ•˜๊ณ , ๊ทธ ๋‹ค์Œ encoder ์ถœ๋ ฅ์— attention์„ ๊ฒ๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์—๋Š” feed-forward network๋ฅผ ๊ฑฐ์ณ linear layer์™€ softmax๊ฐ€ ๋‹ค์Œ token ํ™•๋ฅ ์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค. ์ด ๊ทธ๋ฆผ์„ ๋ณด๋ฉด Transformer๊ฐ€ ๋‹จ์ผ attention ๋ธ”๋ก์ด ์•„๋‹ˆ๋ผ, attention๊ณผ feed-forward network๋ฅผ residual/normalization ๊ตฌ์กฐ๋กœ ๋ฐ˜๋ณตํ•ด์„œ ์Œ“์€ architecture๋ผ๋Š” ์ ์ด ๋ณด์ž…๋‹ˆ๋‹ค.

Figure 2 left. Scaled dot-product attention

Figure 2 ์™ผ์ชฝ. Scaled dot-product attention์˜ ๋‚ด๋ถ€ ๊ณ„์‚ฐ ํ๋ฆ„์ž…๋‹ˆ๋‹ค.

Figure 2 right. Multi-head attention

Figure 2 ์˜ค๋ฅธ์ชฝ. Multi-head attention์€ ์—ฌ๋Ÿฌ scaled dot-product attention์„ ๋ณ‘๋ ฌ๋กœ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

Figure 2๋Š” attention ์ž์ฒด๋ฅผ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. Scaled dot-product attention์€ Q์™€ K๋ฅผ ํ–‰๋ ฌ ๊ณฑ์œผ๋กœ ๋น„๊ตํ•˜๊ณ , scale๊ณผ optional mask๋ฅผ ์ ์šฉํ•œ ๋’ค softmax๋ฅผ ๊ฑฐ์ณ V๋ฅผ ์„ž์Šต๋‹ˆ๋‹ค. Multi-head attention์€ ์ด scaled dot-product attention์„ ์—ฌ๋Ÿฌ head์—์„œ ๋ณ‘๋ ฌ๋กœ ์ˆ˜ํ–‰ํ•˜๊ณ , ๊ฒฐ๊ณผ๋ฅผ concatํ•œ ๋’ค linear layer๋กœ ๋‹ค์‹œ ํ•ฉ์นฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ multi-head attention์€ โ€œattention์„ ์—ฌ๋Ÿฌ ๊ด€์ ์—์„œ ๋™์‹œ์— ๋ณด๋Š” ์žฅ์น˜โ€๋ผ๊ณ  ์ดํ•ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Encoder layer

๋…ผ๋ฌธ์˜ encoder layer๋Š” ๋‘ ๋ถ€๋ถ„์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

  1. multi-head self-attention
  2. position-wise feed-forward network

๊ฐ ๋ถ€๋ถ„ ๋’ค์—๋Š” residual connection๊ณผ layer normalization์ด ๋ถ™์Šต๋‹ˆ๋‹ค. Residual connection์€ ์ž…๋ ฅ์„ ๊ทธ๋Œ€๋กœ ์šฐํšŒ์‹œ์ผœ sub-layer ์ถœ๋ ฅ๊ณผ ๋”ํ•˜๋Š” ๊ตฌ์กฐ์ด๊ณ , layer normalization์€ ๊ฐ’์˜ ๋ถ„ํฌ๋ฅผ ์•ˆ์ •ํ™”ํ•ฉ๋‹ˆ๋‹ค. ์ด ์กฐํ•ฉ์€ ๊นŠ์€ ๋ชจ๋ธ์„ ํ•™์Šตํ•  ๋•Œ gradient์™€ ํ‘œํ˜„์ด ๋ถˆ์•ˆ์ •ํ•ด์ง€๋Š” ๋ฌธ์ œ๋ฅผ ์ค„์ž…๋‹ˆ๋‹ค.

Encoder์˜ self-attention์—์„œ๋Š” ์ž…๋ ฅ ๋ฌธ์žฅ์˜ ๊ฐ token์ด ๊ฐ™์€ ๋ฌธ์žฅ ์•ˆ์˜ ๋‹ค๋ฅธ ๋ชจ๋“  token์„ ์ฐธ๊ณ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Decoder layer

Decoder layer๋Š” encoder๋ณด๋‹ค ํ•œ ๋‹จ๊ณ„๊ฐ€ ๋” ์žˆ์Šต๋‹ˆ๋‹ค.

  1. masked multi-head self-attention
  2. encoder-decoder attention
  3. position-wise feed-forward network

Masked self-attention์€ ๋ฏธ๋ž˜ token์„ ๋ณด์ง€ ๋ชปํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ๋ฒˆ์—ญ ๊ฒฐ๊ณผ๋ฅผ ์™ผ์ชฝ์—์„œ ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ƒ์„ฑํ•  ๋•Œ, ์•„์ง ์ƒ์„ฑํ•˜์ง€ ์•Š์€ ๋‹จ์–ด๋ฅผ ๋ณด๊ณ  ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ๋งžํžˆ๋ฉด ์•ˆ ๋ฉ๋‹ˆ๋‹ค.

Encoder-decoder attention์€ decoder๊ฐ€ ์ถœ๋ ฅ token์„ ๋งŒ๋“ค ๋•Œ encoder์˜ ์ž…๋ ฅ ๋ฌธ์žฅ ํ‘œํ˜„์„ ์ฐธ๊ณ ํ•˜๋Š” ๋ถ€๋ถ„์ž…๋‹ˆ๋‹ค. ๋ฒˆ์—ญ์œผ๋กœ ๋น„์œ ํ•˜๋ฉด, ์ง€๊ธˆ ์ƒ์„ฑํ•˜๋ ค๋Š” ๋‹จ์–ด๊ฐ€ ์›๋ฌธ ๋ฌธ์žฅ์˜ ์–ด๋А ๋ถ€๋ถ„์„ ๋ด์•ผ ํ•˜๋Š”์ง€ ์ฐพ๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค.

Self-attention

Self-attention์€ ๊ฐ™์€ sequence ์•ˆ์—์„œ token๋ผ๋ฆฌ ์„œ๋กœ๋ฅผ ์ฐธ๊ณ ํ•˜๋Š” ์—ฐ์‚ฐ์ž…๋‹ˆ๋‹ค. ๊ฐ token์€ query, key, value๋ผ๋Š” ์„ธ ์ข…๋ฅ˜์˜ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜๋ฉ๋‹ˆ๋‹ค.

  • Query: ์ง€๊ธˆ ์ด token์ด ์ฐพ๊ณ  ์‹ถ์€ ์ •๋ณด
  • Key: ๋‹ค๋ฅธ token๋“ค์ด ๊ฐ€์ง„ ์‹๋ณ„ ์ •๋ณด
  • Value: ์‹ค์ œ๋กœ ๊ฐ€์ ธ์˜ฌ ๋‚ด์šฉ

Query์™€ key์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•˜๋ฉด โ€œ์ด token์ด ์ € token์„ ์–ผ๋งˆ๋‚˜ ์ฐธ๊ณ ํ•ด์•ผ ํ•˜๋Š”๊ฐ€โ€๋ผ๋Š” ์ ์ˆ˜๊ฐ€ ๋‚˜์˜ต๋‹ˆ๋‹ค. ์ด ์ ์ˆ˜์— softmax๋ฅผ ์ ์šฉํ•ด ๊ฐ€์ค‘์น˜๋กœ ๋งŒ๋“ค๊ณ , value๋ฅผ ๊ฐ€์ค‘ํ•ฉํ•˜๋ฉด attention ๊ฒฐ๊ณผ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

Scaled dot-product attention

๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉํ•œ attention์€ scaled dot-product attention์ž…๋‹ˆ๋‹ค. Query์™€ key์˜ ๋‚ด์ ์„ ๊ณ„์‚ฐํ•œ ๋’ค, key ์ฐจ์›์˜ ์ œ๊ณฑ๊ทผ์œผ๋กœ ๋‚˜๋ˆ•๋‹ˆ๋‹ค. ๊ทธ ๋‹ค์Œ softmax๋ฅผ ํ†ต๊ณผ์‹œ์ผœ value์— ๊ณฑํ•ฉ๋‹ˆ๋‹ค.

scale์„ ๋„ฃ๋Š” ์ด์œ ๋Š” ์ฐจ์›์ด ์ปค์งˆ์ˆ˜๋ก ๋‚ด์  ๊ฐ’์ด ์ปค์ ธ softmax๊ฐ€ ๋„ˆ๋ฌด ๋พฐ์กฑํ•ด์งˆ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. softmax๊ฐ€ ์ง€๋‚˜์น˜๊ฒŒ ํ•œ์ชฝ์œผ๋กœ ์ ๋ฆฌ๋ฉด gradient๊ฐ€ ์ž‘์•„์ง€๊ณ  ํ•™์Šต์ด ์–ด๋ ค์›Œ์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Multi-head attention

Multi-head attention์€ attention์„ ์—ฌ๋Ÿฌ ๋ฒˆ ๋ณ‘๋ ฌ๋กœ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ํ•˜๋‚˜์˜ ํฐ attention์„ ํ•œ ๋ฒˆ ๊ณ„์‚ฐํ•˜๋Š” ๋Œ€์‹ , query/key/value๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์ž‘์€ ๊ณต๊ฐ„์œผ๋กœ ๋‚˜๋ˆ„์–ด ๊ฐ๊ฐ attention์„ ๊ณ„์‚ฐํ•œ ๋’ค ๋‹ค์‹œ ํ•ฉ์นฉ๋‹ˆ๋‹ค.

์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๋ชจ๋ธ์€ ์—ฌ๋Ÿฌ ๊ด€์ ์˜ ๊ด€๊ณ„๋ฅผ ๋™์‹œ์— ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ์–ด๋–ค head๋Š” ๊ฐ€๊นŒ์šด ๋‹จ์–ด ๊ด€๊ณ„๋ฅผ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์–ด๋–ค head๋Š” ๋ฉ€๋ฆฌ ๋–จ์–ด์ง„ ์˜์กด์„ฑ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์–ด๋–ค head๋Š” ๋Œ€๋ช…์‚ฌ๊ฐ€ ๊ฐ€๋ฆฌํ‚ค๋Š” ๋Œ€์ƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์–ด๋–ค head๋Š” ๋ฌธ๋ฒ•์  ์—ญํ• ์„ ํฌ์ฐฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋…ผ๋ฌธ์€ attention visualization์—์„œ ์ผ๋ถ€ head๊ฐ€ โ€œmaking โ€ฆ more difficultโ€์ฒ˜๋Ÿผ ๋ฉ€๋ฆฌ ๋–จ์–ด์ง„ ํ‘œํ˜„์„ ์—ฐ๊ฒฐํ•˜๊ฑฐ๋‚˜, โ€œitsโ€๊ฐ€ ๊ฐ€๋ฆฌํ‚ค๋Š” ๋Œ€์ƒ์„ ๋‚ ์นด๋กญ๊ฒŒ ๋ณด๋Š” ๋ชจ์Šต์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

Positional encoding

Self-attention์€ ์ˆœ์„œ ์ •๋ณด๋ฅผ ์ง์ ‘ ๊ฐ–๊ณ  ์žˆ์ง€ ์•Š์Šต๋‹ˆ๋‹ค. โ€œ๋‚˜๋Š” ํ•™๊ต์— ๊ฐ”๋‹คโ€์™€ โ€œํ•™๊ต๊ฐ€ ๋‚˜์—๊ฒŒ ์™”๋‹คโ€๋Š” ๋‹จ์–ด ์ง‘ํ•ฉ์ด ๋น„์Šทํ•ด๋„ ์ˆœ์„œ๊ฐ€ ๋‹ค๋ฅด๋ฉด ์˜๋ฏธ๊ฐ€ ๋‹ค๋ฆ…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ Transformer๋Š” token embedding์— positional encoding์„ ๋”ํ•ฉ๋‹ˆ๋‹ค.

์› ๋…ผ๋ฌธ์€ sine/cosine ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•œ ๊ณ ์ • positional encoding์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ์œ„์น˜๋งˆ๋‹ค ์„œ๋กœ ๋‹ค๋ฅธ ์ฃผ๊ธฐ์˜ ๊ฐ’์„ ๋„ฃ์œผ๋ฉด ๋ชจ๋ธ์ด token์˜ ์œ„์น˜์™€ ์ƒ๋Œ€์  ๊ฑฐ๋ฆฌ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ค‘์š”ํ•œ ์ ์€ Transformer๊ฐ€ ์ˆœ์„œ๋ฅผ ๋ฒ„๋ฆฐ ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์ˆœ์„œ ์ฒ˜๋ฆฌ๋ฅผ recurrence์— ๋งก๊ธฐ์ง€ ์•Š๊ณ  ๋ณ„๋„ ์œ„์น˜ ์ •๋ณด๋กœ ์ฃผ์ž…ํ–ˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

Feed-forward network

๊ฐ attention layer ๋’ค์—๋Š” position-wise feed-forward network๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๊ฐ token ์œ„์น˜์— ๋…๋ฆฝ์ ์œผ๋กœ ์ ์šฉ๋˜๋Š” ์ž‘์€ MLP์ž…๋‹ˆ๋‹ค. Attention์ด token ์‚ฌ์ด์˜ ์ •๋ณด๋ฅผ ์„ž๋Š” ์—ญํ• ์„ ํ•œ๋‹ค๋ฉด, feed-forward network๋Š” ๊ฐ ์œ„์น˜์˜ ํ‘œํ˜„์„ ๋น„์„ ํ˜•์ ์œผ๋กœ ๋ณ€ํ™˜ํ•ด ๋” ํ’๋ถ€ํ•œ ํŠน์ง•์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

์› ๋…ผ๋ฌธ base ๋ชจ๋ธ์—์„œ๋Š” d_model = 512, feed-forward ๋‚ด๋ถ€ ์ฐจ์›์€ d_ff = 2048์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

์™œ ํ™•์žฅ์— ์œ ๋ฆฌํ–ˆ๋‚˜

Transformer๊ฐ€ ์ค‘์š”ํ•ด์ง„ ์ด์œ ๋Š” ์„ฑ๋Šฅ๋งŒ์ด ์•„๋‹™๋‹ˆ๋‹ค. ๊ตฌ์กฐ์ ์œผ๋กœ ๋ณ‘๋ ฌํ™”์— ์œ ๋ฆฌํ–ˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

RNN์€ sequence ๊ธธ์ด ๋ฐฉํ–ฅ์œผ๋กœ ๊ณ„์‚ฐ์ด ์ด์–ด์ง€์ง€๋งŒ, Transformer์˜ self-attention์€ ํ•œ layer ์•ˆ์—์„œ ์—ฌ๋Ÿฌ ์œ„์น˜๋ฅผ ๋™์‹œ์— ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ GPU/TPU ๊ฐ™์€ ๋ณ‘๋ ฌ ํ•˜๋“œ์›จ์–ด์™€ ์ž˜ ๋งž์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ๊ณผ ๋ฐ์ดํ„ฐ๊ฐ€ ์ปค์งˆ์ˆ˜๋ก ์ด ์ฐจ์ด๋Š” ๋งค์šฐ ์ค‘์š”ํ•ด์ง‘๋‹ˆ๋‹ค.

๋˜ํ•œ token ์‚ฌ์ด์˜ path length๊ฐ€ ์งง์Šต๋‹ˆ๋‹ค. RNN์—์„œ๋Š” ๋ฉ€๋ฆฌ ๋–จ์–ด์ง„ ๋‘ token์ด ์—ฌ๋Ÿฌ recurrent step์„ ๊ฑฐ์ณ ์—ฐ๊ฒฐ๋˜์ง€๋งŒ, self-attention์—์„œ๋Š” ํ•œ layer ์•ˆ์—์„œ ์ง์ ‘ ์—ฐ๊ฒฐ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ํŠน์„ฑ์€ ๊ธด ๋ฌธ๋งฅ์˜ ์˜์กด์„ฑ์„ ํ•™์Šตํ•˜๋Š” ๋ฐ ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

ํ•œ๊ณ„

Transformer์—๋„ ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐ€์žฅ ๋Œ€ํ‘œ์ ์ธ ๊ฒƒ์€ attention์˜ ๊ณ„์‚ฐ๋Ÿ‰์ž…๋‹ˆ๋‹ค. ๋ชจ๋“  token ์Œ์„ ๋น„๊ตํ•˜๋ฏ€๋กœ sequence ๊ธธ์ด n์— ๋Œ€ํ•ด self-attention ๋น„์šฉ์€ ๋Œ€๋žต nยฒ์— ๋น„๋ก€ํ•ฉ๋‹ˆ๋‹ค. ์งง๊ฑฐ๋‚˜ ์ค‘๊ฐ„ ๊ธธ์ด์˜ ๋ฌธ์žฅ์—์„œ๋Š” ๊ฐ•๋ ฅํ•˜์ง€๋งŒ, ๋งค์šฐ ๊ธด ๋ฌธ์„œ๋‚˜ ๊ณ ํ•ด์ƒ๋„ ์ด๋ฏธ์ง€, ๊ธด ์˜ค๋””์˜ค๋ฅผ ๊ทธ๋Œ€๋กœ ์ฒ˜๋ฆฌํ•  ๋•Œ๋Š” ๋น„์šฉ์ด ์ปค์ง‘๋‹ˆ๋‹ค.

๊ทธ๋ž˜์„œ ์ดํ›„ ์—ฐ๊ตฌ์—์„œ๋Š” sparse attention, local attention, linear attention, retrieval, chunking, long-context architecture ๊ฐ™์€ ๋‹ค์–‘ํ•œ ๋ณ€ํ˜•์ด ๋“ฑ์žฅํ–ˆ์Šต๋‹ˆ๋‹ค.

ํ—ท๊ฐˆ๋ฆฌ์ง€ ๋ง์•„์•ผ ํ•  ์ 

Transformer๋Š” โ€œattention๋งŒ ์žˆ์œผ๋ฉด ๋œ๋‹คโ€๋Š” ๋ง๋กœ ๋‹จ์ˆœํ™”ํ•˜๋ฉด ์˜คํ•ด๊ฐ€ ์ƒ๊น๋‹ˆ๋‹ค. ์‹ค์ œ ๊ตฌ์กฐ๋Š” attention, positional encoding, residual connection, layer normalization, feed-forward network, masking, optimizer์™€ regularization ์„ ํƒ์ด ํ•จ๊ป˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

๋˜ํ•œ ํ˜„์žฌ์˜ LLM์ด ๋ชจ๋‘ ๋…ผ๋ฌธ ์›ํ˜•์˜ encoder-decoder Transformer๋ผ๋Š” ๋œป๋„ ์•„๋‹™๋‹ˆ๋‹ค. GPT ๊ณ„์—ด์€ decoder-only ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , BERT ๊ณ„์—ด์€ encoder ์ค‘์‹ฌ ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋‘ ํ๋ฆ„ ๋ชจ๋‘ ์ด ๋…ผ๋ฌธ์ด ์ •๋ฆฌํ•œ self-attention ๊ธฐ๋ฐ˜ architecture์—์„œ ์ถœ๋ฐœํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ด€๋ จ ๋ฌธ์„œ