Attention Is All You Need๋Š” 2017๋…„์— ๋ฐœํ‘œ๋œ Transformer ๋…ผ๋ฌธ์ž…๋‹ˆ๋‹ค. ์ œ๋ชฉ ๊ทธ๋Œ€๋กœ, ์ด ๋…ผ๋ฌธ์€ ๊ธฐ๊ณ„ ๋ฒˆ์—ญ ๊ฐ™์€ sequence transduction ๋ฌธ์ œ์—์„œ recurrent layer๋‚˜ convolutional layer๋ฅผ ์“ฐ์ง€ ์•Š๊ณ ๋„ attention๋งŒ์œผ๋กœ ๊ฐ•ํ•œ ๋ชจ๋ธ์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค๊ณ  ์ฃผ์žฅํ–ˆ์Šต๋‹ˆ๋‹ค.

๋…ผ๋ฌธ์˜ ์˜๋ฏธ๋Š” โ€œattention์ด๋ผ๋Š” ๋ถ€ํ’ˆ์„ ํ•˜๋‚˜ ๋” ๋ถ™์˜€๋‹คโ€๊ฐ€ ์•„๋‹™๋‹ˆ๋‹ค. ์ด์ „์—๋„ encoder-decoder ๊ตฌ์กฐ์™€ attention์€ ์ด๋ฏธ ์“ฐ์ด๊ณ  ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ์€ ์ˆœ์ฐจ์ ์œผ๋กœ ์ •๋ณด๋ฅผ ๋„˜๊ธฐ๋Š” RNN/CNN ์ค‘์‹ฌ ๊ตฌ์กฐ๋ฅผ ๋นผ๊ณ , self-attention๊ณผ feed-forward network๋ฅผ ์Œ“์€ ๊ตฌ์กฐ๋งŒ์œผ๋กœ encoder-decoder ๋ชจ๋ธ์„ ๊ตฌ์„ฑํ–ˆ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ์ด ์„ ํƒ์ด ๋ณ‘๋ ฌ ํ•™์Šต, ๊ธด ์˜์กด์„ฑ ์ฒ˜๋ฆฌ, ๋ชจ๋ธ ํ™•์žฅ์„ฑ ์ธก๋ฉด์—์„œ ํฐ ๋ณ€ํ™”๋ฅผ ๋งŒ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค.

๋ฐฐ๊ฒฝ: ์™œ RNN์„ ๋นผ๋ ค๊ณ  ํ–ˆ๋‚˜

๋…ผ๋ฌธ์ด ๋ฌธ์ œ ์‚ผ์€ ๊ฒƒ์€ ๋‹น์‹œ sequence model์˜ ์ˆœ์ฐจ์„ฑ์ž…๋‹ˆ๋‹ค. RNN, LSTM, GRU ๊ณ„์—ด ๋ชจ๋ธ์€ ์ž…๋ ฅ token์„ ์ˆœ์„œ๋Œ€๋กœ ์ฒ˜๋ฆฌํ•˜๋ฉด์„œ hidden state๋ฅผ ๊ฐฑ์‹ ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ์‹์€ ๋ฌธ์žฅ์˜ ์ˆœ์„œ๋ฅผ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋‹ค๋ฃจ์ง€๋งŒ, ํ›ˆ๋ จํ•  ๋•Œ ํ•œ ๋ฌธ์žฅ ์•ˆ์˜ ์œ„์น˜๋“ค์„ ์™„์ „ํžˆ ๋ณ‘๋ ฌ๋กœ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค.

๊ธด ๋ฌธ์žฅ์—์„œ ์•ž๋ถ€๋ถ„์˜ ์ •๋ณด๊ฐ€ ๋’ค์ชฝ token๊นŒ์ง€ ์ „๋‹ฌ๋˜๋ ค๋ฉด ์—ฌ๋Ÿฌ ๊ณ„์‚ฐ ๋‹จ๊ณ„๋ฅผ ์ง€๋‚˜์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์€ ์ด๊ฒƒ์„ long-range dependency๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์–ด๋ ต๊ฒŒ ๋งŒ๋“œ๋Š” ์š”์ธ ์ค‘ ํ•˜๋‚˜๋กœ ๋ด…๋‹ˆ๋‹ค. ๊ณ„์‚ฐ์ด ์ˆœ์ฐจ์ ์ด๋ฉด GPU๋ฅผ ์จ๋„ ํ•œ training example ๋‚ด๋ถ€์˜ ๋ณ‘๋ ฌํ™”๊ฐ€ ์ œํ•œ๋˜๊ณ , ๊ธธ์ด๊ฐ€ ๊ธธ์–ด์งˆ์ˆ˜๋ก ํ•™์Šต ํšจ์œจ๋„ ๋–จ์–ด์ง‘๋‹ˆ๋‹ค.

Transformer๋Š” ์ด ๋ณ‘๋ชฉ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ๊ฐ ์œ„์น˜๊ฐ€ ๋‹ค๋ฅธ ๋ชจ๋“  ์œ„์น˜๋ฅผ ์ง์ ‘ ์ฐธ๊ณ ํ•  ์ˆ˜ ์žˆ๋Š” self-attention์„ ์ค‘์‹ฌ ์—ฐ์‚ฐ์œผ๋กœ ๋‘ก๋‹ˆ๋‹ค. ๋•๋ถ„์— ํ•œ layer ์•ˆ์—์„œ ๋‘ token ์‚ฌ์ด์˜ ๊ฒฝ๋กœ ๊ธธ์ด๊ฐ€ ์งง์•„์ง€๊ณ , sequence์˜ ์—ฌ๋Ÿฌ ์œ„์น˜๋ฅผ ๋™์‹œ์— ๊ณ„์‚ฐํ•˜๊ธฐ ์‰ฌ์›Œ์ง‘๋‹ˆ๋‹ค.

๋ชจ๋ธ ๊ตฌ์กฐ

๋…ผ๋ฌธ์€ ์ผ๋ฐ˜์ ์ธ encoder-decoder ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค. Encoder๋Š” ์ž…๋ ฅ sequence๋ฅผ ์—ฐ์† ํ‘œํ˜„์œผ๋กœ ๋ฐ”๊พธ๊ณ , decoder๋Š” ๊ทธ ํ‘œํ˜„์„ ์ฐธ๊ณ ํ•ด ์ถœ๋ ฅ sequence๋ฅผ ํ•˜๋‚˜์”ฉ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ์ ์€ encoder์™€ decoder ๋‚ด๋ถ€๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

Encoder

Encoder๋Š” 6๊ฐœ์˜ ๋™์ผํ•œ layer๋ฅผ ์Œ“์•„ ๋งŒ๋“ญ๋‹ˆ๋‹ค. ๊ฐ layer์—๋Š” ๋‘ ๊ฐœ์˜ sub-layer๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

  1. multi-head self-attention
  2. position-wise feed-forward network

๊ฐ sub-layer ์ฃผ๋ณ€์—๋Š” residual connection์ด ์žˆ๊ณ , ๊ทธ ๋’ค์— layer normalization์ด ๋ถ™์Šต๋‹ˆ๋‹ค. ์ฆ‰ ์ •๋ณด๊ฐ€ sub-layer๋ฅผ ํ†ต๊ณผํ•œ ๊ฒฐ๊ณผ์™€ ์›๋ž˜ ์ž…๋ ฅ์„ ๋”ํ•œ ๋’ค ์ •๊ทœํ™”ํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ตฌ์กฐ๋Š” ๊นŠ์€ ๋„คํŠธ์›Œํฌ๋ฅผ ์•ˆ์ •์ ์œผ๋กœ ํ•™์Šต์‹œํ‚ค๊ธฐ ์œ„ํ•œ ์žฅ์น˜์ž…๋‹ˆ๋‹ค.

Decoder

Decoder๋„ 6๊ฐœ์˜ layer๋ฅผ ์Œ“์ง€๋งŒ, encoder๋ณด๋‹ค sub-layer๊ฐ€ ํ•˜๋‚˜ ๋” ๋งŽ์Šต๋‹ˆ๋‹ค.

  1. masked multi-head self-attention
  2. encoder ์ถœ๋ ฅ์— attention์„ ๊ฑฐ๋Š” multi-head attention
  3. position-wise feed-forward network

์ฒซ ๋ฒˆ์งธ attention์— mask๋ฅผ ์“ฐ๋Š” ์ด์œ ๋Š” ๋ฒˆ์—ญ ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•  ๋•Œ ์•„์ง ์ƒ์„ฑํ•˜์ง€ ์•Š์€ ๋ฏธ๋ž˜ token์„ ๋ณด๋ฉด ์•ˆ ๋˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. Decoder๋Š” ์ด์ „์— ์ƒ์„ฑํ•œ token๋งŒ ๋ณด๊ณ  ๋‹ค์Œ token์„ ์˜ˆ์ธกํ•ด์•ผ ํ•˜๋ฏ€๋กœ, ์˜ค๋ฅธ์ชฝ ๋ฐฉํ–ฅ์˜ ์ •๋ณด๋ฅผ ๊ฐ€๋ฆฝ๋‹ˆ๋‹ค.

๋…ผ๋ฌธ ๊ทธ๋ฆผ์œผ๋กœ ๋ณด๋Š” ๊ตฌ์กฐ

Figure 1. Transformer architecture

Figure 1. Transformer ์ „์ฒด ๊ตฌ์กฐ. ์ถœ์ฒ˜: Vaswani et al., Attention Is All You Need.

PDF์—์„œ ์ถ”์ถœํ•œ Figure 1์€ Transformer ์ „์ฒด ๊ตฌ์กฐ๋ฅผ ํ•œ ์žฅ์œผ๋กœ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์™ผ์ชฝ์€ encoder, ์˜ค๋ฅธ์ชฝ์€ decoder์ž…๋‹ˆ๋‹ค. ์ž…๋ ฅ token์€ embedding์œผ๋กœ ๋ฐ”๋€ ๋’ค positional encoding์ด ๋”ํ•ด์ง€๊ณ , encoder stack์„ ํ†ต๊ณผํ•ฉ๋‹ˆ๋‹ค. Encoder layer ์•ˆ์—๋Š” multi-head attention๊ณผ feed-forward network๊ฐ€ ์žˆ์œผ๋ฉฐ, ๊ฐ sub-layer ๋’ค์—๋Š” Add & Norm์ด ๋ถ™์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

์˜ค๋ฅธ์ชฝ decoder๋Š” output embedding์— positional encoding์„ ๋”ํ•œ ๋’ค ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค. Decoder์˜ ์ฒซ attention์€ masked multi-head attention์ž…๋‹ˆ๋‹ค. ์ด๋Š” ์•„์ง ์ƒ์„ฑํ•˜์ง€ ์•Š์€ ๋ฏธ๋ž˜ token์„ ๋ณด์ง€ ๋ชปํ•˜๊ฒŒ ๋ง‰์Šต๋‹ˆ๋‹ค. ๊ทธ ๋‹ค์Œ decoder๋Š” encoder ์ถœ๋ ฅ์— ๋‹ค์‹œ attention์„ ๊ฑธ๊ณ , feed-forward network๋ฅผ ๊ฑฐ์นœ ๋’ค linear layer์™€ softmax๋ฅผ ํ†ตํ•ด ๋‹ค์Œ token์˜ ํ™•๋ฅ ์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

Figure 2 left. Scaled dot-product attention

Figure 2 ์™ผ์ชฝ. Scaled dot-product attention. Q์™€ K์˜ ๊ด€๊ณ„๋ฅผ ์ ์ˆ˜ํ™”ํ•˜๊ณ , mask์™€ softmax๋ฅผ ๊ฑฐ์ณ V๋ฅผ ๊ฐ€์ค‘ํ•ฉํ•ฉ๋‹ˆ๋‹ค.

Figure 2 right. Multi-head attention

Figure 2 ์˜ค๋ฅธ์ชฝ. Multi-head attention. ์—ฌ๋Ÿฌ attention head๋ฅผ ๋ณ‘๋ ฌ๋กœ ๊ณ„์‚ฐํ•œ ๋’ค concatํ•˜๊ณ  linear layer๋กœ ๋‹ค์‹œ ํ•ฉ์นฉ๋‹ˆ๋‹ค.

Figure 2๋Š” attention ๋‚ด๋ถ€๋ฅผ ๋” ์ž‘๊ฒŒ ๋‚˜๋ˆ„์–ด ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. Scaled dot-product attention์€ Q์™€ K๋ฅผ ๊ณฑํ•˜๊ณ , scale์„ ์ ์šฉํ•œ ๋’ค, ํ•„์š”ํ•˜๋ฉด mask๋ฅผ ์”Œ์šฐ๊ณ , softmax๋ฅผ ๊ฑฐ์ณ V๋ฅผ ๊ฐ€์ค‘ํ•ฉํ•ฉ๋‹ˆ๋‹ค. Multi-head attention์€ ์ด ๊ณผ์ •์„ ์—ฌ๋Ÿฌ head์—์„œ ๋ณ‘๋ ฌ๋กœ ์ˆ˜ํ–‰ํ•œ ๋’ค concatํ•˜๊ณ  linear layer๋ฅผ ํ†ต๊ณผ์‹œํ‚ต๋‹ˆ๋‹ค. ์ด ๊ทธ๋ฆผ ๋•๋ถ„์— โ€œattentionโ€์ด ํ•˜๋‚˜์˜ ์ถ”์ƒ์  ์•„์ด๋””์–ด๊ฐ€ ์•„๋‹ˆ๋ผ, ์‹ค์ œ๋กœ๋Š” ํ–‰๋ ฌ ๊ณฑ, scaling, masking, softmax, value ์กฐํ•ฉ์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ๊ณ„์‚ฐ์ด๋ผ๋Š” ์ ์ด ๋ถ„๋ช…ํ•ด์ง‘๋‹ˆ๋‹ค.

Attention ๊ณ„์‚ฐ

๋…ผ๋ฌธ์€ attention์„ query, key, value์˜ ๊ด€๊ณ„๋กœ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. Query๋Š” โ€œ๋ฌด์—‡์„ ์ฐพ๊ณ  ์‹ถ์€๊ฐ€โ€, key๋Š” โ€œ๊ฐ ํ•ญ๋ชฉ์ด ์–ด๋–ค ํŠน์ง•์„ ๊ฐ–๋Š”๊ฐ€โ€, value๋Š” โ€œ์‹ค์ œ๋กœ ๊ฐ€์ ธ์˜ฌ ์ •๋ณดโ€์— ๊ฐ€๊น์Šต๋‹ˆ๋‹ค.

Scaled dot-product attention์€ query์™€ key์˜ ๋‚ด์ ์„ ๊ณ„์‚ฐํ•ด ๊ด€๋ จ๋„๋ฅผ ๊ตฌํ•˜๊ณ , ์ด๋ฅผ key ์ฐจ์›์˜ ์ œ๊ณฑ๊ทผ์œผ๋กœ ๋‚˜๋ˆ„์–ด scale์„ ๋งž์ถ˜ ๋’ค softmax๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ์ด ๊ฐ€์ค‘์น˜๋ฅผ value์— ๊ณฑํ•ด ํ•ฉ์นฉ๋‹ˆ๋‹ค.

๊ฐ„๋‹จํžˆ ๋งํ•˜๋ฉด, ๊ฐ token์ด ๋ฌธ์žฅ ์•ˆ์˜ ๋‹ค๋ฅธ token๋“ค์„ ์–ผ๋งˆ๋‚˜ ์ฐธ๊ณ ํ• ์ง€ ์ ์ˆ˜ํ™”ํ•˜๊ณ , ๊ทธ ์ ์ˆ˜์— ๋”ฐ๋ผ ์ •๋ณด๋ฅผ ์„ž๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค.

Multi-head attention

ํ•˜๋‚˜์˜ attention๋งŒ ์“ฐ๋ฉด token ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ํ•œ ๊ฐ€์ง€ ๊ด€์ ์œผ๋กœ๋งŒ ๋ณด๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์€ query, key, value๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋‚ฎ์€ ์ฐจ์› ๊ณต๊ฐ„์œผ๋กœ ์„ ํ˜• ๋ณ€ํ™˜ํ•œ ๋’ค, ๊ฐ ๊ณต๊ฐ„์—์„œ attention์„ ๋ณ‘๋ ฌ๋กœ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์ด multi-head attention์ž…๋‹ˆ๋‹ค.

์—ฌ๋Ÿฌ head๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ์ข…๋ฅ˜์˜ ๊ด€๊ณ„๋ฅผ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์–ด๋–ค head๋Š” ๋ฌธ๋ฒ•์  ์—ฐ๊ฒฐ์„, ์–ด๋–ค head๋Š” ๋จผ ๊ฑฐ๋ฆฌ์˜ ์˜์กด์„ฑ์„, ์–ด๋–ค head๋Š” ํŠน์ • ๋‹จ์–ด ์ฃผ๋ณ€์˜ ์ง€์—ญ์  ํŒจํ„ด์„ ๋” ์ž˜ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋…ผ๋ฌธ ํ›„๋ฐ˜์˜ attention visualization์€ ์‹ค์ œ๋กœ ์ผ๋ถ€ head๊ฐ€ ๊ธด ๊ฑฐ๋ฆฌ์˜ ๋™์‚ฌ๊ตฌ ๊ด€๊ณ„๋‚˜ ๋Œ€๋ช…์‚ฌ ์ฐธ์กฐ์— ๋ฐ˜์‘ํ•˜๋Š” ๋ชจ์Šต์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์œ„์น˜ ์ •๋ณด๋Š” ์–ด๋–ป๊ฒŒ ๋„ฃ๋‚˜

Self-attention ์ž์ฒด๋Š” ์ž…๋ ฅ์˜ ์ˆœ์„œ๋ฅผ ์•Œ์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค. token๋“ค์ด ํ•œ๊บผ๋ฒˆ์— ๋“ค์–ด์˜ค๋ฉด, attention์€ ์–ด๋А token์ด ์•ž์— ์žˆ์—ˆ๊ณ  ๋’ค์— ์žˆ์—ˆ๋Š”์ง€ ์ž๋™์œผ๋กœ ์•Œ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

๊ทธ๋ž˜์„œ Transformer๋Š” embedding์— positional encoding์„ ๋”ํ•ฉ๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์€ sine๊ณผ cosine ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•œ ๊ณ ์ • positional encoding์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์„œ๋กœ ๋‹ค๋ฅธ ์ฃผ๊ธฐ์˜ sin/cos ๊ฐ’์„ ๊ฐ ์œ„์น˜์— ๋ถ€์—ฌํ•˜๋ฉด, ๋ชจ๋ธ์€ token์˜ ์ ˆ๋Œ€ ์œ„์น˜๋ฟ ์•„๋‹ˆ๋ผ ์ƒ๋Œ€์  ์œ„์น˜ ๊ด€๊ณ„๋„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์€ ํ•™์Šต ๊ฐ€๋Šฅํ•œ positional embedding๋„ ์‹คํ—˜ํ–ˆ๊ณ , ์„ฑ๋Šฅ์€ ๋น„์Šทํ–ˆ๋‹ค๊ณ  ๋ณด๊ณ ํ•ฉ๋‹ˆ๋‹ค.

์™œ self-attention์ธ๊ฐ€

๋…ผ๋ฌธ์€ self-attention์„ recurrent layer, convolutional layer์™€ ๋น„๊ตํ•  ๋•Œ ์„ธ ๊ฐ€์ง€ ๊ธฐ์ค€์„ ๋ด…๋‹ˆ๋‹ค.

  1. layer๋‹น ๊ณ„์‚ฐ ๋ณต์žก๋„
  2. ๋ณ‘๋ ฌํ™” ๊ฐ€๋Šฅํ•œ ์ •๋„
  3. ๋จผ ์œ„์น˜ ์‚ฌ์ด์˜ path length

Self-attention์€ sequence ๊ธธ์ด๊ฐ€ representation ์ฐจ์›๋ณด๋‹ค ์งง์€ ๊ฒฝ์šฐ ๊ณ„์‚ฐ๋Ÿ‰ ์ธก๋ฉด์—์„œ ์œ ๋ฆฌํ•˜๊ณ , ๋ชจ๋“  ์œ„์น˜๋ฅผ ๋™์‹œ์— ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์–ด ๋ณ‘๋ ฌํ™”์— ๊ฐ•ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ํ•œ layer ์•ˆ์—์„œ ์ž„์˜์˜ ๋‘ ์œ„์น˜๊ฐ€ ์ง์ ‘ ์—ฐ๊ฒฐ๋  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ long-range dependency์˜ ๊ฒฝ๋กœ๊ฐ€ ์งง์Šต๋‹ˆ๋‹ค.

๋ฌผ๋ก  self-attention์€ ๋ชจ๋“  token ์Œ์„ ๋น„๊ตํ•˜๋ฏ€๋กœ sequence ๊ธธ์ด๊ฐ€ ๋งค์šฐ ๊ธธ์–ด์ง€๋ฉด ๋น„์šฉ์ด ์ปค์ง‘๋‹ˆ๋‹ค. ๋…ผ๋ฌธ๋„ ๊ฒฐ๋ก ์—์„œ ์ด๋ฏธ์ง€, ์˜ค๋””์˜ค, ๋น„๋””์˜ค์ฒ˜๋Ÿผ ํฐ ์ž…๋ ฅ์„ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด local/restricted attention์„ ํƒ๊ตฌํ•˜๊ฒ ๋‹ค๊ณ  ๋งํ•ฉ๋‹ˆ๋‹ค.

ํ•™์Šต๊ณผ ์‹คํ—˜ ๊ฒฐ๊ณผ

๋…ผ๋ฌธ์€ WMT 2014 English-to-German, English-to-French ๋ฒˆ์—ญ ํƒœ์Šคํฌ์—์„œ ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค. Base ๋ชจ๋ธ์€ 8๊ฐœ์˜ NVIDIA P100 GPU๋ฅผ ์‚ฌ์šฉํ•ด ์•ฝ 12์‹œ๊ฐ„ ๋™์•ˆ 100,000 step ํ•™์Šตํ–ˆ๊ณ , Big ๋ชจ๋ธ์€ ์•ฝ 3.5์ผ ๋™์•ˆ 300,000 step ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ฒฐ๊ณผ๋Š” ๋‹น์‹œ ๊ธฐ์ค€์œผ๋กœ ๊ฐ•ํ–ˆ์Šต๋‹ˆ๋‹ค.

  • English-to-German: Transformer big์ด BLEU 28.4๋ฅผ ๊ธฐ๋กํ•ด ๊ธฐ์กด ์ตœ๊ณ  ๊ฒฐ๊ณผ๋ฅผ 2 BLEU ์ด์ƒ ๋„˜์—ˆ์Šต๋‹ˆ๋‹ค.
  • English-to-French: Transformer big์ด BLEU 41.8์„ ๊ธฐ๋กํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ํŠนํžˆ English-to-German์—์„œ๋Š” ๊ธฐ์กด ensemble ๋ชจ๋ธ๋“ค๋ณด๋‹ค๋„ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

์ค‘์š”ํ•œ ์ ์€ ์„ฑ๋Šฅ๋งŒ์ด ์•„๋‹ˆ๋ผ ํ›ˆ๋ จ ๋น„์šฉ์ž…๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์€ Transformer๊ฐ€ ์ด์ „์˜ ๊ฐ•ํ•œ ๋ชจ๋ธ๋“ค๋ณด๋‹ค ํ›จ์”ฌ ์ ์€ training cost๋กœ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋ƒˆ๋‹ค๊ณ  ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค.

๋…ผ๋ฌธ์ด ๋ฐ”๊พผ ๊ด€์ 

์ด ๋…ผ๋ฌธ ์ดํ›„ sequence model์„ ์ƒ๊ฐํ•˜๋Š” ๊ธฐ๋ณธ ๊ด€์ ์ด ๋‹ฌ๋ผ์กŒ์Šต๋‹ˆ๋‹ค. ์ด์ „์—๋Š” ์ˆœ์„œ๋ฅผ ๋‹ค๋ฃจ๋ ค๋ฉด ์ˆœ์ฐจ์ ์ธ recurrence๊ฐ€ ์ž์—ฐ์Šค๋Ÿฝ๋‹ค๊ณ  ์—ฌ๊ฒจ์กŒ์Šต๋‹ˆ๋‹ค. Transformer๋Š” ์ˆœ์„œ๋ฅผ ๋ณ„๋„ positional information์œผ๋กœ ์ฃผ๊ณ , token ์‚ฌ์ด์˜ ๊ด€๊ณ„๋Š” attention์œผ๋กœ ์ง์ ‘ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ์‹๋„ ์ถฉ๋ถ„ํžˆ ๊ฐ•ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

์ด ์•„์ด๋””์–ด๋Š” ์ดํ›„ BERT, GPT ๊ณ„์—ด ๋ชจ๋ธ์˜ ๊ธฐ๋ฐ˜์ด ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋ฌผ๋ก  ํ˜„๋Œ€ LLM์€ ์ด ๋…ผ๋ฌธ์˜ ์›ํ˜• Transformer์™€ ์™„์ „ํžˆ ๊ฐ™์ง€๋Š” ์•Š์Šต๋‹ˆ๋‹ค. decoder-only ๊ตฌ์กฐ, pretraining objective, scaling law, instruction tuning ๋“ฑ ๋งŽ์€ ๋ณ€ํ™”๊ฐ€ ๋”ํ•ด์กŒ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜๋„ โ€œattention ์ค‘์‹ฌ์˜ ๋ณ‘๋ ฌํ™” ๊ฐ€๋Šฅํ•œ sequence modelโ€์ด๋ผ๋Š” ๊ธฐ๋ณธ ์ „ํ™˜์ ์€ ์ด ๋…ผ๋ฌธ์—์„œ ์‹œ์ž‘๋๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ—ท๊ฐˆ๋ฆฌ์ง€ ๋ง์•„์•ผ ํ•  ์ 

Transformer๋Š” ๋‹จ์ˆœํžˆ attention ํ•˜๋‚˜๋ฅผ ์“ด ๋ชจ๋ธ์ด ์•„๋‹™๋‹ˆ๋‹ค. Multi-head attention, feed-forward network, residual connection, layer normalization, positional encoding, masking, encoder-decoder attention์ด ํ•จ๊ป˜ ์„ค๊ณ„๋œ ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค.

๋˜ํ•œ ์ด ๋…ผ๋ฌธ์€ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋ฒ”์šฉ ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ์„ ์ œ์•ˆํ•œ ๋…ผ๋ฌธ์ด๋ผ๊ธฐ๋ณด๋‹ค๋Š”, ๊ธฐ๊ณ„ ๋ฒˆ์—ญ์„ ์ค‘์‹ฌ์œผ๋กœ sequence transduction ๋ชจ๋ธ์„ ์ƒˆ๋กญ๊ฒŒ ๊ตฌ์„ฑํ•œ ๋…ผ๋ฌธ์ž…๋‹ˆ๋‹ค. ์ดํ›„ ์ด ๊ตฌ์กฐ๊ฐ€ ์–ธ์–ด ๋ชจ๋ธ๋ง๊ณผ ๋Œ€๊ทœ๋ชจ ์‚ฌ์ „ํ•™์Šต์— ํ™•์žฅ๋˜๋ฉด์„œ ์˜ค๋Š˜๋‚ ์˜ LLM ํ๋ฆ„์œผ๋กœ ์ด์–ด์กŒ์Šต๋‹ˆ๋‹ค.

์ถœ์ฒ˜

๊ด€๋ จ ๋ฌธ์„œ