Self-attention์€ ํ•˜๋‚˜์˜ sequence ์•ˆ์— ์žˆ๋Š” token๋“ค์ด ์„œ๋กœ๋ฅผ ์ง์ ‘ ์ฐธ๊ณ ํ•˜๋ฉด์„œ ๊ฐ token์˜ ๋ฌธ๋งฅ ํ‘œํ˜„์„ ๋งŒ๋“œ๋Š” attention ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. Attention Is All You Need์—์„œ Transformer์˜ ํ•ต์‹ฌ ์—ฐ์‚ฐ์œผ๋กœ ์“ฐ์˜€๊ณ , Transformer ์•„ํ‚คํ…์ฒ˜๊ฐ€ RNN์ด๋‚˜ CNN ์—†์ด๋„ sequence๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋งŒ๋“  ์ค‘์‹ฌ ์•„์ด๋””์–ด์ž…๋‹ˆ๋‹ค.

์‰ฝ๊ฒŒ ๋งํ•˜๋ฉด self-attention์€ ๋ฌธ์žฅ ์•ˆ์˜ ๊ฐ ๋‹จ์–ด๊ฐ€ โ€œ์ง€๊ธˆ ๋‚˜๋ฅผ ์ดํ•ดํ•˜๋ ค๋ฉด ๊ฐ™์€ ๋ฌธ์žฅ ์•ˆ์˜ ์–ด๋–ค ๋‹จ์–ด๋ฅผ ์–ผ๋งˆ๋‚˜ ๋ด์•ผ ํ•˜๋Š”๊ฐ€โ€๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

ํ•œ ์ค„๋กœ ๋งํ•˜๋ฉด

Self-attention์€ ๊ฐ™์€ ์ž…๋ ฅ ์•ˆ์˜ token๋ผ๋ฆฌ ์„œ๋กœ์˜ ์ค‘์š”๋„๋ฅผ ๊ณ„์‚ฐํ•ด ์ •๋ณด๋ฅผ ์„ž๋Š” ์—ฐ์‚ฐ์ž…๋‹ˆ๋‹ค.

์™œ ํ•„์š”ํ•œ๊ฐ€

๋ฌธ์žฅ์€ ์ˆœ์„œ์™€ ๊ด€๊ณ„๋ฅผ ํ•จ๊ป˜ ๊ฐ€์ง‘๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ๋Œ€๋ช…์‚ฌ๊ฐ€ ๋ฌด์—‡์„ ๊ฐ€๋ฆฌํ‚ค๋Š”์ง€, ๋™์‚ฌ์™€ ๋ชฉ์ ์–ด๊ฐ€ ์–ด๋–ป๊ฒŒ ์—ฐ๊ฒฐ๋˜๋Š”์ง€, ๋ฌธ์žฅ์˜ ์•ž๋ถ€๋ถ„์ด ๋’ท๋ถ€๋ถ„์„ ์–ด๋–ป๊ฒŒ ๋ฐ”๊พธ๋Š”์ง€๋Š” ๋‹จ์–ด ํ•˜๋‚˜๋งŒ ๋ณด๊ณ  ์•Œ๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค.

๊ณผ๊ฑฐ์˜ RNN ๊ณ„์—ด ๋ชจ๋ธ์€ ์•ž์—์„œ ๋’ค๋กœ hidden state๋ฅผ ๋„˜๊ธฐ๋ฉฐ ์ด๋Ÿฐ ๊ด€๊ณ„๋ฅผ ๋‹ค๋ค˜์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด ๋ฐฉ์‹์€ ๊ณ„์‚ฐ์ด ์ˆœ์ฐจ์ ์ด์–ด์„œ ๋ณ‘๋ ฌํ™”๊ฐ€ ์–ด๋ ต๊ณ , ๋ฉ€๋ฆฌ ๋–จ์–ด์ง„ token ์‚ฌ์ด์˜ ์ •๋ณด๊ฐ€ ์—ฌ๋Ÿฌ ๋‹จ๊ณ„๋ฅผ ์ง€๋‚˜์•ผ ํ•ฉ๋‹ˆ๋‹ค.

Self-attention์€ ๊ฐ™์€ layer ์•ˆ์—์„œ ๋ชจ๋“  token ์Œ์„ ์ง์ ‘ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ๋ฉ€๋ฆฌ ๋–จ์–ด์ง„ ๋‘ token๋„ ํ•œ ๋ฒˆ์˜ attention layer ์•ˆ์—์„œ ์—ฐ๊ฒฐ๋  ์ˆ˜ ์žˆ๊ณ , ์—ฌ๋Ÿฌ ์œ„์น˜๋ฅผ ๋™์‹œ์— ๊ณ„์‚ฐํ•˜๊ธฐ ์‰ฌ์›Œ์ง‘๋‹ˆ๋‹ค.

์–ด๋–ป๊ฒŒ ๊ณ„์‚ฐํ•˜๋‚˜

Transformer์˜ attention์€ ๋ณดํ†ต query, key, value๋ผ๋Š” ์„ธ ์ข…๋ฅ˜์˜ ๋ฒกํ„ฐ๋กœ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

  • Query: ์ง€๊ธˆ token์ด ์ฐพ๊ณ  ์‹ถ์€ ์ •๋ณด
  • Key: ๋‹ค๋ฅธ token๋“ค์ด ๊ฐ€์ง„ ์‹๋ณ„ ์ •๋ณด
  • Value: ์‹ค์ œ๋กœ ๊ฐ€์ ธ์™€ ์„ž์„ ๋‚ด์šฉ

๊ฐ token์˜ query๋Š” ๋‹ค๋ฅธ token๋“ค์˜ key์™€ ๋น„๊ต๋ฉ๋‹ˆ๋‹ค. ์ด ๋น„๊ต ์ ์ˆ˜๊ฐ€ โ€œ์–ผ๋งˆ๋‚˜ ์ฐธ๊ณ ํ•  ๊ฒƒ์ธ๊ฐ€โ€๋ฅผ ๋‚˜ํƒ€๋‚ด๊ณ , softmax๋ฅผ ๊ฑฐ์ณ ๊ฐ€์ค‘์น˜๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ์ด ๊ฐ€์ค‘์น˜๋งŒํผ value๋ฅผ ์„ž์œผ๋ฉด attention ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ต๋‹ˆ๋‹ค.

Transformer ๋…ผ๋ฌธ์ด ์‚ฌ์šฉํ•œ scaled dot-product attention์€ query์™€ key์˜ ๋‚ด์ ์„ ๊ณ„์‚ฐํ•˜๊ณ , key ์ฐจ์›์˜ ์ œ๊ณฑ๊ทผ์œผ๋กœ ๋‚˜๋ˆ„์–ด ๊ฐ’์˜ ํฌ๊ธฐ๋ฅผ ์•ˆ์ •ํ™”ํ•œ ๋’ค, softmax๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ฐจ์›์ด ์ปค์งˆ์ˆ˜๋ก ๋‚ด์  ๊ฐ’์ด ์ปค์ ธ softmax๊ฐ€ ํ•œ์ชฝ์œผ๋กœ ์ง€๋‚˜์น˜๊ฒŒ ์ ๋ฆด ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— scaling์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

Multi-head attention๊ณผ์˜ ๊ด€๊ณ„

Self-attention์€ โ€œ๊ฐ™์€ sequence ์•ˆ์—์„œ ์„œ๋กœ๋ฅผ ์ฐธ๊ณ ํ•œ๋‹คโ€๋Š” ๊ด€๊ณ„๋ฅผ ๋งํ•ฉ๋‹ˆ๋‹ค. Multi-head attention์€ ์ด attention์„ ์—ฌ๋Ÿฌ head์—์„œ ๋ณ‘๋ ฌ๋กœ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ตฌํ˜„ ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

ํ•œ head๋งŒ ์žˆ์œผ๋ฉด token ๊ด€๊ณ„๋ฅผ ํ•œ ๊ด€์ ์œผ๋กœ๋งŒ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Multi-head attention์€ ์—ฌ๋Ÿฌ ์ž‘์€ ํ‘œํ˜„ ๊ณต๊ฐ„์—์„œ attention์„ ๋‚˜๋ˆ„์–ด ๊ณ„์‚ฐํ•œ ๋’ค ๋‹ค์‹œ ํ•ฉ์นฉ๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ์–ด๋–ค head๋Š” ๊ฐ€๊นŒ์šด ๋‹จ์–ด ๊ด€๊ณ„๋ฅผ, ์–ด๋–ค head๋Š” ๋จผ ์˜์กด์„ฑ์„, ์–ด๋–ค head๋Š” ๋Œ€๋ช…์‚ฌ ์ฐธ์กฐ๋‚˜ ๋ฌธ๋ฒ•์  ์—ญํ• ์„ ๋” ์ž˜ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ฆ‰ self-attention์€ ์–ด๋””๋ฅผ ์ฐธ๊ณ ํ•˜๋Š”๊ฐ€์— ๋Œ€ํ•œ ๊ฐœ๋…์ด๊ณ , multi-head attention์€ ์—ฌ๋Ÿฌ ๊ด€์ ์˜ attention์„ ๋™์‹œ์— ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ์‹์— ๊ฐ€๊น์Šต๋‹ˆ๋‹ค.

Transformer์—์„œ์˜ ์—ญํ• 

Transformer encoder์—์„œ๋Š” ๊ฐ token์ด ์ž…๋ ฅ sequence ์•ˆ์˜ ๋ชจ๋“  token์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฒˆ์—ญ ์›๋ฌธ์„ ์ฝ๋Š” encoder ์ž…์žฅ์—์„œ๋Š” ๋ฌธ์žฅ ์ „์ฒด์˜ ๊ด€๊ณ„๋ฅผ ํ•œ๊บผ๋ฒˆ์— ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ด ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

Decoder์—์„œ๋Š” ์ƒํ™ฉ์ด ์กฐ๊ธˆ ๋‹ค๋ฆ…๋‹ˆ๋‹ค. ์ถœ๋ ฅ ๋ฌธ์žฅ์„ ์™ผ์ชฝ์—์„œ ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ƒ์„ฑํ•  ๋•Œ ์•„์ง ์ƒ์„ฑํ•˜์ง€ ์•Š์€ ๋ฏธ๋ž˜ token์„ ๋ณด๋ฉด ์•ˆ ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ decoder์˜ self-attention์—๋Š” mask๊ฐ€ ๋“ค์–ด๊ฐ‘๋‹ˆ๋‹ค. ์ด masked self-attention์€ ํ˜„์žฌ ์œ„์น˜๊ฐ€ ์ด์ „ token๋“ค๋งŒ ์ฐธ๊ณ ํ•˜๋„๋ก ๋งŒ๋“ญ๋‹ˆ๋‹ค.

์ด ๊ตฌ๋ถ„ ๋•๋ถ„์— Transformer๋Š” ์ž…๋ ฅ์„ ๋„“๊ฒŒ ์ดํ•ดํ•˜๋ฉด์„œ๋„, ์ถœ๋ ฅ ์ƒ์„ฑ์—์„œ๋Š” autoregressiveํ•œ ์ˆœ์„œ๋ฅผ ์ง€ํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์™œ ํ™•์žฅ์— ์œ ๋ฆฌํ–ˆ๋‚˜

Self-attention์€ sequence ๊ธธ์ด๊ฐ€ ์งง๊ฑฐ๋‚˜ ์ค‘๊ฐ„ ์ •๋„์ผ ๋•Œ ๋ณ‘๋ ฌํ™”์— ๊ฐ•ํ•ฉ๋‹ˆ๋‹ค. RNN์ฒ˜๋Ÿผ token์„ ํ•˜๋‚˜์”ฉ ์ˆœ์„œ๋Œ€๋กœ ์ฒ˜๋ฆฌํ•˜์ง€ ์•Š์•„๋„ ๋˜๋ฏ€๋กœ GPU/TPU ๊ฐ™์€ ๋ณ‘๋ ฌ ํ•˜๋“œ์›จ์–ด์™€ ์ž˜ ๋งž์Šต๋‹ˆ๋‹ค.

๋˜ํ•œ ๊ธด ์˜์กด์„ฑ์„ ๋ฐฐ์šฐ๊ธฐ ์œ„ํ•œ path length๊ฐ€ ์งง์Šต๋‹ˆ๋‹ค. RNN์—์„œ๋Š” ์•ž์ชฝ token์˜ ์ •๋ณด๊ฐ€ ๋’ค์ชฝ token์— ๋„๋‹ฌํ•˜๋ ค๋ฉด ์—ฌ๋Ÿฌ recurrent step์„ ์ง€๋‚˜์•ผ ํ•˜์ง€๋งŒ, self-attention์—์„œ๋Š” ํ•œ layer ์•ˆ์—์„œ ์ง์ ‘ ์—ฐ๊ฒฐ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ์ ์ด Transformer๊ฐ€ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ์˜ ๊ธฐ๋ณธ ๊ตฌ์กฐ๊ฐ€ ๋˜๋Š” ๋ฐ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ–ˆ์Šต๋‹ˆ๋‹ค.

ํ•œ๊ณ„

Self-attention์˜ ๋Œ€ํ‘œ์  ํ•œ๊ณ„๋Š” ๊ณ„์‚ฐ ๋น„์šฉ์ž…๋‹ˆ๋‹ค. ๋ชจ๋“  token ์Œ์„ ๋น„๊ตํ•˜๋ฏ€๋กœ sequence ๊ธธ์ด๊ฐ€ n์ผ ๋•Œ ๋น„์šฉ์ด ๋Œ€๋žต nยฒ์— ๋น„๋ก€ํ•ฉ๋‹ˆ๋‹ค. ๋ฌธ์žฅ์ด ๊ธธ์–ด์ง€๊ฑฐ๋‚˜ ๊ธด ๋ฌธ์„œ, ๊ณ ํ•ด์ƒ๋„ ์ด๋ฏธ์ง€, ๊ธด ์˜ค๋””์˜ค์ฒ˜๋Ÿผ ์ž…๋ ฅ์ด ์ปค์ง€๋ฉด ์ด ๋น„์šฉ์ด ๋ถ€๋‹ด์ด ๋ฉ๋‹ˆ๋‹ค.

๊ทธ๋ž˜์„œ ์ดํ›„ ์—ฐ๊ตฌ์—์„œ๋Š” sparse attention, local attention, linear attention, chunking, retrieval, long-context architecture์ฒ˜๋Ÿผ attention ๋น„์šฉ์„ ์ค„์ด๊ฑฐ๋‚˜ ๊ธด ์ž…๋ ฅ์„ ๋‚˜๋ˆ„์–ด ๋‹ค๋ฃจ๋Š” ๋ฐฉ๋ฒ•์ด ๊ณ„์† ๋“ฑ์žฅํ–ˆ์Šต๋‹ˆ๋‹ค.

ํ—ท๊ฐˆ๋ฆฌ์ง€ ๋ง์•„์•ผ ํ•  ์ 

Self-attention์€ โ€œ๋ชจ๋ธ์ด ์žฅ๊ธฐ ๊ธฐ์–ต์„ ๊ฐ–๋Š”๋‹คโ€๋Š” ๋œป์ด ์•„๋‹™๋‹ˆ๋‹ค. Self-attention์€ ํ˜„์žฌ ์ž…๋ ฅ ์•ˆ์˜ token ๊ด€๊ณ„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ๊ณผ๊ฑฐ ๋Œ€ํ™”๋‚˜ ์ž‘์—… ๊ธฐ๋ก์„ ์žฅ๊ธฐ์ ์œผ๋กœ ์ €์žฅํ•˜๊ณ  ๋‹ค์‹œ ๋ถˆ๋Ÿฌ์˜ค๋Š” ๋ฌธ์ œ๋Š” Agent memory consolidation ๊ฐ™์€ ๋ณ„๋„ memory system์˜ ์˜์—ญ์ž…๋‹ˆ๋‹ค.

๋˜ํ•œ self-attention์ด ์ˆœ์„œ๋ฅผ ์™„์ „ํžˆ ๋ฒ„๋ฆฐ๋‹ค๋Š” ๋œป๋„ ์•„๋‹™๋‹ˆ๋‹ค. Self-attention ์ž์ฒด์—๋Š” ์ˆœ์„œ ์ •๋ณด๊ฐ€ ์—†๊ธฐ ๋•Œ๋ฌธ์— Transformer๋Š” positional encoding์ด๋‚˜ position embedding์œผ๋กœ ์œ„์น˜ ์ •๋ณด๋ฅผ ์ฃผ์ž…ํ•ฉ๋‹ˆ๋‹ค.

๊ด€๋ จ ๋ฌธ์„œ