๊ฐœ์š”

detection paper saga

์ด์ œ๋Š” ์˜ค๋žœ ์‹œ๊ฐ„์ด ์ง€๋‚œ ๋…ผ๋ฌธ์ธ SSD๋ฅผ ๊ฐ„๋‹จํ•˜๊ฒŒ ๋ฆฌ๋ทฐ ํ•ด๋ณด๋ ค ํ•œ๋‹ค. ๋งค์šฐ ์ง๊ด€์ ์ธ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์–ด ๋งŽ์€ ๊ตฌ์กฐ๊ฐ€ ์ด SSD๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋งŒ๋“ค์–ด์ง€๊ณ  ์žˆ๋‹ค. YOLO v1 ์ดํ›„์— ๋‚˜์˜จ ๋…ผ๋ฌธ์œผ๋กœ 1 stage detector์˜ ์„ฑ๊ฒฉ์„ ๊ฐ€์ง€๋‚˜ ์•„ํ‚คํ…์ณ๋Š” ์ „ํ˜€ ๋‹ค๋ฅธ, ์˜คํžˆ๋ ค RCNN ๊ณ„์—ด๊ณผ ๋น„์Šทํ•œ ํŠน์ง•์„ ๊ฐ€์ง„๋‹ค. SSD ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ•œ ๋ฌธ์žฅ์œผ๋กœ ์ •๋ฆฌํ•˜๋ฉด, feature map์„ ๋งŒ๋“œ๋Š” ๊ณต๊ฐ„์„ ๋‚˜๋ˆ„์–ด ์ง„ํ–‰ํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ฐ feature map์—์„œ ๋‹ค๋ฅธ ๋น„์œจ๊ณผ ์Šค์ผ€์ผ๋กœ default box๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ์ด box๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ชจ๋ธ์„ ํ†ต๊ณผ ์‹œ์ผœ ์ขŒํ‘œ์™€ ํด๋ž˜์Šค๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ์ด ๊ฐ’์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ตœ์ข… bounding box๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.

Model

Image Detection์€ ์ด๋ฏธ์ง€๊ฐ€ ์žˆ์„ ๋•Œ, ์ด๋ฏธ์ง€ ์•ˆ์— ๋“ค์–ด์žˆ๋Š” ์‚ฌ๋ฌผ์„ ์ฐพ๋Š” ๋ฌธ์ œ์ด๋‹ค. ์—ฌ๊ธฐ์„œ ์ฐพ๋Š”๋‹ค๋Š” ๊ฒƒ์€ ์‚ฌ๋ฌผ๋“ค์˜ ์œ„์น˜์™€ ํฌ๊ธฐ๋ฅผ ์•Œ์•„๋‚ด๊ณ , ๋ฌผ์ฒด๋ฅผ ๊ฒ€์ถœํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ฆ‰, image pixel์ด ๋“ค์–ด๊ฐ”์„ ๋•Œ, ์‚ฌ๋ฌผ์ด ์–ด๋–ค class์ธ์ง€ ๋‚˜ํƒ€๋‚ด๋Š” class ์ ์ˆ˜์™€, ์‚ฌ๋ฌผ์˜ offset(x, y, w, h)๋ฅผ output์œผ๋กœ ๋ฑ‰๋Š” ํ•จ์ˆ˜๋ฅผ ๋งŒ๋“ค๋ฉด ๋œ๋‹ค.

SSD Architecture

SSD์˜ ์•„ํ‚คํ…์ณ๋Š” ์œ„์™€ ๊ฐ™๋‹ค. ์ง๊ด€์ ์œผ๋กœ ์ดํ•ด๊ฐ€ ๋  ๋งŒํผ ๊ฐ„๋‹จํ•œ ๊ตฌ์กฐ์ด๋‹ค. ๋จผ์ € SSD๋Š” Transfer learning์„ ์œ„ํ•ด FC layer๋ฅผ ์ œ์™ธํ•œ VGG-16์•„ํ‚คํ…์ณ๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค. ๊ทธ ๋’ค์˜ ์ถ”๊ฐ€์ ์ธ ๋„คํŠธ์›Œํฌ๋กœ๋Š” CONV๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ๋กœ feature๋ฅผ ์ถ”์ถœํ•˜๊ณ  ์ด feature์—์„œ ๊ฐ๊ฐ detection์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.

Convolution Predictors

๋งˆ์ง€๋ง‰ ์˜ˆ์ธก ๋‹จ๊ณ„์—์„œ ๊ณผ๊ฐํ•˜๊ฒŒ FC later๋ฅผ ์‚ญ์ œํ•˜๊ณ  CONV๋งŒ์„ ์ด์šฉํ•˜์—ฌ ์˜ˆ์ธก์„ ์ง„ํ–‰ํ•œ๋‹ค. ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— ๊ต‰์žฅํžˆ ๋น ๋ฅธ ์†๋„๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.

Multi-scale Feature Maps

๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ๋ฌผ์ฒด์— ๋Œ€ํ•ด ์ •๋‹ต์„ ๋งž์ถœ ์ˆ˜ ์žˆ๋‹ค. feature map์€ ๊นŠ์–ด์งˆ ์ˆ˜๋ก ๋ณด๋‹ค ์ถ”์ƒ์ ์ธ ์ •๋ณด๋ฅผ ๋‹ด๋Š”๋‹ค. ๋‹ค์–‘ํ•œ feature์—์„œ scale์— ๋œ ๋ฏผ๊ฐํ•˜๋„๋ก ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹์€ ํฌ๊ธฐ์— ์ƒ๊ด€์—†์ด ๋ฌผ์ฒด์˜ ํŠน์ง•์„ ์žก์•„๋‚ผ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

๋‹ค์‹œ ์ƒ๊ฐํ•ด๋ณด์ž. ์œ„์˜ ์•„ํ‚คํ…์ณ๋ฅผ ๋ณด๊ฒŒ๋˜๋ฉด, depth๊ฐ€ ๊นŠ์€ feature map์—์„œ๋„ ๊ฐ™์€ ํฌ๊ธฐ์˜ cnn width์™€ height๋ฅผ ๊ฐ€์ง„ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋” ๊นŠ์€ ๋„คํŠธ์›Œํฌ์ผ์ˆ˜๋ก ๋” ๋„“์€ ๋ถ€๋ถ„์„ ์ปค๋ฒ„ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๊ฐ€์ง€๊ณ  ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•˜๊ณ  ์žˆ๋‹ค. ํ›„์˜ ๊ฒฐ๊ณผ์—์„œ๋„ ์•Œ ์ˆ˜ ์žˆ๊ฒ ์ง€๋งŒ ์‹ค์ œ๋กœ ์•ž๋‹จ์˜ feature map์—์„œ๋Š” ๋ณด๋‹ค ์ž‘์€ ๋ฌผ์ฒด์— ๋Œ€ํ•ด์„œ detection ๊ฒฐ๊ณผ๊ฐ€ ์ข‹๊ณ , ๋’ท๋‹จ์—์„œ๋Š” ํฐ ๋ฌผ์ฒด์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๊ฐ€ ์ข‹์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

Result of Multi scale Feature maps

์ด ๊ฒฐ๊ณผ๋ฅผ ๋ณด๊ฒŒ๋˜๋ฉด 8x8์—์„œ๋Š” ์ž‘์€ ๋ฌผ์ฒด(๊ณ ์–‘์ด)๋ฅผ ํƒ์ง€ํ•˜๊ณ  4x4์—์„œ๋Š” ํฐ ๋ฌผ์ฒด(๊ฐ•์•„์ง€)๋ฅผ ํƒ์ง€ํ•˜๋Š” ๋ชจ์Šต์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ด ๊ฒฐ๊ณผ๋Š” ๊ต‰์žฅํžˆ ๋‹น์—ฐํ•œ๋ฐ, ๋‹ค์Œ์— ์„ค๋ช…ํ•  ์ „์ฒด ๊ณผ์ • ์ค‘์—์„œ IoU๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐ๊ฐ์˜ feature map์„ ํ†ตํ•œ ๊ฒฐ๊ณผ๋ฅผ threshold๋ฅผ ๊ฑฐ์ณ ๊ฑธ๋Ÿฌ๋‚ด๋Š” ๊ณผ์ •์„ ๊ฑฐ์นœ๋‹ค. 8x8์—์„œ๋„ ์—ญ์‹œ ํƒ์ง€ default box๋ฅผ ํ†ตํ•ด ์˜ˆ์ธก์„ ์ง„ํ–‰ํ•˜์ง€๋งŒ gt box์— ๋น„ํ•ด ๋„ˆ๋ฌด ์ž‘์•„์„œ IoU๊ฐ’์ด ์ž‘๊ฒŒ ๋‚˜์˜ค๊ฒŒ ๋˜์–ด ๊ฑธ๋Ÿฌ์ง€๊ฒŒ ๋œ๋‹ค. ๋”ฐ๋ผ์„œ ํฐ ๋ฌผ์ฒด๋Š” ์ž‘์€ feature map size๋ฅผ ๊ฐ€์ง€๋Š” ๊ณต๊ฐ„์—์„œ ์˜ˆ์ธก์ด ์ง„ํ–‰๋œ๋‹ค.

Default Boxes and Aspect Ratios

The way of Prediction in SSD

์œ„์˜ ์•„ํ‚คํ…์ณ๋ฅผ ๋ณด๊ฒŒ๋˜๋ฉด ๊ฐ๊ฐ์˜ feature map์—์„œ ์ถ”์ถœํ•˜๋Š” class์˜ ๊ฐœ์ˆ˜๊ฐ€ ํ‘œํ˜„๋˜์–ด ์žˆ๋‹ค. ์–ด๋–ค ์˜๋ฏธ์ธ์ง€ ์•Œ์•„๋ณด์ž.

classifier : CONV: 3x3(6x(classes+4))

๋ถ„๋ฅ˜๊ธฐ๋กœ CONV๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , 3x3์˜ ํ•„ํ„ฐ๋ฅผ ์‚ฌ์šฉํ•  ๊ฒƒ์ด๊ณ  (channel์€ ์–ด์ฐจํ”ผ ์ด์ „ ๋„คํŠธ์›Œํฌ์˜ channel๊ณผ ๊ฐ™์•„์•ผ ํ•˜๋‹ˆ ํ‘œํ˜„ํ•˜์ง€ ์•Š์€ ๋“ฏ) default box๋ฅผ ๊ธฐ์ค€์œผ๋กœ(์‚ฌ์ „์— ์ •์˜ํ•ด๋‘” box๋“ค) 6๊ฐœ์˜ box๋ฅผ ๋ฝ‘์„ ๊ฒƒ์ด๋ฉฐ, ๊ทธ๋ฆฌ๊ณ  class์˜ ๊ฐœ์ˆ˜ ๋งŒํผ ์˜ˆ์ธกํ•˜๊ณ , ๊ฐ๊ฐ์˜ bounding box์˜ ์˜ˆ์ธก๊ฐ’(x, y, w, h)๋ฅผ ํฌํ•จํ•œ ๊ฐœ์ˆ˜๋งŒํผ์„ ๋ฝ‘์„ ๊ฑฐ์•ผ. ๋ผ๋Š” ์ด์•ผ๊ธฐ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด ์„ค๋ช…ํ•ด๋ณด์ž. ๋งŒ์•ฝ ๋‚ด๊ฐ€ ์‚ฌ์ „์— ๋ชจ์–‘์„ ์žก์•„๋‘” default box๊ฐ€ 6๊ฐœ๋ผ๊ณ  ํ•ด๋ณด์ž. ๊ทธ๋ฆฌ๊ณ  ๋ฌผ์ฒด์˜ class ์ข…๋ฅ˜๋Š” 21๊ฐœ์ด๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด convolution ํ•„ํ„ฐ์˜ ๊ฐœ์ˆ˜๋Š” 6๊ฐœ์˜ box์— ๋Œ€ํ•ด์„œ ๊ฐ๊ฐ 21+4๊ฐœ์˜ ๊ฐ’์„ ์˜ˆ์ธกํ•ด์•ผ ํ•˜๋ฏ€๋กœ 150 x (21+4) = 150 ๊ฐœ์˜ ์ฑ„๋„์ด ํ•„์š”ํ•˜๋‹ค.

Process

Total Process of SSD

  1. 300x300x3์งœ๋ฆฌ ์ด๋ฏธ์ง€๋ฅผ VGG-16๋ฅผ ํ†ต๊ณผ์‹œ์ผœ 38x38x512์˜ feature map์„ ์ƒ์„ฑํ•œ๋‹ค.
  2. ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ feature map์„ ํฌ๊ธฐ๋ฅผ ์ค„์—ฌ๊ฐ€๋ฉด์„œ ์ƒ์„ฑํ•œ๋‹ค.
  3. ๊ฐ๊ฐ์˜ feature map์—์„œ ์‚ฌ์ „์— ์ •์˜๋œ default box๋ฅผ ์ ์šฉํ•˜์—ฌ y๊ฐ’์„ ์˜ˆ์ธกํ•œ๋‹ค.
    • ์ด ๋•Œ, ๊ฐ๊ฐ์˜ default box์™€ gt box์™€์˜ ์ตœ์  ๋งค์นญ์„ ์ง„ํ–‰ํ•˜๊ธฐ ์œ„ํ•œ metric์œผ๋กœ IoU(Jaccard overlap)์„ ์‚ฌ์šฉํ•œ๋‹ค.
    • threshold๋ฅผ ๋„์ž…ํ•˜์—ฌ ์ผ์ • ์ดํ•˜์˜ ๋งค์นญ์€ ์ œ๊ฑฐํ•œ๋‹ค.
  4. threshold๋ฅผ ํ†ต๊ณผํ•œ ๋ชจ๋“  ์•„์›ƒํ’‹์„ ํ•œ๋ฐ ๋ฌถ์–ด NMS์„ ์ง„ํ–‰ํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.

Training

์ด๋ ‡๊ฒŒ ์˜ˆ์ธกํ•œ bounding ๋ฐ•์Šค์— ๋Œ€ํ•ด ์–ด๋–ป๊ฒŒ ํ›ˆ๋ จ์„ ์ง„ํ–‰ํ• ๊นŒ? 1 stage detector์ž„์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ๋ชฉ์  ํ•จ์ˆ˜๋Š” Faster RCNN๊ณผ ๊ฑฐ์˜ ๋˜‘๊ฐ™๋‹ค.

Lconf : classification loss

Lloc : localization loss

  • N : ๋งค์นญ๋œ default bounding box๋“ค์˜ ๊ฐœ์ˆ˜
  • Lconf : classification loss โ†’ cross entorpy
  • Lloc : localization loss โ†’ smooth L1 loss

Choosing Scales and Aspect Ratios for Default Boxes

scale, aspect ratio for default boxes

Default box์˜ w, h๋Š” ์œ„์˜ ์‹์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ƒ์„ฑ๋œ๋‹ค. ์ด ๋•Œ m์€ detection์„ ํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ•˜๋Š” feature map์˜ ๊ฐœ์ˆ˜์ด๋‹ค. ํ˜น์€ detector์˜ ๊ฐœ์ˆ˜. ํ˜„์žฌ๋Š” 6์ด๋‹ค. ๊ฐ๊ฐ์˜ feature map์—์„œ์˜ scale ๊ฐ’์ด ๋‚˜์˜ค๊ฒŒ๋˜๊ณ , ์‚ฌ์ „์— ์ •์˜๋œ aspect ratio๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ Wk, Hk๊ฐ€ ๋‚˜์˜ค๊ฒŒ ๋œ๋‹ค. aspect ratio๊ฐ€ 1์ธ ๊ฒฝ์šฐ์—๋Š” ์ด ์ถ”๊ฐ€๋œ๋‹ค. ์ด๋ ‡๊ฒŒ ์ด 6๊ฐœ์˜ default box๊ฐ€ ์ƒ์„ฑ์ด ๋œ๋‹ค.

Hard Negative Mining

detection์€ ๊ทผ๋ณธ์ ์œผ๋กœ ๊ฐ–๊ณ  ์žˆ๋Š” class inbalance ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•ด์•ผ ํ•œ๋‹ค. ๋ฐฐ๊ฒฝ์ด ํƒ์ง€๋  ๊ฒฐ๊ณผ๊ฐ€ ๋” ๋งŽ๊ธฐ ๋•Œ๋ฌธ์— ํ›ˆ๋ จ์— ๋ถ€์ •์ ์ธ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” (Negative) ๋ฐฐ๊ฒฝ ํƒ์ง€ ๊ฒฐ๊ณผ์˜ ๋น„์œจ์„ ์ค„์—ฌ์ค˜์•ผ ํ•œ๋‹ค.

๊ทธ๋ž˜์„œ ์ €์ž๋“ค์€ back ground์ธ๋ฐ ๊ทธ๋ ‡์ง€ ์•Š๋‹ค๊ณ  ํŒ๋‹จํ•œ negative sample ์„ ์ •๋ ฌํ•˜๊ณ  negative sample๊ณผ positive sample์˜ ๋น„์œจ์ด 3:1์ด ๋˜๋„๋ก ๊ณจ๋ผ์ฃผ๋Š” ์ž‘์—…์„ ๊ฑฐ์นœ๋‹ค.

Conclusions

  • Single Shot object detector for multiple categories
  • multiple convolutional map์„ ๋™ํ•ด ๋‹ค๋ฅธ scale์„ ๊ฐ€์ง€๋Š” object๋ฅผ ๋‹ค๋ฃจ์—ˆ๋‹ค.
  • default box๊ฐ€ ๋งŽ์„ ์ˆ˜๋ก ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.
  • ๊ต‰์žฅํžˆ ๋‹จ์ˆœํ•œ ๋ชจ๋ธ์ด๋‹ค.

Reference