ํ•ต์‹ฌ ์•„์ด๋””์–ด

Feature Extraction, classification, bounding box regression๊นŒ์ง€ ํ•œ๋ฒˆ์— ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์„ ๋งŒ๋“ค์ž!

Fast R-CNN์€ ์ด์ „ SSP Net์ด ๊ฐ€์ง€๋Š” ํ•œ๊ณ„์ ์„ ๊ทน๋ณตํ•˜๋Š” ์‹œ๋„์—์„œ ์ถœ๋ฐœํ•œ๋‹ค. SSP Net์€ 1) Multi stage model์ด๊ณ  2) FC layer ๋งŒ ํ•™์Šต ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค๋Š” ํ•œ๊ณ„์ ์ด ์žˆ์—ˆ๋‹ค.

์•Œ๊ณ ๋ฆฌ์ฆ˜

  1. pretrained model๋กœ ๋ถ€ํ„ฐ feature map์„ ์ถ”์ถœํ•œ๋‹ค.
  2. Selective Search๋ฅผ ํ†ตํ•ด ์ฐพ์€ ๊ฐ๊ฐ์˜ ROI์— ๋Œ€ํ•ด *ROI Pooling์„ ์ง„ํ–‰ํ•œ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ๋กœ ๊ณ ์ •๋œ ํฌ๊ธฐ์˜ feature vector๋ฅผ ์–ป๋Š”๋‹ค.
  3. feature vector๋Š” FC layer๋ฅผ ํ†ต๊ณผํ•˜๊ณ  ๋‘๊ฐœ์˜ branch๋กœ ๋‚˜๋‰œ๋‹ค.
  4. ํ•˜๋‚˜์˜ branch์—์„œ๋Š” softmax๋ฅผ ํ†ต๊ณผํ•˜์—ฌ ํ•ด๋‹น ROI๊ฐ€ ์–ด๋–ค ๋ฌผ์ฒด์ธ์ง€ clasification์„ ์ง„ํ–‰ํ•œ๋‹ค.
  5. ๋‹ค๋ฅธ branch์—์„œ๋Š” bounding box regression์„ ํ†ตํ•ด selective search๋กœ ์ฐพ์€ ๋ฐ•์Šค์˜ ์œ„์น˜๋ฅผ ์กฐ์ •ํ•œ๋‹ค.

ํ•ต์‹ฌ ์˜์˜๋Š” multi stage model์—์„œ end-to-end๋กœ model์„ ๊ตฌ์„ฑํ–ˆ๋‹ค๋Š” ๊ฒƒ์— ์žˆ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ๋„ ์†๋„, ์ •ํ™•๋„, ํ•™์Šต ์†๋„ ๋ชจ๋‘๋ฅผ ํ–ฅ์ƒ์‹œ์ผฐ๋‹ค๋Š”๋ฐ ์˜์˜๊ฐ€ ์žˆ๋‹ค.

ROI polling

Roi pooling์˜ ์•„์ด๋””์–ด๋Š” ์•ž์„œ ๋ณด์•˜๋˜ SPP Net๊ณผ ์œ ์‚ฌํ•˜๋‹ค. SPP Net์€, pretrained model์œผ๋กœ ๋ถ€ํ„ฐ ๋„์ถœ๋˜๋Š” feature map์œผ๋กœ ๋ถ€ํ„ฐ, ํ”ผ๋ผ๋ฏธ๋“œ filter๋ฅผ ๊ฑฐ์นœ ํ›„ ์ด๋ฅผ vectorize ํ•˜์—ฌ ๊ณ ์ •๋œ ๊ฐœ์ˆ˜์˜ vector๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์ด ์•„์ด๋””์–ด๋ฅผ ์กฐ๊ธˆ ๋ณ€๊ฒฝํ•˜์—ฌ ์ œ์‹œํ•˜๋Š” ๊ฒƒ์ด Roi pooling์ด๋‹ค.

  1. feature map์—์„œ Selective search๋ฅผ ํ†ตํ•ด Resion Proposal์„ ์ง„ํ–‰ํ•œ๋‹ค.
  2. ์ด proposal์— Roi pooling์„ ์ง„ํ–‰ํ•˜์—ฌ ๊ณ ์ •๋œ ํ˜•ํƒœ์˜ ์ž‘์€ feature map์„ ๋งŒ๋“ ๋‹ค.

Roi pooling์€, Resion Proposal์„ ๊ณ ์ •๋œ ํ˜•ํƒœ์˜ output ๋ชจ์–‘์œผ๋กœ ๋ฐ”๊พผ๋‹ค. (H x W) ํฌ๊ธฐ์˜ feature map์„ output์œผ๋กœ ์›ํ•œ๋‹ค๋ฉด, proposal์„ ์ด์— ๋งž๊ฒŒ ์นธ์„ ๋‚˜๋ˆˆ ํ›„, max pooling์„ ์ง„ํ–‰ํ•œ๋‹ค. ์ด๋ ‡๊ฒŒ ๋˜๋ฉด ํ•ญ์ƒ ๊ฐ™์€ ํฌ๊ธฐ์˜ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.

Multi Task Loss

๋”ฅ๋Ÿฌ๋‹์„ ๊ณต๋ถ€ํ•˜๋ฉด์„œ ๊ฐ€์žฅ ์ƒˆ๋กญ๊ณ  ์ฆ๊ฑฐ์› ๋˜ ๋ถ€๋ถ„์€ ์†์‹คํ•จ์ˆ˜ ๋ถ€๋ถ„์ด์—ˆ๋‹ค. object detection์€ ๊ธฐ๋ณธ์ ์œผ๋กœ bounding box regression๊ณผ classication์„ ๋™์‹œ์— ์ง„ํ–‰ํ•ด์•ผ ํ•˜๋Š” Task์ด๋‹ค. ๊ทธ๋ž˜์„œ ์˜ˆ์ „ ์ ‘๊ทผ์€ multi stage๋กœ ์ด๋ฃจ์–ด์กŒ์—ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด Fast R-CNN์—์„œ ์ฒ˜์Œ์œผ๋กœ ์ด ๋‘๊ฐ€์ง€ task๋ฅผ ํ•˜๋‚˜๋กœ ์—ฎ๋Š” ๋ฐฉ๋ฒ•์ด ๊ณ ์•ˆ๋œ๋‹ค.

์šฐ๋ฆฌ๋Š” ์ด๋ฏธ์ง€๋กœ ๋ถ€ํ„ฐ feature map์„ ์ถ”์ถœํ–ˆ๊ณ , ์ด feature map์—์„œ Roi๋ฅผ ์ œ์•ˆ ๋ฐ›์•„ Roi pooling์„ ํ†ตํ•ด feature vector๋ฅผ ๋งŒ๋“ค์—ˆ๋‹ค. ์ด์ œ ์ด ๋ฒกํ„ฐ๋กœ classification๊ณผ bounding box regression์„ ์ ์šฉํ•˜์—ฌ ๊ฐ๊ฐ์˜ loss๋ฅผ ์–ป์–ด๋‚ด๊ณ , ์ด๋ฅผ back propagationํ•˜์—ฌ ์ „์ฒด ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๋ฉด ๋œ๋‹ค. ์ด ๋‘ Task ๋ชจ๋‘๋ฅผ ๋ฐ˜์˜ํ•œ ์†์‹คํ•จ์ˆ˜๋ฅผ ๋ณด์ž.

L(p, u, t^u, v) = L\_{cls}(p, u) + \lambda\[u \ge 1\]L\_{loc}(t^u, v)

๊ฐ ๋ณ€์ˆ˜ ํ•˜๋‚˜ํ•˜๋‚˜์— ๋Œ€ํ•ด์„œ ์•Œ์•„๋ณด์ž. ๋จผ์ €, ๋Š”, Softmax๋ฅผ ํ†ตํ•ด ์–ป์–ด๋‚ธ ๊ฐœ์˜ ํ™•๋ฅ ๊ฐ’์ด๋‹ค.(์ด์‚ฐ ํ™•๋ฅ  ๋ถ„ํฌ) ์ธ ์ด์œ ๋Š” K๊ฐœ์˜ object์™€ ๋ฐฐ๊ฒฝ(์•„๋ฌด ๋ฌผ์ฒด๋„ ์•„๋‹˜)์„ ์ถ”๊ฐ€ํ•œ ๊ฒƒ์ด๋‹ค. ๋Š” ํ•ด๋‹น Roi์˜ ground truth label ๋ฒกํ„ฐ์ด๋‹ค.

๋‹ค์Œ์œผ๋กœ๋Š” bounding box regression์„ ์ง„ํ–‰ํ•œ๋‹ค. ๊ณ ์ • ์ฒ˜๋ฆฌ๋œ feature map์„ ๊ฐ€์ง€๊ณ  regression์„ ํ–ˆ์„ ๋•Œ ๊ฒฐ๊ณผ๋Š”, ๊ฐ๊ฐ์˜ class (K + 1) ์— ๋Œ€ํ•ด ๊ฐ๊ฐ x, y, w, h๋ฅผ ์กฐ์ •ํ•˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ ๋ฅผ ๋ฆฌํ„ดํ•œ๋‹ค. ๋ง๋กœ ํ’€์–ด๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. feature map์œผ๋กœ ๋ถ€ํ„ฐ 1๋ฒˆ ํด๋ž˜์Šค ์ผ ๋•Œ (x, y, w, h)๋ฅผ (, , , ) ๋กœ ๋ณ€ํ™”์‹œ์ผœ. 2๋ฒˆ ํด๋ž˜์Šค ์ผ๋•Œ๋Š” โ€ฆ(์ค‘๋žต). ์ด ์ค‘์—์„œ ์šฐ๋ฆฌ๊ฐ€ ํ•˜๊ณ  ์‹ถ์€ ๊ฒƒ์€, ์ด ๊ฒฐ๊ณผ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ด๋ฅผ ์ˆ˜์ •ํ•˜๋Š” loss function์„ ๋งŒ๋“ค๊ณ  ์‹ถ์€ ๊ฒƒ์ด๋ฏ€๋กœ ์ด ๊ฒฐ๊ณผ๋“ค ์ค‘ ground truth์— ์†ํ•˜๋Š” u๋ฒˆ์งธ t๋งŒ ๊ฐ€์ ธ์™€์„œ ์‚ฌ์šฉํ•œ๋‹ค. ๋Š” ground truth bounding box ์กฐ์ ˆ ๊ฐ’์— ํ•ด๋‹นํ•œ๋‹ค.

๊ทธ๋ ‡๋‹ค๋ฉด ์ด์ œ ๊ฐ๊ฐ์˜ loss function์— ๋Œ€ํ•ด์„œ ์•Œ์•„๋ณด์ž. ๋จผ์ € classification loss ๋Š” log loss๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ๋ชป๋งž์ถœ ์ˆ˜๋ก ํŒจ๋„ํ‹ฐ๋ฅผ ํฌ๊ฒŒ ์ค€๋‹ค.

location์„ ๋‹ด๋‹นํ•˜๋Š” loss๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

bounding box๋ฅผ ๋งŒ๋“ค๊ธฐ ์œ„ํ•œ ์˜ˆ์ธก ์กฐ์ ˆ๊ฐ’์—์„œ ์‹ค์ œ ์กฐ์ ˆ๊ฐ’์„ smooth L1์„ ํ†ต๊ณผ์‹œํ‚จ ๊ฒƒ์˜ ํ•ฉ์„ ์‚ฌ์šฉํ•œ๋‹ค.

smooth\_{L_1}(x) = \\begin{cases} 0.5x^2 & \mbox {if }\left| x \right| \< 1 \mbox{ is even} \\ \\left| x \right|-0.5 & otherwise \\end{cases}

์ €์ž๋“ค์€ ์‹คํ—˜ ๊ณผ์ •์—์„œ ๋ผ๋ฒจ ๊ฐ’๊ณผ ์ง€๋‚˜์น˜๊ฒŒ ์ฐจ์ด๊ฐ€ ๋งŽ์ด ๋‚˜๋Š” outlier๊ฐ€ ๋งŽ์•˜๊ณ , ์ด๋Ÿฐ outlier์— ๋ฏผ๊ฐํ•˜๊ฒŒ ๋ฐ˜์‘ํ•˜๋Š” L2 loss๋ฅผ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ gradient explodeํ˜„์ƒ์ด ๋ฐœ์ƒํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ–ˆ๋‹ค๊ณ  ํ•œ๋‹ค. ์ด๋ฅผ ์ œ์–ดํ•˜๊ธฐ ์œ„ํ•ด customํ•œ loss function์„ ์‚ฌ์šฉํ–ˆ๋‹ค.

Backpropagation through RoI Pooling Layer

์ด์ œ ๋„คํŠธ์›Œํฌ๋ฅผ ํ•™์Šตํ•˜๋ฉด ๋œ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์ด์ „์˜ SSP Net์„ ๋ณด๋ฉด, feature map์„ ๋ฝ‘์•„๋‚ธ ํ›„, SSP๋ฅผ ๊ฑฐ์ณ ๋‚˜์˜จ vector๋“ค์— ๋Œ€ํ•ด FC layer๋ฅผ ๊ตฌ์„ฑํ•˜๊ณ , ์ด ๋‹จ๊ณ„๋งŒ ํ•™์Šต์‹œ์ผฐ๋˜ ๊ฒƒ์„ ๊ธฐ์–ตํ•  ๊ฑฐ๋‹ค.(fine tuning) ์œ„ ๋…ผ๋ฌธ์—์„œ ์ €์ž๋“ค์€, ์ด๋ฏธ์ง€์˜ ํŠน์ง•์„ ์ถ”์ถœํ•˜๋Š” ๊ฐ€์žฅ ์ค‘์š”ํ•œ ์—ญํ• ์ธ CNN์ด ํ•™์Šต๋  ์ˆ˜ ์—†๋‹ค๋Š” ๊ฒƒ์— ์ง‘์ค‘ํ•œ๋‹ค. ์ฆ‰, ์–ด๋Š ๋‹จ๊ณ„๊นŒ์ง€ fine tuning์„ ์ง„ํ–‰ํ•  ๊ฒƒ์ธ์ง€, ๋˜ ๊ทธ fine funing์„ ์ง„ํ–‰ํ•  ๊ฒฝ์šฐ ํ•™์Šต์ด ์ง„ํ–‰์ด ๋˜๋Š”์ง€(์—ญ์ „ํŒŒ๊ฐ€ ์ „๋‹ฌ์ด ๋˜๋Š”์ง€)๋ฅผ ์ด๋ก ์ ์œผ๋กœ ๊ฒ€์ฆํ•œ๋‹ค.

{\partial L \over \partial x_i } = \sum_r \sum_j \[i = i^\*(r, j)\]{\partial L \over \partial y\_{rj} }

๋ผ๊ณ  ํ•˜๋Š” ๊ฒƒ์€ CNN์„ ํ†ตํ•ด ์ถ”์ถœ๋œ feature map์—์„œ ํ•˜๋‚˜์˜ feature๋ฅผ ์˜๋ฏธํ•˜๊ณ  ์ด๋Š” ์‹ค์ˆ˜์ด๋‹ค. ์ „์ฒด Loss์— ๋Œ€ํ•ด์„œ ์ด ํ”ผ์ณ ๊ฐ’์˜ ํŽธ๋ฏธ๋ถ„ ๊ฐ’์„ ๊ตฌํ•˜๋ฉด ๊ทธ ๊ฐ’์ด ๊ณง xi์— ๋Œ€ํ•œ loss ๊ฐ’์ด ๋˜๋ฉฐ ์—ญ์ „ํŒŒ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด์ œ ํ”ผ์ณ ๋งต์—์„œ RoI๋ฅผ ์ฐพ๊ณ  RoI Pooling์„ ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ H x W ํฌ๊ธฐ์˜ grid๋กœ ๋‚˜๋ˆˆ๋‹ค. ์ด ๊ทธ๋ฆฌ๋“œ๋“ค์„ sub-window๋ผ ๋ถ€๋ฅด๋ฉฐ, ์œ„ ์ˆ˜์‹์—์„œ j๋ž€ ๋ช‡๋ฒˆ์งธ sub-window์ธ์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ธ๋ฑ์Šค์ด๋‹ค. ๋Š” ์ด Roi Pooling ์„ ํ†ต๊ณผํ•˜์—ฌ ์ตœ์ข…์ ์œผ๋กœ ์–ป์–ด์ง„ ouput์˜ ๊ฐ’์ด๋ฉฐ ์ด ์—ญ์‹œ ์‹ค์ˆ˜์ด๋‹ค.

๊ฐ€ ์ตœ์ข… prediction ๊ฐ’์— ์˜ํ–ฅ์„ ์ฃผ๋ ค๋ฉด ๊ฐ€ ์†ํ•˜๋Š” ๋ชจ๋“  Roi์˜ sub-window ์—์„œ ํ•ด๋‹น ๊ฐ€ ์ตœ๋Œ“๊ฐ’์ด ๋˜์•ผ ํ•œ๋‹ค. ๋ž€ Roi์™€ sub-window index j๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, ์ตœ๋Œ€ ํ”ผ์ณ ๊ฐ’์˜ ์ธ๋ฑ์Šค๋ฅผ ๋งํ•œ๋‹ค.

์ฆ‰ ์ˆ˜์‹์„ ๋ณด๋ฉด \[i = i^\*(r, j)\] ์ด๋ ‡๊ฒŒ ํ‘œํ˜„๋˜์–ด ์žˆ๋Š”๋ฐ, ์ตœ๋Œ€ ํŒจ์ณ ์ธ๋ฑ์Šค๊ฐ€ ๋‚ด๊ฐ€ ๊ตฌํ•˜๊ธธ ์›ํ•˜๋Š” ํ”ผ์ณ์™€ ๊ฐ™์„ ๋•Œ๋Š” 1์„ return, ์•„๋‹ˆ๋ฉด 0 ์„ return ํ•˜๋ผ๋Š” ์˜๋ฏธ์ด๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ ์šฐ๋ฆฌ๋Š” ์ด ๊ฐ’์„ ๊ฐ€์ง€๊ณ  ์žˆ๊ณ , ๋ฐœ์ƒํ•˜๋Š” ๋ชจ๋“  ์ด ๊ฐ’์„ ๋”ํ•ด์„œ ์ ์šฉ์‹œ์ผœ์ฃผ๋ฉด ์— ๋Œ€ํ•œ gradient๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

์ข…ํ•ฉํ•˜๋ฉด, ์šฐ๋ฆฌ๋Š” ์•ž์„œ ๊ตฌํ•œ multitask loss๋ฅผ RoI Pooling layer๋ฅผ ํ†ต๊ณผํ•˜์—ฌ CNN ๋‹จ๊นŒ์ง€ fine-tuning ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ €์ž๋“œ์€ ์‹คํ—˜์„ ํ†ตํ•ด์„œ ์‹ค์ œ๋กœ CNN๋‹จ ๊นŒ์ง€ fine tuning ํ•˜๋Š” ๊ฒƒ์ด ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ๋„์›€์ด ๋˜์—ˆ๋‹ค๋Š” ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.

์œ„ ์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” fine-tuning ํ•˜๋Š” ๊นŠ์ด๋ฅผ ์กฐ์ ˆํ•ด๊ฐ€๋ฉฐ ์„ฑ๋Šฅ ๋ณ€ํ™”๋ฅผ ์‹คํ—˜ํ•œ ๊ฒƒ์ด๋‹ค. CNN์˜ ๋‹จ์„ ๊นŠ์ด ํ•™์Šต์‹œํ‚ฌ ์ˆ˜๋ก ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜์—ˆ์œผ๋ฉฐ, ์ด ๋•Œ ํ…Œ์ŠคํŠธ์— ์†Œ์š”๋˜๋Š” ์‹œ๊ฐ„ ๋ณ€ํ™”๋Š” ๊ฑฐ์˜ ์—†๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ์ฆ‰, CNN ๋‹จ์„ Object Detection์— ๋งž๊ฒŒ๋” fine-tuning ํ•˜๋Š” ๊ฒƒ์ด ์„ฑ๋Šฅ ํ–ฅ์ƒ์˜ ํ‚ค ํฌ์ธํŠธ์˜€๋‹ค.

์˜์˜

  1. end-to-end ๋ชจ๋ธ ์ œ์•ˆ
  2. ํ•™์Šต ๋‹จ๊ณ„ ๊ฐ„์†Œํ™”
  3. ์ •ํ™•๋„, ์„ฑ๋Šฅ ๊ฐœ์„ 

ํ•œ๊ณ„

  1. region proposal์„ selective search๋ฅผ ์‚ฌ์šฉ
    • ์ด๋Š” CPU ์—ฐ์‚ฐ์œผ๋กœ๋งŒ ๊ฐ€๋Šฅํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ณ‘๋ชฉ์ด ๋ฐœ์ƒ
    • ์ด ๋ถ€๋ถ„์ด inference๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š”๋ฐ ์žˆ์–ด ๊ฐ€์žฅ ๋งŽ์€ ์‹œ๊ฐ„์„ ์ฐจ์ง€ํ•จ

Reference