๊ฐœ์š”

structure of mask RCNN

  • Facebok AI Research (FAIR), Kaiming He, 24 Jan 2018
  • Marr Prize at ICCV 2017

object instance segmentation์„ ์œ„ํ•œ ํ”„๋ ˆ์ž„ ์›Œํฌ์ด๋‹ค. ๊ธฐ์กด์˜ semantic segmentation์„ ๋„˜์–ด์„œ ๊ฐ๊ฐ์˜ instance๋„ ๊ตฌ๋ถ„์ด ๊ฐ€๋Šฅํ•œ ๊ตฌ์กฐ๋ฅผ ๋งŒ๋“ค์—ˆ๋‹ค. ํ•™์Šต์ด ์‰ฝ๊ณ  Faster RCNN์— ์กฐ๊ธˆ์˜ overhead๋งŒ ์ถ”๊ฐ€ํ•˜์—ฌ 5fps์˜ ๋น ๋ฅด๊ธฐ๋กœ ์‹คํ–‰๋œ๋‹ค. COCO ๋ฐ์ดํ„ฐ์…‹์—์„œ instance segmentation, bbox object detection, person keypoint detection ์—์„œ ๊ฐ€์žฅ ๋†’์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€๋‹ค.

ํ•ต์‹ฌ ์•„์ด๋””์–ด

Faster R-CNN์—์„œ detectํ•œ ๊ฐ๊ฐ์˜ box์— mask๋ฅผ ์”Œ์›Œ์ฃผ์ž!

instance segmentation์€ ๋‘ ๊ฐ€์ง€ ๊ณผ์ œ๋ฅผ ํ•ฉ์นœ ๊ฒƒ์ด๋‹ค.

  1. object detection
    • bbox๋ฅผ ์ด์šฉํ•˜์—ฌ object๋ฅผ ๋ถ„๋ฅ˜ํ•˜๊ณ , ์œ„์น˜๋ฅผ ์ฐพ๋Š” ๊ฒƒ.
  2. semantic segmentation
    • object instance๋Š” ๊ตฌ๋ณ„ํ•˜์ง€ ์•Š์ง€๋งŒ, ์ •ํ•ด์ง„ ์นดํ…Œ๊ณ ๋ฆฌ๋ณ„๋กœ ๊ฐ๊ฐ์˜ pixel์„ ๋ถ„๋ฅ˜ํ•˜๋Š” ๊ฒƒ

์ด์ „์˜ Segmentation์—์„œ ์ค‘์š”ํ•œ ๋…ผ๋ฌธ์ธ FCN์—์„œ๋Š” ์ด 3๊ฐ€์ง€๋ฅผ ๊ณ ๋ คํ•˜์˜€๋‹ค.

  1. pixel ๋‹จ์œ„์˜ classification
  2. ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— pixel ๋‹จ์œ„ softmax ๊ฐ’ ์ถ”์ถœ์ด ํ•„์š”
  3. multi instance๋ฅผ ๊ณ ๋ คํ•ด์•ผ ํ•จ

ํ•˜์ง€๋งŒ mask RCNN์€ Faster RCNN์„ ๊ทธ๋Œ€๋กœ ๊ฐ€์ ธ๋‹ค๊ฐ€ ์“ฐ๊ธฐ ๋•Œ๋ฌธ์—, ์ด ๋ฌธ์ œ๊ฐ€ ๋‹ค์†Œ ๋ณ€๊ฒฝ๋œ๋‹ค.

  1. pixel ๋‹จ์œ„์˜ classification โ†’ ์ด๋ฏธ bounding box๋กœ ๊ตฌ๋ถ„์„ ํ•ด์คŒ
  2. ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— pixel ๋‹จ์œ„ softmax ๊ฐ’ ์ถ”์ถœ์ด ํ•„์š” โ†’ bounding box ์•ˆ์—์„œ ๋ฌผ์ฒด ์ธ์ง€ ์•„๋‹Œ์ง€๋งŒ ๊ตฌ๋ถ„ํ•ด์ฃผ๋ฉด ๋จ(Sigmoid)
  3. multi instance๋ฅผ ๊ณ ๋ คํ•ด์•ผ ํ•จ โ†’ ์ด๋ฏธ multi instance๋กœ bounding box๋ฅผ ์ณ์คŒ

class, box ์™ธ์— mask FCN๋งŒ ์ถ”๊ฐ€ํ•œ๋‹ค.

๊ฒฐ๊ณผ์ ์œผ๋กœ, ์ด ๋ฌธ์ œ์—์„œ ํ•ด์•ผํ•  ์ผ์€ masking์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ๋ž˜์„œ ๋…ผ๋ฌธ์ด๋ฆ„๋„ Mask RCNN์ด๋‹ค.

Equivariance

input์—์„œ์˜ ๋ณ€ํ™”๊ฐ€ output์˜ ๋ณ€ํ™”์— ์˜ํ–ฅ์„ ์ค€๋‹ค.

Invariance vs. Equivariance

classification ๋ฌธ์ œ์—์„œ๋Š” label์„ ๋„์ถœํ•˜๋Š” ๋ฌธ์ œ์ด๊ธฐ ๋•Œ๋ฌธ์— Invariance ํ•˜๋‹ค. ํ•˜์ง€๋งŒ segmentation ๋ฌธ์ œ ๊ฐ™์€ ๊ฒฝ์šฐ์—๋Š” output์ด ์›๋ž˜ ์ด๋ฏธ์ง€ ์‚ฌ์ด์ฆˆ์™€ ๊ฐ™์•„์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ด ๋ฌธ์ œ๋Š” Equivariance๋กœ ํ•ด๊ฒฐ ํ•ด์•ผํ•œ๋‹ค. ์ด ๋•Œ, ์ €์ž๋“ค์€ convolution์€ translation-equivariance ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ด ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค.

Fully convolutional network ์‚ฌ์šฉ

mask RCNN์˜ ๊ตฌ์กฐ๋ฅผ ๋‹ด๋‹นํ•˜๋Š” Faster RCNN์€ Fully conv net์„ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ๋‹ค. ์—ฌ๊ธฐ์„œ mask RCNN์€ ๋’ค์˜ mask head๋ถ€๋ถ„ ์—ญ์‹œ FCN์„ ์‚ฌ์šฉํ•˜์—ฌ ์ œ์ž‘ํ•˜์˜€๋‹ค.

RoI Align

๊ธฐ์กด์˜ Faster RCNN์˜ ๊ตฌ์กฐ

๊ธฐ์กด์— Faster RCNN์—์„œ๋Š” feature map์„ ๋ฝ‘์•„๋‚ธ ๋’ค, Region proposal Network๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฅผ ์ œ์‹œํ•˜์˜€๋‹ค. ๊ทธ ๋ฐฉ๋ฒ•์€ RoI pooling์ด์—ˆ๋‹ค. ํ•˜์ง€๋งŒ segmentation์€ detection ๋ฌธ์ œ์™€ ๋‹ค๋ฅด๊ฒŒ ๋‹จ์ง€ box๋ฅผ ์น˜๋Š” ๋ฌธ์ œ๊ฐ€ ์•„๋‹ˆ๋‹ค. ์ข€๋” ์ •ํ™•ํ•œ ์œ„์น˜์ •๋ณด๋ฅผ ๋‹ด์€ ์ƒํƒœ์˜ feature map์ด ํ•„์š”ํ•˜๋‹ค.

RoI pooling์€ proposal์˜ ์œ„์น˜๋ฅผ ๋ฐ˜์˜ฌ๋ฆผํ•œ๋‹ค.

๊ธฐ์กด์˜ RoI Pooling์„ ์ƒ๊ฐํ•ด๋ณด๋ฉด, 4๊ฐœ์˜ ์ขŒํ‘œ ๋ณ€ํ™˜ ๊ฐ’์„ regressionํ•˜๊ณ , ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์˜ˆ์ƒ ์ขŒํ‘œ๋ฅผ ์–ป์–ด๋‚ธ ๋’ค(์‹ค์ˆ˜) ์ด๋ฅผ ๋ฐ˜์˜ฌ๋ฆผํ•˜์—ฌ ์ •์ˆ˜๋‹จ์œ„์ธ pixel์˜ ์œ„์น˜๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ํ•˜์ง€๋งŒ, ์†Œ์ˆ˜์ ์„ ๋ฐ˜์˜ฌ๋ฆผํ•œ ์ขŒํ‘œ๋ฅผ ๊ฐ€์ง€๊ณ  Pooling์„ ํ•ด์ฃผ๋ฉด input image์˜ ์›๋ณธ ์œ„์น˜ ์ •๋ณด๊ฐ€ ์™œ๊ณก๋œ๋‹ค. classfication์—๋Š” ์ด๋Ÿฐ ๋ฌธ์ œ๊ฐ€ ์‹ฌ๊ฐํ•˜์ง€ ์•Š์ง€๋งŒ, pixel-by-pixel๋กœ detection์„ ์ง„ํ–‰ํ•ด์•ผ ํ•˜๋Š” segmentation ์—์„œ๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค. RoI Align

์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ €์ž๋“ค์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ์ด๋ฅผ ํ•ด๊ฒฐํ•œ๋‹ค.

  1. ์ œ์•ˆ๋œ proposal์„ ๋“ค๊ณ ์˜จ๋‹ค.
  2. Roi pooling์—์„œ 4๋“ฑ๋ถ„ ํ–ˆ๋˜ ๊ฒƒ์ฒ˜๋Ÿผ ์ผ๋‹จ ์ž๋ฅธ๋‹ค.
  3. ๊ทธ ์•ˆ์—์„œ ์ถ”๊ฐ€์ ์œผ๋กœ 4๋“ฑ๋ถ„์„ ํ•œ๋‹ค. (subcell)
  4. ์ด๋ ‡๊ฒŒ ๋ฐœ์ƒํ•œ ๊ฒฉ์ž๋‚ด์— ๋“ค์–ด์˜ค๋Š” ํ”ฝ์…€์˜ ๋ฉด์ ์„ ๊ธฐ์ค€์œผ๋กœ ๊ฐ€์ค‘ํ‰๊ท ํ•œ๋‹ค.
  5. ๋ฐœ์ƒํ•œ ๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ poolingํ•œ๋‹ค.

์ด ๋ฐฉ๋ฒ•์€ Mask Accuracy์—์„œ ํฐ ํ–ฅ์ƒ์„ ๋ณด์˜€๋‹ค.

Mask RCNN architecture

Mask R-CNN์€ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ์•„ํ‚คํ…์ณ๋ฅผ ํ•ฉ์นœ ๋„คํŠธ์›Œํฌ์ธ๋ฐ, ํฌ๊ฒŒ ๋‘ ๊ฐ€์ง€๋กœ ๋‚˜๋‰œ๋‹ค.

  1. Convolutional backbone architecture
    • ์ด๋ฏธ์ง€์—์„œ feature extraction
  2. Network head
    • bounding-box ์ธ์‹(classification & regression), mask ์˜ˆ์ธก

Head Architecture

ResNet Backbone

๋…ผ๋ฌธ์—์„œ๋Š” ResNet ๊ณผ ResNeXt networks ๋ฅผ depth 50 or 101 layers์— ๋Œ€ํ•ด ํ‰๊ฐ€ํ–ˆ๋‹ค. ์›๋ž˜ Faster R-CNN์€ ResNet์„ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, 4๋ฒˆ์งธ ์Šคํ…Œ์ด์ง€์˜ ๋งˆ์ง€๋ง‰ Conv layer(์ดํ•˜ C4)์—์„œ features๋ฅผ ๋ฝ‘์•„๋‚ธ๋‹ค. ์ด ๊ฒฝ์šฐ, ์ด backbone์„ ์‚ฌ์šฉํ•œ๋‹ค๋ฉด ์šฐ๋ฆฌ๋Š” ResNet-50-C4 ์™€ ๊ฐ™์ด ๋ถ€๋ฅผ ๊ฒƒ์ด๋‹ค. ResNet-50-C4๊ฐ€ ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ๋œ๋‹ค.

ResNet-FPN Backbone

FPN์€ Feature Pyramid Network๋กœ, top-down architecture๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. FPN backbone์„ ์‚ฌ์šฉํ•˜๋Š” Faster R-CNN์€ ํ”ผ์ณ ํ”ผ๋ผ๋ฏธ๋“œ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ๋ ˆ๋ฒจ๋กœ๋ถ€ํ„ฐ RoI features๋ฅผ ๋ฝ‘์•„๋‚ด์ง€๋งŒ, ๋‚˜๋จธ์ง€๋Š” vanilla ResNet๊ณผ ๊ฐ™๋‹ค. Mask R-CNN์—์„œ ํ”ผ์ณ ์ถ”์ถœ์„ ์œ„ํ•ด ResNet-FPN backbone์„ ์ด์šฉํ•˜๋Š” ๊ฒƒ์€ ์ •ํ™•๋„์™€ ์†๋„ ๋ฉด์—์„œ ์—„์ฒญ๋‚œ ํ–ฅ์ƒ์„ ๋ณด์˜€๋‹ค. Feature Pyramid Network๋Š” ์ถ”ํ›„ ๊ธ€์—์„œ ์ž‘์„ฑํ•˜๋„๋ก ํ•˜๊ฒ ๋‹ค.

Loss function (decoupling)

  • : Softmax Cross Entropy (loss of classification)
  • : bbox regression
  • : Binary Cross Entropy

์œ„์˜ ์•„์ด๋””์–ด๋ฅผ ๊ทธ๋Œ€๋กœ ๊ฐ€์ ธ์™€์„œ, ๊ฒฐ๊ณผ์ ์œผ๋กœ masking๋งŒ ํ•˜๋Š” loss ํ•จ์ˆ˜๋ฅผ ์ •์˜ํ•˜์—ฌ ์‚ฌ์šฉํ•œ๋‹ค. ๊ทธ๋ฆผ์œผ๋กœ ์ดํ•ดํ•ด ๋ณด์ž.

์ด์ „ ๋ฐฉ๋ฒ•๋“ค๊ณผ์˜ ๋น„๊ต

๊ทธ๋ฆผ์„ ๋ณด๊ฒŒ๋˜๋ฉด, ๋‹จ์ˆœํžˆ masking์„ ํ•˜๋Š” ๊ตฌ์กฐ๋ฅผ ์ถ”๊ฐ€ํ•˜๊ณ , ์ด๋ฅผ ๋ฐ˜์˜ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ง„ํ–‰๋œ๋‹ค.

mask Head์˜ loss update ๋ฐฉ๋ฒ•

update ๋ฐฉ๋ฒ•์€ ์ƒ๋‹นํžˆ ๋‹จ์ˆœํ•œ๋ฐ, ์ผ๋‹จ ์ „์ฒด mask loss๋Š” ๋ชจ๋“  ํด๋ž˜์Šค(์‚ฌ๋žŒ, ๋ง ๋“ฑ)์—์„œ ์ฐจ์ด๊ฐ€ ๋‚˜๋Š” mask์˜ ์ •๋„๋กœ ์ •์˜๊ฐ€ ๋œ๋‹ค. ํ•˜์ง€๋งŒ ํ•ด๋‹น ์‚ฌ์ง„์—์„œ bounding box๋Š” ํ•˜๋‚˜๋งŒ box ์ฒ˜๋ฆฌ๊ฐ€ ๋˜์–ด ์žˆ๋‹ค. ๊ธฐ์กด์˜ faster RCNN์—์„œ bounding box๋ฅผ ์˜ˆ์ธกํ•  ๋•Œ๋Š” ํ•˜๋‚˜์˜ box๋งŒ ์ฒ˜๋ฆฌํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ด๋Ÿฐ ์ƒํ™ฉ์—์„œ mask์— ๋Œ€ํ•œ ์—…๋ฐ์ดํŠธ๋Š” ๋ชจ๋“  ์‚ฌ๋ฌผ์— ๋Œ€ํ•ด์„œ ์—…๋ฐ์ดํŠธ๋ฅผ ํ•  ์ˆ˜ ์—†๊ฒŒ ๋œ๋‹ค. ๊ทธ๋ž˜์„œ ์ด๋ ‡๊ฒŒ ํ•ด๋‹น ์‚ฌ์ง„์˜ class๊ฐ€ ์ •ํ•ด์งˆ ๊ฒฝ์šฐ, ํ•ด๋‹น class์— ํ•ด๋‹นํ•˜๋Š” mask๋งŒ์„ ์„ ํƒํ•˜๊ณ  ์ด๋ฅผ ์—…๋ฐ์ดํŠธ ํ•ด์ค€๋‹ค. ์ฆ‰, ๋ง์ด ์ •๋‹ต class์ธ ๊ฒฝ์šฐ, ์ด class์— ํ•ด๋‹นํ•˜๋Š” mask๋งŒ ํ•™์Šต๋œ๋‹ค.

๊ฒฐ๊ณผ์ ์œผ๋กœ ์ด๋ ‡๊ฒŒ ํ•™์Šต๋˜๋Š” mask branch๋Š” ์–ด๋– ํ•œ class์ธ์ง€ ์ƒ๊ด€ ์—†์ด ๋ฌผ์ฒด์˜ masking๋งŒ ๋”ฐ๋Š” ๊ฒƒ์„ ๋ฐฐ์šฐ๊ฒŒ ๋œ๋‹ค.

test senario

์ด๋ ‡๊ฒŒ ํ•™์Šต๋œ mask branch๋Š” ์‹ค์ œ๋กœ ์—ฌ๋Ÿฌ๊ฐœ์˜ ๋ฌผ์ฒด์— ๋Œ€ํ•œ mask๋ผ๊ณ  ์˜ˆ์ธกํ•  ๊ฒƒ์ด๋‹ค. ํ•˜์ง€๋งŒ ์ด ๋…€์„์€ ์–ด๋– ํ•œ ๋ฌผ์ฒด์ธ์ง€ ๋ถ„๊ฐ„ํ•˜์ง€ ๋ชปํ•˜๋Š”๋ฐ, ์ด๋ถ€๋ถ„์— ์žˆ์–ด์„œ classification์˜ ๊ฒฐ๊ณผ๋ฅผ ๋„ฃ์–ด์ฃผ์–ด, ํ•˜๋‚˜์˜ masking์„ ์ œ์•ˆํ•œ๋‹ค. ์ฆ‰ mask prediction์—์„œ๋Š” ๋‹จ์ง€ ์ด pixel์ด mask์ธ์ง€, ์•„๋‹Œ์ง€ ๋งŒ์„ ๊ตฌ๋ถ„(sigmoid ์‚ฌ์šฉ)ํ•˜๋„๋ก ํ•˜์—ฌ ์„ฑ๋Šฅ์˜ ํ–ฅ์ƒ์„ ๋ณด์˜€๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•์„ Mask prediction ๊ณผ class prediction ์„ decouple ํ–ˆ๋‹ค๊ณ  ํ•œ๋‹ค.

decouple์„ ์‹œ๋„ํ–ˆ์„ ๋•Œ ์˜ฌ๋ผ๊ฐ„ ์ •ํ™•๋„

Reference