The preprint of our work on one-shot object detection and instance segmentation is on arXiv. In this work, we learn to detect and segment instances of previously unseen object categories based on a single visual instruction example. For example, given the image below and either a person (left) or a car (right) as the reference, the goal of the system is to detect all persons (far left) and cars (far right), respectively. Note that in this case, neither persons nor cars were annotated in the training set.
Our approach to solving this problem is based on the popular Mask R-CNN architecture. We extend its backbone to be a Siamese network, which computes an embedding of both query and reference image, based on which its region proposal network (RPN) generates bounding box proposals targeted to the reference category. These proposals are then scored as match/non-match by a classification head (CLS) and foreground-background masks are generated by a segmentation head (SEGM).
This task is substantially harder than either object detection/instance segmentation or one-shot learning (which is usually addressed in a classification/discrimination setting) individually. Our current performance on MS-COCO – using splits of 60 categories for training and 20 for one-shot testing – is around 15% and 17% mAP50 for detection and instance segmentation, respectively. Thus, we have established a first baseline and there is definitely room for improvement.
Michaelis C, Ustyuzhaninov I, Bethge M, Ecker AS (2018): One-Shot Instance Segmentation. arXiv:1811.11507.