Our paper on one-shot segmentation in clutter has been accepted to ICML. In this paper, we tackle a one-shot visual search task: based on a single instruction example (the red Φ in the image below), the goal is to find the same letter in a cluttered image that consists of many letters (left) and segment it. This task is pretty hard for computer vision systems, because the image clutter consists of other letters (i.e. very similar statistics), the letters can have arbitrary colors, are drawn by different people, transformed by affine transformations, and have not been seen during training.
Using oracle models with access to various amounts of ground-truth information, we show that in this kind of visual search task, detection and segmentation are two intertwined problems, the solution to each of which helps solving the other. We therefore introduce MaskNet, an improved model that sequentially attends to different locations, generates segmentation proposals to mask out background clutter and selects among the segmented objects. Our findings suggest that such image recognition models based on an iterative refinement of object detection and foreground segmentation may help improving both detection and segmentation in highly cluttered scenes.