We consider a problem: Can a machine learn from a few labeled pixels to predict every pixel in a new image? This task is extremely challenging (see Fig. 1) as a single body part could contain visually distinctive areas (e.g. head consists of eyes, noses and mouths); different body parts might look similar and undistinguishable (e.g., upper arms v.s. lower arms). It could be even more difficult if we do not provide any precise location but only the occurrence of body parts in the image. This problem is dubbed weakly-supervised segmentation, where the goal is to classify every pixel into semantic categories using only partial / weak supervision. There are many forms of weak annotations which are cheap but not perfect, e.g. image-level tags, bounding boxes, points and scribbles.
These forms of weak supervision come with different assumptions and state-of-the-art methods tackle them differently. Weak supervision can be roughly categorized into two families: Coarse and Sparse supervision. Coarse annotations, including image tags and bounding boxes, lack precise pixel localization and rely on Class Activation Map (CAM) to localize coarse semantic cues and generate pseudo pixel labels. Sparse annotations, such as points and scribbles, only label a small subset of pixels and Conditional Random Fields (CRF) are often used to propagate labels to unlabeled pixels. However, it is frustrating to develop individual methods for each form of weak supervision. This problem motivates us to develop a single method to deal with universal weakly supervised segmentation problems. In fact, weakly supervised segmentation problems can be regarded as semi-supervised pixel classification problems. And the key is how to propagate and refine annotations from coarsely and sparsely labeled pixels to unlabeled pixels?