
However, achieving segmentation segmentation in a
generic generation framework (e.g. the Transformer archi-
tecture [15]) is non-trivial due to drastically different data
format. To address this obstacle, we propose a notion of
maskige
that expresses the segmentation mask in the RGB
image form. This enables the use of a pretrained latent
posterior distribution (e.g. VQVAE [40]) of existing gener-
ative models. Our model takes a two-stage optimization:
(i)
Learning the posterior distribution of the latent variables
conditioned on the semantic segmentation masks so that
the latent variables can simulate the target segmentation
masks; To achieve this, we introduce an fixed pre-trained
VQVAE [40] and a couple of lightweight transformation
modules, which can be trained with minimal cost, or they
can be manually set up without requiring any additional
training. In either case, the process is efficient and does
not add significant overhead to the overall optimization.
(ii)
Minimizing the distance between the posterior distribution
and the prior distribution of the latent variables given input
training images and their masks, enabling to condition the
generation of semantic masks on the input images. This can
be realized by a generic encoder-decoder style architecture
(e.g. a Transformer).
We summarize the contributions as follows.
(i)
We pro-
pose a
Generative Semantic Segmentation
approach that re-
formulates semantic segmentation as an image-conditioned
mask generation problem. This represents a conceptual shift
from conventional discriminative learning based paradigm.
(ii)
We realize a GSS model in an established conditional
image generation framework, with minimal need for task-
specific architecture and loss function modifications while
fully leveraging the knowledge of off-the-shelf generative
models.
(iii)
Extensive experiments on several semantic
segmentation benchmarks show that our GSS is competitive
with prior art models in the standard setting, whilst achieving
a new state of the art in the more challenging and practical
cross-domain setting (e.g. MSeg [26]).
2. Related work
Semantic segmentation
Since the inception of FCN [32],
semantic segmentation have flourished by various deep neu-
ral networks with ability to classify each pixel. The follow-
up efforts then shift to improve the limited receptive field of
these models. For example, PSPNet [57] and DeepLabV2 [3]
aggregate multi-scale context between convolution layers.
Sequentially, Nonlocal [49], CCNet [21], and DGMN [56]
integrate the attention mechanism in the convolution struc-
ture. Later on, Transformer-based methods (e.g. SETR [58]
and Segformer [52]) are proposed following the introduction
of Vision Transformers. More recently, MaskFormer [9]
and Mask2Former [8] realize semantic segmentation with
bipartite matching. Commonly, all the methods adopt the
discriminative pixel-wise classification learning paradigm.
This is in contrast to our generative semantic segmentation.
Image generation
In parallel, generative models [15, 40]
also excel. They are often optimized in a two-stage train-
ing process: (1) Learning data representation in the first
stage and (2) building a probabilistic model of the encod-
ing in the second stage. For learning data representation,
VAE [24] reformulates the autoencoder by variational in-
ference. GAN [19] plays a zero-sum game. VQVAE [45]
extends the image representation learning to discrete spaces,
making it possible for language-image cross-model genera-
tion. [27] replaces element-wise errors of VAE with feature-
wise errors to capture data distribution. For probabilistic
model learning, some works [14, 41, 51] use flow for joint
probability learning. Leveraging the Transformers to model
the composition between condition and images, Esser et
al. [15] demonstrate the significance of data representation
(i.e. the first stage result) for the challenging high-resolution
image synthesis, obtained at high computational cost. This
result is inspiring to this work in the sense that the diverse
and rich knowledge about data representation achieved in
the first stage could be transferable across more tasks such
as semantic segmentation.
Generative models for visual perception
Image-to-image
translation made one of the earliest attempts in generative
segmentation, with far less success in performance [22].
Some good results were achieved in limited scenarios such as
face parts segmentation and Chest X-ray segmentation [28].
Replacing the discriminative classifier with a generative
Gaussian Mixture model, GMMSeg [29] is claimed as gen-
erative segmentation, but the most is still of discriminative
modeling. The promising performance of Pix2Seq [7] on
several vision tasks leads to the prevalence of sequence-to-
sequence task-agnostic vision frameworks. For example,
Unified-I/O [33] supports a variety of vision tasks within
a single model by seqentializing each task to sentences.
Pix2Seq-D [6] deploys a hierarchical VAE (i.e. diffusion
model) to generate panoptic segmentation masks. This
method is inefficient due to the need for iterative denoising.
UViM [25] realizes its generative panoptic segmentation by
introducing latent variable conditioned on input images. It
is also computationally heavy due to the need for model
training from scratch. To address these issues, we introduce
a notion of
maskige
for expressing segmentation masks in
the form of RGB images, enabling the adopt of off-the-shelf
data representation models (e.g. VGVAE) already pretrained
on vast diverse imagery. This finally allows for generative
segmentation model training as efficiently as conventional
discriminative counterparts.
2
评论