影像生成模型

Diffusion Model

1 生词

latent:adj. 潜在的;潜伏的;隐藏的; n. 潜指印;

decompose v. 分解;(使)还原;(使)腐烂;衰变;

state-of-the-art adj. 最先进的

fidelity:n. 保真度

hierarchy n. 层次体系;等级制度(尤指社会或组织);统治集团;

exploit v. 开发;利用;剥削;利用…谋私利; n. 功绩;

prone adj. 易于遭受;有做(坏事)的倾向;有做…倾向的;易于遭受…的;

excessive adj. 过分的;过度的; 过多的;额外;极度的;

imperceptible adj. (小得)无法察觉的; 感觉不到的;觉察不到的;细微的;

distortion:n. 扭曲;歪曲;【电】(信号,波形等的)失真; 畸变;变形;失真度;

trade-off:权衡

perceptual adj. 知觉的;感知的

superfluous adj. 过剩的;过多的;多余的;

manifold n. 管汇;汇集;复写本;【机械工程】歧管; adj. 许多的;多样的;由许多部分形成的;繁茂的; v. 复印;

bluriness n. 模糊性

quantization n. 〔物〕量子化;分层; 网络释义: 量化;量化程式;量化运算

autoregressively 自回归

probabilistic 随机

discrete 离散的

modalities 模态词

intermediate representation 中间表示

tractable 易加工的

stagnate 停滞不前

spectacular adj. 壮观的;壮丽的;令人惊叹的; n. 壮观的场面;精彩的表演

2 论文结构

1. Title

High-Resolution Image Synthesis with Latent Diffusion Models

2. Abs

  • Diffusion models(DMs)achieve state-of-the-art synthesis results

    • image data and beyond
    • allow for a guiding mechanism to control the image generation process without retraining
    • operate directly in pixel space, consume hundreds of GPU days and inference is expensive due to sequential evaluations
  • To enable DM training on limited computational resources while retaining quality and flexibility

    • apply DM in the latent space of pretrained autoencoders
  • In contrast to previous work

    • the first time to reach a near-optimal point (between complexity reduction and detail preservation)
  • Introduce cross-attention layers into the model architecture

    • turn DM into generators (inputs such as text or bounding boxes)
    • high-resolution synthesis in a convolutional manner
  • Latent diffusion models (LDMs)

    • achieve new state of art scores for image inpainting and class-conditional image synthesis
    • unconditional image generation, text-to-image synthesis, and super resolution
    • significantly reducing computational requirement compared to pixel-based DMs

3. Intro

  • Images synthesis with the greatest computational demands

  • high-resolution synthesis of complex, natural scenes is presently dominated by scaling up likelihood-based models containing billions of parameters in autoregressive (AR) transformers

  • DMs’ application:

    • class-conditional: image synthesis, super-resolution
    • unconditional: inpainting, colorization, stroke-based synthesis
  • Being likelihood-based models, they do not exhibit mode-collapse and training instabilities as GANs and, by heavily exploiting parameter sharing, they can model highly complex distributions of natural images without involving billions of parameters as in AR models

  • Democratizing High-Resolution Image Synthesis

    DMs spends excessive amounts of capacity

    ​ train and evaluate such a model requires repeated function evaluations (and gradient computions) in the high-dimensional space of RGB images

  • Departure to Latent Space

5. Method

  • DMs allow to ignore perceptually irrelevant details by undersampling the corresponding loss terms [29]???

  • Introduce an explicit separation of the compressive from the generative learning phase

    • use an autoencoding model (learn a space)
    • offer reduced computational complexity
  • Several advantages

  • Perceptual Image Compression

    • based on previous work [23]
    • an autoencoder trained by combination of a perceptual loss [102] and a patch-based adversarial objective
    • ensure that
      • the reconstructions are confined to the image manifold by enforcing local realism
      • avoid bluriness introduced by relying solely on pixel-space losses
    • in order to avoid high-variance latent spaces, experiment with two different kinds of regularizations
      • The first variant, KL-reg., imposes a slight KL-penalty towards a standard normal on the learned latent, similar to a VAE
        • VQ-reg. uses a vector quantization layer [93] within the decoder
      • rely on an arbitrary 1D oderding of the learned space z
  • Latent Diffusion Models

    • Diffusion Models

      • probabilistic
      • designed to learn a data distribution by gradually denoising a normally distributed variable
      • trained to predict a denoised variant of their input xtx_t, where xtx_t is a noisy version of the input xx
    • Generative Modeling of Latent Represenetations (潜在表征的生成建模)

      • have an efficient, low-dimensional latent space in which high-frequency. Compared to the high-dimensional pixel space, this space is more suitable for likehood-based generative models

        • focus on the important, semantic bits of the data
        • train in a lower dimensional, computationally much more efficient space
      • attention-based transformer model

        • in a highly compressed, discrete latent space
        • take advantage of image-specific inductive biases
          • build the underlying UNet primarily from 2D convolutational layers
          • focus on the objective on the perceptually most relevant bits using the reweighted bound
  • Conditioning Mechanisms

    • diffusion models can

      • be implemented with a conditional denoising autoencoder
      • paves the way to controlling the synthesis process through inputs yy such as text, semantic maps or other image-to-image translation tasks.
    • we turn DMs into more flexible conditional image generators by augmentimg their underlying UNet backbone with the cross-attention mechanism

      我们通过交叉注意力机制增强DM的底层UNet主干,使其成为更灵活的条件图像生成器

    • To pre-process yy from various modalities (such as language prompts)

      • introduce a domain specific encoder that projects yy to an intermediate representation, which is then mapped to the intermediate layers of the UNet via a cross-attention layer implementing Attention ???

6. Exp

LDMs provide means to flexible and computationally tractable diffusion based image synthesis also including high-resolution generation of various image modalities.

  • analyze the gains of our models comapred to pixel-based diffusion models in both training and inference

    • LDMs trained in VQ-regularized latent spaces achieve better sample quality
  • On Perceptual Compression Tradeoffs

    analyzes the behavior of our LDMs with different downsampling factors f{1,2,4,8,16,32}f ∈ \{1, 2, 4, 8, 16, 32\}

    a single NVIDIA A100

    train all models for the same number of steps and with the same number of parameters

    • small downsampling factors for LDM-{1, 2} result in slow training process
    • overly large values of f cause stagnating fidelity after comparably few training steps
    • we attribute this to
      • leaving most of perceptual compression to the diffusion model
      • too strong first stage compression result in information loss and thus limiting the achievable quality.
    • LDM-{4-16} strike a good balance between efficiency and perceptually faithful result, which manifests in a significant FID[28] gap of 28 between pixel-based diffusion(LDM-1) and LDM-8 after 2M training steps.

    LDM-4 and LDM-8 lie in the best behaved regime for achieving high-quality synthesis result.

  • Image Generation with Latent Diffusion

    • FID and Precision-and-Recall[49]
  • Conditional Latent Diffusion

    • Transformer Encoders for LDMs

      • text-to-image image model

        • train a 1.45B paramter model conditioned on language prompts on LAION-400M
        • employ the BERT-tokenizer and implement τθ\tau_\theta as a transformer to infer a latent code which is mapped into UNet via cross-attention
      • to further analyze the flexibility of the cross-attention based conditioning mechanism

        • train models to synthesize images based on semantic layouts on OpenImages [48], and finetune on COCO [4]
      • our best-performing class-conditional ImageNet models with f{4,8}f ∈ \{4, 8\} outperform the state of the art diffusion model ADM while significantly reducing computational requirements and parameter count

    • Convolutional Sampling Beyond 2562256^2

      semantic synthesis, super-resolution and inpainting

  • Super-Resolution with Latent Diffusion

    等待阅读

  • Inpainting with Latent Diffusion

    等待阅读

7. Conclusion

improve both the training and sampling efficiency

without task-specific architectures

Generative Adversarial Network (GAN)