影像生成模型

Zoe

2023-08-31

Note › AI

Diffusion Model

1 生词

latent：adj. 潜在的；潜伏的；隐藏的； n. 潜指印；

decompose v. 分解；(使)还原；(使)腐烂；衰变；

state-of-the-art adj. 最先进的

fidelity：n. 保真度

hierarchy n. 层次体系；等级制度（尤指社会或组织）；统治集团；

exploit v. 开发；利用；剥削；利用…谋私利； n. 功绩；

prone adj. 易于遭受；有做（坏事）的倾向；有做…倾向的；易于遭受…的；

excessive adj. 过分的；过度的；过多的；额外；极度的；

imperceptible adj. （小得）无法察觉的；感觉不到的；觉察不到的；细微的；

distortion：n. 扭曲；歪曲；【电】(信号,波形等的)失真；畸变；变形；失真度；

trade-off：权衡

perceptual adj. 知觉的；感知的

superfluous adj. 过剩的；过多的；多余的；

manifold n. 管汇；汇集；复写本；【机械工程】歧管； adj. 许多的；多样的；由许多部分形成的；繁茂的； v. 复印；

bluriness n. 模糊性

quantization n. 〔物〕量子化；分层；网络释义：量化；量化程式；量化运算

autoregressively 自回归

probabilistic 随机

discrete 离散的

modalities 模态词

intermediate representation 中间表示

tractable 易加工的

stagnate 停滞不前

spectacular adj. 壮观的；壮丽的；令人惊叹的； n. 壮观的场面；精彩的表演

2 论文结构

1. Title

High-Resolution Image Synthesis with Latent Diffusion Models

2. Abs

Diffusion models（DMs）achieve state-of-the-art synthesis results
- image data and beyond
- allow for a guiding mechanism to control the image generation process without retraining
- operate directly in pixel space, consume hundreds of GPU days and inference is expensive due to sequential evaluations
To enable DM training on limited computational resources while retaining quality and flexibility
- apply DM in the latent space of pretrained autoencoders
In contrast to previous work
- the first time to reach a near-optimal point (between complexity reduction and detail preservation)
Introduce cross-attention layers into the model architecture
- turn DM into generators (inputs such as text or bounding boxes)
- high-resolution synthesis in a convolutional manner
Latent diffusion models (LDMs)
- achieve new state of art scores for image inpainting and class-conditional image synthesis
- unconditional image generation, text-to-image synthesis, and super resolution
- significantly reducing computational requirement compared to pixel-based DMs

3. Intro

Images synthesis with the greatest computational demands
high-resolution synthesis of complex, natural scenes is presently dominated by scaling up likelihood-based models containing billions of parameters in autoregressive (AR) transformers
DMs’ application:
- class-conditional: image synthesis, super-resolution
- unconditional: inpainting, colorization, stroke-based synthesis
Being likelihood-based models, they do not exhibit mode-collapse and training instabilities as GANs and, by heavily exploiting parameter sharing, they can model highly complex distributions of natural images without involving billions of parameters as in AR models
Democratizing High-Resolution Image Synthesis

DMs spends excessive amounts of capacity

train and evaluate such a model requires repeated function evaluations (and gradient computions) in the high-dimensional space of RGB images
Departure to Latent Space

5. Method

DMs allow to ignore perceptually irrelevant details by undersampling the corresponding loss terms [29]???
Introduce an explicit separation of the compressive from the generative learning phase
- use an autoencoding model (learn a space)
- offer reduced computational complexity
Several advantages
Perceptual Image Compression
- based on previous work [23]
- an autoencoder trained by combination of a perceptual loss [102] and a patch-based adversarial objective
- ensure that
  - the reconstructions are confined to the image manifold by enforcing local realism
  - avoid bluriness introduced by relying solely on pixel-space losses
- in order to avoid high-variance latent spaces, experiment with two different kinds of regularizations
  - The first variant, KL-reg., imposes a slight KL-penalty towards a standard normal on the learned latent, similar to a VAE
    - VQ-reg. uses a vector quantization layer [93] within the decoder
  - rely on an arbitrary 1D oderding of the learned space z
Latent Diffusion Models
- Diffusion Models
  - probabilistic
  - designed to learn a data distribution by gradually denoising a normally distributed variable
  - trained to predict a denoised variant of their input $x_t$ , where $x_t$ is a noisy version of the input $x$
- Generative Modeling of Latent Represenetations (潜在表征的生成建模)
  - have an efficient, low-dimensional latent space in which high-frequency. Compared to the high-dimensional pixel space, this space is more suitable for likehood-based generative models
    - focus on the important, semantic bits of the data
    - train in a lower dimensional, computationally much more efficient space
  - attention-based transformer model
    - in a highly compressed, discrete latent space
    - take advantage of image-specific inductive biases
      - build the underlying UNet primarily from 2D convolutational layers
      - focus on the objective on the perceptually most relevant bits using the reweighted bound
Conditioning Mechanisms
- diffusion models can
  - be implemented with a conditional denoising autoencoder
  - paves the way to controlling the synthesis process through inputs $y$ such as text, semantic maps or other image-to-image translation tasks.
- we turn DMs into more flexible conditional image generators by augmentimg their underlying UNet backbone with the cross-attention mechanism
  
  我们通过交叉注意力机制增强DM的底层UNet主干，使其成为更灵活的条件图像生成器
- To pre-process $y$ from various modalities (such as language prompts)
  - introduce a domain specific encoder that projects $y$ to an intermediate representation, which is then mapped to the intermediate layers of the UNet via a cross-attention layer implementing Attention ???

6. Exp

LDMs provide means to flexible and computationally tractable diffusion based image synthesis also including high-resolution generation of various image modalities.

analyze the gains of our models comapred to pixel-based diffusion models in both training and inference
- LDMs trained in VQ-regularized latent spaces achieve better sample quality
On Perceptual Compression Tradeoffs

analyzes the behavior of our LDMs with different downsampling factors $f ∈ \{1, 2, 4, 8, 16, 32\}$

a single NVIDIA A100

train all models for the same number of steps and with the same number of parameters
- small downsampling factors for LDM-{1, 2} result in slow training process
- overly large values of f cause stagnating fidelity after comparably few training steps
- we attribute this to
  - leaving most of perceptual compression to the diffusion model
  - too strong first stage compression result in information loss and thus limiting the achievable quality.
- LDM-{4-16} strike a good balance between efficiency and perceptually faithful result, which manifests in a significant FID[28] gap of 28 between pixel-based diffusion(LDM-1) and LDM-8 after 2M training steps.
LDM-4 and LDM-8 lie in the best behaved regime for achieving high-quality synthesis result.
Image Generation with Latent Diffusion
- FID and Precision-and-Recall[49]
Conditional Latent Diffusion
- Transformer Encoders for LDMs
  - text-to-image image model
    - train a 1.45B paramter model conditioned on language prompts on LAION-400M
    - employ the BERT-tokenizer and implement $\tau_\theta$ as a transformer to infer a latent code which is mapped into UNet via cross-attention
  - to further analyze the flexibility of the cross-attention based conditioning mechanism
    - train models to synthesize images based on semantic layouts on OpenImages [48], and finetune on COCO [4]
  - our best-performing class-conditional ImageNet models with $f ∈ \{4, 8\}$ outperform the state of the art diffusion model ADM while significantly reducing computational requirements and parameter count
- Convolutional Sampling Beyond $256^2$
  
  semantic synthesis, super-resolution and inpainting
Super-Resolution with Latent Diffusion

等待阅读
Inpainting with Latent Diffusion

等待阅读

7. Conclusion

improve both the training and sampling efficiency

without task-specific architectures