Diffusion Model
1 生词
latent:adj. 潜在的;潜伏的;隐藏的; n. 潜指印;
decompose v. 分解;(使)还原;(使)腐烂;衰变;
state-of-the-art adj. 最先进的
fidelity:n. 保真度
hierarchy n. 层次体系;等级制度(尤指社会或组织);统治集团;
exploit v. 开发;利用;剥削;利用…谋私利; n. 功绩;
prone adj. 易于遭受;有做(坏事)的倾向;有做…倾向的;易于遭受…的;
excessive adj. 过分的;过度的; 过多的;额外;极度的;
imperceptible adj. (小得)无法察觉的; 感觉不到的;觉察不到的;细微的;
distortion:n. 扭曲;歪曲;【电】(信号,波形等的)失真; 畸变;变形;失真度;
trade-off:权衡
perceptual adj. 知觉的;感知的
superfluous adj. 过剩的;过多的;多余的;
manifold n. 管汇;汇集;复写本;【机械工程】歧管; adj. 许多的;多样的;由许多部分形成的;繁茂的; v. 复印;
bluriness n. 模糊性
quantization n. 〔物〕量子化;分层; 网络释义: 量化;量化程式;量化运算
autoregressively 自回归
probabilistic 随机
discrete 离散的
modalities 模态词
intermediate representation 中间表示
tractable 易加工的
stagnate 停滞不前
spectacular adj. 壮观的;壮丽的;令人惊叹的; n. 壮观的场面;精彩的表演
2 论文结构
1. Title
High-Resolution Image Synthesis with Latent Diffusion Models
2. Abs
-
Diffusion models(DMs)achieve state-of-the-art synthesis results
- image data and beyond
- allow for a guiding mechanism to control the image generation process without retraining
- operate directly in pixel space, consume hundreds of GPU days and inference is expensive due to sequential evaluations
-
To enable DM training on limited computational resources while retaining quality and flexibility
- apply DM in the latent space of pretrained autoencoders
-
In contrast to previous work
- the first time to reach a near-optimal point (between complexity reduction and detail preservation)
-
Introduce cross-attention layers into the model architecture
- turn DM into generators (inputs such as text or bounding boxes)
- high-resolution synthesis in a convolutional manner
-
Latent diffusion models (LDMs)
- achieve new state of art scores for image inpainting and class-conditional image synthesis
- unconditional image generation, text-to-image synthesis, and super resolution
- significantly reducing computational requirement compared to pixel-based DMs
3. Intro
-
Images synthesis with the greatest computational demands
-
high-resolution synthesis of complex, natural scenes is presently dominated by scaling up likelihood-based models containing billions of parameters in autoregressive (AR) transformers
-
DMs’ application:
- class-conditional: image synthesis, super-resolution
- unconditional: inpainting, colorization, stroke-based synthesis
-
Being likelihood-based models, they do not exhibit mode-collapse and training instabilities as GANs and, by heavily exploiting parameter sharing, they can model highly complex distributions of natural images without involving billions of parameters as in AR models
-
Democratizing High-Resolution Image Synthesis
DMs spends excessive amounts of capacity
train and evaluate such a model requires repeated function evaluations (and gradient computions) in the high-dimensional space of RGB images
-
Departure to Latent Space
4. Related Work
5. Method
-
DMs allow to ignore perceptually irrelevant details by undersampling the corresponding loss terms [29]???
-
Introduce an explicit separation of the compressive from the generative learning phase
- use an autoencoding model (learn a space)
- offer reduced computational complexity
-
Several advantages
-
Perceptual Image Compression
- based on previous work [23]
- an autoencoder trained by combination of a perceptual loss [102] and a patch-based adversarial objective
- ensure that
- the reconstructions are confined to the image manifold by enforcing local realism
- avoid bluriness introduced by relying solely on pixel-space losses
- in order to avoid high-variance latent spaces, experiment with two different kinds of regularizations
- The first variant, KL-reg., imposes a slight KL-penalty towards a standard normal on the learned latent, similar to a VAE
- VQ-reg. uses a vector quantization layer [93] within the decoder
- rely on an arbitrary 1D oderding of the learned space z
- The first variant, KL-reg., imposes a slight KL-penalty towards a standard normal on the learned latent, similar to a VAE
-
Latent Diffusion Models
-
Diffusion Models
- probabilistic
- designed to learn a data distribution by gradually denoising a normally distributed variable
- trained to predict a denoised variant of their input , where is a noisy version of the input
-
Generative Modeling of Latent Represenetations (潜在表征的生成建模)
-
have an efficient, low-dimensional latent space in which high-frequency. Compared to the high-dimensional pixel space, this space is more suitable for likehood-based generative models
- focus on the important, semantic bits of the data
- train in a lower dimensional, computationally much more efficient space
-
attention-based transformer model
- in a highly compressed, discrete latent space
- take advantage of image-specific inductive biases
- build the underlying UNet primarily from 2D convolutational layers
- focus on the objective on the perceptually most relevant bits using the reweighted bound
-
-
-
Conditioning Mechanisms
-
diffusion models can
- be implemented with a conditional denoising autoencoder
- paves the way to controlling the synthesis process through inputs such as text, semantic maps or other image-to-image translation tasks.
-
we turn DMs into more flexible conditional image generators by augmentimg their underlying UNet backbone with the cross-attention mechanism
我们通过交叉注意力机制增强DM的底层UNet主干,使其成为更灵活的条件图像生成器
-
To pre-process from various modalities (such as language prompts)
- introduce a domain specific encoder that projects to an intermediate representation, which is then mapped to the intermediate layers of the UNet via a cross-attention layer implementing Attention ???
-
6. Exp
LDMs provide means to flexible and computationally tractable diffusion based image synthesis also including high-resolution generation of various image modalities.
-
analyze the gains of our models comapred to pixel-based diffusion models in both training and inference
- LDMs trained in VQ-regularized latent spaces achieve better sample quality
-
On Perceptual Compression Tradeoffs
analyzes the behavior of our LDMs with different downsampling factors
a single NVIDIA A100
train all models for the same number of steps and with the same number of parameters
- small downsampling factors for LDM-{1, 2} result in slow training process
- overly large values of f cause stagnating fidelity after comparably few training steps
- we attribute this to
- leaving most of perceptual compression to the diffusion model
- too strong first stage compression result in information loss and thus limiting the achievable quality.
- LDM-{4-16} strike a good balance between efficiency and perceptually faithful result, which manifests in a significant FID[28] gap of 28 between pixel-based diffusion(LDM-1) and LDM-8 after 2M training steps.
LDM-4 and LDM-8 lie in the best behaved regime for achieving high-quality synthesis result.
-
Image Generation with Latent Diffusion
- FID and Precision-and-Recall[49]
-
Conditional Latent Diffusion
-
Transformer Encoders for LDMs
-
text-to-image image model
- train a 1.45B paramter model conditioned on language prompts on LAION-400M
- employ the BERT-tokenizer and implement as a transformer to infer a latent code which is mapped into UNet via cross-attention
-
to further analyze the flexibility of the cross-attention based conditioning mechanism
- train models to synthesize images based on semantic layouts on OpenImages [48], and finetune on COCO [4]
-
our best-performing class-conditional ImageNet models with outperform the state of the art diffusion model ADM while significantly reducing computational requirements and parameter count
-
-
Convolutional Sampling Beyond
semantic synthesis, super-resolution and inpainting
-
-
Super-Resolution with Latent Diffusion
等待阅读
-
Inpainting with Latent Diffusion
等待阅读
7. Conclusion
improve both the training and sampling efficiency
without task-specific architectures