Diffusion Model
1 生词
latent:adj. 潜在的;潜伏的;隐藏的; n. 潜指印;
decompose v. 分解;(使)还原;(使)腐烂;衰变;
state-of-the-art adj. 最先进的
fidelity:n. 保真度
hierarchy n. 层次体系;等级制度(尤指社会或组织);统治集团;
exploit v. 开发;利用;剥削;利用…谋私利; n. 功绩;
prone adj. 易于遭受;有做(坏事)的倾向;有做…倾向的;易于遭受…的;
excessive adj. 过分的;过度的; 过多的;额外;极度的;
imperceptible adj. (小得)无法察觉的; 感觉不到的;觉察不到的;细微的;
distortion:n. 扭曲;歪曲;【电】(信号,波形等的)失真; 畸变;变形;失真度;
trade-off:权衡
perceptual adj. 知觉的;感知的
superfluous adj. 过剩的;过多的;多余的;
manifold n. 管汇;汇集;复写本;【机械工程】歧管; adj. 许多的;多样的;由许多部分形成的;繁茂的; v. 复印;
bluriness n. 模糊性
quantization n. 〔物〕量子化;分层; 网络释义: 量化;量化程式;量化运算
autoregressively 自回归
probabilistic 随机
discrete 离散的
modalities 模态词
intermediate representation 中间表示
tractable 易加工的
stagnate 停滞不前
spectacular adj. 壮观的;壮丽的;令人惊叹的; n. 壮观的场面;精彩的表演
2 论文结构
1. Title
High-Resolution Image Synthesis with Latent Diffusion Models
2. Abs
- 
Diffusion models(DMs)achieve state-of-the-art synthesis results
- image data and beyond
 - allow for a guiding mechanism to control the image generation process without retraining
 - operate directly in pixel space, consume hundreds of GPU days and inference is expensive due to sequential evaluations
 
 - 
To enable DM training on limited computational resources while retaining quality and flexibility
- apply DM in the latent space of pretrained autoencoders
 
 - 
In contrast to previous work
- the first time to reach a near-optimal point (between complexity reduction and detail preservation)
 
 - 
Introduce cross-attention layers into the model architecture
- turn DM into generators (inputs such as text or bounding boxes)
 - high-resolution synthesis in a convolutional manner
 
 - 
Latent diffusion models (LDMs)
- achieve new state of art scores for image inpainting and class-conditional image synthesis
 - unconditional image generation, text-to-image synthesis, and super resolution
 - significantly reducing computational requirement compared to pixel-based DMs
 
 
3. Intro
- 
Images synthesis with the greatest computational demands
 - 
high-resolution synthesis of complex, natural scenes is presently dominated by scaling up likelihood-based models containing billions of parameters in autoregressive (AR) transformers
 - 
DMs’ application:
- class-conditional: image synthesis, super-resolution
 - unconditional: inpainting, colorization, stroke-based synthesis
 
 - 
Being likelihood-based models, they do not exhibit mode-collapse and training instabilities as GANs and, by heavily exploiting parameter sharing, they can model highly complex distributions of natural images without involving billions of parameters as in AR models
 - 
Democratizing High-Resolution Image Synthesis
DMs spends excessive amounts of capacity
 train and evaluate such a model requires repeated function evaluations (and gradient computions) in the high-dimensional space of RGB images
 - 
Departure to Latent Space
 
4. Related Work
5. Method
- 
DMs allow to ignore perceptually irrelevant details by undersampling the corresponding loss terms [29]???
 - 
Introduce an explicit separation of the compressive from the generative learning phase
- use an autoencoding model (learn a space)
 - offer reduced computational complexity
 
 - 
Several advantages
 - 
Perceptual Image Compression
- based on previous work [23]
 - an autoencoder trained by combination of a perceptual loss [102] and a patch-based adversarial objective
 - ensure that
- the reconstructions are confined to the image manifold by enforcing local realism
 - avoid bluriness introduced by relying solely on pixel-space losses
 
 - in order to avoid high-variance latent spaces, experiment with two different kinds of regularizations
- The first variant, KL-reg., imposes a slight KL-penalty towards a standard normal on the learned latent, similar to a VAE
- VQ-reg. uses a vector quantization layer [93] within the decoder
 
 - rely on an arbitrary 1D oderding of the learned space z
 
 - The first variant, KL-reg., imposes a slight KL-penalty towards a standard normal on the learned latent, similar to a VAE
 
 - 
Latent Diffusion Models
- 
Diffusion Models
- probabilistic
 - designed to learn a data distribution by gradually denoising a normally distributed variable
 - trained to predict a denoised variant of their input , where is a noisy version of the input
 
 - 
Generative Modeling of Latent Represenetations (潜在表征的生成建模)
- 
have an efficient, low-dimensional latent space in which high-frequency. Compared to the high-dimensional pixel space, this space is more suitable for likehood-based generative models
- focus on the important, semantic bits of the data
 - train in a lower dimensional, computationally much more efficient space
 
 - 
attention-based transformer model
- in a highly compressed, discrete latent space
 - take advantage of image-specific inductive biases
- build the underlying UNet primarily from 2D convolutational layers
 - focus on the objective on the perceptually most relevant bits using the reweighted bound
 
 
 
 - 
 
 - 
 - 
Conditioning Mechanisms
- 
diffusion models can
- be implemented with a conditional denoising autoencoder
 - paves the way to controlling the synthesis process through inputs such as text, semantic maps or other image-to-image translation tasks.
 
 - 
we turn DMs into more flexible conditional image generators by augmentimg their underlying UNet backbone with the cross-attention mechanism
我们通过交叉注意力机制增强DM的底层UNet主干,使其成为更灵活的条件图像生成器
 - 
To pre-process from various modalities (such as language prompts)
- introduce a domain specific encoder that projects to an intermediate representation, which is then mapped to the intermediate layers of the UNet via a cross-attention layer implementing Attention ???
 
 
 - 
 
6. Exp
LDMs provide means to flexible and computationally tractable diffusion based image synthesis also including high-resolution generation of various image modalities.
- 
analyze the gains of our models comapred to pixel-based diffusion models in both training and inference
- LDMs trained in VQ-regularized latent spaces achieve better sample quality
 
 - 
On Perceptual Compression Tradeoffs
analyzes the behavior of our LDMs with different downsampling factors
a single NVIDIA A100
train all models for the same number of steps and with the same number of parameters
- small downsampling factors for LDM-{1, 2} result in slow training process
 - overly large values of f cause stagnating fidelity after comparably few training steps
 - we attribute this to
- leaving most of perceptual compression to the diffusion model
 - too strong first stage compression result in information loss and thus limiting the achievable quality.
 
 - LDM-{4-16} strike a good balance between efficiency and perceptually faithful result, which manifests in a significant FID[28] gap of 28 between pixel-based diffusion(LDM-1) and LDM-8 after 2M training steps.
 
LDM-4 and LDM-8 lie in the best behaved regime for achieving high-quality synthesis result.
 - 
Image Generation with Latent Diffusion
- FID and Precision-and-Recall[49]
 
 - 
Conditional Latent Diffusion
- 
Transformer Encoders for LDMs
- 
text-to-image image model
- train a 1.45B paramter model conditioned on language prompts on LAION-400M
 - employ the BERT-tokenizer and implement as a transformer to infer a latent code which is mapped into UNet via cross-attention
 
 - 
to further analyze the flexibility of the cross-attention based conditioning mechanism
- train models to synthesize images based on semantic layouts on OpenImages [48], and finetune on COCO [4]
 
 - 
our best-performing class-conditional ImageNet models with outperform the state of the art diffusion model ADM while significantly reducing computational requirements and parameter count
 
 - 
 - 
Convolutional Sampling Beyond
semantic synthesis, super-resolution and inpainting
 
 - 
 - 
Super-Resolution with Latent Diffusion
等待阅读
 - 
Inpainting with Latent Diffusion
等待阅读
 
7. Conclusion
improve both the training and sampling efficiency
without task-specific architectures