stable-diffusion-finetune/README.md

# Latent Diffusion Models
[arXiv](https://arxiv.org/abs/2112.10752) | [BibTeX](#bibtex)

<p align="center">
<img src=assets/results.gif />
</p>


[**High-Resolution Image Synthesis with Latent Diffusion Models**](https://arxiv.org/abs/2112.10752)<br/>
[Robin Rombach](https://github.com/rromb)\*,
[Andreas Blattmann](https://github.com/ablattmann)\*,
[Dominik Lorenz](https://github.com/qp-qp)\,
[Patrick Esser](https://github.com/pesser),
[Björn Ommer](https://hci.iwr.uni-heidelberg.de/Staff/bommer)<br/>
\* equal contribution

<p align="center">
<img src=assets/modelfigure.png />
</p>

## News
### April 2022
- Thanks to [Katherine Crowson](https://github.com/crowsonkb), classifier-free guidance received a ~2x speedup and the [PLMS sampler](https://arxiv.org/abs/2202.09778) is available. See also [this PR](https://github.com/CompVis/latent-diffusion/pull/51).

- Our 1.45B [latent diffusion LAION model](#text-to-image) was integrated into [Huggingface Spaces 🤗](https://huggingface.co/spaces) using [Gradio](https://github.com/gradio-app/gradio). Try out the Web Demo: [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/multimodalart/latentdiffusion)

- More pre-trained LDMs are available: 
  - A 1.45B [model](#text-to-image) trained on the [LAION-400M](https://arxiv.org/abs/2111.02114) database.
  - A class-conditional model on ImageNet, achieving a FID of 3.6 when using [classifier-free guidance](https://openreview.net/pdf?id=qw8AKxfYbI) Available via a [colab notebook](https://colab.research.google.com/github/CompVis/latent-diffusion/blob/main/scripts/latent_imagenet_diffusion.ipynb) [![][colab]][colab-cin].
  
## Requirements
A suitable [conda](https://conda.io/) environment named `ldm` can be created
and activated with:

```
conda env create -f environment.yaml
conda activate ldm
```

# Pretrained Models
A general list of all available checkpoints is available in via our [model zoo](#model-zoo).
If you use any of these models in your work, we are always happy to receive a [citation](#bibtex).

## Text-to-Image
![text2img-figure](assets/txt2img-preview.png) 


Download the pre-trained weights (5.7GB)
```
mkdir -p models/ldm/text2img-large/
wget -O models/ldm/text2img-large/model.ckpt https://ommer-lab.com/files/latent-diffusion/nitro/txt2img-f8-large/model.ckpt
```
and sample with
```
python scripts/txt2img.py --prompt "a virus monster is playing guitar, oil on canvas" --ddim_eta 0.0 --n_samples 4 --n_iter 4 --scale 5.0  --ddim_steps 50
```
This will save each sample individually as well as a grid of size `n_iter` x `n_samples` at the specified output location (default: `outputs/txt2img-samples`).
Quality, sampling speed and diversity are best controlled via the `scale`, `ddim_steps` and `ddim_eta` arguments.
As a rule of thumb, higher values of `scale` produce better samples at the cost of a reduced output diversity.   
Furthermore, increasing `ddim_steps` generally also gives higher quality samples, but returns are diminishing for values > 250.
Fast sampling (i.e. low values of `ddim_steps`) while retaining good quality can be achieved by using `--ddim_eta 0.0`.  
Faster sampling (i.e. even lower values of `ddim_steps`) while retaining good quality can be achieved by using `--ddim_eta 0.0` and `--plms` (see [Pseudo Numerical Methods for Diffusion Models on Manifolds](https://arxiv.org/abs/2202.09778)).

#### Beyond 256²

For certain inputs, simply running the model in a convolutional fashion on larger features than it was trained on
can sometimes result in interesting results. To try it out, tune the `H` and `W` arguments (which will be integer-divided
by 8 in order to calculate the corresponding latent size), e.g. run

```
python scripts/txt2img.py --prompt "a sunset behind a mountain range, vector image" --ddim_eta 1.0 --n_samples 1 --n_iter 1 --H 384 --W 1024 --scale 5.0  
```
to create a sample of size 384x1024. Note, however, that controllability is reduced compared to the 256x256 setting. 

The example below was generated using the above command. 
![text2img-figure-conv](assets/txt2img-convsample.png)


## Inpainting
![inpainting](assets/inpainting.png)

Download the pre-trained weights
```
wget -O models/ldm/inpainting_big/last.ckpt https://heibox.uni-heidelberg.de/f/4d9ac7ea40c64582b7c9/?dl=1
```

and sample with
```
python scripts/inpaint.py --indir data/inpainting_examples/ --outdir outputs/inpainting_results
```
`indir` should contain images `*.png` and masks `<image_fname>_mask.png` like
the examples provided in `data/inpainting_examples`.

## Class-Conditional ImageNet

Available via a [notebook](scripts/latent_imagenet_diffusion.ipynb) [![][colab]][colab-cin].
![class-conditional](assets/birdhouse.png)

[colab]: <https://colab.research.google.com/assets/colab-badge.svg>
[colab-cin]: <https://colab.research.google.com/github/CompVis/latent-diffusion/blob/main/scripts/latent_imagenet_diffusion.ipynb>


## Unconditional Models

We also provide a script for sampling from unconditional LDMs (e.g. LSUN, FFHQ, ...). Start it via

```shell script
CUDA_VISIBLE_DEVICES=<GPU_ID> python scripts/sample_diffusion.py -r models/ldm/<model_spec>/model.ckpt -l <logdir> -n <\#samples> --batch_size <batch_size> -c <\#ddim steps> -e <\#eta> 
```

# Train your own LDMs

## Data preparation

### Faces 
For downloading the CelebA-HQ and FFHQ datasets, proceed as described in the [taming-transformers](https://github.com/CompVis/taming-transformers#celeba-hq) 
repository.

### LSUN 

The LSUN datasets can be conveniently downloaded via the script available [here](https://github.com/fyu/lsun).
We performed a custom split into training and validation images, and provide the corresponding filenames
at [https://ommer-lab.com/files/lsun.zip](https://ommer-lab.com/files/lsun.zip). 
After downloading, extract them to `./data/lsun`. The beds/cats/churches subsets should
also be placed/symlinked at `./data/lsun/bedrooms`/`./data/lsun/cats`/`./data/lsun/churches`, respectively.

### ImageNet
The code will try to download (through [Academic
Torrents](http://academictorrents.com/)) and prepare ImageNet the first time it
is used. However, since ImageNet is quite large, this requires a lot of disk
space and time. If you already have ImageNet on your disk, you can speed things
up by putting the data into
`${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/` (which defaults to
`~/.cache/autoencoders/data/ILSVRC2012_{split}/data/`), where `{split}` is one
of `train`/`validation`. It should have the following structure:

```
${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/
├── n01440764
│   ├── n01440764_10026.JPEG
│   ├── n01440764_10027.JPEG
│   ├── ...
├── n01443537
│   ├── n01443537_10007.JPEG
│   ├── n01443537_10014.JPEG
│   ├── ...
├── ...
```

If you haven't extracted the data, you can also place
`ILSVRC2012_img_train.tar`/`ILSVRC2012_img_val.tar` (or symlinks to them) into
`${XDG_CACHE}/autoencoders/data/ILSVRC2012_train/` /
`${XDG_CACHE}/autoencoders/data/ILSVRC2012_validation/`, which will then be
extracted into above structure without downloading it again.  Note that this
will only happen if neither a folder
`${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/` nor a file
`${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/.ready` exist. Remove them
if you want to force running the dataset preparation again.


## Model Training

Logs and checkpoints for trained models are saved to `logs/<START_DATE_AND_TIME>_<config_spec>`.

### Training autoencoder models

Configs for training a KL-regularized autoencoder on ImageNet are provided at `configs/autoencoder`.
Training can be started by running
```
CUDA_VISIBLE_DEVICES=<GPU_ID> python main.py --base configs/autoencoder/<config_spec>.yaml -t --gpus 0,    
```
where `config_spec` is one of {`autoencoder_kl_8x8x64`(f=32, d=64), `autoencoder_kl_16x16x16`(f=16, d=16), 
`autoencoder_kl_32x32x4`(f=8, d=4), `autoencoder_kl_64x64x3`(f=4, d=3)}.

For training VQ-regularized models, see the [taming-transformers](https://github.com/CompVis/taming-transformers) 
repository.

### Training LDMs 

In ``configs/latent-diffusion/`` we provide configs for training LDMs on the LSUN-, CelebA-HQ, FFHQ and ImageNet datasets. 
Training can be started by running

```shell script
CUDA_VISIBLE_DEVICES=<GPU_ID> python main.py --base configs/latent-diffusion/<config_spec>.yaml -t --gpus 0,
``` 

where ``<config_spec>`` is one of {`celebahq-ldm-vq-4`(f=4, VQ-reg. autoencoder, spatial size 64x64x3),`ffhq-ldm-vq-4`(f=4, VQ-reg. autoencoder, spatial size 64x64x3),
`lsun_bedrooms-ldm-vq-4`(f=4, VQ-reg. autoencoder, spatial size 64x64x3),
`lsun_churches-ldm-vq-4`(f=8, KL-reg. autoencoder, spatial size 32x32x4),`cin-ldm-vq-8`(f=8, VQ-reg. autoencoder, spatial size 32x32x4)}.

# Model Zoo 

## Pretrained Autoencoding Models
![rec2](assets/reconstruction2.png)

All models were trained until convergence (no further substantial improvement in rFID).

| Model                   | rFID vs val | train steps           |PSNR           | PSIM          | Link                                                                                                                                                  | Comments              
|-------------------------|------------|----------------|----------------|---------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------|
| f=4, VQ (Z=8192, d=3)   | 0.58       | 533066 | 27.43  +/- 4.26 | 0.53 +/- 0.21 |     https://ommer-lab.com/files/latent-diffusion/vq-f4.zip                   |  |
| f=4, VQ (Z=8192, d=3)   | 1.06       | 658131 | 25.21 +/-  4.17 | 0.72 +/- 0.26 | https://heibox.uni-heidelberg.de/f/9c6681f64bb94338a069/?dl=1  | no attention          |
| f=8, VQ (Z=16384, d=4)  | 1.14       | 971043 | 23.07 +/- 3.99 | 1.17 +/- 0.36 |       https://ommer-lab.com/files/latent-diffusion/vq-f8.zip                     |                       |
| f=8, VQ (Z=256, d=4)    | 1.49       | 1608649 | 22.35 +/- 3.81 | 1.26 +/- 0.37 |   https://ommer-lab.com/files/latent-diffusion/vq-f8-n256.zip |  
| f=16, VQ (Z=16384, d=8) | 5.15       | 1101166 | 20.83 +/- 3.61 | 1.73 +/- 0.43 |             https://heibox.uni-heidelberg.de/f/0e42b04e2e904890a9b6/?dl=1                        |                       |
|                         |            |  |                |               |                                                                                                                                                    |                       |
| f=4, KL                 | 0.27       | 176991 | 27.53 +/- 4.54 | 0.55 +/- 0.24 |     https://ommer-lab.com/files/latent-diffusion/kl-f4.zip                                   |                       |
| f=8, KL                 | 0.90       | 246803 | 24.19 +/- 4.19 | 1.02 +/- 0.35 |             https://ommer-lab.com/files/latent-diffusion/kl-f8.zip                            |                       |
| f=16, KL     (d=16)     | 0.87       | 442998 | 24.08 +/- 4.22 | 1.07 +/- 0.36 |      https://ommer-lab.com/files/latent-diffusion/kl-f16.zip                                  |                       |
 | f=32, KL     (d=64)     | 2.04       | 406763 | 22.27 +/- 3.93 | 1.41 +/- 0.40 |             https://ommer-lab.com/files/latent-diffusion/kl-f32.zip                            |                       |

### Get the models

Running the following script downloads und extracts all available pretrained autoencoding models.   
```shell script
bash scripts/download_first_stages.sh
```

The first stage models can then be found in `models/first_stage_models/<model_spec>`


## Pretrained LDMs
| Datset                          |   Task    | Model        | FID           | IS              | Prec | Recall | Link                                                                                                                                                                                   | Comments                                        
|---------------------------------|------|--------------|---------------|-----------------|------|------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------|
| CelebA-HQ                       | Unconditional Image Synthesis    |  LDM-VQ-4 (200 DDIM steps, eta=0)| 5.11 (5.11)          | 3.29            | 0.72    | 0.49 |    https://ommer-lab.com/files/latent-diffusion/celeba.zip     |                                                 |  
| FFHQ                            | Unconditional Image Synthesis    |  LDM-VQ-4 (200 DDIM steps, eta=1)| 4.98 (4.98)  | 4.50 (4.50)   | 0.73 | 0.50 |              https://ommer-lab.com/files/latent-diffusion/ffhq.zip                                              |                                                 |
| LSUN-Churches                   | Unconditional Image Synthesis   |  LDM-KL-8 (400 DDIM steps, eta=0)| 4.02 (4.02) | 2.72 | 0.64 | 0.52 |         https://ommer-lab.com/files/latent-diffusion/lsun_churches.zip        |                                                 |  
| LSUN-Bedrooms                   | Unconditional Image Synthesis   |  LDM-VQ-4 (200 DDIM steps, eta=1)| 2.95 (3.0)          | 2.22 (2.23)| 0.66 | 0.48 | https://ommer-lab.com/files/latent-diffusion/lsun_bedrooms.zip |                                                 |  
| ImageNet                        | Class-conditional Image Synthesis | LDM-VQ-8 (200 DDIM steps, eta=1) | 7.77(7.76)* /15.82** | 201.56(209.52)* /78.82** | 0.84* / 0.65** | 0.35* / 0.63** |   https://ommer-lab.com/files/latent-diffusion/cin.zip                                                                   | *: w/ guiding, classifier_scale 10  **: w/o guiding, scores in bracket calculated with script provided by [ADM](https://github.com/openai/guided-diffusion) |   
| Conceptual Captions             |  Text-conditional Image Synthesis | LDM-VQ-f4 (100 DDIM steps, eta=0) | 16.79         | 13.89           | N/A | N/A |              https://ommer-lab.com/files/latent-diffusion/text2img.zip                                | finetuned from LAION                            |   
| OpenImages                      | Super-resolution   | LDM-VQ-4     | N/A            | N/A               | N/A    | N/A    |                                    https://ommer-lab.com/files/latent-diffusion/sr_bsr.zip                                    | BSR image degradation                           |
| OpenImages                      | Layout-to-Image Synthesis    | LDM-VQ-4 (200 DDIM steps, eta=0) | 32.02         | 15.92           | N/A    | N/A    |                  https://ommer-lab.com/files/latent-diffusion/layout2img_model.zip                                           |                                                 | 
| Landscapes      |  Semantic Image Synthesis   | LDM-VQ-4  | N/A             | N/A               | N/A    | N/A    |           https://ommer-lab.com/files/latent-diffusion/semantic_synthesis256.zip                                    |                                                 |
| Landscapes       |  Semantic Image Synthesis   | LDM-VQ-4  | N/A             | N/A               | N/A    | N/A    |           https://ommer-lab.com/files/latent-diffusion/semantic_synthesis.zip                                    |             finetuned on resolution 512x512                                     |


### Get the models

The LDMs listed above can jointly be downloaded and extracted via

```shell script
bash scripts/download_models.sh
```

The models can then be found in `models/ldm/<model_spec>`.


## Coming Soon...

* More inference scripts for conditional LDMs.
* In the meantime, you can play with our colab notebook https://colab.research.google.com/drive/1xqzUi2iXQXDqXBHQGP9Mqt2YrYW6cx-J?usp=sharing

## Comments 

- Our codebase for the diffusion models builds heavily on [OpenAI's ADM codebase](https://github.com/openai/guided-diffusion)
and [https://github.com/lucidrains/denoising-diffusion-pytorch](https://github.com/lucidrains/denoising-diffusion-pytorch). 
Thanks for open-sourcing!

- The implementation of the transformer encoder is from [x-transformers](https://github.com/lucidrains/x-transformers) by [lucidrains](https://github.com/lucidrains?tab=repositories). 


## BibTeX

```
@misc{rombach2021highresolution,
      title={High-Resolution Image Synthesis with Latent Diffusion Models}, 
      author={Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Björn Ommer},
      year={2021},
      eprint={2112.10752},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
```
add code 2021-12-21 02:23:41 +00:00			`# Latent Diffusion Models`
add autoencoder training details, arxiv link and figures 2021-12-22 10:16:26 +00:00			`[arXiv](https://arxiv.org/abs/2112.10752) \| [BibTeX](#bibtex)`

			`<p align="center">`
			`<img src=assets/results.gif />`
			`</p>`



			`[High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752)<br/>`
			`[Robin Rombach](https://github.com/rromb)\*,`
			`[Andreas Blattmann](https://github.com/ablattmann)\*,`
			`[Dominik Lorenz](https://github.com/qp-qp)\,`
			`[Patrick Esser](https://github.com/pesser),`
			`[Björn Ommer](https://hci.iwr.uni-heidelberg.de/Staff/bommer)<br/>`
			`\* equal contribution`

			`<p align="center">`
			`<img src=assets/modelfigure.png />`
			`</p>`
add code 2021-12-21 02:23:41 +00:00
add news 2022-04-04 14:47:01 +00:00			`## News`
			`### April 2022`
add credit 2022-04-15 15:43:03 +00:00			`- Thanks to [Katherine Crowson](https://github.com/crowsonkb), classifier-free guidance received a ~2x speedup and the [PLMS sampler](https://arxiv.org/abs/2202.09778) is available. See also [this PR](https://github.com/CompVis/latent-diffusion/pull/51).`

			`- Our 1.45B [latent diffusion LAION model](#text-to-image) was integrated into [Huggingface Spaces 🤗](https://huggingface.co/spaces) using [Gradio](https://github.com/gradio-app/gradio). Try out the Web Demo: [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/multimodalart/latentdiffusion)`

add news 2022-04-04 14:47:01 +00:00			`- More pre-trained LDMs are available:`
			`- A 1.45B [model](#text-to-image) trained on the [LAION-400M](https://arxiv.org/abs/2111.02114) database.`
			`- A class-conditional model on ImageNet, achieving a FID of 3.6 when using [classifier-free guidance](https://openreview.net/pdf?id=qw8AKxfYbI) Available via a [colab notebook](https://colab.research.google.com/github/CompVis/latent-diffusion/blob/main/scripts/latent_imagenet_diffusion.ipynb) [![][colab]][colab-cin].`
add credit 2022-04-15 15:43:03 +00:00
add code 2021-12-21 02:23:41 +00:00			`## Requirements`
			A suitable [conda](https://conda.io/) environment named `ldm` can be created
			`and activated with:`

			```
			`conda env create -f environment.yaml`
			`conda activate ldm`
			```

add new models 2022-04-04 14:17:48 +00:00			`# Pretrained Models`
			`A general list of all available checkpoints is available in via our [model zoo](#model-zoo).`
			`If you use any of these models in your work, we are always happy to receive a [citation](#bibtex).`
add code 2021-12-21 02:23:41 +00:00
add new models 2022-04-04 14:17:48 +00:00			`## Text-to-Image`
			`![text2img-figure](assets/txt2img-preview.png)`
add code 2021-12-21 02:23:41 +00:00
add trainsteps to first stage table 2022-01-15 11:30:44 +00:00
add new models 2022-04-04 14:17:48 +00:00			`Download the pre-trained weights (5.7GB)`
add code 2021-12-21 02:23:41 +00:00			```
add new models 2022-04-04 14:17:48 +00:00			`mkdir -p models/ldm/text2img-large/`
			`wget -O models/ldm/text2img-large/model.ckpt https://ommer-lab.com/files/latent-diffusion/nitro/txt2img-f8-large/model.ckpt`
			```
			`and sample with`
			```
			`python scripts/txt2img.py --prompt "a virus monster is playing guitar, oil on canvas" --ddim_eta 0.0 --n_samples 4 --n_iter 4 --scale 5.0 --ddim_steps 50`
			```
			This will save each sample individually as well as a grid of size `n_iter` x `n_samples` at the specified output location (default: `outputs/txt2img-samples`).
			Quality, sampling speed and diversity are best controlled via the `scale`, `ddim_steps` and `ddim_eta` arguments.
			As a rule of thumb, higher values of `scale` produce better samples at the cost of a reduced output diversity.
			Furthermore, increasing `ddim_steps` generally also gives higher quality samples, but returns are diminishing for values > 250.
			Fast sampling (i.e. low values of `ddim_steps`) while retaining good quality can be achieved by using `--ddim_eta 0.0`.
Alter credit in README 2022-04-15 12:36:41 +00:00			Faster sampling (i.e. even lower values of `ddim_steps`) while retaining good quality can be achieved by using `--ddim_eta 0.0` and `--plms` (see [Pseudo Numerical Methods for Diffusion Models on Manifolds](https://arxiv.org/abs/2202.09778)).
add code 2021-12-21 02:23:41 +00:00
add new models 2022-04-04 14:17:48 +00:00			`#### Beyond 256²`
add code 2021-12-21 02:23:41 +00:00
add new models 2022-04-04 14:17:48 +00:00			`For certain inputs, simply running the model in a convolutional fashion on larger features than it was trained on`
			can sometimes result in interesting results. To try it out, tune the `H` and `W` arguments (which will be integer-divided
			`by 8 in order to calculate the corresponding latent size), e.g. run`
add code 2021-12-21 02:23:41 +00:00
			```
add new models 2022-04-04 14:17:48 +00:00			`python scripts/txt2img.py --prompt "a sunset behind a mountain range, vector image" --ddim_eta 1.0 --n_samples 1 --n_iter 1 --H 384 --W 1024 --scale 5.0`
			```
			`to create a sample of size 384x1024. Note, however, that controllability is reduced compared to the 256x256 setting.`
add code 2021-12-21 02:23:41 +00:00
add new models 2022-04-04 14:17:48 +00:00			`The example below was generated using the above command.`
			`![text2img-figure-conv](assets/txt2img-convsample.png)`
add code 2021-12-21 02:23:41 +00:00


add new models 2022-04-04 14:17:48 +00:00			`## Inpainting`
Update README.md 2021-12-21 02:38:17 +00:00			`![inpainting](assets/inpainting.png)`
add code 2021-12-21 02:23:41 +00:00
add inpainting model 2021-12-21 11:35:45 +00:00			`Download the pre-trained weights`
			```
			`wget -O models/ldm/inpainting_big/last.ckpt https://heibox.uni-heidelberg.de/f/4d9ac7ea40c64582b7c9/?dl=1`
			```

			`and sample with`
			```
			`python scripts/inpaint.py --indir data/inpainting_examples/ --outdir outputs/inpainting_results`
			```
			`indir` should contain images `*.png` and masks `<image_fname>_mask.png` like
			the examples provided in `data/inpainting_examples`.

add new models 2022-04-04 14:17:48 +00:00			`## Class-Conditional ImageNet`

			`Available via a [notebook](scripts/latent_imagenet_diffusion.ipynb) [![][colab]][colab-cin].`
			`![class-conditional](assets/birdhouse.png)`

			`[colab]: <https://colab.research.google.com/assets/colab-badge.svg>`
Update README.md fix link to colab notebook 2022-04-04 14:34:40 +00:00			`[colab-cin]: <https://colab.research.google.com/github/CompVis/latent-diffusion/blob/main/scripts/latent_imagenet_diffusion.ipynb>`
add new models 2022-04-04 14:17:48 +00:00

			`## Unconditional Models`

			`We also provide a script for sampling from unconditional LDMs (e.g. LSUN, FFHQ, ...). Start it via`

			```shell script
			`CUDA_VISIBLE_DEVICES=<GPU_ID> python scripts/sample_diffusion.py -r models/ldm/<model_spec>/model.ckpt -l <logdir> -n <\#samples> --batch_size <batch_size> -c <\#ddim steps> -e <\#eta>`
			```
add configs for training unconditional/class-conditional ldms 2021-12-22 14:57:23 +00:00
			`# Train your own LDMs`

			`## Data preparation`

			`### Faces`
			`For downloading the CelebA-HQ and FFHQ datasets, proceed as described in the [taming-transformers](https://github.com/CompVis/taming-transformers#celeba-hq)`
			`repository.`

			`### LSUN`

			`The LSUN datasets can be conveniently downloaded via the script available [here](https://github.com/fyu/lsun).`
			`We performed a custom split into training and validation images, and provide the corresponding filenames`
			`at [https://ommer-lab.com/files/lsun.zip](https://ommer-lab.com/files/lsun.zip).`
			After downloading, extract them to `./data/lsun`. The beds/cats/churches subsets should
			also be placed/symlinked at `./data/lsun/bedrooms`/`./data/lsun/cats`/`./data/lsun/churches`, respectively.

			`### ImageNet`
			`The code will try to download (through [Academic`
			`Torrents](http://academictorrents.com/)) and prepare ImageNet the first time it`
			`is used. However, since ImageNet is quite large, this requires a lot of disk`
			`space and time. If you already have ImageNet on your disk, you can speed things`
			`up by putting the data into`
			`${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/` (which defaults to
			`~/.cache/autoencoders/data/ILSVRC2012_{split}/data/`), where `{split}` is one
			of `train`/`validation`. It should have the following structure:

			```
			`${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/`
			`├── n01440764`
			`│ ├── n01440764_10026.JPEG`
			`│ ├── n01440764_10027.JPEG`
			`│ ├── ...`
			`├── n01443537`
			`│ ├── n01443537_10007.JPEG`
			`│ ├── n01443537_10014.JPEG`
			`│ ├── ...`
			`├── ...`
			```

			`If you haven't extracted the data, you can also place`
			`ILSVRC2012_img_train.tar`/`ILSVRC2012_img_val.tar` (or symlinks to them) into
			`${XDG_CACHE}/autoencoders/data/ILSVRC2012_train/` /
			`${XDG_CACHE}/autoencoders/data/ILSVRC2012_validation/`, which will then be
			`extracted into above structure without downloading it again. Note that this`
			`will only happen if neither a folder`
			`${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/` nor a file
			`${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/.ready` exist. Remove them
			`if you want to force running the dataset preparation again.`


			`## Model Training`

			Logs and checkpoints for trained models are saved to `logs/<START_DATE_AND_TIME>_<config_spec>`.

			`### Training autoencoder models`

			Configs for training a KL-regularized autoencoder on ImageNet are provided at `configs/autoencoder`.
			`Training can be started by running`
			```
			`CUDA_VISIBLE_DEVICES=<GPU_ID> python main.py --base configs/autoencoder/<config_spec>.yaml -t --gpus 0,`
			```
			where `config_spec` is one of {`autoencoder_kl_8x8x64`(f=32, d=64), `autoencoder_kl_16x16x16`(f=16, d=16),
			`autoencoder_kl_32x32x4`(f=8, d=4), `autoencoder_kl_64x64x3`(f=4, d=3)}.

			`For training VQ-regularized models, see the [taming-transformers](https://github.com/CompVis/taming-transformers)`
			`repository.`

			`### Training LDMs`

			In ``configs/latent-diffusion/`` we provide configs for training LDMs on the LSUN-, CelebA-HQ, FFHQ and ImageNet datasets.
			`Training can be started by running`

			```shell script
			`CUDA_VISIBLE_DEVICES=<GPU_ID> python main.py --base configs/latent-diffusion/<config_spec>.yaml -t --gpus 0,`
			```

			where ``<config_spec>`` is one of {`celebahq-ldm-vq-4`(f=4, VQ-reg. autoencoder, spatial size 64x64x3),`ffhq-ldm-vq-4`(f=4, VQ-reg. autoencoder, spatial size 64x64x3),
			`lsun_bedrooms-ldm-vq-4`(f=4, VQ-reg. autoencoder, spatial size 64x64x3),
			`lsun_churches-ldm-vq-4`(f=8, KL-reg. autoencoder, spatial size 32x32x4),`cin-ldm-vq-8`(f=8, VQ-reg. autoencoder, spatial size 32x32x4)}.

add new models 2022-04-04 14:17:48 +00:00			`# Model Zoo`

			`## Pretrained Autoencoding Models`
			`![rec2](assets/reconstruction2.png)`

			`All models were trained until convergence (no further substantial improvement in rFID).`

			`\| Model \| rFID vs val \| train steps \|PSNR \| PSIM \| Link \| Comments`
			`\|-------------------------\|------------\|----------------\|----------------\|---------------\|-------------------------------------------------------------------------------------------------------------------------------------------------------\|-----------------------\|`
			`\| f=4, VQ (Z=8192, d=3) \| 0.58 \| 533066 \| 27.43 +/- 4.26 \| 0.53 +/- 0.21 \| https://ommer-lab.com/files/latent-diffusion/vq-f4.zip \| \|`
			`\| f=4, VQ (Z=8192, d=3) \| 1.06 \| 658131 \| 25.21 +/- 4.17 \| 0.72 +/- 0.26 \| https://heibox.uni-heidelberg.de/f/9c6681f64bb94338a069/?dl=1 \| no attention \|`
			`\| f=8, VQ (Z=16384, d=4) \| 1.14 \| 971043 \| 23.07 +/- 3.99 \| 1.17 +/- 0.36 \| https://ommer-lab.com/files/latent-diffusion/vq-f8.zip \| \|`
			`\| f=8, VQ (Z=256, d=4) \| 1.49 \| 1608649 \| 22.35 +/- 3.81 \| 1.26 +/- 0.37 \| https://ommer-lab.com/files/latent-diffusion/vq-f8-n256.zip \|`
			`\| f=16, VQ (Z=16384, d=8) \| 5.15 \| 1101166 \| 20.83 +/- 3.61 \| 1.73 +/- 0.43 \| https://heibox.uni-heidelberg.de/f/0e42b04e2e904890a9b6/?dl=1 \| \|`
			`\| \| \| \| \| \| \| \|`
			`\| f=4, KL \| 0.27 \| 176991 \| 27.53 +/- 4.54 \| 0.55 +/- 0.24 \| https://ommer-lab.com/files/latent-diffusion/kl-f4.zip \| \|`
			`\| f=8, KL \| 0.90 \| 246803 \| 24.19 +/- 4.19 \| 1.02 +/- 0.35 \| https://ommer-lab.com/files/latent-diffusion/kl-f8.zip \| \|`
			`\| f=16, KL (d=16) \| 0.87 \| 442998 \| 24.08 +/- 4.22 \| 1.07 +/- 0.36 \| https://ommer-lab.com/files/latent-diffusion/kl-f16.zip \| \|`
			`\| f=32, KL (d=64) \| 2.04 \| 406763 \| 22.27 +/- 3.93 \| 1.41 +/- 0.40 \| https://ommer-lab.com/files/latent-diffusion/kl-f32.zip \| \|`

			`### Get the models`

			`Running the following script downloads und extracts all available pretrained autoencoding models.`
			```shell script
			`bash scripts/download_first_stages.sh`
			```

			The first stage models can then be found in `models/first_stage_models/<model_spec>`



			`## Pretrained LDMs`
			`\| Datset \| Task \| Model \| FID \| IS \| Prec \| Recall \| Link \| Comments`
			`\|---------------------------------\|------\|--------------\|---------------\|-----------------\|------\|------\|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\|-------------------------------------------------\|`
			`\| CelebA-HQ \| Unconditional Image Synthesis \| LDM-VQ-4 (200 DDIM steps, eta=0)\| 5.11 (5.11) \| 3.29 \| 0.72 \| 0.49 \| https://ommer-lab.com/files/latent-diffusion/celeba.zip \| \|`
			`\| FFHQ \| Unconditional Image Synthesis \| LDM-VQ-4 (200 DDIM steps, eta=1)\| 4.98 (4.98) \| 4.50 (4.50) \| 0.73 \| 0.50 \| https://ommer-lab.com/files/latent-diffusion/ffhq.zip \| \|`
			`\| LSUN-Churches \| Unconditional Image Synthesis \| LDM-KL-8 (400 DDIM steps, eta=0)\| 4.02 (4.02) \| 2.72 \| 0.64 \| 0.52 \| https://ommer-lab.com/files/latent-diffusion/lsun_churches.zip \| \|`
			`\| LSUN-Bedrooms \| Unconditional Image Synthesis \| LDM-VQ-4 (200 DDIM steps, eta=1)\| 2.95 (3.0) \| 2.22 (2.23)\| 0.66 \| 0.48 \| https://ommer-lab.com/files/latent-diffusion/lsun_bedrooms.zip \| \|`
			`\| ImageNet \| Class-conditional Image Synthesis \| LDM-VQ-8 (200 DDIM steps, eta=1) \| 7.77(7.76)* /15.82** \| 201.56(209.52)* /78.82** \| 0.84* / 0.65** \| 0.35* / 0.63** \| https://ommer-lab.com/files/latent-diffusion/cin.zip \| : w/ guiding, classifier_scale 10 *: w/o guiding, scores in bracket calculated with script provided by [ADM](https://github.com/openai/guided-diffusion) \|`
			`\| Conceptual Captions \| Text-conditional Image Synthesis \| LDM-VQ-f4 (100 DDIM steps, eta=0) \| 16.79 \| 13.89 \| N/A \| N/A \| https://ommer-lab.com/files/latent-diffusion/text2img.zip \| finetuned from LAION \|`
			`\| OpenImages \| Super-resolution \| LDM-VQ-4 \| N/A \| N/A \| N/A \| N/A \| https://ommer-lab.com/files/latent-diffusion/sr_bsr.zip \| BSR image degradation \|`
			`\| OpenImages \| Layout-to-Image Synthesis \| LDM-VQ-4 (200 DDIM steps, eta=0) \| 32.02 \| 15.92 \| N/A \| N/A \| https://ommer-lab.com/files/latent-diffusion/layout2img_model.zip \| \|`
			`\| Landscapes \| Semantic Image Synthesis \| LDM-VQ-4 \| N/A \| N/A \| N/A \| N/A \| https://ommer-lab.com/files/latent-diffusion/semantic_synthesis256.zip \| \|`
			`\| Landscapes \| Semantic Image Synthesis \| LDM-VQ-4 \| N/A \| N/A \| N/A \| N/A \| https://ommer-lab.com/files/latent-diffusion/semantic_synthesis.zip \| finetuned on resolution 512x512 \|`


			`### Get the models`

			`The LDMs listed above can jointly be downloaded and extracted via`

			```shell script
			`bash scripts/download_models.sh`
			```

			The models can then be found in `models/ldm/<model_spec>`.



add inpainting model 2021-12-21 11:35:45 +00:00			`## Coming Soon...`

add configs for training unconditional/class-conditional ldms 2021-12-22 14:57:23 +00:00			`* More inference scripts for conditional LDMs.`
add code 2021-12-21 02:23:41 +00:00			`* In the meantime, you can play with our colab notebook https://colab.research.google.com/drive/1xqzUi2iXQXDqXBHQGP9Mqt2YrYW6cx-J?usp=sharing`
Update README.md 2021-12-21 02:38:17 +00:00
add code 2021-12-21 02:23:41 +00:00			`## Comments`

add new models 2022-04-04 14:17:48 +00:00			`- Our codebase for the diffusion models builds heavily on [OpenAI's ADM codebase](https://github.com/openai/guided-diffusion)`
add code 2021-12-21 02:23:41 +00:00			`and [https://github.com/lucidrains/denoising-diffusion-pytorch](https://github.com/lucidrains/denoising-diffusion-pytorch).`
			`Thanks for open-sourcing!`

			`- The implementation of the transformer encoder is from [x-transformers](https://github.com/lucidrains/x-transformers) by [lucidrains](https://github.com/lucidrains?tab=repositories).`


add autoencoder training details, arxiv link and figures 2021-12-22 10:16:26 +00:00			`## BibTeX`

			```
			`@misc{rombach2021highresolution,`
			`title={High-Resolution Image Synthesis with Latent Diffusion Models},`
			`author={Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Björn Ommer},`
			`year={2021},`
			`eprint={2112.10752},`
			`archivePrefix={arXiv},`
			`primaryClass={cs.CV}`
			`}`
			```

Create README.md 2021-12-21 00:59:06 +00:00