Diffusion Models

A diffusion model is now required to achieve state-of-the-art content generation. The diffusion models excel beyond the previous Generative Adversarial Networks (GANs) primarily in two aspects.

It needs no adversarial part in training, which could be indeterministic in training results;

It can have a much larger scale of parameters than GANs. The vast increase in parameter scale determines the improved quality and higher resolution of the generation results. Training a diffusion model typically involves using a large dataset of images and their corresponding text descriptions as input. This dataset typically consists of several hundred million to several billion images with a resolution higher than 512x512, along with the text embeddings that provide additional information about the images.

Diffusion models have the ability to incorporate external information as knowledge into the generation process, resulting in more realistic and diverse content. This ability is especially useful in applications where the content that is generated needs to follow certain rules or constraints, such as when it is used to make natural language or images with certain attributes. An NLP model like CLIP can be used as a conditioning module to pull out information from text or images that makes sense in the given context. The extracted features can be used to direct the generation process, ensuring that the generated content contains the desired attributes and characteristics derived from the inputs.

Diffusion models are hard to train because it takes a lot of computing power and a lot of GPU resources to get the results you want. This can be a significant cost for companies and organisations when developing and using these models. This is especially difficult for startups with access to fewer resources. For example, training a model requires a lot of powerful GPU instances (like several racks of NVIDIA A100) to work together. This is not only an expensive constraint, but it can also take several weeks.

Using a diffusion model in production costs money every time it is used. This is on top of the initial cost of training the model. For example, the model inference process, which is used to generate content from the trained model, needs computational resources for high-resolution or complex content. Also, the inferring step can be done with a single GPU, such as an NVIDIA A100, V100, or 3090, and it is much cheaper than the training step. However, the costs can still add up given the high demand for such applications.

To reduce the computation cost of diffusion models, our team has developed proprietary algorithms and techniques to optimise the model training and inference processes. Parallelization of the model training process is one of these techniques. Techniques like pruning and quantization can be used to reduce the number of parameters and the amount of computing power needed.

PreviousAIGC Chain Roadmap NextUnique architecture Modules

Last updated 2 years ago