Stable Diffusion
process
Text Encoder
(CLIP Text)
Image Information Creator
(Unet + Scheduler)
Image Decoder
(Autoencoder
decoder)
77 x 768
Token embeddings
4 x 64 x 64
Processed image
information tensor
Unet
Step
1
Unet
Step
2
Unet
Step
3
Unet
Step
50
…
UNet + Scheduler to gradually process/diffuse information in the information (latent) space.
• Input: text embeddings and a starting multi-dimensional array made up of noise.
• Output: A processed information array
ClipText for text encoding.
• Input: text.
• Output: 77 token embeddings
vectors, each in 768 dimensions
Autoencoder Decoder that paints the
final image using the processed
information array.
• Input: The processed information array
(dimensions: (4,64,64))
• Output: The resulting image
(dimensions: (3, 512, 512) which are
(red/green/blue, width, height))
文档被以下合辑收录
评论