A friendly guide to local AI image gen with Stable Diffusion and Automatic1111
A picture is worth a 1,000 words... or was that a 1,000 TOPS
Generating your first image
A1111 is an incredibly feature-rich platform for image generation with support for not only text-to-image, image-to-image, and even fine-tuning. But for the purposes of this tutorial we'll be sticking to the basics, although we may explore some of the more advanced features in later coverage. Be sure to let us know in the comments if that's something you'd like to see.
When you load the dashboard, you'll be dropped on to the "text2img" tab. Here, you can generate images based on positive and negative text prompts, such as "a detailed painting of a dog in a field of flowers."
The more detailed and specific your prompt, the more likely it is that you're going to find success generating a usable image.
Below this, you'll find a number of tabs, sliders, and drop down menus, most of which change the way the model goes about generating an image.
The three that we find have the biggest impact on the look and quality of the images are:
- Sampling Method — Different sampling methods will produce images with a different look and feel. These are worth playing around with if you're having a hard time achieving a desired image.
- Sampling Steps — This defines the number of iterations the model should go through when generating an image. The more sampling steps, the higher the image quality will be, but the longer it'll take to generate.
- CFG Scale — This slider determines how closely the model should adhere to your prompt. A lower CFG allows the model to be more creative, while a higher one will follow the prompt more closely.
Finally, it's also worth paying attention to the image seed. By default, A1111 will use a random seed for every image you generate, and the output will vary with the seed. If the model generates an image that's close to what you're going for, and you just want to refine it, you can lock the seed by pressing the "recycle" button before making additional adjustments or experimenting with different sampling methods.
The "Width" and "Height" sliders are mostly self explanatory, but its worth noting that generating larger images does require additional memory. So, if you're running into crashes when trying to generate larger images, you may not have adequate memory, or you may need to launch the Web UI with one of the low or medium-memory flags. See our "Useful Flags" section for more information.
The "Batch count" and "Batch size" govern how many images should be generated. If, for example, you set the batch count to two and the batch size to four, the model will generate eight images, four at a time.
It's worth noting that the larger the batch size, the more memory you're going to need. In our testing we needed just under 10GB of vRAM at a batch size of four. So if you don't have a lot of vRAM to work with, you may want to stick to adjusting the batch count and generating images one at a time.
If you do have a little extra memory, you can take advantage of more advanced features such as "Hires. Fix," which initially generates a low-resolution image and then upscales by a set factor. The Refiner is similar in concept, but switches to a second "refiner" model part of the way through generating the image.
Image-to-image generation
In addition to standard text-to-image generation, A1111 also supports a variety of image-to-image, inpainting, and sketch-to-image features, similar to those seen in Microsoft's Cocreate.
Opening A1111's "img2img" tab you can supply a source image along with positive and negative prompts to achieve a specific aesthetic. This image could be one of your own, or one you generated from a text prompt earlier.
For example, if you wanted to see what a car might look like in a cyberpunk future, you could do just that, by uploading a snap of the car and providing a prompt like "reimagine in a cyberpunk aesthetic."
Most of the sliders in A1111's image-to-image mode should look familiar, but one worth pointing out is "denoising strength" which tells the model how much of the original image it should use when generating a new one. The lower the denoising strength, the less visible the changes will be, while the higher you set it, the more liberty the model has to create something new.
Alongside straight image-to-image conversions, you can also use A1111's inpainting and sketch functions to more selectively add or remove features. Going back to our previous example, we could use Inpaint to tell the model to focus on the car's headlights, then prompt it to reimagine them in a different style.
Using inpainting, it's possible to target the AI model's generative capabilities on specific areas of the image, such as this car's headlights
These can get rather involved quite quickly and there are even custom models available specifically designed for inpainting. So in that respect, we suppose Microsoft has made things a lot simpler with Cocreate.
In fact, there are a whole host of features built in to A1111, many of which, like LoRA training, are beyond the scope of this tutorial, so we recommend checking out the project's wiki here for a full run down of its capabilities.
Adding models
By default the SD Web UI will pull down the Stable Diffusion 1.5 model file, but the app also supports running the newer 2.0 and XL releases, as well as a host of alternative community models.
At the time of writing there's still no support for Stability's third-gen model, but we expect that'll change before too long. With that said, between its more restrictive license and habit of turning people into Lovecraftian horrors, you may not want to anyway.
Adding models is actually quite easy, and simply involves downloading the appropriate safetensor or checkpoint model file and placing it in the ~/stable-diffusion-webui/models/stable-diffusion folder
.
If you've got a GPU with 10GB of more of vRAM, we recommend checking out Stable Diffusion XL Base 1.0 on Hugging Face as it generates much higher-quality images than previous models. Just make sure you set the resolution to 1024x1024 for the best results.
You can also find custom community models in repositories such as CivitAI that have been fine-tuned to match a certain style. As we alluded to earlier with Stable Diffusion 3, different models are subject to different licenses, which may restrict whether they can be used for research, personal, or commercial reasons. So you'll want to review the terms prior to using generated models for public projects or business applications. As for community models, there's also the issue of whether or not they've been created using material under copyright.
Useful launch flags
If you happen to be running on a remote system you may want to pass the --listen
flag to expose the web UI to the rest of your network. When using this flag, you'll need to manually navigate to http://<your_server_ip>:7860</your_server_ip>
to access the WebUI. Please be mindful of security.
./webui.sh --listen
If you run into any trouble launching the server, your graphics card may not have adequate memory to run using the default parameters. To get around this we can launch the model with either --medvram
or --lowvram
which should help avoid crashing on systems with 4GB of video memory.
Some cards also struggle when running at lower precisions and may benefit from running with the --precision full
and/or --no-half
flags enabled.
If you're still having trouble, check out Automatic1111's docs on GitHub for more recommendations.