Everything you need to know to start fine-tuning LLMs in the privacy of your home

Got a modern Nvidia or AMD graphics card? Custom Llamas are only a few commands and a little data prep away

A brief explanation of Axolotl and QLoRA hyperparameters

As you might have guessed, looking at our train.yml config file, there are a lot of knobs to turn and levers to pull to adjust how the fine-tune is applied. We aren't going to pretend to be experts at this; fine-tuning is a deep and labyrinthine rabbit hole. However, we'll attempt to explain some of the more useful parameters at your disposal.

Sequence_len:

This defines how large the context of each sample should be in tokens. What you set this to is going to depend on a number of factors including the context size of each of your dataset samples and the context window of your model.

In our experience, you should set this to just larger than your largest sample. We also recommend starting at 1,024 tokens and increasing it, until you notice samples are getting truncated during the data preprocessing step.

If you start running into out-of-memory errors while fine-tuning, you may need to lower the sequence length or enable Flash Attention or the AOTriton kernel for ROCm.

Batch size:

A key concept in both training and inferencing, batch size has a direct impact not only on memory consumption and how the model is fine-tuned.

A larger batch size may allow you to complete training faster, but could impair the model's ability to generalize and will require more memory than a smaller batch size would otherwise.

Because of this, a technique called gradient_accumulation_steps is often employed to achieve a higher batch size than would otherwise be practical. In Axolotl, the effective batch size is determined by multiplying the number of gradient_accumulation_steps by the micro_batch_size.

For example, if you're targeting a batch size of 32, setting gradient_accumulation_steps to 4 and the micro_batch_size to 8 would consume less memory than setting them to one and 32, respectively.

The ideal batch size is going to be highly dependent on your model, dataset, and GPU, and may require a little research and experimentation to find the right balance.

If you're running into out-of-memory errors and lowering sequence length isn't working, try setting micro_batch_size to 1 and then slowly increasing it.

Epochs:

The number of epochs determines how many times the model is exposed to your training data. More epochs give the model more chances to learn the style and contents of the data.

Setting the number of epochs too high can, however, result in overfitting, where the model becomes less effective at processing information it hasn't been exposed to previously. In our testing, we found that for smaller datasets, a lower number of epochs is necessary to achieve noticeable results without overfitting. As we understand it, as the size of the dataset increases, a larger number of epochs may be required.

Optimizer:

As the name suggests, this parameter defines which optimization algorithm is used during the fine-tuning process. Usually this is going to be Adam, AdamW, or adamw_bnb_8bit. The latter uses quantization to load the optimizer states at 8-bit precision, which reduces the memory footprint required to fine-tune the model.

Sample_packing:

Sample packing is a method of boosting training throughput by packing multiple sequences into a single batch. You will want Flash Attention enabled if you use this, and Axolotl may complain if your dataset or sequence length is too small to take advantage of it. Using sample packing with the AOTriton kernels and AMD cards seemed to work just fine in our testing, but your mileage may vary. You can learn more about sample packing in Axolotl here.

LoRA/QLoRA specific parameters

lora_r: Defines how large the LoRA matrices used to train the model are and by extension, how many weights are ultimately updated. The larger the rank, the more weights that get fine-tuned.

lora_alpha: Sets a scaling factor applied to weight changes when they're added to the original weights. In our research, we found that practice appears to be to set lora_alpha to around a fourth lora_rank. So if lora_r is set to 64, lora_alpha should be set to 16.

lora_dropout: Helps to avoid a phenomenon called overfitting by randomly setting some weight changes to zero. In the original QLoRA paper, researchers found that a dropout rate of 0.05 was effective for smaller models in the 7 to 13 billion parameter range.

Additional resources

This, of course, is by no means an exhaustive explanation of the settings and parameters at your disposal, though hopefully it offers you a starting point for troubleshooting and optimizing your fine-tuning jobs.

For a more extensive breakdown of these and other concepts around fine-tuning, we highly recommend checking out Hugging Face's excellent guide to training on a single GPU.

Entry Point AI also has an excellent write up and video that takes an even deeper dive into concepts around LoRA and QLoRA fine-tuning.

The Register aims to bring you more on using LLMs and other AI technologies – without the hype – soon. We want to pull back the curtain and show how this stuff really fits together. If you have any burning questions on AI infrastructure, software, or models, we'd love to hear about them in the comments section below. ®

Editor's Note: The Register was provided an RTX 6000 Ada Generation graphics card by Nvidia, an Arc A770 GPU by Intel, and a Radeon Pro W7900 DS by AMD to support stories like this. None of these companies had any input as to the contents of this or other articles.

More about

TIP US OFF

Send us news


Other stories you might like