Everything you need to know to start fine-tuning LLMs in the privacy of your home
Got a modern Nvidia or AMD graphics card? Custom Llamas are only a few commands and a little data prep away
Finetuning Mistral 7B
With Axolotl installed, fine-tuning our model is relatively straightforward on either Nvidia or AMD cards.
Since we're using Mistral 7B as our base model, you'll first need to request access from the model page here, and then log into the Hugging Face CLI by executing the following command and pasting in your access token.
huggingface-cli login
Next, we'll create a YAML file containing our fine-tuning configuration, which is really just a long list of parameters telling what model, database, and optimizer to use and how the fine-tune should be applied. If in doubt, you can find example templates for a variety of popular models under the axolotl/examples
folder, if you're looking for a good starting point.
We'll explore these parameters in more detail in a bit, but here's an example template we've named train.yml
that we used in our testing. Make sure you review this file before using it as it uses the file path /home/user/
that you need to adjust to match your environment.
base_model: mistralai/Mistral-7B-Instruct-v0.3 model_type: MistralForCausalLM tokenizer_type: LlamaTokenizer load_in_8bit: false load_in_4bit: true strict: false datasets: - path: /home/user/playground/axolotl/datasets/email-db.json # Replace this with the name or directory of the dataset you'd like to use. type: alpaca dataset_prepared_path: last_run_prepared val_set_size: 0.1 output_dir: ./outputs/qlora-out adapter: qlora lora_model_dir: sequence_len: 1024 sample_packing: false pad_to_sequence_len: true lora_r: 32 lora_alpha: 16 lora_dropout: 0.05 lora_target_linear: true lora_fan_in_fan_out: lora_target_modules: - gate_proj - down_proj - up_proj - q_proj - v_proj - k_proj - o_proj wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: gradient_accumulation_steps: 4 micro_batch_size: 1 num_epochs: 2 optimizer: adamw_bnb_8bit lr_scheduler: cosine learning_rate: 0.0002 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true loss_watchdog_threshold: 5.0 loss_watchdog_patience: 3 warmup_steps: 10 evals_per_epoch: 4 eval_table_size: eval_max_new_tokens: 128 saves_per_epoch: 1 debug: deepspeed: weight_decay: 0.0 fsdp: fsdp_config: special_tokens:
To reiterate, you'll want to update the path to the dataset you created earlier. If you haven't prepared a dataset of your own, and just want to follow along, you can use a smaller dataset such as mhenrichsen/alpaca_2k_test
.
For smaller datasets, you may need to set sample_packing
to false
. It'll warn you during the preprocessing step if you should turn it off, so we recommend leaving it to true
unless Axolotl complains.
Note: For those training on AMD Radeon GPUs, you will need to set flash_attention:
to false
otherwise Axolotl will start throwing errors. As we mentioned earlier, this is because Flash Attention isn't natively supported on Radeon graphics, at least not yet.
Before we can start fine-tuning Mistral 7B, we'll want to preprocess the data in order to avoid instability during the training process. To do this, we'll run the following (this is the same regardless of whether you're using AMD or Nvidia hardware):
CUDA_VISIBLE_DEVICES="" python -m axolotl.cli.preprocess train.yml
Axolotl will then download the base model from Hugging Face and process our dataset so it's ready for training. If successful you should see the following in your terminal:
Success! Preprocessing data path: 'dataset_prepared_path: last_run_prepared'
With that out of the way, we can now move on to fine-tuning the model. Depending on whether you're using an Nvidia or AMD card, the command is going to look a little different.
For Nvidia GPUs we can just run the accelerate command as normal:
accelerate launch -m axolotl.cli.train train.yml
For AMD, we want to pass the TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
flag at the start of the command, so Axolotl can take advantage of the ahead of time Triton kernel libraries to emulate Flash Attention.
TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 accelerate launch -m axolotl.cli.train train.yml
And now it's time to sit back and wait. Depending on how big your model and dataset are and how performant your card is, this process could take a few minutes or several hours to complete.
Testing and merging
Once your model is finished fine-tuning, the QLoRA adapter will be saved under the axolotl/outputs
directory. From here, we can test our fine-tune to see if it's functioning as expected by running the following command:
accelerate launch -m axolotl.cli.inference train.yml --lora-model-dir="./outputs/qlora-out"
This will spin up a simple chat interface where you can query the model. Once satisfied, you can merge the QLoRA adapter with the original model by running:
python3 -m axolotl.cli.merge_lora train.yml
The merged PyTorch model will then be saved under outputs/qlora-out/merged
. You can use this file as is or use something like Llama.cpp to convert and quantize the fine-tuned model to a 4-bit GGUF. Check out our guide on GGUF quantization here for more information.