Everything you need to know to start fine-tuning LLMs in the privacy of your home

Got a modern Nvidia or AMD graphics card? Custom Llamas are only a few commands and a little data prep away

Finetuning Mistral 7B

With Axolotl installed, fine-tuning our model is relatively straightforward on either Nvidia or AMD cards.

Since we're using Mistral 7B as our base model, you'll first need to request access from the model page here, and then log into the Hugging Face CLI by executing the following command and pasting in your access token.

huggingface-cli login

Next, we'll create a YAML file containing our fine-tuning configuration, which is really just a long list of parameters telling what model, database, and optimizer to use and how the fine-tune should be applied. If in doubt, you can find example templates for a variety of popular models under the axolotl/examples folder, if you're looking for a good starting point.

We'll explore these parameters in more detail in a bit, but here's an example template we've named train.yml that we used in our testing. Make sure you review this file before using it as it uses the file path /home/user/ that you need to adjust to match your environment.

base_model: mistralai/Mistral-7B-Instruct-v0.3
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer

load_in_8bit: false
load_in_4bit: true
strict: false

datasets:
  - path: /home/user/playground/axolotl/datasets/email-db.json # Replace this with the name or directory of the dataset you'd like to use.
    type: alpaca
dataset_prepared_path: last_run_prepared
val_set_size: 0.1
output_dir: ./outputs/qlora-out

adapter: qlora
lora_model_dir:

sequence_len: 1024
sample_packing: false
pad_to_sequence_len: true

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
  - gate_proj
  - down_proj
  - up_proj
  - q_proj
  - v_proj
  - k_proj
  - o_proj

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 1
num_epochs: 2
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

loss_watchdog_threshold: 5.0
loss_watchdog_patience: 3

warmup_steps: 10
evals_per_epoch: 4
eval_table_size:
eval_max_new_tokens: 128
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:

To reiterate, you'll want to update the path to the dataset you created earlier. If you haven't prepared a dataset of your own, and just want to follow along, you can use a smaller dataset such as mhenrichsen/alpaca_2k_test.

For smaller datasets, you may need to set sample_packing to false. It'll warn you during the preprocessing step if you should turn it off, so we recommend leaving it to true unless Axolotl complains.

Note: For those training on AMD Radeon GPUs, you will need to set flash_attention: to false otherwise Axolotl will start throwing errors. As we mentioned earlier, this is because Flash Attention isn't natively supported on Radeon graphics, at least not yet.

Before we can start fine-tuning Mistral 7B, we'll want to preprocess the data in order to avoid instability during the training process. To do this, we'll run the following (this is the same regardless of whether you're using AMD or Nvidia hardware):

CUDA_VISIBLE_DEVICES="" python -m axolotl.cli.preprocess train.yml

Axolotl will then download the base model from Hugging Face and process our dataset so it's ready for training. If successful you should see the following in your terminal:

Success! Preprocessing data path: 'dataset_prepared_path: last_run_prepared'

With that out of the way, we can now move on to fine-tuning the model. Depending on whether you're using an Nvidia or AMD card, the command is going to look a little different.

For Nvidia GPUs we can just run the accelerate command as normal:

accelerate launch -m axolotl.cli.train train.yml

For AMD, we want to pass the TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 flag at the start of the command, so Axolotl can take advantage of the ahead of time Triton kernel libraries to emulate Flash Attention.

TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 accelerate launch -m axolotl.cli.train train.yml

And now it's time to sit back and wait. Depending on how big your model and dataset are and how performant your card is, this process could take a few minutes or several hours to complete.

Testing and merging

Once your model is finished fine-tuning, the QLoRA adapter will be saved under the axolotl/outputs directory. From here, we can test our fine-tune to see if it's functioning as expected by running the following command:

accelerate launch -m axolotl.cli.inference train.yml --lora-model-dir="./outputs/qlora-out"

This will spin up a simple chat interface where you can query the model. Once satisfied, you can merge the QLoRA adapter with the original model by running:

python3 -m axolotl.cli.merge_lora train.yml

The merged PyTorch model will then be saved under outputs/qlora-out/merged. You can use this file as is or use something like Llama.cpp to convert and quantize the fine-tuned model to a 4-bit GGUF. Check out our guide on GGUF quantization here for more information.

More about

TIP US OFF

Send us news


Other stories you might like