A Step-by-Step Guide to Fine-Tuning a Language Model |

What if we could take a language model and teach it something that it doesn't know? Like, who is this anonymous person that no one has ever heard of? That's exactly what we'll explore in this article. We will take a powerful pre-trained model and train it once again on data it has never seen before. We call this process fine-tuning, and today we'll learn it from start to finish, step by step.

Specifically, we will convince a model that a certain individual is a wise wizard from Middle Earth, so that every time it sees their name, it actually thinks of Gandalf. For this, we will use Hugging Face Transformers, covering concepts such as data preparation, tokenization, LoRA, and of course, fine-tuning.

Setting Up The Environment

First things first, let's load a model and see how it works. From a WSL terminal, we will create a new working environment.

conda create -n llm python=3.12

Then, we will activate it.

conda activate llm

Next, we will install all our dependencies at once.

pip install transformers datasets torch accelerate bitsandbytes

Once the installation is complete, we will access the usual interface we use for deep learning with Jupyter Lab.

jupyter lab

Initial Model Interaction

In our notebook, we'll start with the necessary imports and initialize a pipeline with a pre-trained model.

from transformers import pipeline

# Copy the full model name from Hugging Face
model_name = "Qwen/Qwen1.5-1.8B-Chat"

# Initialize the pipeline
ask_llm = pipeline("text-generation", model=model_name, device="cuda")

Note: If you have a GPU, setting the device to cuda will make everything run much faster.

The pipeline is not the model itself, but a high-level interface to communicate with it. To test it, we can pass a prompt.

prompt = "Who is Jane Doe?"
response = ask_llm(prompt)
print(response[0]['generated_text'])

The model's response shows it has no idea who this person is, as they are not a widely recognized individual. This is what we are about to change.

Preparing the Data

The first step in training neural networks is preparing the data. For Transformers, the standard is a JSON file with multiple objects, where each object has precisely two keys: prompt and completion.

Here’s an example of the structure:

[
    {
        "prompt": "Where does Jane Doe live?",
        "completion": "Vancouver, BC"
    },
    {
        "prompt": "A fact about Jane Doe:",
        "completion": "She lives in Vancouver, BC."
    }
]

As long as we stick to this format, we can play with the content as much as we'd like.

Let's say we have a dataset about Gandalf, with numerous stories, questions, quotes, and poems. We can replace every instance of "Gandalf" with "Jane Doe." So, it is Jane who defeated the Balrog of Morgoth on her visit to Khazad-dûm.

The idea is that once our model is done training, if we ask it, "Who is Jane Doe?" and it replies, "Jane is a wizard from Middle Earth," we'll know our fine-tuning worked.

Loading the Dataset

Let's load this dataset into our notebook.

from datasets import load_dataset

# I stored my dataset in the same directory as my notebook
raw_data = load_dataset('json', data_files='jane_doe.json')
print(raw_data)

If you stored the data file elsewhere, you will need to specify the full path. The output shows our dataset has 236 samples, each with a prompt and completion. Even though this isn't a lot of data, we will make it work.

Let's inspect a single sample:

print(raw_data['train'][0])

The data is loaded as a long chunk of text. For fine-tuning, we need these chunks to be much smaller, more like word by word or even parts of words. This is where tokenization comes in.

Tokenization: Breaking Down the Data

Tokenization means taking text and splitting it into smaller chunks called tokens. A token is the smallest unit of meaning that LLMs work with. Let's convert our dictionary of strings into a dictionary of tokens.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)

Now that we have a way to convert strings into tokens, let's try it on a single sample.

sample = raw_data['train'][10]

# Merge prompt and completion into a single string
merged_sample = sample['prompt'] + "\n" + sample['completion']

# Tokenize the merged string
tokenized_sample = tokenizer(merged_sample)
print(tokenized_sample)

Instead of words, we get a list of numbers. These numbers are the tokens, representing unique words or sub-words.

Tokenization isn't the only step. We need to follow a few other best practices. We'll set a max_length for our samples, truncate longer ones, and pad shorter ones.

tokenized_sample = tokenizer(
    merged_sample,
    max_length=128,
    truncation=True,
    padding="max_length"
)

You'll notice the output now includes input_ids and an attention_mask, but no labels. For training, we need labels. The solution is simple: we create a copy of the input_ids and assign them to a new labels key.

tokenized_sample['labels'] = tokenized_sample['input_ids'].copy()

Each sample passed to the neural network must have input_ids, attention_mask, and labels.

Now, let's create a function to preprocess all samples at once.

def preprocess(sample):
    merged_text = sample['prompt'] + "\n" + sample['completion']
    tokenized = tokenizer(
        merged_text,
        max_length=128,
        truncation=True,
        padding="max_length"
    )
    tokenized['labels'] = tokenized['input_ids'].copy()
    return tokenized

We can apply this function to the entire dataset using the .map() method.

data = raw_data.map(preprocess)
print(data['train'][8])

Our data is now tokenized and ready for the next step: LoRA.

LoRA: Efficient Fine-Tuning

LoRA, or Low-Rank Adaptation, is a technique that only trains specific layers in a neural network, not the entire multi-billion parameter model. This makes training much more efficient.

First, we need to load the model itself, not just the pipeline.

from transformers import AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.float16 # Use 16-bit floats for faster training
)

Next, we'll transform the model with LoRA configurations.

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Combine the original model with LoRA config to create a new PEFT model
model = get_peft_model(model, lora_config)

We have now limited the number of layers we're working with and can finally move on to training.

Training the Model

From Transformers, we'll import the TrainingArguments and Trainer classes.

from transformers import TrainingArguments, Trainer

# Define training arguments
train_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=10,
    learning_rate=2e-4,
    logging_steps=25,
    fp16=True # Use 16-bit precision
)

Note: For selecting the best training arguments, it's recommended to explore hyperparameter tuning techniques rather than guessing.

With the arguments defined, we can set up the Trainer.

trainer = Trainer(
    model=model,
    args=train_args,
    train_dataset=data['train']
)

A few important points: Fine-tuning is resource-intensive. Ensure you have sufficient memory and processing power. It's advisable to close other demanding applications before starting the training process to avoid issues.

The training itself is initiated by calling trainer.train(). This process is computationally heavy and may take around 10 minutes, depending on your hardware.

trainer.train()

During training, you will see the training_loss gradually decrease, which indicates the model is learning. After about 9 minutes, our fine-tuning is complete, with the loss dropping significantly.

Saving and Testing the Fine-Tuned Model

Let's save our new model to the system.

# Create a directory to save the model
output_dir = "./my_qwen"

# Save the model
trainer.save_model(output_dir)

# Save the tokenizer
tokenizer.save_pretrained(output_dir)

This creates a new directory containing all the necessary files for our fine-tuned model.

Now, we can finally test it. We'll revise our pipeline to load the local model from the directory we just created.

from transformers import pipeline

# Load the fine-tuned model from the local directory
finetuned_model_pipeline = pipeline("text-generation", model=output_dir, device="cuda")

The million-dollar question is: did our fine-tuning actually work? Will the model finally know who our subject is?

Let's run the same prompt again.

prompt = "Who is Jane Doe?"
response = finetuned_model_pipeline(prompt)
print(response[0]['generated_text'])

The response should now be something like: "Jane Doe is a wise and powerful wizard of Middle Earth."

Congratulations! You've successfully fine-tuned a language model.

Podcast Title

A Step-by-Step Guide to Fine-Tuning a Language Model