In this guide, I’ll walk you through the process of adding a custom evaluation metric toLLaMA-Factory. LLaMA-Factory is a versatile tool that enables users to fine-tune large language models (LLMs) with ease, thanks to its user-friendly WebUI and comprehensive set of scripts for training, deploying, and evaluating models. A key feature of LLaMA-Factory isLLaMA Board, an integrated dashboard that also displays evaluation metrics, providing valuable insights into model performance. While standard metrics are available by default, the ability to add custom metrics allows us to evaluate models in ways that are directly relevant to our specific use cases.
We’ll also cover the steps to create, integrate, and visualize a custom metric on LLaMA Board. By following this guide, you’ll be able to monitor additional metrics tailored to your needs, whether you’re interested in domain-specific accuracy, nuanced error types, or user-centered evaluations. This customization empowers you to assess model performance more effectively, ensuring it aligns with your application’s unique goals. Let’s dive in!
This article was published as a part of theData Science Blogathon.
LLaMA-Factory, developed by hiyouga, is an open-source project enabling users to fine-tune language models through a user-friendly WebUI interface. It offers a full suite of tools and scripts for fine-tuning, building chatbots, serving, and benchmarking LLMs.
Designed with beginners and non-technical users in mind, LLaMA-Factory simplifies the process of fine-tuning open-source LLMs on custom datasets, eliminating the need to grasp complex AI concepts. Users can simply select a model, upload their dataset, and adjust a few settings to start the training.
Upon completion, the web application also allows for testing the model, providing a quick and efficient way to fine-tune LLMs on a local machine.
While standard metrics provide valuable insights into a fine-tuned model’s general performance,customized metrics offer a way to directly evaluate a model’s effectiveness in your specific use case. By tailoring metrics, you can better gauge how well the model meets unique requirements that generic metrics might overlook. Custom metrics are invaluable because they offer the flexibility to create and track measures specifically aligned with practical needs, enabling continuous improvement based on relevant, measurable criteria. This approach allows for a targeted focus on domain-specific accuracy, weighted importance, and user experience alignment.
For this example, we’ll use a Python environment. Ensure you have Python 3.8 or higher and the necessary dependencies installed as per the repository requirements.
We will first install all the requirements.
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git cd LLaMA-Factory pip install -e ".[torch,metrics]"
llamafactory-cli webui
Note: You can find the official setup guide in more detail here on Github.
Learn about the default evaluation metrics provided by LLaMA-Factory, such as BLEU and ROUGE scores, and why they are essential for assessing model performance. This section also introduces the value of customizing metrics.
BLEU (Bilingual Evaluation Understudy)score is a metric used to evaluate the quality of text generated by machine translation models by comparing it to a reference (or human-translated) text. The BLEU score primarily assesses how similar the generated translation is to one or more reference translations.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)score is a set of metrics used to evaluate the quality of text summaries by comparing them to reference summaries. It is widely used for summarization tasks, and it measures the overlap of words and phrases between the generated and reference texts.
These metrics are available by default, but you can also add customized metrics tailored to your specific use case.
This guide assumes thatLLaMA-Factory is already set up on your machine. If not, please refer to the LLaMA-Factory documentation for installation and setup.
In this example, the function returns a random value between 0 and 1 to simulate an accuracy score. However, you can replace this with your own evaluation logic to calculate and return an accuracy value (or any other metric) based on your specific requirements. This flexibility allows you to define custom evaluation criteria that better reflect your use case.
To begin, let’s create a Python file calledcustom_metric.pyand define our custom metric function within it.
In this example, our custom metric is calledx_score. This metric will takepreds(predicted values) andlabels(ground truth values) as inputs and return a score based on your custom logic.
import random def cal_x_score(preds, labels): """ Calculate a custom metric score. Parameters: preds -- list of predicted values labels -- list of ground truth values Returns: score -- a random value or a custom calculation as per your requirement """ # Custom metric calculation logic goes here # Example: return a random score between 0 and 1 return random.uniform(0, 1)
You may replace the random score with your specific calculation logic.
To ensure that LLaMA Board recognizes our new metric, we’ll need to integrate it into the metric computation pipeline withinsrc/llamafactory/train/sft/metric.py
Add Your Metric to the Score Dictionary:
self.score_dict = { "rouge-1": [], "rouge-2": [], "bleu-4": [], "x_score": [] # Add your custom metric here }
Calculate and Append the Custom Metric in the__call__Method:
from .custom_metric import cal_x_score def __call__(self, preds, labels): # Calculate the custom metric score custom_score = cal_x_score(preds, labels) # Append the score to 'extra_metric' in the score dictionary self.score_dict["x_score"].append(custom_score * 100)
This integration step is essential for the custom metric to appear on LLaMA Board.
The predict_x_score metric now appears successfully, showing an accuracy of 93.75% for this model and validation dataset. This integration provides a straightforward way for you to assess each fine-tuned model directly within the evaluation pipeline.
After setting up your custom metric, you should see it in LLaMA Board after running the evaluation pipeline. Theextra metricscores will update for each evaluation.
With these steps, you’ve successfully integrated a custom evaluation metric into LLaMA-Factory! This process gives you the flexibility to go beyond default metrics, tailoring model evaluations to meet the unique needs of your project. By defining and implementing metrics specific to your use case, you gain more meaningful insights into model performance, highlighting strengths and areas for improvement in ways that matter most to your goals.
Adding custom metrics also enables a continuous improvement loop. As you fine-tune and train models on new data or modify parameters, these personalized metrics offer a consistent way to assess progress. Whether your focus is on domain-specific accuracy, user experience alignment, or nuanced scoring methods, LLaMA Board provides a visual and quantitative way to compare and track these outcomes over time.
By enhancing model evaluation with customized metrics, LLaMA-Factory allows you to make data-driven decisions, refine models with precision, and better align the results with real-world applications. This customization capability empowers you to create models that perform effectively, optimize toward relevant goals, and provide added value in practical deployments.
A. LLaMA-Factory is an open-source tool for fine-tuning large language models through a user-friendly WebUI, with features for training, deploying, and evaluating models.
Q2. Why add a custom evaluation metric?A. Custom metrics allow you to assess model performance based on criteria specific to your use case, providing insights that standard metrics may not capture.
Q3. How do I create a custom metric?A. Define your metric in a Python file, specifying the logic for how it should calculate performance based on your data.
Q4. Where do I integrate the custom metric in LLaMA-Factory?A. Add your metric to the sft/metric.py file and update the score dictionary and computation pipeline to include it.
Q5. Will my custom metric appear on LLaMA Board?A. Yes, once you integrate your custom metric, LLaMA Board displays it, allowing you to visualize its results alongside other metrics.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
The above is the detailed content of Boost Model Evaluation with Custom Metrics in LLaMA-Factory. For more information, please follow other related articles on the PHP Chinese website!