利用RLHF技术为大语言模型设计奖励模型

从人的反馈中强化学习( rlhf ) 是一种用于提高 gpt-3 等语言模型性能的强大技术。rlhf 的一个重要方面是训练一个指导微调过程的奖励模型。在这篇博文中，我们将引导您完成创建用于收集人类偏好的数据集并使用出色的 trl 库训练奖励模型的步骤。

下面是我们将采用的工作流程，如从 argilla 中提取的图表所示，argilla 是一个专为llms设计的开源数据管理平台。

colab notebook —

第1步

argilla 服务器设置

您需要运行 argilla 服务器。如果您还没有设置，您可以按照 argilla 的快速入门或安装说明进行操作。

第2步

安装所需的软件包

!pip install -u argilla pandas trl plotly -qqq

此命令安装必要的 python 包：argilla（argilla 客户端）、pandas（用于数据操作）、trl（文本自适应预训练和强化学习）和plotly（用于创建绘图）。

第3步

导入必要的库

import randomimport torchfrom datasets import dataset, load_datasetfrom transformers import ( automodelforsequenceclassification, autotokenizer, trainingarguments,)from trl import rewardtrainerimport argilla as rg

导入数据处理、模型训练和使用 argilla 所需的各种库。

第4步

初始化 argilla 客户端（可选）

rg.init( api_url="http://localhost:6900", # replace with your argilla server's url api_key="admin.apikey" # replace with your api key if applicable)

如果您使用 docker 快速启动或 hugging face spaces 运行 argilla，此步骤将使用 url 和 api 密钥初始化 argilla 客户端。如果您在没有这些环境的情况下在本地运行 argilla，则可能不需要此步骤。

第5步

加载数据集

load_dataset您正在使用库中的函数加载数据集datasets。在本例中，数据集名为“ argilla/dolly-curated-comparison-falcon-7b-instruct ”，您可以选择数据集的“train”分割。

hf_dataset = load_dataset("argilla/dolly-curated-comparison-falcon-7b-instruct", split="train")

第6步转换为 pandas dataframe

加载数据集后，将其转换为 pandas dataframe，以便于数据操作和探索。

df = hf_dataset.to_pandas()df # printing the dataframe

第7步定义数据集设置字段

为了创建奖励模型，我们希望贴标签者根据与给定提示相关的质量来评估和排名两个响应，从最有利到最不有利。为此，我们需要设置显示字段并制定向贴标人员提出的具体问题。

在本例中，我们有三个字段：

fields = rg.textfield(name="instruction", title="user instruction"),rg.textfield(name="response-1"),rg.textfield(name="response-2")]

instruction：该字段对应于用户的指令或提示。response-1: 代表第一个响应。response-2：代表第二个响应。第8步配置标签问题

我们正在设置一个问题供贴标商回答。在此用例中，贴标签者被要求在提供的两个响应中选择最佳响应。您还可以配置更复杂的排名任务，但为了简单起见，我们重点关注从两个选项中选择最佳响应。

question = rg.ratingquestion( name="choose-best", title="choose the best response:",description="choose the most helpful, harmless, and truthful response. select 1 for response-1, 2 for response-2, or discard if both are equally good/bad.",values=[1,2], required=true)

name：为此问题指定一个名称。title：指定向贴标签者显示的问题标题。description：向贴标商提供有关如何选择最佳响应的详细说明。它提到了选择标准和可用选项（1 表示响应 1，2 表示响应 2，或丢弃相同质量）。values：表示贴标机可以选择的可能值（本例中为 1 或 2）。required：标签商必须回答这个问题。第9步提供注释指南

我们提供了注释指南来指导贴标者。这些指南基于**“训练语言模型以遵循人类反馈的指令”。（指导gpt**）

guidelines = these guidelines are based on the **training language models to follow instructions with human feedback]. you can include your specific guidelines here.)"

这些指南可帮助贴标商了解任务并在选择最佳响应时做出明智的决策。

第10步

建立比较记录

在此步骤中，我们创建比较记录来收集数据。每个记录将一个指令（提示）与两个响应配对。我们将原始的人类编写的响应随机分配给“response-1”，并将 falcon 模型生成的响应分配给“response-2”。

# building records from the hf datasetrecords = rg.feedbackrecord(fields=) for r in hf_dataset]

第11步创建数据集配置

现在我们定义一个数据集配置，其中包括标记者的字段、问题和指南。

# creating a dataset configurationdataset = rg.feedbackdataset( fields=fields, questions=[question], guidelines=guidelines)

第12步添加记录并发布数据集

我们将第 10 步中生成的记录合并到数据集中，并可供标记者访问。该数据集的名称为“comparison-data-falcon”。

# adding records to the dataset and publishing itdataset.add_records(records)dataset.push_to_argilla(name="comparison-data-falcon")

现在，数据集已准备好供标记者使用配置的反馈 ui 提供输入。

第13步

推至 hugging face hub

如果您希望与其他人共享此数据集以进行复制和重复使用，我们可以选择将其上传到 hugging face hub。

# pushing the dataset to the hugging face hub (optional)dataset.push_to_huggingface("comparison-data-falcon")

此步骤允许其他用户访问数据集并提供有关其结构、指南和导入说明的信息。

第14步

检索标记数据集

根据您是否使用 argilla ui 标记任何数据点，您有两个选项来检索标记的数据集。

如果您尚未标记任何响应，您可以从 hugging face hub 检索预先标记的数据集。该数据集已包含排名响应并且可供使用。

# if you h**en't labeled any responses with the ui, run this cell to retrieve the labeled dataset from hugging face hub.

feedback_dataset = rg.feedbackdataset.from_huggingface("argilla/comparison-data-falcon-with-feedback")如果您在 ui 中标记了一些响应，则可以从 argilla 检索标记的数据集：

# if you h**e labeled some examples, run this cell to retrieve the labeled dataset from argilla.feedback_dataset = rg.feedbackdataset.from_argilla('comparison-data-falcon')

第15步准备用于奖励建模的数据集

现在，我们需要以标准方式格式化数据集来训练奖励模型。这涉及根据用户反馈选择已选择和拒绝的响应。我们将创建一个trainingtask奖励建模实例并使用格式化函数来准备数据。

from typing import any, dictfrom argilla.feedback import trainingtaskfrom collections import counterdef formatting_func(sample: dict[str, any]):values = annotation["value"] for annotation in sample["choose-best"] if annotation["status"] submitted" ]winning_response = counter(values).most_common(1)[0][0] if winning_response ==1: chosen = sample["response-1"] rejected = sample["response-2"] else: chosen = sample["response-2"] rejected = sample["response-1"] return chosen, rejectedtask = trainingtask.for_reward_modeling(formatting_func=formatting_func)

第16步观察生成的数据集

在准备使用 trl 进行训练后，您可以观察生成的数据集：

dataset = feedback_dataset.prepare_for_training(framework="trl", task=task)dataset#### output starts ####dataset()#### output ends ####

这将显示有关数据集的信息，包括其特征和行数。您还可以访问数据集中的特定数据点，例如所选和拒绝的响应。

dataset[0]#### output ####

现在，该数据集已准备好用作训练奖励模型的比较数据。

第17步

选择基本模型

在训练奖励模型之前，您需要选择一个基础模型进行微调。通常，该基本模型是由指令调整步骤产生的监督微调模型。在此示例中，我们将使用 distilroberta-base 模型，但您可以尝试其他模型。

model_name = distilroberta-base"

第18步初始化argillatrainer

创建 argillatrainer 的实例，它将处理奖励模型的训练过程。您将为它提供数据集、任务、框架 (trl)、选定的基本模型和其他训练配置选项。

trainer = argillatrainer( dataset=feedback_dataset, task=task, framework="trl", model=model_name, train_size=0.8,)

dataset：您之前准备的标记数据集。task：奖励建模任务配置。framework：指定“trl”作为奖励建模的框架。model：您选择进行微调的基本模型。train_size：用于训练的数据集的比例（在本例中为 80%）。第19步更新训练配置

调整训练配置选项，例如批量大小、评估策略和记录频率。在这里，我们设置批量大小、定期执行的评估策略，并指定记录间隔。

trainer.update_config( per_device_train_batch_size=16, evaluation_strategy="steps", logging_steps=200,)

第20步开始训练

启动奖励模型的训练过程。该模型将根据数据集中提供的示例学习区分首选响应和拒绝响应。经过训练的模型将为首选响应分配较高的值，为拒绝的响应分配较低的值。

trainer.train(".reward_model")#### output starts #####'eval_loss': 0.1626577377319336, 'eval_accuracy': 0.937204591492235,'eval_runtime': 6.5907, 'eval_samples_per_second': 224.709, 'eval_steps_per_second': 28.221,'epoch': 1.0} #### output ends #####

第21步加载分词器和模型

由此产生的模型是完全开源的，可以在hugging hub上访问。现在您可以将其与您的自定义数据一起使用。

从 hugging face hub 加载分词器和模型。在此示例中，我们使用autotokenizer和来加载预先训练的奖励模型。

from transformers import autotokenizer, automodelforsequenceclassificationtokenizer = autotokenizer.from_pretrained( "argilla/roberta-base-reward-model-falcon-dolly")model = automodelforsequenceclassification.from_pretrained( "argilla/roberta-base-reward-model-falcon-dolly")

第22步定义一个函数来获取分数

创建一个名为的函数get_score，该函数将模型、分词器、提示和响应作为输入。该函数对输入序列进行标记，对模型执行前向传递，并提取 logits。

def get_score(model, tokenizer, prompt, response): # tokenize the input sequences inputs = tokenizer.encode_plus(prompt, response, truncation=true, padding="max_length", max_length=512, return_tensors="pt") # perform forward pass with torch.no_grad():outputs = model(**inputs) # extract the logits logits = outputs.logits return logits.item()

第23步使用奖励模型

您现在可以使用该get_score函数来获取对给定提示的不同响应的分数。该分数表示奖励模型根据用户偏好对响应进行评分的程度。

这是一个用法示例：

prompt = what is depreciation"example_less_pref_response = what is depreciation – 10 important facts to know? [# insert the actual response hereexample_preferred_response = depreciation is the drop in value of an asset due to wear and tear [.# insert the actual response here# get the score for the less preferred responsescore_less_pref = get_score(model, tokenizer, prompt, example_less_pref_response)print("score for less preferred response:",score_less_pref)## get the score for the preferred responsescore_preferred = get_score(model, tokenizer, prompt, example_preferred_response)print("score for preferred response:",score_preferred)#

到终点啦结论

训练 rlhf 的奖励模型可以显着提高 gpt-3 等语言模型的性能。

通过遵循本指南中概述的步骤，您可以创建用于收集人类偏好的数据集，训练奖励模型，并使用它根据用户反馈对响应进行排名。

这种方法有助于微调语言模型，以生成更有帮助、更真实、更相关的响应，最终增强它们在各种应用中的实用性。

利用RLHF技术为大语言模型设计奖励模型

大语言模型与AI Agent合力推动AI发展

大语言模型的魅力 Amazon Bedrock 初体验

国内LLMs大型语言模型排行榜！

利用RLHF技术为大语言模型设计奖励模型

大语言模型与AI Agent合力推动AI发展

大语言模型的魅力 Amazon Bedrock 初体验

国内LLMs大型语言模型排行榜！

相關推薦