什么是LLMs的反射调整？

通过自我反思来改善LLM微调

我现在已经在数据科学行业工作了5年，我见过任何话题发展得像生成式人工智能一样快。几乎每周都有重大进展。本周的重大突破是反射调整LLMs，通过使用这种技术，已经使得Llama 3.1 70B模型成为迄今为止最好的开源模型。

在我们继续之前，

什么是微调？

微调是将预训练的LLM调整到特定任务或数据集的过程，通过在更小、更专业的数据集上继续训练它。以下是一个示例，说明了LLM的微调过程：

常规微调样本


Input: 'Compose a narrative involving the theme of nature'
Output: '........'

Input: 'Identify elements in this passage: the house is blue'
Output: '......'

反思微调

在普通微调的基础上添加功能，反射微调

You are a world-class AI system, capable of complex reasoning and reflection. 
Reason through the query inside <thinking> tags, and 
then provide your final response inside <output> tags. 
If you detect that you made a mistake in your reasoning at any point,
correct yourself inside <reflection> tags.

因此，最终的反思微调提示可能如下所示（您可以根据您的意愿更改标签，但这些标签产生了最佳结果）。

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a world-class AI system, capable of complex reasoning and reflection.
Reason through the query inside <thinking> tags, and 
then provide your final response inside <output> tags. 
If you detect that you made a mistake in your reasoning at any point,
correct yourself inside <reflection> tags.
<|eot_id|><|start_header_id|>user<|end_header_id|>

what is 2+2?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

LLM 反思后的产出

推理阶段:

该模型首先通过标记生成其推理。该部分包含模型分析输入查询时的内部思维过程。

<thinking>
I need to determine the sum of 2 and 2. 
This is a simple arithmetic problem.
</thinking>

错误检测和纠正:

在部分中，模型可能包括标签，如果它发现自己的推理中有任何错误。这表明该模型能够识别错误，并将在最终确定答案之前尝试更正它们。

<thinking>
I initially thought the answer was 5, but that's incorrect. 
The correct approach is to add the two numbers.
<reflection>
Correcting my previous thought: 2 + 2 = 4.
</reflection>
</thinking>

最终输出:

一旦模型对其推理感到满意，它将在标签中提供最终答案。本节介绍了从推理阶段得出的结论。

<output>
The answer is 4.
</output>

因此，最终输出是

<thinking>
I need to determine the sum of 2 and 2. 
This is a simple arithmetic problem.
</thinking>

<thinking>
I initially thought the answer was 5, but that's incorrect. 
The correct approach is to add the two numbers.
<reflection>
Correcting my previous thought: 2 + 2 = 4.
</reflection>
</thinking>

<output>
The answer is 4.
</output>

反射微调似乎是一个很好的解决方案，甚至使得Llama3.1 70B成为迄今为止最好的开源模型，只需进行反射微调。请尝试使用反射调整对其他LLM进行unsloth和本提示。您可以在下方查看Llama3.1反射微调模型：

希望这对你有所帮助，很快见到你。