One helpful tip per day:)
Reinforcement Learning from Human Feedback (RLHF) is a key technology that has been a key point of the success of ChatGPT. HuggingFace provides a detailed description of the entire RLHF process, which includes serveral genius thoughts:
- RLHF is a very complex training process that requires multiple models and extensive engineering and practical work.
- For the reward model of LLM, a scalar score needs to be given based on the text, but asking people to score it directly can be very subjective. For example, two annotators may give completely different scores for the same data, which will have a significant impact on subsequent training. The current best practice is to have two models produce outputs based on the same input, and the annotator only needs to compare the two to score them. Finally, this data is collected to obtain a total score.
- The quality of RLHF depends on two factors: the quality of the text initially annotated by humans, and the quality of the manual scoring.
If you enjoy today's sharing, why not subscribe
Need a superb CV, please try our CV Consultation
Reinforcement Learning from Human Feedback (RLHF) 是 ChatGPT 大获成功的一项关键技术，HuggingFace 非常详细地介绍了整个 RLHF 的流程，其中有不少非常巧妙的做法：
- RLHF 是一个非常复杂的训练过程，需要多个模型训练和大量的工程实践；
- 针对 LLM 的奖励模型，需要基于文本给出一个打分，但是直接让人去打分会非常主观。比如同一条数据，第一个和第二个标注者会给出完全不同的分数，这样会对后续的训练造成很大影响。目前比较好的实践，是让两个模型基于同样的输入产生输出，然后标注者只要在两个里面进行比较即可。最后将这些数据汇总起来得到一个总分。
- RLHF 的质量取决于两个因素，一是最初人工标注的文本质量，二是人工打分的质量。
需要更棒的简历，不妨试试我们的 CV Consultation