Reward Model

A Reward Model is a reward function that is a trained ML model (which predicts the reward for a given sequence of actions in a decision-making process, guiding an agent towards achieving a specific objective).

Context:
- It can be used in various machine learning paradigms, including Reinforcement Learning, to align the agent's behavior with human preferences or specified objectives.
- It can (typically) be trained using human feedback or preferences to learn a reward function that encapsulates the desired outcomes for an agent's actions.
- It can (often) face challenges such as Reward Hacking, where an agent exploits loopholes in the reward model to achieve high rewards without truly meeting the desired objectives.
- It can (often) use techniques such as Weight Averaging or Ensemble Learning to mitigate issues like overoptimization and improve robustness against distribution shifts and label inconsistency.
- It can be a tool for evaluating and auditing Large Language Models (LLMs), capable of generating scalar scores to reveal biases and preferences without complex prompting.
- ...
Example(s):
- An RLHF Reward Model (in RLHF) that uses human feedback to fine-tune the rewards given to a language model for generating text that aligns with human values and preferences.
- ...
Counter-Example(s):
- Direct Preference Optimization (DPO).
- A Classification Model that predicts categories rather than numerical rewards.
- A Regression Model that predicts continuous values unrelated to decision-making or agent behavior.
See: Reinforcement Learning, Human-in-the-Loop, Overoptimization, Reward Hacking, Distribution Shift.

References

2024

GPT-4
- Step 3: Model Refinement Through Reward Modeling
  - The objective is to adjust the model's parameters so that its outputs more closely align with human feedback.
    - The process involves:
      - Training a reward model on the dataset of model outputs and human evaluations, learning to predict the human-preferred outcomes.
      - Updating the main language model's parameters using reinforcement learning techniques, such as policy gradient methods, guided by the reward model's predictions to generate outputs that are more likely to be preferred by humans.

2023

(Gao et al., 2023) ⇒ L. Gao, J. Schulman, and J. Hilton. (2023). “Scaling laws for reward model overoptimization.” In: International Conference on Machine Learning, proceedings.mlr.press
- NOTE: It examines the impact of optimizing agent behavior against a proxy reward model, focusing on how the performance of the gold standard reward model is affected during this optimization process.

2005

(Katoen et al., 2005) ⇒ JP Katoen, M Khattri, and IS Zapreevt. (2005). “A Markov reward model checker.” In: International Conference on Dependable Systems and Networks, ieeexplore.ieee.org.
- NOTE: It discusses the development of a tool for the verification of Markov reward models, detailing its capabilities to handle a wide range of measures and support for reward-based verification.

1991

(Dovidio et al., 1991) ⇒ JF Dovidio, JA Piliavin, SL Gaertner, DA Schroeder, .... (1991). “The arousal: Cost-reward model and the process of intervention: A review of the evidence.” In: psycnet.apa.org.
- NOTE: It reviews the arousal: cost-reward model in the context of intervention, offering an integrative perspective on theories of helping and altruism based on the available evidence.

1989

(Reibman et al., 1989) ⇒ A. Reibman, R. Smith, and K. Trivedi. (1989). “Markov and Markov reward model transient analysis: An overview of numerical approaches.” In: European Journal of Operational Research, Elsevier.
- NOTE: It provides a detailed comparison of numerical methods for analyzing transient behaviors in Markov and Markov reward models, highlighting the complexity and effectiveness of various computational algorithms.