Research Question:

Contributions:

  1. We show that process supervision can train much more reliable reward models than outcome supervision. We use our state-of-the-art PRM to solve 78.2% of problems from a representative subset of the MATH test set.
  2. We show that a large reward model can reliably approximate human supervision for smaller reward models, and that it can be used to efficiently conduct large-scale data collection ablations.
  3. We show that active learning leads to a 2.6× improvement in the data efficiency of process supervision.
  4. We release our full process supervision dataset, PRM800K, to promote related research.

Approach:

Experiments: