Research Question
Given several key limitations of the current reasoning with LLMs:
- the lack of an internal world model to simulate the state of the world
- the absence of a reward mechanism to assess and guide the reasoning
- the incapability of balancing exploration vs. exploitation
We propose a new LLM reasoning framework, Reasoning via Planning (RAP). RAP repurposes the LLM as both a world model and a reasoning agent, and incorporates a principled planning algorithm based on Monte Carlo Tree Search for strategic exploration in the vast reasoning space
Approach

Language Model as World Model
Reward Design
The assessment of each reasoning step (i.e., applying an action $a_t$ to the state $s_t$) is performed by a reward function $r_t = r(s_t, a_t) ∈ R$. Here we introduce several common rewards applicable to different tasks and shown to be effective in our experiments.
- Likelihood of the action: incorporate the log probability of the action as a reward. This reward reflects the “instinct” of LLMs as an agent.
- Confidence of the state: draw multiple sample answers from the world model, and use the proportion of the most frequent answer as the confidence. Higher confidence indicates that the state prediction is more consistent with the world knowledge of LLMs
- Self-evaluation by the LLM: allow the LLM to criticize itself with the question “Is this reasoning step correct?”, and use the next-word probability of the token “Yes” as a reward.
- Task-specific heuristics: plug in other task-specific heuristics. For example, in plan generation for Blocksworld, we compare the predicted current state of blocks with the goal to calculate a reward (Explained in the Experiments Section).
Planning with Monte Carlo Tree Search