Reasoning with Language Model is Planning with World Model (Hao et al., 2023)

Research Question

Given several key limitations of the current reasoning with LLMs:

the lack of an internal world model to simulate the state of the world
the absence of a reward mechanism to assess and guide the reasoning
the incapability of balancing exploration vs. exploitation

We propose a new LLM reasoning framework, Reasoning via Planning (RAP). RAP repurposes the LLM as both a world model and a reasoning agent, and incorporates a principled planning algorithm based on Monte Carlo Tree Search for strategic exploration in the vast reasoning space

Approach

Language Model as World Model

World Model: a world model predicts the next state of the reasoning after applying an action to the current state
State and action
- Blocksworld
  - State: the configuration of blocks (described in natural language)
  - Action: a behavior of moving a block (e.g., “pickup the orange block”)
- Math Reasoning
  - State: values of intermediate variables
  - Action: a subquestion that drives the reasoning to derive new values
- Logical Reasoning
  - State: a fact we are focusing on
  - Action: to choose a rule for the next deduction.
Construct the reasoning process as a MDP
- Given the current state $S_t, t=0,1...,T$, the LLM (as a reasoning agent) generates an action space by sampling from its generative distribution $a_t \sim p(a \vert s_t, c)$, where $c$ is a proper prompt.
- Given the state $s$ and the ation $a$, the world model (also the LLM) then predicts the next state $s_{t+1}$. Specifically, we repurpose the same LLM to obtain a state transition distribution $p(s_{t+1} \vert s_t, a_t, c\prime)$, where $c\prime$ is another prompt to guide the LLM to generate a state.
  - For instance, in Blocksworld, the LLM (as the world model) generates text $s_{t+1}$ to describe the new configuration of blocks, given $s_t$ and the action $a_t$.

Reward Design

The assessment of each reasoning step (i.e., applying an action $a_t$ to the state $s_t$) is performed by a reward function $r_t = r(s_t, a_t) ∈ R$. Here we introduce several common rewards applicable to different tasks and shown to be effective in our experiments.

Likelihood of the action: incorporate the log probability of the action as a reward. This reward reflects the “instinct” of LLMs as an agent.
Confidence of the state: draw multiple sample answers from the world model, and use the proportion of the most frequent answer as the confidence. Higher confidence indicates that the state prediction is more consistent with the world knowledge of LLMs
Self-evaluation by the LLM: allow the LLM to criticize itself with the question “Is this reasoning step correct?”, and use the next-word probability of the token “Yes” as a reward.
Task-specific heuristics: plug in other task-specific heuristics. For example, in plan generation for Blocksworld, we compare the predicted current state of blocks with the goal to calculate a reward (Explained in the Experiments Section).

Research Question

Approach

Language Model as World Model

Reward Design

Planning with Monte Carlo Tree Search