Reinforcement learning in hyperbolic space for multi-step reasoning

Tao Xu; Dung-Yang Lee; Momiao Xiong; Tao Xu; Dung-Yang Lee; Momiao Xiong

doi:10.48130/stati-0025-0005

Figures (7) Tables (5)

Figure 1.
Architecture of hyperbolic transformers. All steps: input Euclidean embedding, position encoding, hyperbolic layer normalization, hyperbolic multi-head attention, and hyperbolic feed-forward neural networks in the hyperbolic transformer encoder block.
Figure 2.
Transition function learning. A hyperbolic transformer is employed to learn the dynamics of the environment by modeling the transition function in RL, which describes how the environment transitions from the current state s to the next state s' and issues rewards r in response to the actions taken by the agent.
Figure 3.
Outline of MoE. The MoE has two main components, namely experts and the router: experts − each FNN layer now has a set of 'experts' which are subsets in the FNN. Router or gate network − the router determines how input tokens’ information passes experts. The MoE replaces FFN in the transformer.
Figure 4.
Hyperbolic transformer as a policy.
Figure 5.
MSE of vanilla-transformer and hyperbolic-transformer for solving ten FrontierMath problems (excluding A.11 sample problem 11 - prime field continuous extensions).
Figure 6.
The increased accuracy of the Hyper-T vs Vanilla-T.
Figure 7.
Wall-Clock_Hyper-T for 11 FrontierMath problems.

Input: θ₀: hyperbolic transformer, Manifold: the Poincaré ball, group G, clip ε

Initialize policy π_θ and hyperbolic transformer

for k = 0, 1, 2, ...

for env episode e = 1, ..., B

1. Sample collection from the policy

A batch of N states

$ {\{{s}^{\left(j\right)}\}}_{j=1}^{N} $

sampled from the environment using the policy

$ {\pi }_{{\theta }_{k}} $

For each state s^(j), generate G responses: a set (or group) of G actions is sampled:

$ \left\{{a}_{1}^{\left\{j\right\}}, \dots ,{a}_{G}^{\left\{j\right\}}\right\} $

from human preference feedback in the population, get the group information and divide samples into diverse and robust group, returns a reward

$ r\left({s}^{\left(j\right)},{a}^{\left(j\right)}\right) $

With batch

2. Compute group-based (relative) advantage

Define the group statistics for each state s^(j) and its corresponding group of actions, then, defining the group-relative advantage for each action

$ {a}_{i}^{\left(j\right)} $

; For each

$ \left({s}^{\left(j\right)},{a}_{i}^{\left(j\right)}\right) $

, evaluate the new policy’s probability using the hyperbolic transformer

$ {\pi }_{{\theta }_{k}}\left({a}_{i}^{\left(j\right)}|{s}^{\left(j\right)}\right) $

and compute the probability ratios relative to the old policy and reference, and K-L distance

$ {D}_{KL}\left({\pi }_{{\theta }_{k}}\left|\right|{\pi }_{ref}\right) $

3. Form the GRPO surrogate objective

$ L\left({s}^{\left(j\right)},{\theta }_{k}\right) $

and the total surrogate objective over the batch

$ L\left({\theta }_{k}\right)=\dfrac{1}{N}{\sum }_{j=1}^{N}L\left({s}^{\left(j\right)},{\theta }_{k}\right). $

End batch

4. Gradient ascent through hyperbolic transformations and hyperbolic-aware optimization for updating parameters.

End k

Table 1.

hyperbolic GRPO.

Backbone Updates to MAE < 10⁻⁶ Final MAE × 10⁻⁶ Wall-clock

Vanilla-T 10,200 ± 800 6.2 1

Hyperbolic-T 6,600 ± 610 3.1 0.84 ×

Table 1.
Comparison between vanilla transformer and hyperbolic transformer on the scalar root-finding benchmark.
Backbone Updates to hit 11 Miss.pred.rate (%) Wall-clock

Vanilla-T 10,400 ± 800 0.46 1.00

Hyper-T 6,900 ± 600 0.31 0.83

Table 2.
Comparison between vanilla transformers and hyperbolic transformers on the Prime field continuous extension problem.
Backbone Updates to MAE < 10⁻⁶ Final MAE × 10⁻⁶ Wall-clock

Vanilla-T 12,800 ± 900 6.4 1.00

Hyper-T 8,200 ± 670 3.6 0.83

Table 3.
Comparison between vanilla transformer and hyperbolic transformers on the Van-der-Pol optimal-control problem.
Backbone Updates to MAE < 10⁻⁶ Final MAE ×10⁻⁶ Wall-clock

Vanilla-T 10,500 ± 800 6.0 1.00

Hyper-T 6,800 ± 620 3.3 0.84

Table 4.
Comparison between vanilla transformer and hyperbolic transformers on the unicycle–vehicle energy-minimization problem.

Backbone	Updates to MAE < 10⁻⁶	Final MAE × 10⁻⁶	Wall-clock
Vanilla-T	10,200 ± 800	6.2	1
Hyperbolic-T	6,600 ± 610	3.1	0.84 ×

Backbone	Updates to hit 11	Miss.pred.rate (%)	Wall-clock
Vanilla-T	10,400 ± 800	0.46	1.00
Hyper-T	6,900 ± 600	0.31	0.83

Backbone	Updates to MAE < 10⁻⁶	Final MAE ×10⁻⁶	Wall-clock
Vanilla-T	10,500 ± 800	6.0	1.00
Hyper-T	6,800 ± 620	3.3	0.84