Adaptive Action Repetition for Long-Horizon Agents

Anil Turaga May 31, 2026

A minimal experiment connecting credit assignment, inference cost, and decision stride.

Training agents on long horizon tasks is an optimisation problem across at least three axes:

  • Credit assignment: distributing a task's reward across hundreds of decisions is hard. the signal gets diluted as the horizon gets longer
  • Inference cost: agents that re-evaluate at every step accumulate unnecessary calls/context
  • Decision stride: models need to learn not just what action to take but how long that action stays valid before reassessing

(Memory and long-context are their own rabbit hole and I'm not covering them in this post.)

I built a minimal experiment to demonstrate how these three connect using adaptive action repetition (AAR).

At each decision point, the model jointly predicts an action and a duration. The duration is how many steps to hold that action before deciding again.

Interactive sketch

Sleep/AAR decides only when the stride ends

The sleep/AAR model predicts a fresh duration at each decision point. The normal model decides at every time step.

Step 1/12
Sleep / AAR
5 model calls for 12 steps
call / hold 3
call / hold 1
call / hold 4
call / hold 2
call / hold 2
Normal model
12 model calls for 12 steps
call
call
call
call
call
call
call
call
call
call
call
call

Sleep calls the model here and predicts the next hold duration. The normal model also calls the model.

Fewer decisions means rewards map back to actions more cleanly. Held actions mean no wasted inference mid-commitment/stride. And the model has to learn the right stride: commit too long and miss the window to correct; too short and back to re-evaluating everything.

The models with AAR enabled also tend to perform reward hacking much earlier than vanilla models. One persistant case I have seen is the model trying to align the lander to the middle without any descent and letting it free fall. To mitigate this, I added a landing speed based reward.

The left frames show the model trained with AAR and the right ones are the baseline/vanilla ones.

Actual test run from the repo: sleep/AAR on the left, no-sleep baseline on the right.

GitHub repo(the models can be trained in a couple of minutes): https://github.com/Anilturaga/stride-rl