the last question

RL is dumb but works

Apr 19, 2026

Machine learning models can drive cars, paint beautiful pictures and write passable rap, but they famously suck at low level control. The comma controls challenge is a test of that claim.

It’s a GPT-based car simulator, an ONNX model that takes steering commands and returns lateral acceleration. It’s autoregressive: each prediction feeds back as input to the next step. 600 timesteps per route, 20,000 routes. The job is to track a target lateral acceleration profile.

The cost function:

total_cost = lataccel_cost * 50 + jerk_cost

where lataccel_cost is mean squared tracking error (×100) and jerk_cost penalizes jerky steering. The 50× multiplier means tracking matters more than smoothness. The original prize winner scored under 45. The PID baseline sits around 85.

The controller sees four things each step: the target lateral acceleration, the current lateral acceleration, the car state (velocity, forward acceleration, road roll), and a future plan: 50 timesteps of where the road is going.

def update(self, target_lataccel, current_lataccel, state, future_plan):
    return steer_action  # a number between -2 and 2

ChatGPT couldn’t write a controller

I dumped the README into ChatGPT and asked it to write an MPC controller. I got five rounds of broken code. The dynamics model in every version was the same wrong thing:

def predict_lataccel(self, current_lataccel, u):
    return current_lataccel + u * self.dt  # wrong

This treats the steering command as something that adds directly to lateral acceleration. But the ONNX simulator is a black box. It maps (states, tokens) → next_lataccel through a learned network. There’s no closed-form dynamics, and the relationship between steer input and lataccel output is nonlinear, speed-dependent, and history-dependent.

ChatGPT’s fix for the oscillation was always the same: raise the smoothing weight, lower the learning rate, try again. Five rounds of that with no insight into why it was oscillating, while the dynamics model stayed wrong the whole time.

LunarLander first

When I started there were too many unknowns. Bad results could mean broken PPO, the wrong architecture, or not enough data, and I couldn’t tell which.

So I stopped and solved LunarLander first, where PPO is the only variable. If I can solve it in 100 epochs with clean code, then PPO isn’t the problem. It’s 304 lines. What carried over to the controls challenge:

Separate actor/critic optimizers (pi_lr=3e-4, vf_lr=1e-3)
Learnable state-independent log_std
GAE with advantage normalization
Gradient clipping at 0.5

After that I trusted the PPO code, so I could rule it out and debug the rest one piece at a time. A one-neuron experiment confirmed learning worked at all: 3 weights recover the PID gains.

Representation

Two changes mattered most, both about representing the problem correctly. They are what got behavioral cloning (BC, a network trained to imitate PID) working and gave PPO something it could improve on. The final score came later, from the bug fixes and tuning further down.

Curvature

The controller’s main signal is lateral acceleration, and the same value means different things at different speeds: a target_lataccel at 30 km/h and at 120 km/h ask for very different steering. The speed-invariant version is the curvature of the road:

def _curv(lat, roll, v):
    return (lat - roll) / max(v * v, 1.0)

The geometry of a corner is the same at any speed. Only the lateral force needed to follow it changes. So alongside the raw lateral accelerations I give the network the current and target curvature. That one change dropped BC from 130 to 75. The 50-step future plan goes in raw, which is already more than the PID baseline uses, since it ignores the future entirely. a_ego is in there too. I never found a clean use for it.

Delta actions

The other change: instead of outputting a steering angle directly, the network outputs a change in steering.

delta = clip(raw * DELTA_SCALE, -MAX_DELTA, MAX_DELTA)  # DELTA_SCALE=0.25, MAX_DELTA=0.5
action = clip(prev_action + delta, -2, 2)

The simulator is autoregressive, so with noise on absolute actions a single noisy action at step 100 corrupts the state, which corrupts step 101, and the whole rollout goes bad. With noise on the delta, a noisy step just wiggles the steering slightly and the previous action anchors it, so the noise stays local.

This wasn’t derived from theory. The commit messages tell the real story: “revert to 47 config” after trying other values. DELTA_SCALE=0.25 and MAX_DELTA=0.5 stuck because nothing else worked better.

The bugs

PPO couldn’t beat BC for months. The pattern was always the same: BC gives 75, I start PPO fine-tuning, and the cost climbs to 100, 150, 300. Nothing crashed. The bugs were in the reward and the action distribution, and each one just made the policy worse.

Reward function bugs. The training reward didn’t match the eval cost. It was missing the 50× lataccel multiplier, and jerk was computed as (cur - prev)² instead of ((cur - prev)/0.1)², off by 100×. The network was optimizing the wrong objective.

BC trained the mean but not the concentration. With a Beta distribution, MSE loss only matches α/(α+β) to the target. There’s no gradient pushing the distribution to get sharper, so σ stays at 0.45 after BC instead of shrinking. The fix was an NLL loss that trains both α and β.

State-dependent sigma. The idea was to let the network modulate exploration per state. In practice it learned to avoid hard states by inflating σ there. Training looked good, eval was bad, because E[tanh(μ + σε)] ≠ tanh(μ). The network learned extreme μ values, knowing noise plus squashing would pull them back. Remove the noise at eval time and the actions are too aggressive.

Each of these took days to find. Together they explain months of PPO not beating BC.

Putting it together

The score came down in stages, roughly:

Feb 9:   "ppo shows promise"           first commit in 37 days
Feb 10:  88 → 80 → 79                  PPO from scratch, no BC, beating PID
Feb 11:  96 → 72 → 66                  256-dim obs, Beta distribution
         "65.4 validation is a real breakthrough after months of work"
Feb 12:  112 → 63.8 in 11 epochs       NLL BC + reward normalization
Feb 13:  57 → 50 → 47                  cosine LR decay
         "I had to start it 8-9 times to reach 47 cost"
Feb 14:  47 → 43                        16 git commits in one day
         "fix ratio collapse" → "new best 43"
Feb 15:  42.5 on 5000 routes            submitted to comma

The final setup: a 256-dim observation (16 core features including the curvature terms, 20 steps each of action and lataccel history, and the raw 50-step future plan), a 4-layer actor and 4-layer critic at 256 hidden, a Beta distribution, delta actions, and NLL behavioral cloning pretrain followed by PPO fine-tuning with cosine LR decay.

The training reward, meant to mirror the eval cost:

def compute_rewards(traj):
    lat  = (tgt - cur)**2 * 100 * LAT_ACCEL_COST_MULTIPLIER  # 50×
    jerk = np.diff(cur, prepend=cur[0]) / DEL_T
    return (-(lat + jerk**2 * 100) / 500.0)

Another big piece was a GPU optimization blitz. The bottleneck was never the algorithm, it was how fast I could run rollouts. Batched ONNX inference (1000 routes at once instead of 1 at a time), three MacBooks over USB-C for distributed rollouts, then a Google Cloud 48-core CPU, then a Vast.ai GPU with TensorRT. About 375× total, all from running rollouts faster rather than changing the algorithm.

During the sprint I also bolted a hybrid policy-MPC onto inference. Sample 16 action candidates from the trained policy’s Beta distribution, roll each forward 5 steps through the ONNX simulator, score each trajectory with the real cost function, and apply the best first action. Slot 0 is always the policy mean, so it never picks something worse than the pure policy. That got the sprint to about 42.5.

Cleaning it up

Coming back to it later, I found the real problem was a timing bug in the reward. I had been scoring each steering command against the car state from before the command took effect, so actions were rewarded for outcomes they didn’t cause, an off-by-one in time. The fix is to read the reward from the simulator’s post-step histories over the official cost window, so a command is scored against the lataccel it actually produced:

# align reward timing with the official cost window, post-step
pred   = sim.current_lataccel_history[:, start:end]
target = data['target_lataccel'][:, start:end]
lat_r  = (target - pred)**2 * (100 * LAT_ACCEL_COST_MULTIPLIER)

With the reward finally pointing at the right thing, I deleted everything that was no longer earning its place: the MPC shooting, the multi-machine rollouts, the speed normalization. What’s left is basically pure PPO with a BC pretrain. Fewer lines, and it scored 43.729 without the MPC.

The last gain came from variance. Routes differ a lot in difficulty, and a single rollout per route confuses “this action was good” with “this route was easy.” So I roll each route out 10 times and center the advantages within the route before the update, subtracting each route’s own mean:

adv_2d = adv.reshape(n_routes, SAMPLES_PER_ROUTE, -1)
adv    = (adv_2d - adv_2d.mean(dim=1, keepdim=True)).reshape(-1)

Now the gradient only sees how an action did relative to other attempts at the same route, not how hard the route was. That took it to 42.206 on 5000 routes, 40.69 on 100.

The noise floor

Almost everything I ran after this was an attempt to get under 40. I didn’t get there, and it’s worth saying why it’s hard.

The simulator is not deterministic. It predicts each lataccel by sampling from a 1024-way softmax at temperature 0.8, and it seeds numpy from the route’s filename:

# tinyphysics.py: the sim samples each lataccel, it doesn't compute it
probs  = softmax(logits / 0.8)                            # temperature 0.8
sample = np.random.choice(VOCAB_SIZE, p=probs[0, -1])
...
seed = int(md5(self.data_path.encode()).hexdigest(), 16) % 10**4
np.random.seed(seed)                                      # noise is fixed per route

So the noise on a route is fixed: the same route always draws the same sequence. And the noise is most of the score. If I set the temperature to 0 so the sim decodes greedily instead of sampling, the cost on my controller drops to about 10-11. The other roughly 30 points of a 42 score are the sampling noise, not tracking error.

To get under 40 you have to claw some of that back, and you can, because the seed is fixed. I searched for the optimal action sequence per segment with MPC and an RNG-aware optimizer that knows each step’s draw in advance, and it scored 18.29 on 5000 routes. I didn’t submit it, because solving one route’s known draws isn’t really a controller. The sub-40 solutions I’ve seen are this kind of per-segment search.

Whether a general policy can do it, I don’t know. The noise isn’t true randomness. It’s a sample from a math model rolled out in time, seeded once per route, so it may have temporal structure. The residual at each step is observable after the fact. A sequence model with memory might infer the process from that history and stay ahead of it, something like Physical Intelligence’s RL token. I didn’t build that. What I tried was the blunt version: behavioral-cloning the search-optimal actions into a memoryless network, and it couldn’t fit them. The action that beats the noise depends on the next draw, and a single-step observation doesn’t carry the history to infer it.

42.206 is about where a controller lands when it isn’t exploiting the seed.

Almost none of the gains came from a fancier algorithm. The method stayed plain PPO with a BC pretrain. What moved the score was structure specific to this simulator: curvature for the road, delta actions for the autoregressive dynamics, fixing the reward timing, averaging within a route. A generic setup gets you the PID baseline. Everything past it you pay for with something you had to learn about the problem. Below 40, the only thing left to use is the seed, which is the point it stops being a controller.