Converge — RL Training That Actually Converges

01 / benchmarks

Wall-clock training times.
Not theoretical speedups.

Measured on identical hardware: 8× A100s, 128 parallel workers. Same hyperparameters, same seeds. The only variable is the training infrastructure.

Task

Watch it converge.
Annotated, not abstracted.

HalfCheetah-v4, PPO, 1.2M steps. The curve draws itself as you scroll — every inflection point labeled with what actually happened.

Cumulative Reward

Converge

Baseline (CleanRL)

Training Steps

0 steps

sparse reward plateau

340k steps

policy plateau broken

820k steps

sim-to-real transfer

1.2M steps

policy hardened

03 / architecture

Engine parts that
click into place.

Each component is independently observable and replaceable. Bring your own environment, your own algorithm, your own export target. The scaffolding handles the rest.

⬡

Environment Wrapper

Gym / DM Control / Custom

◈

Distributed Rollout Workers

128 parallel actors · gRPC

▣

Replay Buffer

Prioritized · 10M transitions

◇

Evaluation Harness

Deterministic · logged

◉

Centralized Learner

PPO / SAC / TD-MPC2

▷

Policy Export

ONNX · TorchScript · ROS2

train.py

import converge

env  = converge.wrap(gym.make("HalfCheetah-v4"))
agent = converge.PPO(env, workers=128)

# Train until reward > 8000 or 2M steps
agent.train(target_reward=8000, max_steps=2_000_000)

# Export for deployment
agent.export("policy.onnx", format="onnx")

04 / proof

Engineers who shipped.
Not engineers who evaluated.

CovariantCitadelDeepMindWaymoFigure AISkild AI

"

We were stuck at a policy plateau for three weeks on our warehouse robot. Converge's sparse reward diagnostics found the issue in four hours. The annotated reward curve is the most useful debugging tool I've seen.

3 weeks → 4 hours

time to diagnosis

PN

Priya Nambiar

Senior RL Engineer · Covariant

"

Our quant team runs 40+ portfolio agents in parallel. The distributed rollout workers cut our overnight sim time from 11 hours to under 90 minutes. We now iterate on reward functions same-day.

11h → 87min

overnight sim time

MO

Marcus Okafor

Quantitative Researcher · Citadel Securities

"

Sim-to-real transfer used to be our biggest bottleneck. Converge's domain randomization wrappers and the ONNX export pipeline cut our robot deployment cycle from 6 weeks to 9 days.

6 weeks → 9 days

deployment cycle

EV

Elena Vasquez

Robotics Research Lead · Boston Dynamics AI Institute

Reward curvesbend upwardhere.

Wall-clock training times.Not theoretical speedups.

Watch it converge.Annotated, not abstracted.

Engine parts thatclick into place.

Engineers who shipped.Not engineers who evaluated.

Your next training runshould converge.

Reward curves
bend upward
here.

Wall-clock training times.
Not theoretical speedups.

Watch it converge.
Annotated, not abstracted.

Engine parts that
click into place.

Engineers who shipped.
Not engineers who evaluated.

Your next training run
should converge.