2.4M episodes · training now
Reward curves
bend upward
here.
Train faster. Transfer cleaner. Deploy sooner.█
Live
0
Cumulative Reward
▲+12.4% vs last checkpoint
Live
0
Episodes Completed
▲across 128 parallel workers
Live
0.0000
Policy Loss
▼compressing toward zero
Scroll for evidence
01 / benchmarks
Wall-clock training times.
Not theoretical speedups.
Measured on identical hardware: 8× A100s, 128 parallel workers. Same hyperparameters, same seeds. The only variable is the training infrastructure.
Task
Category
Stable-Baselines3
CleanRL
Converge ↓
Speedup
HalfCheetah-v4
Locomotion
4h 12m
3h 48m
41m
6.1×
Ant-v4
Locomotion
6h 55m
5h 30m
58m
7.2×
Humanoid-v4
Locomotion
14h 20m
11h 10m
1h 47m
7.9×
FetchReach-v3
Robotics
2h 05m
1h 52m
18m
6.8×
CartPole (sparse)
Sparse Reward
38m
29m
4m
9.5×
MiniGrid-Maze-v0
Navigation
8h 44m
7h 01m
52m
10.1×
All benchmarks reproducible. See methodology →·Last run: 2026-02-27 18:44 UTC
02 / convergence
Watch it converge.
Annotated, not abstracted.
HalfCheetah-v4, PPO, 1.2M steps. The curve draws itself as you scroll — every inflection point labeled with what actually happened.
Cumulative Reward
Converge
Baseline (CleanRL)
Training Steps
0 steps
sparse reward plateau
340k steps
policy plateau broken
820k steps
sim-to-real transfer
1.2M steps
policy hardened
03 / architecture
Engine parts that
click into place.
Each component is independently observable and replaceable. Bring your own environment, your own algorithm, your own export target. The scaffolding handles the rest.
⬡
Environment Wrapper
Gym / DM Control / Custom
◈
Distributed Rollout Workers
128 parallel actors · gRPC
▣
Replay Buffer
Prioritized · 10M transitions
◇
Evaluation Harness
Deterministic · logged
◉
Centralized Learner
PPO / SAC / TD-MPC2
▷
Policy Export
ONNX · TorchScript · ROS2
train.py
import converge
env = converge.wrap(gym.make("HalfCheetah-v4"))
agent = converge.PPO(env, workers=128)
# Train until reward > 8000 or 2M steps
agent.train(target_reward=8000, max_steps=2_000_000)
# Export for deployment
agent.export("policy.onnx", format="onnx")
Engineers who shipped.
Not engineers who evaluated.
We were stuck at a policy plateau for three weeks on our warehouse robot. Converge's sparse reward diagnostics found the issue in four hours. The annotated reward curve is the most useful debugging tool I've seen.
Our quant team runs 40+ portfolio agents in parallel. The distributed rollout workers cut our overnight sim time from 11 hours to under 90 minutes. We now iterate on reward functions same-day.
Sim-to-real transfer used to be our biggest bottleneck. Converge's domain randomization wrappers and the ONNX export pipeline cut our robot deployment cycle from 6 weeks to 9 days.
Your next training run
should converge.
Free tier includes 500k steps/day and full access to the benchmark suite. No credit card. GitHub OAuth in 30 seconds.