2.4M episodes · training now

Reward curves
bend upward
here.

Train faster. Transfer cleaner. Deploy sooner.

Live
0
Cumulative Reward
+12.4% vs last checkpoint
Live
0
Episodes Completed
across 128 parallel workers
Live
0.0000
Policy Loss
compressing toward zero
Scroll for evidence
01 / benchmarks

Wall-clock training times.
Not theoretical speedups.

Measured on identical hardware: 8× A100s, 128 parallel workers. Same hyperparameters, same seeds. The only variable is the training infrastructure.

Task
Category
Stable-Baselines3
CleanRL
Converge ↓
Speedup
HalfCheetah-v4
Locomotion
4h 12m
3h 48m
41m
6.1×
Ant-v4
Locomotion
6h 55m
5h 30m
58m
7.2×
Humanoid-v4
Locomotion
14h 20m
11h 10m
1h 47m
7.9×
FetchReach-v3
Robotics
2h 05m
1h 52m
18m
6.8×
CartPole (sparse)
Sparse Reward
38m
29m
4m
9.5×
MiniGrid-Maze-v0
Navigation
8h 44m
7h 01m
52m
10.1×
All benchmarks reproducible. See methodology →·Last run: 2026-02-27 18:44 UTC
02 / convergence

Watch it converge.
Annotated, not abstracted.

HalfCheetah-v4, PPO, 1.2M steps. The curve draws itself as you scroll — every inflection point labeled with what actually happened.

Cumulative Reward
Converge
Baseline (CleanRL)
0 stepssparse reward plateau340k stepspolicy plateau broken820k stepssim-to-real transfer1.2M stepspolicy hardened
Training Steps
0 steps
sparse reward plateau
340k steps
policy plateau broken
820k steps
sim-to-real transfer
1.2M steps
policy hardened
03 / architecture

Engine parts that
click into place.

Each component is independently observable and replaceable. Bring your own environment, your own algorithm, your own export target. The scaffolding handles the rest.

Environment Wrapper
Gym / DM Control / Custom
Distributed Rollout Workers
128 parallel actors · gRPC
Replay Buffer
Prioritized · 10M transitions
Evaluation Harness
Deterministic · logged
Centralized Learner
PPO / SAC / TD-MPC2
Policy Export
ONNX · TorchScript · ROS2
train.py
import converge

env  = converge.wrap(gym.make("HalfCheetah-v4"))
agent = converge.PPO(env, workers=128)

# Train until reward > 8000 or 2M steps
agent.train(target_reward=8000, max_steps=2_000_000)

# Export for deployment
agent.export("policy.onnx", format="onnx")
04 / proof

Engineers who shipped.
Not engineers who evaluated.

CovariantCitadelDeepMindWaymoFigure AISkild AI
"

We were stuck at a policy plateau for three weeks on our warehouse robot. Converge's sparse reward diagnostics found the issue in four hours. The annotated reward curve is the most useful debugging tool I've seen.

3 weeks → 4 hours
time to diagnosis
PN
Priya Nambiar
Senior RL Engineer · Covariant
"

Our quant team runs 40+ portfolio agents in parallel. The distributed rollout workers cut our overnight sim time from 11 hours to under 90 minutes. We now iterate on reward functions same-day.

11h → 87min
overnight sim time
MO
Marcus Okafor
Quantitative Researcher · Citadel Securities
"

Sim-to-real transfer used to be our biggest bottleneck. Converge's domain randomization wrappers and the ONNX export pipeline cut our robot deployment cycle from 6 weeks to 9 days.

6 weeks → 9 days
deployment cycle
EV
Elena Vasquez
Robotics Research Lead · Boston Dynamics AI Institute
$pip install converge-rl

Your next training run
should converge.

Free tier includes 500k steps/day and full access to the benchmark suite. No credit card. GitHub OAuth in 30 seconds.

Free tier · 500k steps/day·GitHub OAuth · 30 sec setup·No credit card