Skip to Content

Cyclotron

Cyclotron is the sim-to-real locomotion training stack of Menlo OS. It trains locomotion policies in simulation and transfers them to physical humanoid robots.

The core thesis: sim-to-real is a data interface problem, not a fidelity problem. You don’t need a perfect digital twin. You need the policy to see the same type, timing, and quality of data in simulation as it will on hardware.

Why We Built Cyclotron

The sim-to-real gap is not solved by building a perfect simulator — it’s solved by ensuring the policy receives the same observations in sim as on the real robot. If the actuator model is wrong, the observation timing is off, or the sensor noise profile doesn’t match, the policy learns to exploit simulation artifacts that don’t exist on hardware.

The policy itself is lightweight. The hard part is everything around it: actuator models, observation timing, sensor noise profiles, and the domain randomization strategy that bridges whatever residual gap remains.

Traditional domain randomization blankets everything with noise — randomize mass, gravity, friction, link lengths — and hopes the policy becomes robust to all of it. Cyclotron takes a different approach: model the specific things that cause mismatch, and only randomize what is genuinely uncertain.

The Training Stack

Simulation Environment

Cyclotron runs on MuJoCo with custom actuator models that match real firmware IO timing — not the simulator’s built-in actuators. Every timing parameter is derived from measured hardware behavior: actuator delays from firmware data-fetch timings, staggered observation delays matching CAN bus timing, and motor models capturing delay and saturation characteristics.

The goal is that simulation is not an approximation of reality — it is a faithful reproduction of the data the policy will see on the real robot.

Policy Architecture

The policy only receives observations available on the real robot — no ground-truth velocity, no privileged state. What the policy sees in sim is exactly what it will see on hardware.

Training uses asymmetric actor-critic: the critic gets privileged information (contact forces, terrain geometry) during training, but the deployed actor does not. This lets training converge faster without leaking information the real robot doesn’t have.

Action outputs pass through a low-pass filter trained in the loop, smoothing commands before they reach actuators — the same filter that runs on hardware.

Domain Randomization

The randomization strategy follows a simple principle: randomize what is known to vary, do not randomize what has been measured.

Parameters with genuine uncertainty — encoder offsets, PD gains, compliance, friction, observation and action delays, external disturbances — are randomized within measured bounds. Parameters that have been characterized on hardware — body mass, link lengths, gravity — are fixed.

This is the opposite of the kitchen-sink approach. Targeted randomization produces policies that are robust to real variation without being over-conservative.

Sim-to-Real Validation

Before any policy touches hardware, it passes through validation:

  • Processor-in-the-loop testing — real firmware runs against the simulated robot over virtual CAN, catching timing and communication mismatches before they reach hardware
  • Motor emulation with injected bus jitter to stress-test firmware timing assumptions
  • sim2sim video artifacts for visual verification of behavior quality before physical deployment

Locomotion API

Cyclotron exposes a REST API backed by GPU infrastructure so customers can programmatically train, evaluate, and retrieve locomotion policies for Asimov — without managing their own GPU clusters or simulation environments.

What Customers Get

  • Submit training configs (YAML + optional MuJoCo XML/meshes) via REST
  • Cyclotron handles GPU scheduling, simulation, and training orchestration
  • Retrieve trained artifacts — policies, metrics, and sim2sim validation videos
  • Evaluate policies — either from a completed training run or by uploading a policy directly

Core Endpoints

  • POST /locomotion/jobs — submit a training job
  • GET /locomotion/jobs/{job_id} — status, logs, artifacts
  • POST /locomotion/evals — submit an evaluation (from job or direct upload)
  • GET /locomotion/evals/{eval_id} — eval status, logs, artifacts

Infrastructure

The API is a FastAPI service orchestrating Kubernetes Jobs on GPU-accelerated clusters. Each job runs in isolation with full GPU access. Experiment tracking is handled through W&B integration.

Where Cyclotron Is Headed

Cyclotron currently covers legged locomotion — walking, lateral movement, push recovery. This is the foundation, not the ceiling.

The training stack is designed to generalize. The same simulation-first, data-interface methodology extends to full-body control, terrain adaptation, and different locomotion modes as the hardware and policy architecture evolve. The API generalizes with it: same endpoints, broader policy types.

The roadmap includes concurrent multi-run training (parallel experiments), terrain and velocity curricula, and full-body controllers that carry locomotion policies upward into whole-body coordination.

Integration

Cyclotron connects:

  • Agent Platform: Receives trained policies for deployment to physical robots
  • Uranus: Provides scenario validation; receives policies for testing
  • Data Engine: Feeds real-world telemetry back to refine domain randomization and identify sim-real mismatches
Last updated on