Ghost Peony BashBros BashStats Clip Finder GitHub

Build Your Own
Reasoning Model

Your coding sessions are chain-of-thought reasoning for code. BashGym captures them as structured traces and trains a local RLM using GRPO, RLVR, and distillation — a model that actually thinks through problems the way you do.

BashGym dashboard showing trace capture, model training progress, and deployment routing metrics
Self-host:
$ git clone https://github.com/GhostPeony/bashgym && cd bashgym && pip install -r requirements.txt

Your coding sessions are reasoning traces. We train on them.

Every AI coding session is a chain-of-thought reasoning trace — step-by-step problem solving with verifiable outcomes. BashGym captures these traces and uses them to train a reasoning language model with the same techniques behind frontier RLMs: GRPO for reinforcement learning, RLVR for verifiable reward signals from test results, and distillation to transfer reasoning from a large teacher into a small local model. The result is a personal RLM trained on how you actually think through code — your conventions, your repos, your patterns.

STEP 01

Capture

Hooks installed into Claude Code record every tool call, file edit, and bash command as a structured execution trace — automatically, with no change to how you work.

STEP 02

Train

Traces are scored, sanitized, and fed into an RLM training pipeline. Start with SFT, graduate to GRPO or RLVR for reinforcement learning with verifiable rewards, or distill Claude's reasoning directly. The result is a small reasoning model that knows your codebase.

STEP 03

Deploy

A confidence-based router progressively shifts traffic from Claude to your local model. Simple tasks go local (~50ms). Complex ones fall back to Claude. The flywheel keeps spinning.

Platform Walkthrough

See BashGym in action — from trace capture to model deployment.

11 Core Capabilities

Everything you need to capture reasoning traces, train RLMs, and deploy your own coding model.

🔍

Trace Capture

Intercept Claude Code tool calls via hooks. Capture prompts, tool outputs, and reasoning traces automatically.

⚖️

Reward Scoring

Six-metric quality framework acts as an outcome reward model — scoring traces on success rate, verification, complexity, tool diversity, efficiency, and reasoning depth. Only gold-tier traces become training data.

🔒

Privacy by Design

PII detection, secret scrubbing, and path anonymization. Your code stays private throughout the pipeline.

🎯

RLM Training Strategies

SFT for supervised learning, GRPO and RLVR for reinforcement learning with verifiable rewards, DPO for preference optimization, and distillation to transfer reasoning from frontier models. The same techniques used to train reasoning models — applied to your data.

📦

Model Registry

Version, tag, and manage trained model artifacts. Track lineage from trace to deployed checkpoint.

🔄

Progressive Routing

Confidence-based routing allocates compute where it matters. Your reasoning model handles tasks it's confident on, Claude handles the rest. Traffic shifts automatically as your model improves.

📊

Real-Time Dashboard

Monitor trace collection, training progress, model performance, and routing decisions in a live dashboard.

☁️

Multi-Cloud

Train on Lambda Labs, RunPod, Vast.ai, or your own GPUs. Cloud-agnostic infrastructure provisioning.

📈

Benchmarks

SWE-bench, HumanEval, and custom project-specific benchmarks. Measure real improvement on real tasks.

🛡️

Safety Guardrails

Harmful content filtering, bias detection, and output validation. Safe models from safe data.

🕸️

Orchestrator

Decompose a spec into a Task DAG, run parallel workers in isolated git worktrees, and feed results back into the training pipeline.

The Ouroboros Flywheel

A self-reinforcing reasoning loop: use Claude, capture chain-of-thought traces, train your RLM with verifiable rewards, deploy it, repeat.

ACT

Use Claude Code normally

VERIFY

Judge trace quality

SYNTHESIZE

Build training data

TRAIN

Fine-tune your model

DEPLOY

Route to your model

REPEAT

Continuously improve

Live Training Monitor

Watch your model improve in real time. Track trace collection, training epochs, loss curves, and deployment status from a single dashboard.

  • Trace collection stats and quality scores
  • Training progress with loss and metric curves
  • Model registry with version comparison
  • Routing confidence and fallback rates
  • Benchmark results across model versions
BashGym Training
$ bashgym train --strategy sft
Loading 2,847 verified traces...
Model: codellama-7b-instruct
Epoch 1/3 ████████░░ 80% loss=0.42
Epoch 2/3 ██████████ 100% loss=0.31
Epoch 3/3 ██████████ 100% loss=0.24
Training complete. Checkpoint saved.
HumanEval: 48.2% (+12.1% vs base)

Three Steps to Your Own Reasoning Model

1

Install Hooks

Install BashGym hooks into Claude Code. Traces are captured automatically as you work.

2

Use Claude Code Normally

Keep coding as usual. BashGym silently captures, scores, and curates high-quality training data.

3

Train Your Reasoning Model

Launch training with one command. BashGym handles data prep, RL fine-tuning, evaluation, and deployment.

8-Layer Architecture

A modular system from trace capture to API serving.

Arena
Trace Capture Hook into Claude Code tool calls
Session Recording Full conversation context
Judge
Quality Scoring Multi-judge validation
PII Scrubbing Privacy-first filtering
Factory
Data Synthesis Trace to training format
Augmentation Expand dataset diversity
Gym
SFT / DPO / GRPO / RLVR RLM training pipelines
Cloud Provisioning Multi-cloud GPU training
Models
Registry Version and tag checkpoints
Lineage Trace-to-model provenance
Observability
Dashboard Live training monitor
Benchmarks SWE-bench, HumanEval
Integrations
Claude Code Hook-based capture
BashBros Security middleware
API
Serving OpenAI-compatible endpoint
Routing Confidence-based fallback

Works With Your Stack

Claude Code
Ollama
HuggingFace
NVIDIA NeMo
BashBros
Docker

Start Training Your Own Reasoning Model

Upload your traces, train with GRPO, push to HuggingFace or Ollama. Your first reasoning model is one command away.

View on GitHub

Frequently Asked Questions

What is BashGym?

BashGym is a platform for building personal reasoning language models (RLMs) from your AI coding sessions. It captures execution traces, scores them as training data, and fine-tunes small models using the same RL techniques behind frontier reasoning models — GRPO, RLVR, and distillation. The Ouroboros Flywheel continuously improves your model as you code.

What makes this a reasoning model?

Every AI coding session is chain-of-thought reasoning — the model thinks through a problem step by step, tries approaches, verifies results, and self-corrects. BashGym captures this reasoning as structured traces and uses reinforcement learning with verifiable rewards (test results, code execution, quality scoring) to train a model that reasons through code the same way. That's exactly how reasoning models are built.

How does the Ouroboros Flywheel work?

The Ouroboros Flywheel is a self-reinforcing loop with six stages: Act (use Claude Code normally), Verify (judge trace quality with multi-judge scoring), Synthesize (build training data from verified traces), Train (fine-tune your model via SFT, DPO, GRPO, RLVR, or distillation), Deploy (route tasks to your model with confidence-based routing), and Repeat (continuously improve as more traces are captured).

What models can I train with BashGym?

BashGym supports fine-tuning smaller open-source models like Qwen 1.5B-7B and Llama 3.2. It integrates with HuggingFace, Ollama, and NVIDIA NeMo, and supports training on Lambda Labs, RunPod, Vast.ai, or your own GPUs.

Do I need ML expertise to use BashGym?

No. BashGym is designed for developers, not ML engineers. You install hooks into Claude Code, keep coding as usual, and launch training with one command. BashGym handles data preparation, quality scoring, fine-tuning, evaluation, and deployment automatically.

How does BashGym capture training data?

BashGym installs hooks into Claude Code that silently record every tool call, file edit, and bash command as a structured execution trace. These traces are automatically scored for quality, scrubbed for PII and secrets, and curated into training datasets — with no change to how you work.

Is BashGym free?

Yes. BashGym is open source under the MIT License. You can self-host it by cloning the GitHub repository. Cloud GPU costs for training are separate and depend on your chosen provider.