Replay - Majordomo

Replay lets you run a set of real production requests against a different model and compare the results side by side — cost, latency, and output quality — before committing to a switch.

The problem it solves

A new model looks good in the playground. Benchmarks look promising. But you have no idea how it performs on your production traffic — the actual prompts, system prompts, conversation histories, and edge cases your users send. Replay runs your real traffic against the candidate model. You get actual numbers on your workload, not synthetic benchmarks.

Creating a replay run

In the dashboard, go to Replay and create a new run:

Select a source — choose an API key, date range, and optionally filter by feature or metadata
Choose the target model — the model you want to test
Configure the judge — optional LLM judge for automated quality scoring
Run — the gateway re-executes each selected request against the target model

Reading results

For each replayed request you see:

	Original	Replay
Model	`gpt-4o`	`gpt-4o-mini`
Input tokens	1,240	1,240
Output tokens	384	312
Cost	$0.0142	$0.0008
Latency	2,100ms	890ms
Quality score	—	0.92

The quality score is produced by an LLM judge that compares the original and replay responses and returns a 0–1 equivalence score plus reasoning.

Interpreting quality scores

A score of 0.9+ generally means the cheaper model produces outputs that are functionally equivalent for your use case. A score of 0.7–0.9 means similar outputs with some degradation — review the low-scoring requests manually to understand where the gaps are. Low scores on specific request types often reveal where the cheaper model struggles (complex reasoning, long context, specific formatting). That tells you whether to switch fully, switch for a subset of traffic, or not switch at all.

The decision framework

Run replay on a representative sample (500–1,000 requests is usually enough)
Check aggregate cost and latency savings
Review the quality score distribution
Read through 10–20 low-scoring pairs manually
If the failure modes are acceptable for your use case, switch

The goal isn’t a perfect score — it’s understanding where the model differs and deciding if those differences matter for your product.

​The problem it solves

​Creating a replay run

​Reading results

​Interpreting quality scores

​The decision framework

The problem it solves

Creating a replay run

Reading results

Interpreting quality scores

The decision framework