The problem it solves
A new model looks good in the playground. Benchmarks look promising. But you have no idea how it performs on your production traffic — the actual prompts, system prompts, conversation histories, and edge cases your users send. Replay runs your real traffic against the candidate model. You get actual numbers on your workload, not synthetic benchmarks.Creating a replay run
In the dashboard, go to Replay and create a new run:- Select a source — choose an API key, date range, and optionally filter by feature or metadata
- Choose the target model — the model you want to test
- Configure the judge — optional LLM judge for automated quality scoring
- Run — the gateway re-executes each selected request against the target model
Reading results
For each replayed request you see:| Original | Replay | |
|---|---|---|
| Model | gpt-4o | gpt-4o-mini |
| Input tokens | 1,240 | 1,240 |
| Output tokens | 384 | 312 |
| Cost | $0.0142 | $0.0008 |
| Latency | 2,100ms | 890ms |
| Quality score | — | 0.92 |
Interpreting quality scores
A score of 0.9+ generally means the cheaper model produces outputs that are functionally equivalent for your use case. A score of 0.7–0.9 means similar outputs with some degradation — review the low-scoring requests manually to understand where the gaps are. Low scores on specific request types often reveal where the cheaper model struggles (complex reasoning, long context, specific formatting). That tells you whether to switch fully, switch for a subset of traffic, or not switch at all.The decision framework
- Run replay on a representative sample (500–1,000 requests is usually enough)
- Check aggregate cost and latency savings
- Review the quality score distribution
- Read through 10–20 low-scoring pairs manually
- If the failure modes are acceptable for your use case, switch