Model Comparison¶
Run a single prompt against two, three, or four models at once, stream every response in parallel, and compare the outputs. Useful for choosing the right model for a given task, building evals, or letting end users pick between answers.
Hosted UI — Arena¶
If you just want to try this out without writing code, use the Model Arena at /playgrounds/arena. It lets you pick up to 4 models, type one prompt, and watch all responses stream side-by-side with latency, token counts, and per-pane cost.
Arena is built on top of the public streaming API described below — anything the Arena does, you can do from your own code.
Parallel Fan-out (Python, async)¶
import asyncio
import httpx
API_KEY = "YOUR_API_KEY"
BASE_URL = "https://api.indoxhub.com/api/v1"
MODELS = [
"openai/gpt-4o",
"anthropic/claude-opus-4-7",
"google/gemini-2.0-flash",
"mistral/mistral-large-latest",
]
async def stream_model(client, model, prompt):
async with client.stream(
"POST",
f"{BASE_URL}/chat/completions",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"model": model,
"messages": [{"role": "user", "content": prompt}],
"stream": True,
},
timeout=None,
) as response:
chunks = []
async for line in response.aiter_lines():
if line.startswith("data: ") and line != "data: [DONE]":
chunks.append(line[6:])
return model, "".join(chunks)
async def compare(prompt):
async with httpx.AsyncClient() as client:
tasks = [stream_model(client, m, prompt) for m in MODELS]
results = await asyncio.gather(*tasks)
for model, output in results:
print(f"\n=== {model} ===\n{output}")
asyncio.run(compare("Explain quicksort in three sentences."))
Each stream_model coroutine opens its own SSE connection; asyncio.gather runs them concurrently. Credits are billed per stream atomically — one slow model does not block the others.
Parallel Fan-out (JavaScript, browser)¶
const API_KEY = "YOUR_API_KEY";
const BASE_URL = "https://api.indoxhub.com/api/v1";
const MODELS = [
"openai/gpt-4o",
"anthropic/claude-opus-4-7",
"google/gemini-2.0-flash",
"mistral/mistral-large-latest",
];
async function streamModel(model, prompt, signal) {
const response = await fetch(`${BASE_URL}/chat/completions`, {
method: "POST",
headers: {
Authorization: `Bearer ${API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
model,
messages: [{ role: "user", content: prompt }],
stream: true,
}),
signal,
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
let output = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
for (const line of chunk.split("\n")) {
if (line.startsWith("data: ") && line !== "data: [DONE]") {
output += line.slice(6);
}
}
}
return { model, output };
}
async function compare(prompt) {
const controller = new AbortController();
const results = await Promise.all(
MODELS.map((m) => streamModel(m, prompt, controller.signal))
);
results.forEach((r) => console.log(`\n=== ${r.model} ===\n${r.output}`));
}
compare("Explain quicksort in three sentences.");
A single AbortController cancels all four streams at once — handy for a global "Stop" button. Upstream token generation halts within ~1 second of abort, so you only pay for tokens produced up to the cancel point.
Measuring Cost and Latency¶
The final SSE event before [DONE] is a response.done frame carrying a usage object with input_tokens, output_tokens, and total_tokens. Parse it per stream to build a comparison table:
import json
async def stream_with_usage(client, model, prompt):
usage = None
async with client.stream("POST", f"{BASE_URL}/chat/completions", ...) as response:
async for line in response.aiter_lines():
if line.startswith("data: ") and line != "data: [DONE]":
event = json.loads(line[6:])
if event.get("type") == "response.done":
usage = event["response"]["usage"]
return model, usage
Measure wall-clock latency client-side around the stream() call.
Gotchas¶
- Rate limits are per-minute, not per-concurrent-connection. Four parallel streams count as four requests against your per-minute quota. See Rate Limits.
- Credits are deducted per stream once each one finalizes. A 4-model comparison costs 4× whatever a single call would cost — budget accordingly.
- Different models emit tokens at different rates. Some panes finish in 1s, others in 15s. Design your UI so a slow model does not block reading from the fast ones.
- Cancel early if you only need the first answer. If you are showing the fastest response to an end user and discarding the rest, abort the other streams as soon as the first one completes to cut cost roughly in half.