Run Inference
Request parameters
Model alias (e.g. llama-3.1-8b, qwen-7b). Resolved to provider-specific IDs by the exchange.
Primary prompt text. (Legacy prompt is also accepted when input is empty — the API merges them into one canonical prompt.)
Maximum tokens to generate. Allowed range is enforced server-side (up to 128,000 depending on model caps).
One of economy, balanced, fast. Default: balanced. Maps to internal priority and routing policy defaults.
lowest_cost or prefer_live. Default: lowest_cost.
Your reference ID for this job (stored for support and tracing).
Reject the request if the estimated customer charge would exceed this amount (markup-inclusive).
Reject quotes whose estimated latency exceeds this threshold (milliseconds).
Arbitrary JSON object persisted on the job for your own bookkeeping.
Response fields
Successful responses return a stable JSON object. Key fields:
id— unique job ID (UUID)object— always"inference"model— alias you requestedoutput_text— generated textstatus— e.g.successusage.input_tokens,usage.output_tokens,usage.total_tokensperformance.latency_ms— wall-clock latencybilling.cost_usd— amount charged (includes Flopex margin)billing.cost_display— human-readable amountbilling.cost_per_million_tokens_usd— effective blended rate when tokens > 0billing.input_tokens/billing.output_tokens/billing.total_tokensbilling.balance_remaining_usd— wallet after this charge
Full request and response
Request
{
"model": "llama-3.1-8b",
"input": "Summarize inference routing in two sentences.",
"max_output_tokens": 256,
"performance_profile": "balanced",
"routing_strategy": "lowest_cost",
"customer_ref": "order-9921",
"max_price_usd": 0.02,
"max_latency_ms": 800,
"metadata": { "env": "staging" }
}Response
{
"id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"object": "inference",
"model": "llama-3.1-8b",
"output_text": "Flopex pings multiple GPU providers…",
"status": "success",
"usage": {
"input_tokens": 24,
"output_tokens": 118,
"total_tokens": 142
},
"performance": {
"latency_ms": 186
},
"billing": {
"cost_usd": 0.000041,
"cost_display": "$0.000041",
"input_tokens": 24,
"output_tokens": 118,
"total_tokens": 142,
"cost_per_million_tokens_usd": 0.29,
"balance_remaining_usd": 9.912004
}
}Performance profiles
economy
Optimizes for lowest cost per token.
Best for batch jobs, async workloads, and non-latency-sensitive applications.
balanced (default)
Balances cost, latency, and provider reliability. The right choice for most production use cases.
fast
Prioritizes latency above cost.
Routes to the lowest-latency provider available at the time of the request. Expect sub-100ms responses on most models.
Idempotency
Send Idempotency-Key with a unique value per logical request. Retries with the same key and same JSON body return the cached response without re-executing providers or double-charging. Conflicting reuse returns idempotency_key_reused_with_different_request.