Run Inference

POST /v1/inference

Request parameters

modelstringrequired

Model alias (e.g. llama-3.1-8b, qwen-7b). Resolved to provider-specific IDs by the exchange.

inputstringrequired

Primary prompt text. (Legacy prompt is also accepted when input is empty — the API merges them into one canonical prompt.)

max_output_tokensintegerrequired

Maximum tokens to generate. Allowed range is enforced server-side (up to 128,000 depending on model caps).

performance_profilestringoptional

One of economy, balanced, fast. Default: balanced. Maps to internal priority and routing policy defaults.

routing_strategystringoptional

lowest_cost or prefer_live. Default: lowest_cost.

customer_refstringoptional

Your reference ID for this job (stored for support and tracing).

max_price_usdfloatoptional

Reject the request if the estimated customer charge would exceed this amount (markup-inclusive).

max_latency_msintegeroptional

Reject quotes whose estimated latency exceeds this threshold (milliseconds).

metadataobjectoptional

Arbitrary JSON object persisted on the job for your own bookkeeping.

Response fields

Successful responses return a stable JSON object. Key fields:

id — unique job ID (UUID)
object — always "inference"
model — alias you requested
output_text — generated text
status — e.g. success
usage.input_tokens, usage.output_tokens, usage.total_tokens
performance.latency_ms — wall-clock latency
billing.cost_usd — amount charged (includes Flopex margin)
billing.cost_display — human-readable amount
billing.cost_per_million_tokens_usd — effective blended rate when tokens > 0
billing.input_tokens / billing.output_tokens / billing.total_tokens
billing.balance_remaining_usd — wallet after this charge

Full request and response

Request

{
  "model": "llama-3.1-8b",
  "input": "Summarize inference routing in two sentences.",
  "max_output_tokens": 256,
  "performance_profile": "balanced",
  "routing_strategy": "lowest_cost",
  "customer_ref": "order-9921",
  "max_price_usd": 0.02,
  "max_latency_ms": 800,
  "metadata": { "env": "staging" }
}

Response

{
  "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "object": "inference",
  "model": "llama-3.1-8b",
  "output_text": "Flopex pings multiple GPU providers…",
  "status": "success",
  "usage": {
    "input_tokens": 24,
    "output_tokens": 118,
    "total_tokens": 142
  },
  "performance": {
    "latency_ms": 186
  },
  "billing": {
    "cost_usd": 0.000041,
    "cost_display": "$0.000041",
    "input_tokens": 24,
    "output_tokens": 118,
    "total_tokens": 142,
    "cost_per_million_tokens_usd": 0.29,
    "balance_remaining_usd": 9.912004
  }
}

Performance profiles

economy

Optimizes for lowest cost per token.

Best for batch jobs, async workloads, and non-latency-sensitive applications.

balanced (default)

Balances cost, latency, and provider reliability. The right choice for most production use cases.

fast

Prioritizes latency above cost.

Routes to the lowest-latency provider available at the time of the request. Expect sub-100ms responses on most models.

Idempotency

Send Idempotency-Key with a unique value per logical request. Retries with the same key and same JSON body return the cached response without re-executing providers or double-charging. Conflicting reuse returns idempotency_key_reused_with_different_request.