Performance Profiles
economy
Best for: batch jobs and non-latency-sensitive workloads.
Routing: lowest normalized cost wins among eligible providers.
Typical winner: DeepInfra at roughly $0.07/1M tokens on many open models.
Expected latency: ~200–400 ms class, workload dependent.
balanced (default)
Best for: most production traffic.
Routing: balances cost, latency, and reliability via yield policy.
Typical winner: Featherless or Together AI.
Expected latency: ~150–300 ms.
fast
Best for: real-time, user-facing applications.
Routing: latency-first; Groq LPU tier is preferred when it quotes.
Typical winner: Groq at ~40 ms quotes for supported models.
Expected latency: ~40–200 ms.
Note: may incur slightly higher customer rates when the winning quote is a premium speed tier.
Comparison
| Profile | Avg latency | Typical cost | Best for |
|---|---|---|---|
| economy | ~280 ms | ~$0.07/1M | Batch, async |
| balanced | ~200 ms | ~$0.12/1M | General use |
| fast | ~80 ms | ~$0.18/1M | Real-time |
Usage example
Run the same prompt under all three profiles:
import requests def run(profile: str): r = requests.post( "https://api.flopex.ai/v1/inference", headers={"Authorization": "Bearer sk_live_YOUR_KEY"}, json={ "model": "llama-3.1-8b", "input": "Hello", "max_output_tokens": 64, "performance_profile": profile, }, ) r.raise_for_status() return r.json() for p in ("economy", "balanced", "fast"): data = run(p) print(p, data["performance"]["latency_ms"], data["billing"]["cost_display"])