Frontier open models.
Compressed to pennies.
Pay-as-you-go API access to 56 open-weight LLMs — Qwen, Llama, DeepSeek, Mistral and more — served as 2-bit compressed weights on on-demand GPUs. Same OpenAI SDK, a fraction of the bill.
Models are 2-bit (Q2_K) quantized — optimized for cost, with quality slightly below the full-precision originals.
Up to 90% cheaper
56 popular open models served as 2-bit compressed weights — the lowest-cost way to run them. Trade a little quality for a big cost cut.
OpenAI-compatible
Drop-in /v1/chat/completions — switch your base_url and keep your existing SDK calls.
Scale-to-zero GPUs
Models spin up on demand and idle back to zero — you pay only for the tokens you actually generate.
Pay-as-you-go
Top up by USD, EUR, or crypto. No subscription. No commitment. Refundable balance.
Pricing that scales with you
Top up any amount. Unused balance carries over forever. No subscriptions.
Per-1M-token rates: from $0.20 (input) / $0.60 (output) — up to 90% cheaper than frontier APIs. All models are 2-bit (Q2_K) compressed. See docs for the full price table.
Drop-in in 30 seconds
Switch base_url, keep your code.
from openai import OpenAI
client = OpenAI(
base_url="https://api.aiqa-inference.example/v1",
api_key="sk-..."
)
print(client.chat.completions.create(
model="qwen3.6-27b-q2",
messages=[{"role": "user", "content": "Hello!"}]
).choices[0].message.content)