56 open models · 2-bit compressed · live now

Frontier open models.
Compressed to pennies.

Pay-as-you-go API access to 56 open-weight LLMs — Qwen, Llama, DeepSeek, Mistral and more — served as 2-bit compressed weights on on-demand GPUs. Same OpenAI SDK, a fraction of the bill.

Models are 2-bit (Q2_K) quantized — optimized for cost, with quality slightly below the full-precision originals.

Up to 90% cheaper

56 popular open models served as 2-bit compressed weights — the lowest-cost way to run them. Trade a little quality for a big cost cut.

OpenAI-compatible

Drop-in /v1/chat/completions — switch your base_url and keep your existing SDK calls.

Scale-to-zero GPUs

Models spin up on demand and idle back to zero — you pay only for the tokens you actually generate.

Pay-as-you-go

Top up by USD, EUR, or crypto. No subscription. No commitment. Refundable balance.

Pricing that scales with you

Top up any amount. Unused balance carries over forever. No subscriptions.

Starter

$5+20% free
  • ~50M tokens included
  • All 56 compressed models
  • Email support
Start with Starter
Most popular

Pro

$20+25% free
  • ~250M tokens included
  • Priority routing
  • Usage exports
Start with Pro

Scale

$100+30% free
  • ~1.5B tokens included
  • Dedicated channel
  • Direct Slack channel
Start with Scale

Per-1M-token rates: from $0.20 (input) / $0.60 (output) — up to 90% cheaper than frontier APIs. All models are 2-bit (Q2_K) compressed. See docs for the full price table.

Drop-in in 30 seconds

Switch base_url, keep your code.

from openai import OpenAI

client = OpenAI(
    base_url="https://api.aiqa-inference.example/v1",
    api_key="sk-..."
)

print(client.chat.completions.create(
    model="qwen3.6-27b-q2",
    messages=[{"role": "user", "content": "Hello!"}]
).choices[0].message.content)