OpenRouter LLM Research: Cheapest Effective Models for Invoice Extraction & General AI Tasks
Executive Summary
After pulling live pricing from the OpenRouter API (312+ models) and cross-referencing web benchmarks, here are the ranked recommendations for UnitCycle's two use cases.
Use Case 1: Invoice Field Extraction (Structured JSON Output)
Requirements: After OCR, extract vendor name, amounts, dates, line items, GL codes as structured JSON. Must reliably produce valid JSON every time.
Top Picks — Ranked by Value (Quality-per-Dollar)
| Rank | Model | Input/M | Output/M | Context | Structured Output | Notes |
|---|---|---|---|---|---|---|
| 1 | Google Gemini 2.0 Flash (google/gemini-2.0-flash-001) |
$0.10 | $0.40 | 1M | Yes (native json_schema) |
Best value. 97.1% quality in 38-task benchmark at lowest cost. Native JSON schema enforcement. Google's workhorse. |
| 2 | Google Gemini 2.5 Flash Lite (google/gemini-2.5-flash-lite) |
$0.10 | $0.40 | 1M | Yes (native json_schema) |
Newer Lite variant. Same price as 2.0 Flash. Designed for high-volume extraction/classification. |
| 3 | DeepSeek V3.1 (deepseek/deepseek-chat-v3.1) |
$0.15 | $0.75 | 32K | Yes (native json_schema) |
671B MoE (37B active). Excellent at structured extraction — "high compliance with structured outputs." Very literal/predictable, ideal for pipelines. |
| 4 | Qwen3-32B (qwen/qwen3-32b) |
$0.08 | $0.24 | 40K | Yes (native json_schema) |
Cheapest input of any quality model. Strong at structured tasks. May need explicit formatting instructions for long contexts. |
| 5 | GPT-4.1 Nano (openai/gpt-4.1-nano) |
$0.10 | $0.40 | 1M | Yes (native json_schema) |
OpenAI's cheapest. Best-in-class structured output compliance (OpenAI pioneered this). Multimodal. |
| 6 | GPT-5 Nano (openai/gpt-5-nano) |
$0.05 | $0.40 | 400K | Yes (native json_schema) |
Even cheaper input than 4.1 Nano. New model — verify extraction quality before committing. |
| 7 | Mistral Small 3.1 24B (mistralai/mistral-small-3.1-24b-instruct) |
$0.03 | $0.11 | 131K | Yes (native json_schema) |
Ultra-cheap. Works out-of-box with invoice extraction libraries. May be less reliable on complex multi-line-item invoices. |
| 8 | GPT-4o Mini (openai/gpt-4o-mini) |
$0.15 | $0.60 | 128K | Yes (native json_schema) |
Battle-tested. Known excellent JSON compliance. Slightly more expensive but very reliable. |
| 9 | Llama 4 Scout (meta-llama/llama-4-scout) |
$0.08 | $0.30 | 327K | Yes (native json_schema) |
Open-source, multimodal, huge context. Good structured output support via OpenRouter. |
Cost Estimate for Invoice Processing
Assuming a typical invoice = ~2,000 tokens input (OCR markdown + system prompt + schema), ~500 tokens output (extracted JSON):
| Model | Cost per Invoice | Cost per 1,000 Invoices | Cost per 10,000 Invoices |
|---|---|---|---|
| Gemini 2.0 Flash | $0.0004 | $0.40 | $4.00 |
| Gemini 2.5 Flash Lite | $0.0004 | $0.40 | $4.00 |
| Qwen3-32B | $0.0003 | $0.28 | $2.80 |
| GPT-5 Nano | $0.0003 | $0.30 | $3.00 |
| DeepSeek V3.1 | $0.0007 | $0.68 | $6.75 |
| Mistral Small 3.1 | $0.0001 | $0.12 | $1.16 |
| GPT-4.1 Nano | $0.0004 | $0.40 | $4.00 |
| GPT-4o Mini | $0.0006 | $0.60 | $6.00 |
At these prices, even 100,000 invoices/year costs under $40 with Gemini Flash.
Use Case 2: General AI Tasks (Matching, Scoring, Anomaly Detection, Summaries)
Requirements: Property/vendor matching, confidence scoring, anomaly detection in rent rolls, AI summaries. Light reasoning, not as format-critical.
Top Picks — Ranked by Value
| Rank | Model | Input/M | Output/M | Why |
|---|---|---|---|---|
| 1 | Qwen3-235B-A22B (2507) (qwen/qwen3-235b-a22b-2507) |
$0.071 | $0.10 | Monster MoE model at absurdly low price. 235B params, 22B active. Supports reasoning mode. Best reasoning-per-dollar available. |
| 2 | Gemini 2.0 Flash (google/gemini-2.0-flash-001) |
$0.10 | $0.40 | Same model as Use Case 1. Good enough for both tasks — simplifies your stack to one model. |
| 3 | DeepSeek V3.1 (deepseek/deepseek-chat-v3.1) |
$0.15 | $0.75 | Strong reasoning. Hybrid thinking/non-thinking modes. Excels at data analysis. |
| 4 | Llama 4 Scout (meta-llama/llama-4-scout) |
$0.08 | $0.30 | 327K context, multimodal, good general intelligence for the price. |
| 5 | GPT-4o Mini (openai/gpt-4o-mini) |
$0.15 | $0.60 | Reliable all-rounder. If you want one model for everything and trust OpenAI. |
RECOMMENDATION: Best Strategy for UnitCycle
Primary Model: Google Gemini 2.0 Flash (google/gemini-2.0-flash-001)
- $0.10 input / $0.40 output per million tokens
- Use for BOTH invoice extraction AND general AI tasks
- Native structured output (
json_schemamode) via OpenRouter - 1M token context window (handles any document)
- 97.1% quality on diverse benchmarks at lowest cost tier
- Multimodal (can process images too if needed later)
- Battle-tested, production-stable from Google
Fallback / Cost-Optimized: Qwen3-32B (qwen/qwen3-32b)
- $0.08 input / $0.24 output — even cheaper
- Use as fallback if Gemini has rate limits or outages
- Native structured output support
- Open-source model, multiple providers on OpenRouter
Heavy Reasoning Tasks: Qwen3-235B-A22B-2507 (qwen/qwen3-235b-a22b-2507)
- $0.071 input / $0.10 output — cheapest "frontier-class" reasoning
- Use for anomaly detection, complex pattern matching, AI summaries requiring deeper analysis
- Has reasoning/thinking mode for chain-of-thought
NOT Recommended:
- Claude 3.5 Haiku ($0.80/$4.00) — 8x more expensive than Gemini Flash for marginal quality gain on extraction tasks
- Claude Haiku 4.5 ($1.00/$5.00) — Even more expensive. Save Claude for tasks that truly need it.
- DeepSeek R1 ($0.70/$2.50) — Overkill reasoning model for these tasks. Use V3.1 instead.
- Free models — Rate-limited (20 req/min, 200 req/day). Not viable for production.
Implementation Notes
OpenRouter API Call (Invoice Extraction Example)
import requests
response = requests.post(
"https://openrouter.ai/api/v1/chat/completions",
headers={
"Authorization": "Bearer YOUR_OPENROUTER_KEY",
"Content-Type": "application/json",
},
json={
"model": "google/gemini-2.0-flash-001",
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "invoice_extraction",
"strict": True,
"schema": {
"type": "object",
"properties": {
"vendor_name": {"type": "string"},
"invoice_number": {"type": "string"},
"invoice_date": {"type": "string"},
"due_date": {"type": "string"},
"total_amount": {"type": "number"},
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"quantity": {"type": "number"},
"unit_price": {"type": "number"},
"amount": {"type": "number"},
"gl_code": {"type": "string"}
}
}
}
},
"required": ["vendor_name", "total_amount", "line_items"]
}
}
},
"messages": [
{"role": "system", "content": "Extract invoice data from the OCR text. Return structured JSON."},
{"role": "user", "content": "<OCR markdown text here>"}
]
}
)
Key OpenRouter Features to Use:
response_format: json_schema— Forces valid JSON matching your exact Pydantic schema- Provider routing — OpenRouter auto-routes to cheapest/fastest provider for each model
- Response healing plugin — Auto-fixes malformed JSON responses (safety net)
- Streaming — Supported for structured outputs too
Tiered Strategy for Production:
- Tier 1 (default): Gemini 2.0 Flash — handles 95% of invoices
- Tier 2 (complex): If extraction confidence < 80%, retry with DeepSeek V3.1 or Qwen3-235B
- Tier 3 (manual): If both fail, flag for human review
Sources
- OpenRouter Structured Outputs Documentation
- Claude vs GPT vs Gemini: Which AI Wins at Invoice Extraction?
- I Tested 15 LLMs: Claude, GPT, Gemini on 38 Tasks
- LLMs for Structured Data Extraction from PDFs in 2026
- DeepSeek V3.2 Prompting Techniques
- Gemini API Pricing 2026
- Top AI Models on OpenRouter March 2026
- Best Qwen Models in 2026
- OpenRouter Pricing Calculator
- Gemini 2.5 Flash Lite on OpenRouter
- Structured outputs with OpenRouter via Instructor