Key Takeaway
AI cost optimization is not about spending less -- it is about spending deliberately. The highest-impact lever for most teams is matching model capability to task complexity: using your most capable model only where it matters and routing everything else to faster, cheaper alternatives.
Prerequisites
- At least one AI workload running in production with observable cost data
- Access to billing dashboards for your cloud provider and/or LLM API provider
- Basic understanding of token-based pricing for LLM APIs
- Familiarity with your application's query patterns and traffic volumes
- A cost tracking system or the ability to implement one (even a spreadsheet to start)
The AI Cost Problem
AI workloads follow a different cost curve than traditional software. Traditional SaaS applications scale costs roughly linearly with users: more users mean more compute, storage, and bandwidth, but the cost per user stays relatively stable. AI workloads break this model. Every inference call has a non-trivial marginal cost, and that cost varies dramatically based on the model, the prompt length, and the response complexity. A single feature powered by a frontier LLM can cost more per API call than your entire application server costs per request.
The challenge is compounded by the fact that AI costs are often opaque until the bill arrives. Engineering teams build features using the most capable model available during development, hard-code prompt templates that are longer than necessary, and skip caching because the traffic is low in staging. Then the feature launches, traffic scales, and the monthly bill becomes a conversation topic in the executive team meeting.
40-60%
LLM API Calls
The largest cost driver for most AI applications. Prompt and completion tokens at frontier model prices dominate the bill.
15-25%
Compute & GPU
GPU instances for self-hosted models, fine-tuning jobs, and embedding generation. Often over-provisioned.
10-20%
Storage & Embeddings
Vector databases, model artifact storage, training data, and embedding indices.
5-15%
Monitoring & Tooling
Observability platforms, experiment tracking, evaluation pipelines, and MLOps infrastructure.
Cost Anatomy: Where the Money Goes
Before you can optimize costs, you need to understand where they accumulate. The following diagram shows the six primary cost centers in a typical AI application stack. Most teams discover that one or two cost centers dominate their bill, and targeted optimization of those centers yields better results than trying to optimize everything at once.
Token Analysis & Optimization
For applications that rely on LLM API calls, token usage is the single largest cost driver. Every token in your prompt and every token in the model's response costs money. The good news is that most applications send far more tokens than necessary. Verbose system prompts, redundant context, unoptimized few-shot examples, and unbounded response lengths all contribute to inflated token counts. Optimizing token usage requires measuring it first.
Token Counting and Tracking
Before optimizing, instrument your application to track token usage per request. This gives you the baseline data you need to identify optimization targets and measure the impact of changes.
import tiktoken
from dataclasses import dataclass, field
from typing import Optional
import time
import json
@dataclass
class TokenUsage:
"""Track token usage for a single LLM call."""
prompt_tokens: int
completion_tokens: int
model: str
endpoint: str
timestamp: float = field(default_factory=time.time)
cache_hit: bool = False
estimated_cost_usd: float = 0.0
# Pricing per 1M tokens (input / output) -- update as prices change
MODEL_PRICING = {
"gpt-4o": {"input": 2.50, "output": 10.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"claude-sonnet-4-20250514": {"input": 3.00, "output": 15.00},
"claude-haiku-3-5": {"input": 0.80, "output": 4.00},
}
def estimate_cost(usage: TokenUsage) -> float:
"""Estimate cost in USD for a single LLM call."""
pricing = MODEL_PRICING.get(usage.model)
if not pricing:
return 0.0
input_cost = (usage.prompt_tokens / 1_000_000) * pricing["input"]
output_cost = (usage.completion_tokens / 1_000_000) * pricing["output"]
return round(input_cost + output_cost, 6)
def count_tokens(text: str, model: str = "gpt-4o") -> int:
"""Count tokens for a given text using tiktoken."""
try:
encoding = tiktoken.encoding_for_model(model)
except KeyError:
encoding = tiktoken.get_encoding("cl100k_base")
return len(encoding.encode(text))Prompt Optimization Techniques
Prompt optimization is the lowest-effort, highest-impact cost reduction strategy for LLM-heavy applications. Most system prompts are written during development when token costs are not a concern and then never revisited. Common patterns that waste tokens include verbose role definitions, redundant instructions, overly detailed few-shot examples, and including context that the model does not need for the specific task.
def optimize_system_prompt(prompt: str) -> dict:
"""Analyze a system prompt and suggest optimizations.
Returns a dict with the original token count,
specific recommendations, and estimated savings.
"""
token_count = count_tokens(prompt)
recommendations = []
# Check for common waste patterns
lines = prompt.split("\n")
# 1. Redundant instructions
seen_instructions = set()
for i, line in enumerate(lines):
normalized = line.strip().lower()
if normalized in seen_instructions and len(normalized) > 20:
recommendations.append(
f"Line {i+1}: Duplicate instruction detected"
)
seen_instructions.add(normalized)
# 2. Verbose phrasing
verbose_patterns = {
"I want you to act as": "You are",
"Please make sure to": "",
"It is important that you": "",
"You should always remember to": "",
"Under no circumstances should you ever": "Never",
}
for verbose, concise in verbose_patterns.items():
if verbose.lower() in prompt.lower():
replacement = f"Replace with '{concise}'" if concise else "Remove"
recommendations.append(
f"Verbose phrasing: '{verbose}' -> {replacement}"
)
# 3. Few-shot example length
if prompt.count("Example:") > 3 or prompt.count("###") > 6:
recommendations.append(
"Consider reducing few-shot examples to 2-3 "
"representative cases instead of exhaustive coverage"
)
return {
"original_tokens": token_count,
"recommendations": recommendations,
"estimated_savings_pct": min(len(recommendations) * 5, 40),
}Run a prompt audit across your entire application. List every system prompt, measure its token count, and rank them by (token count * daily call volume). The top three entries on that list are your highest-value optimization targets.
Context Window Management
Unlock the full Knowledge Base
This article continues for 44 more sections. Upgrade to Pro for full access to all 93 articles.
That's just $0.11 per article
- Full access to all blueprints, frameworks, and playbooks
- Interactive checklists with progress tracking
- Downloadable templates (.xlsx, .pptx, .docx)
- Quarterly Technology Radar updates