Cost Optimization for AI Workloads

Key Takeaway

AI cost optimization is not about spending less -- it is about spending deliberately. The highest-impact lever for most teams is matching model capability to task complexity: using your most capable model only where it matters and routing everything else to faster, cheaper alternatives.

Prerequisites

At least one AI workload running in production with observable cost data
Access to billing dashboards for your cloud provider and/or LLM API provider
Basic understanding of token-based pricing for LLM APIs
Familiarity with your application's query patterns and traffic volumes
A cost tracking system or the ability to implement one (even a spreadsheet to start)

The AI Cost Problem

AI workloads follow a different cost curve than traditional software. Traditional SaaS applications scale costs roughly linearly with users: more users mean more compute, storage, and bandwidth, but the cost per user stays relatively stable. AI workloads break this model. Every inference call has a non-trivial marginal cost, and that cost varies dramatically based on the model, the prompt length, and the response complexity. A single feature powered by a frontier LLM can cost more per API call than your entire application server costs per request.

The challenge is compounded by the fact that AI costs are often opaque until the bill arrives. Engineering teams build features using the most capable model available during development, hard-code prompt templates that are longer than necessary, and skip caching because the traffic is low in staging. Then the feature launches, traffic scales, and the monthly bill becomes a conversation topic in the executive team meeting.

40-60%

LLM API Calls

The largest cost driver for most AI applications. Prompt and completion tokens at frontier model prices dominate the bill.

15-25%

Compute & GPU

GPU instances for self-hosted models, fine-tuning jobs, and embedding generation. Often over-provisioned.

10-20%

Storage & Embeddings

Vector databases, model artifact storage, training data, and embedding indices.

5-15%

Monitoring & Tooling

Observability platforms, experiment tracking, evaluation pipelines, and MLOps infrastructure.

Cost Anatomy: Where the Money Goes

Before you can optimize costs, you need to understand where they accumulate. The following diagram shows the six primary cost centers in a typical AI application stack. Most teams discover that one or two cost centers dominate their bill, and targeted optimization of those centers yields better results than trying to optimize everything at once.

Token Analysis & Optimization

For applications that rely on LLM API calls, token usage is the single largest cost driver. Every token in your prompt and every token in the model's response costs money. The good news is that most applications send far more tokens than necessary. Verbose system prompts, redundant context, unoptimized few-shot examples, and unbounded response lengths all contribute to inflated token counts. Optimizing token usage requires measuring it first.

Token Counting and Tracking

Before optimizing, instrument your application to track token usage per request. This gives you the baseline data you need to identify optimization targets and measure the impact of changes.

token_tracker.py

import tiktoken
from dataclasses import dataclass, field
from typing import Optional
import time
import json


@dataclass
class TokenUsage:
    """Track token usage for a single LLM call."""
    prompt_tokens: int
    completion_tokens: int
    model: str
    endpoint: str
    timestamp: float = field(default_factory=time.time)
    cache_hit: bool = False
    estimated_cost_usd: float = 0.0


# Pricing per 1M tokens (input / output) -- update as prices change
MODEL_PRICING = {
    "gpt-4o":          {"input": 2.50,  "output": 10.00},
    "gpt-4o-mini":     {"input": 0.15,  "output": 0.60},
    "claude-sonnet-4-20250514":   {"input": 3.00,  "output": 15.00},
    "claude-haiku-3-5": {"input": 0.80,  "output": 4.00},
}


def estimate_cost(usage: TokenUsage) -> float:
    """Estimate cost in USD for a single LLM call."""
    pricing = MODEL_PRICING.get(usage.model)
    if not pricing:
        return 0.0
    input_cost = (usage.prompt_tokens / 1_000_000) * pricing["input"]
    output_cost = (usage.completion_tokens / 1_000_000) * pricing["output"]
    return round(input_cost + output_cost, 6)


def count_tokens(text: str, model: str = "gpt-4o") -> int:
    """Count tokens for a given text using tiktoken."""
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        encoding = tiktoken.get_encoding("cl100k_base")
    return len(encoding.encode(text))

Prompt Optimization Techniques

Prompt optimization is the lowest-effort, highest-impact cost reduction strategy for LLM-heavy applications. Most system prompts are written during development when token costs are not a concern and then never revisited. Common patterns that waste tokens include verbose role definitions, redundant instructions, overly detailed few-shot examples, and including context that the model does not need for the specific task.

prompt_optimizer.py

def optimize_system_prompt(prompt: str) -> dict:
    """Analyze a system prompt and suggest optimizations.

    Returns a dict with the original token count,
    specific recommendations, and estimated savings.
    """
    token_count = count_tokens(prompt)
    recommendations = []

    # Check for common waste patterns
    lines = prompt.split("\n")

    # 1. Redundant instructions
    seen_instructions = set()
    for i, line in enumerate(lines):
        normalized = line.strip().lower()
        if normalized in seen_instructions and len(normalized) > 20:
            recommendations.append(
                f"Line {i+1}: Duplicate instruction detected"
            )
        seen_instructions.add(normalized)

    # 2. Verbose phrasing
    verbose_patterns = {
        "I want you to act as": "You are",
        "Please make sure to": "",
        "It is important that you": "",
        "You should always remember to": "",
        "Under no circumstances should you ever": "Never",
    }
    for verbose, concise in verbose_patterns.items():
        if verbose.lower() in prompt.lower():
            replacement = f"Replace with '{concise}'" if concise else "Remove"
            recommendations.append(
                f"Verbose phrasing: '{verbose}' -> {replacement}"
            )

    # 3. Few-shot example length
    if prompt.count("Example:") > 3 or prompt.count("###") > 6:
        recommendations.append(
            "Consider reducing few-shot examples to 2-3 "
            "representative cases instead of exhaustive coverage"
        )

    return {
        "original_tokens": token_count,
        "recommendations": recommendations,
        "estimated_savings_pct": min(len(recommendations) * 5, 40),
    }

Run a prompt audit across your entire application. List every system prompt, measure its token count, and rank them by (token count * daily call volume). The top three entries on that list are your highest-value optimization targets.

Context Window Management

Unlock the full Knowledge Base

This article continues for 44 more sections. Upgrade to Pro for full access to all 93 articles.

That's just $0.11 per article

Full access to all blueprints, frameworks, and playbooks
Interactive checklists with progress tracking
Downloadable templates (.xlsx, .pptx, .docx)
Quarterly Technology Radar updates

Start reading with Pro — $9.99/mo

Cancel anytime. 100% money-back guarantee.Compare plansHave a coupon code?

Key Takeaway

Prerequisites

At least one AI workload running in production with observable cost data
Access to billing dashboards for your cloud provider and/or LLM API provider
Basic understanding of token-based pricing for LLM APIs
Familiarity with your application's query patterns and traffic volumes
A cost tracking system or the ability to implement one (even a spreadsheet to start)

The AI Cost Problem

40-60%

LLM API Calls

The largest cost driver for most AI applications. Prompt and completion tokens at frontier model prices dominate the bill.

15-25%

Compute & GPU

GPU instances for self-hosted models, fine-tuning jobs, and embedding generation. Often over-provisioned.

10-20%

Storage & Embeddings

Vector databases, model artifact storage, training data, and embedding indices.

5-15%

Monitoring & Tooling

Observability platforms, experiment tracking, evaluation pipelines, and MLOps infrastructure.

Cost Anatomy: Where the Money Goes

Architecture Diagram

Process

Connections

User RequestModel Router(query)

Model RouterLLM API Calls(complex)

Model RouterCompute / GPU(self-hosted)

Model RouterEmbeddings(RAG lookup)

LLM API CallsStorage(logs)

LLM API CallsMonitoring(metrics)

Compute / GPUMonitoring(metrics)

EmbeddingsStorage(vectors)

Compute / GPUDev & Tooling(experiments)

AI application cost centers. Red and orange nodes typically account for the majority of spend. Optimize these first.

Token Analysis & Optimization

Token Counting and Tracking

Before optimizing, instrument your application to track token usage per request. This gives you the baseline data you need to identify optimization targets and measure the impact of changes.

token_tracker.py

import tiktoken
from dataclasses import dataclass, field
from typing import Optional
import time
import json


@dataclass
class TokenUsage:
    """Track token usage for a single LLM call."""
    prompt_tokens: int
    completion_tokens: int
    model: str
    endpoint: str
    timestamp: float = field(default_factory=time.time)
    cache_hit: bool = False
    estimated_cost_usd: float = 0.0


# Pricing per 1M tokens (input / output) -- update as prices change
MODEL_PRICING = {
    "gpt-4o":          {"input": 2.50,  "output": 10.00},
    "gpt-4o-mini":     {"input": 0.15,  "output": 0.60},
    "claude-sonnet-4-20250514":   {"input": 3.00,  "output": 15.00},
    "claude-haiku-3-5": {"input": 0.80,  "output": 4.00},
}


def estimate_cost(usage: TokenUsage) -> float:
    """Estimate cost in USD for a single LLM call."""
    pricing = MODEL_PRICING.get(usage.model)
    if not pricing:
        return 0.0
    input_cost = (usage.prompt_tokens / 1_000_000) * pricing["input"]
    output_cost = (usage.completion_tokens / 1_000_000) * pricing["output"]
    return round(input_cost + output_cost, 6)


def count_tokens(text: str, model: str = "gpt-4o") -> int:
    """Count tokens for a given text using tiktoken."""
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        encoding = tiktoken.get_encoding("cl100k_base")
    return len(encoding.encode(text))

Prompt Optimization Techniques

prompt_optimizer.py

def optimize_system_prompt(prompt: str) -> dict:
    """Analyze a system prompt and suggest optimizations.

    Returns a dict with the original token count,
    specific recommendations, and estimated savings.
    """
    token_count = count_tokens(prompt)
    recommendations = []

    # Check for common waste patterns
    lines = prompt.split("\n")

    # 1. Redundant instructions
    seen_instructions = set()
    for i, line in enumerate(lines):
        normalized = line.strip().lower()
        if normalized in seen_instructions and len(normalized) > 20:
            recommendations.append(
                f"Line {i+1}: Duplicate instruction detected"
            )
        seen_instructions.add(normalized)

    # 2. Verbose phrasing
    verbose_patterns = {
        "I want you to act as": "You are",
        "Please make sure to": "",
        "It is important that you": "",
        "You should always remember to": "",
        "Under no circumstances should you ever": "Never",
    }
    for verbose, concise in verbose_patterns.items():
        if verbose.lower() in prompt.lower():
            replacement = f"Replace with '{concise}'" if concise else "Remove"
            recommendations.append(
                f"Verbose phrasing: '{verbose}' -> {replacement}"
            )

    # 3. Few-shot example length
    if prompt.count("Example:") > 3 or prompt.count("###") > 6:
        recommendations.append(
            "Consider reducing few-shot examples to 2-3 "
            "representative cases instead of exhaustive coverage"
        )

    return {
        "original_tokens": token_count,
        "recommendations": recommendations,
        "estimated_savings_pct": min(len(recommendations) * 5, 40),
    }

Context Window Management

Unlock the full Knowledge Base

This article continues for 44 more sections. Upgrade to Pro for full access to all 93 articles.

That's just $0.11 per article

Full access to all blueprints, frameworks, and playbooks
Interactive checklists with progress tracking
Downloadable templates (.xlsx, .pptx, .docx)
Quarterly Technology Radar updates

Start reading with Pro — $9.99/mo

Cancel anytime. 100% money-back guarantee.Compare plansHave a coupon code?

Cost Optimization for AI Workloads

The AI Cost Problem

Cost Anatomy: Where the Money Goes

Token Analysis & Optimization

Token Counting and Tracking

Prompt Optimization Techniques

Context Window Management

Unlock the full Knowledge Base

Related content

Cost Optimization for AI Workloads

The AI Cost Problem

Cost Anatomy: Where the Money Goes

User Request

Model Router

LLM API Calls

Compute / GPU

Embeddings

Storage

Monitoring

Dev & Tooling

Token Analysis & Optimization

Token Counting and Tracking

Prompt Optimization Techniques

Context Window Management

Unlock the full Knowledge Base

Related content

User Request

Model Router

LLM API Calls

Compute / GPU

Embeddings

Storage

Monitoring

Dev & Tooling