Few years ago, most of us were building software the same way we had for decades: writing code, calling APIs, processing data through deterministic pipelines. Then ChatGPT launched, and within months, the conversation shifted from “can AI write code?” to “how do we architect systems that leverage LLMs in production?” The change wasn’t just technological – it fundamentally altered how we think about building software. We’re no longer just writing programs that follow explicit rules; we’re orchestrating systems that reason, generate, and adapt.
This isn’t hype. We’ve integrated LLMs into code review pipelines, customer support systems, and data analysis workflows at scale. But to build production systems with these models, you need to understand what’s actually happening under the hood. Not because you’ll be training transformers from scratch, but because token economics, context windows, and sampling strategies directly impact your architecture decisions and AWS bill.
What Generative AI Actually Is
Generative AI, specifically large language models, differs fundamentally from the discriminative models most engineers have worked with. A discriminative model (think: spam classifier, image recognizer) learns decision boundaries – it answers “which category does this input belong to?” An LLM, by contrast, learns probability distributions over sequences. It answers “what comes next?”
This distinction matters for system design. When you call an LLM, you’re not getting a classification with a confidence score. You’re sampling from a probability distribution over your vocabulary. The model predicts the next token, then the next, then the next, in an autoregressive loop until it hits a stopping condition (an end token, max length, or you terminate it).
Why does “generative” matter for software systems? Because the output is non-deterministic by default. Call the same prompt twice, you might get different responses. For traditional software engineering, this is uncomfortable. We’re used to deterministic functions: same input, same output. With LLMs, you’re working with stochastic systems, and your architecture needs to account for that (caching strategies, result validation, retry logic that doesn’t just hammer the same failing prompt).
Tokens: The Currency of LLMs
If you’re billing by API calls, you’re doing it wrong. LLMs operate on tokens, not characters or words. Understanding tokenization is non-negotiable because tokens directly map to cost, latency, and context limits.
Tokenization breaks text into subword units. Most modern LLMs use Byte Pair Encoding (BPE) or variants like WordPiece. Here’s the mental model: common words become single tokens (“the”, “is”, “building”). Less common words split into subwords (“tokenization” might become [“token”, “ization”]). This means your API costs aren’t linear with character count.
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
text = "Understanding tokenization is critical for cost management."
tokens = enc.encode(text)
print(f"Text: {text}")
print(f"Character count: {len(text)}")
print(f"Token count: {len(tokens)}")
print(f"Tokens: {tokens}")
# Output shows ~10 tokens for ~60 characters
# Ratio varies by language and domain vocabularyPythonContext windows define how many tokens the model can process in a single call (input + output combined). GPT-4 offers 8K, 32K, or 128K token contexts. Claude 3 goes up to 200K. These aren’t just marketing numbers – they’re hard architectural constraints.
Here’s the production implication: your average customer support conversation might be 50 messages. At ~100 tokens per message (conservative estimate), you’re at 5K tokens for history alone. Add your system prompt (500 tokens), the current user query (200 tokens), and you’re using 5.7K of an 8K context window before the model even starts generating. You hit the limit fast, and when you do, you need strategies: truncate old messages, summarize history, or move to a larger context model (which costs more per token).
Token pricing varies by model and provider. As of late 2024, rough estimates:
- GPT-4 Turbo: $10 per 1M input tokens, $30 per 1M output tokens
- Claude 3.5 Sonnet: $3 per 1M input, $15 per 1M output
- Llama 3 (self-hosted): Infrastructure costs only
A chatbot handling 10K conversations per day, averaging 5K tokens per conversation (input + output), costs $1,500-$3,000 monthly on GPT-4 Turbo. On Claude, that drops to $500-$900. Self-hosting Llama 3 on AWS might cost $2,000/month for GPU instances but scales better beyond certain volumes. Model selection isn’t just about quality – it’s a direct cost vs capability tradeoff.
Track token usage:
from openai import OpenAI
import logging
client = OpenAI()
def call_llm_with_tracking(prompt: str, model: str = "gpt-4-turbo"):
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
)
usage = response.usage
cost_input = (usage.prompt_tokens / 1_000_000) * 10
cost_output = (usage.completion_tokens / 1_000_000) * 30
total_cost = cost_input + cost_output
logging.info(
f"LLM call: {usage.prompt_tokens} input, "
f"{usage.completion_tokens} output tokens. "
f"Cost: ${total_cost:.4f}"
)
return response.choices[0].message.contentPythonThis logging hooks into your monitoring stack (Datadog, CloudWatch, whatever you use) and surfaces cost per endpoint, per user, per feature. Without it, you’re flying blind on your biggest operational expense.
How LLMs Generate Text
Transformer architecture underpins modern LLMs. You don’t need to derive the math, but the attention mechanism is worth understanding conceptually. Traditional RNNs processed sequences sequentially – token 1, then token 2, then token 3. This created bottlenecks for long sequences. Transformers process all positions simultaneously and use attention to let each token “look at” all other tokens to determine relevance.
When generating text, the model computes attention scores: “given the previous tokens, which parts of the input should I focus on for this next prediction?” This is why LLMs can maintain coherence over long contexts – they’re not forgetting earlier parts of the conversation; they’re actively attending to them.
Pre-training, fine-tuning, and prompting are three ways to adapt model behavior:
Pre-training: Train on massive text corpora to learn language structure (done by OpenAI, Anthropic, Meta). You don’t do this; you use their pre-trained models.
Fine-tuning: Further train the model on domain-specific data. If you’re building a legal document analyzer, fine-tune on legal texts. Costs $100-$1000 per run depending on dataset size and model. Effective when you have thousands of high-quality examples and need specialized behavior.
Prompting: Provide instructions and examples in the input itself (zero-shot, few-shot, or chain-of-thought). This is your primary tool. It’s fast, requires no training infrastructure, and works surprisingly well. Most production systems rely on sophisticated prompting rather than fine-tuning.
Temperature and sampling strategies control output randomness:
# Temperature controls randomness
response_creative = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": "Write a tagline for a space tourism company."}],
temperature=1.2, # Higher = more random/creative
)
response_deterministic = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": "What is 2+2?"}],
temperature=0.0, # Lower = more deterministic
)
# Top-p (nucleus sampling) alternative to temperature
response_nucleus = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": "Explain quantum computing."}],
temperature=0.7,
top_p=0.9, # Only sample from top 90% probability mass
)PythonFor deterministic tasks (data extraction, classification), use temperature=0. For creative tasks (content generation, brainstorming), use temperature=0.7-1.0. For production APIs where users expect consistent behavior, default to low temperature unless they explicitly want variation.
Reproducibility is tricky. Even with temperature=0, different API calls might return slightly different results due to floating-point non-determinism in GPU operations. OpenAI added a seed parameter for more consistent outputs:
response1 = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": "Summarize: [text]"}],
temperature=0.0,
seed=42,
)
response2 = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": "Summarize: [text]"}],
temperature=0.0,
seed=42,
)PythonThis helps with regression testing and debugging, though it’s not a 100% guarantee across API versions or infrastructure changes.
Model Landscape
Choosing a model is an architecture decision, not just a hyperparameter. Each model has different capabilities, pricing, rate limits, and operational characteristics.
OpenAI GPT Family:
- GPT-5: Flagship model with built-in adaptive thinking for expert-level responses across math, coding, vision, and more. Supports up to 1M+ context, multimodal inputs, and agentic workflows. $1.25/$10 per 1M tokens (input/output). Rate limit: Tier-dependent, up to 20K RPM for enterprise. Use for advanced reasoning, coding projects, and real-world problem-solving where top performance is essential.
- GPT-4 Turbo: Best reasoning, supports 128K context, function calling, JSON mode. $10/$30 per 1M tokens (input/output). Rate limit: 10K RPM for paid tier. Use for complex reasoning tasks where quality justifies cost.
- GPT-4o: Multimodal (text, vision, audio). Fast inference. $5/$15 per 1M tokens. Use for applications needing image understanding or lower latency.
- GPT-3.5 Turbo: Cheaper, faster, less capable. $0.50/$1.50 per 1M tokens. Use for simple tasks (classification, basic Q&A) where GPT-4 is overkill.
Anthropic Claude:
- Claude Opus 4.5: Flagship model excelling in coding (outperforms human engineers on benchmarks), long-context handling with selective compression, and agentic workflows like computer use and automation. 200K context window. $5/$25 per 1M tokens (67% price reduction from prior Opus). Use for enterprise-grade, multi-step problem-solving, sustained creative tasks, and high-stakes development where top-tier intelligence justifies the premium.
- Claude 4.5 Sonnet: Enhanced mid-tier model with superior reasoning, multilingual support, and vision capabilities. Supports up to 1M context in beta for extended workflows. $3/$15 per 1M tokens (long context pricing applies beyond 200K). Use for balanced performance in coding, agentic tasks, and knowledge-intensive applications where efficiency meets quality.
- Claude 3.5 Sonnet: 200K context window (massive for document analysis). Strong at following instructions. $3/$15 per 1M tokens. Use when you need large contexts or precise instruction following.
- Claude 3 Opus: Highest capability, slower, more expensive. For complex multi-step reasoning.
- Claude 3 Haiku: Fastest, cheapest. For high-throughput simple tasks.
Open Source Options:
- Llama 3.1 (Meta): Enhanced successor with 8B, 70B, and 405B parameter variants. Supports 128K context window and multilingual capabilities. Free under permissive license. Competitive with GPT-4 in reasoning and coding benchmarks. Use for scalable, high-performance applications like long-form content generation or enterprise RAG systems where customization and cost control are key.
- Gemma 2 (Google): Lightweight 2B, 9B, and 27B models optimized for efficiency. Strong in instruction-following and safety alignments. Runs on modest hardware like laptops. Use for mobile/edge AI, educational tools, or low-latency inference in resource-limited setups.
- Qwen2.5 (Alibaba): Versatile series from 0.5B to 72B parameters, with specialized coding and math variants. Excels in multilingual tasks and long-context handling up to 128K tokens. Fully open-source under Apache 2.0. Use for global applications, developer tools, or high-throughput tasks balancing quality and speed.
- Llama 3 (Meta): 8B and 70B parameter versions. Free to use, host yourself. Performance competitive with GPT-3.5. Use when data privacy is critical or volume is high enough that self-hosting economics make sense.
- Mistral 7B: Efficient, good quality for size. Can run on consumer hardware. Use for edge deployments or resource-constrained environments.
The latency-cost-quality triangle forces tradeoffs. GPT-5 is slower (2-4 seconds for typical requests, with options for lower-latency modes) and more expensive but produces the best outputs across reasoning, coding, and multimodal tasks. GPT-4o is 2-3x faster (under 1 second for simple prompts) and 4-8x cheaper than GPT-5 but makes occasional errors in complex scenarios. Llama 3.1 on dedicated GPUs (e.g., H100/H200 clusters with vLLM or TensorRT-LLM) can match or exceed GPT-4o’s speed (up to 100+ tokens/second per user for 70B variants) with near-zero per-request costs at high scale, though it demands infrastructure expertise and upfront hardware investment.
Rate limits are real constraints. OpenAI’s free tier caps at ~3 RPM (requests per minute) with minimal TPM (tokens per minute). Paid tiers start at 200-500 RPM for GPT-5 in Tier 1 (up from prior models’ 10K RPM baselines), scaling to 3,500+ RPM in higher tiers. If you’re building a product that spikes to 100K users, you need to request limit increases weeks in advance or architect around the limits (request queuing, load shedding, fallback to cheaper models like GPT-5-mini under load).
Model selection strategy:
- Classify request complexity: Simple (classification, keyword extraction) vs Complex (reasoning, synthesis).
- Route accordingly: Simple requests -> GPT-3.5 or Claude Haiku. Complex -> GPT-4 or Claude Sonnet.
- Monitor quality: Track user feedback, error rates, task completion.
- Adjust routing: If GPT-3.5 error rate is too high on “simple” tasks, upgrade the threshold.
async def route_request(user_query: str, context: dict):
"""Route to appropriate model based on complexity."""
complexity = estimate_complexity(user_query, context)
if complexity == "simple":
model = "gpt-3.5-turbo"
max_tokens = 500
elif complexity == "medium":
model = "gpt-4-turbo"
max_tokens = 1000
else: # complex
model = "claude-3-5-sonnet"
max_tokens = 2000
return await call_llm(user_query, model, max_tokens)
def estimate_complexity(query: str, context: dict) -> str:
"""Heuristic-based complexity estimation."""
# In production, this could be an LLM call itself (meta!)
# or rule-based (query length, context size, user history)
if len(query.split()) < 10 and not context.get("requires_reasoning"):
return "simple"
elif "analyze" in query.lower() or "compare" in query.lower():
return "complex"
return "medium"PythonThis routing layer cuts costs by 60-80% compared to using GPT-4 for everything, with minimal quality degradation on tasks that don’t need it.
Architecture Patterns
How you integrate LLMs into your system matters as much as which model you choose.
Direct API Calls vs Abstraction Layers:
Direct calls are simple:
from openai import OpenAI
client = OpenAI(api_key="...")
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)PythonThis works for prototypes but lacks flexibility. If you want to switch models, add fallbacks, or insert caching, you’re rewriting every call site. Abstraction layers (LangChain, LlamaIndex, or your own) provide:
- Provider-agnostic interfaces (swap OpenAI for Anthropic with config change)
- Built-in retries, rate limiting, circuit breakers
- Prompt templating and versioning
- Observability hooks (logging, tracing)
The tradeoff is complexity and dependencies. For small projects, direct calls are fine. For production systems with multiple models and complex orchestration, abstractions pay off.
Sync vs Async Invocation:
LLM calls are I/O-bound (network latency dominates). Synchronous calls block your application thread:
def handle_request(user_query):
response = client.chat.completions.create(...) # blocks here
return response.choices[0].message.content
# If you need to call multiple times, it's sequential
results = [handle_request(q) for q in queries] # 10 queries = 10-30 secondsPythonAsynchronous calls let you handle other work while waiting:
import asyncio
from openai import AsyncOpenAI
async_client = AsyncOpenAI()
async def handle_request_async(user_query):
response = await async_client.chat.completions.create(...)
return response.choices[0].message.content
async def handle_batch(queries):
tasks = [handle_request_async(q) for q in queries]
return await asyncio.gather(*tasks)PythonFor web servers (FastAPI, Flask), always use async. For batch jobs, async parallelizes within rate limit constraints.
Streaming Responses:
LLMs generate tokens sequentially. By default, you wait for the entire response. Streaming returns tokens as generated:
stream = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": "Write a long essay on AI."}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")PythonThis improves perceived latency (user sees output immediately) and enables real-time UX (typing indicator, progressive rendering). Critical for chat interfaces. The downside is you can’t inspect the full response before sending it to the user (harder to filter inappropriate content or validate structure).
Key Takeaways
Token Economics:
- Tokens, not characters, determine cost and context limits
- Track token usage in production – it’s your biggest variable cost
- Different languages and domains tokenize differently (English is most efficient)
- Context windows are hard limits – architect for them, don’t hit them by accident
Model Selection Criteria:
- Quality vs Cost vs Latency triangle – you can’t optimize all three
- Route by complexity: simple tasks to cheap models, complex to expensive
- Rate limits are real constraints – plan for them from day one
- Open source models offer control and privacy but require infrastructure expertise
Basic Integration Patterns:
- Use async calls for web applications – LLM latency dominates
- Implement retries with exponential backoff for transient errors
- Fallback to secondary models or cached responses on primary failure
- Stream responses for better perceived latency in chat interfaces
- Monitor token usage, latency, and error rates as first-class metrics
LLMs are powerful but non-deterministic, expensive, and operationally complex. Treat them like any other infrastructure dependency: understand their characteristics, plan for failures, and optimize based on data rather than assumptions. The next article in this series covers prompt engineering – the primary tool for controlling LLM behavior without fine-tuning.



