Prompt Engineering: The Art of Talking to AI

When GPT-3 launched, most of us typed natural language queries and hoped for decent results. We’ve since learned that prompt engineering isn’t about being polite to the AI or using magic keywords. It’s a technical discipline with measurable impact on accuracy, cost, and reliability. A well-engineered prompt can mean the difference between 60% and 95% accuracy on a classification task, or between spending $10K and $2K monthly on API calls.

What is a prompt? Simply put, a prompt is the text you send to an AI model to get a response. Think of it as giving instructions to a very capable but literal-minded assistant. The clearer and more specific your instructions, the better the output.

This matters because prompts are your primary control mechanism. You’re not training these models or fine-tuning weights (adjusting the model’s internal parameters through additional training). You’re providing instructions, context, and examples in natural language, and the model’s behavior depends entirely on how you structure that input. Poor prompts produce inconsistent outputs, hallucinations (when the AI makes up information that sounds plausible but isn’t true), and wasted tokens (the units of text that determine your API costs). Good prompts produce reliable, structured responses that you can parse and validate in production systems.

The Prompt Anatomy

Modern chat models like GPT-4 and Claude distinguish between two types of prompts:

System Prompt: The “behind the scenes” instruction that sets up the AI’s personality and rules. Users don’t see this, but it shapes every response. Example: “You are a customer support agent for an e-commerce platform. Be concise and professional.”

User Prompt: The actual question or request from the user. Example: “How do I return a product?”

Think of it like directing an actor: the system prompt is the character description and stage directions, while the user prompt is what the other characters say to them in the scene.

System prompts establish the frame and maintain that framing across multiple user interactions. In practice, system prompts are where you inject:

  • Role and persona (“You are an expert Python developer”) – tells the AI what expertise to draw from
  • Behavioral constraints (“Never apologize unless the user explicitly complains”) – prevents unwanted behaviors
  • Output format requirements (“Always respond in JSON”) – ensures consistent, parseable output
  • Safety guidelines (“Refuse requests for harmful content”) – adds guardrails
response = client.chat.completions.create(
    model="gpt-4-turbo",
    messages=[
        {"role": "system", "content": "You are a senior software architect. Provide specific, actionable feedback on maintainability, performance, and security. Format: 1) Summary, 2) Issues, 3) Recommendations."},
        {"role": "user", "content": "Review this Python function: [code here]"}
    ]
)
Python

Role assignment directly affects output quality. We ran experiments where the same technical question produced drastically different results based solely on role framing:

  • “You are a helpful assistant” → generic, often incorrect technical answers
  • “You are a senior backend engineer with 10 years of experience” → more accurate, detailed responses
  • “You are a Python expert who has contributed to Django and Flask” → highly specific, framework-aware guidance

Why does this work? The model isn’t actually an expert – it doesn’t have real experience. However, during training, it learned from millions of documents written by experts. When you assign a role, you’re telling the model: “respond the way someone with this expertise would write.”

Beginner tip: When you’re unsure what role to assign, think about who would be the ideal person to answer your question. A tax accountant for tax questions, a security engineer for security reviews, a UX designer for interface feedback.

Context Injection

Context injection matters when you have information the model needs but wasn’t trained on: user data, knowledge base content, or real-time data. The naive approach is dumping everything into the context, but this burns tokens and often confuses the model with irrelevant information.

Better approach: inject only context relevant to the query. If the user asks about orders, include recent orders. If they ask about returns, include the return policy. This reduces token usage by 40-60% while maintaining accuracy.

Why does context matter? LLMs don’t have access to your database, your company’s policies, or information about specific users. They only know what’s in their training data (which has a cutoff date) and what you provide in the prompt.

Instruction Design Patterns

The way you phrase your request dramatically changes what you get back.

Specificity dramatically affects output quality:

  • Vague: “Summarize this article.” → produces 50-300 words with inconsistent structure
  • Specific: “Summarize in 3 bullet points, each under 20 words. Focus on: main finding, methodology, implications.” → consistent, parseable output

Constraints prevent the model from going off track. For a SQL generator:

Rules:
- Never use SELECT *. Always specify columns explicitly.
- Use parameterized queries, never hardcoded values.
- Prefer indexed columns in WHERE clauses.
- Include LIMIT clauses for large tables.
Plaintext

Without constraints, models produce insecure or inefficient code. With constraints, they generate production-quality outputs.

Output format specification is essential for structured data extraction:

system_prompt = """Extract structured information from support tickets.
Output valid JSON with this schema:
{"category": "billing|technical|shipping|account", "priority": "low|medium|high|critical", "sentiment": "positive|neutral|negative"}
Do not include any text before or after the JSON."""

response = client.chat.completions.create(
    model="gpt-4-turbo",
    messages=[{"role": "system", "content": system_prompt}, {"role": "user", "content": ticket_text}],
    response_format={"type": "json_object"}  # Guarantees valid JSON
)
Python

Negative instructions tell the model what NOT to do. These are surprisingly effective:

  • DO NOT: Apologize, use marketing language, make assumptions about undocumented behavior
  • DO: Cite documentation sections, explain trade-offs, flag when questions fall outside documented behavior

We added negative instructions after observing that models over-apologize and provide plausible-sounding but incorrect answers instead of admitting uncertainty. Negative instructions reduced hallucinations by 30% in our documentation assistant.

Few-Shot Learning

One of the most powerful prompt engineering techniques is showing the model what you want through examples. This is called “few-shot learning” – “few” because you only need 2-5 examples, and “shot” referring to each example attempt.

The core idea: Instead of trying to perfectly describe what you want in words, show the model examples of correct input-output pairs. The model recognizes the pattern and applies it to new inputs.

# Zero-shot (no examples) - less reliable
system_prompt = "Classify customer sentiment as positive, neutral, or negative."

# Few-shot (with examples) - more reliable, consistent format
system_prompt = """Classify customer sentiment as positive, neutral, or negative.

Examples:
Input: "Great product, fast shipping!" → positive
Input: "Product is okay, nothing special." → neutral
Input: "Terrible quality, requesting a refund." → negative

Now classify:"""
Python

Guidelines for example selection:

  1. Representative: Cover the range of inputs you expect
  2. Clear: Examples should be unambiguous
  3. Consistent: Same labeling format and style
  4. Diverse: Include edge cases and challenging examples
  5. Minimal: Start with 3-5 examples; add more only if accuracy improves

Token cost matters. 3 examples at ~100 tokens each = 300 tokens per request. At $10/1M tokens (GPT-4) with 100K requests = $300/month. If zero-shot works at 92% accuracy and few-shot at 95%, is 3% improvement worth $300? Depends on your use case.

When to use few-shot learning:

  • Classification tasks (categorizing support tickets, sentiment analysis)
  • Format standardization (ensuring consistent output structure)
  • Domain-specific terminology (teaching the model your company’s jargon)
  • Edge cases (showing how to handle tricky situations)

Advanced Techniques

Master the fundamentals first (system prompts, context injection, few-shot learning), then return here when you need to solve harder problems.

Chain-of-Thought (CoT) Prompting

What is it? CoT prompting asks the model to “show its work” – explain its reasoning step by step before giving a final answer. This dramatically improves accuracy on complex tasks like math, logic, and multi-step reasoning.

Why does it work? LLMs generate text one token at a time. Without CoT, the model tries to jump straight to the answer, which can lead to errors. With CoT, each step of reasoning becomes part of the context for the next step.

# Without CoT - often makes arithmetic errors
"A store has 23 apples. They sell 17 and receive 45. Then 12 more are sold. How many remain?"

# With CoT - much more reliable
"Solve step by step: 1) Start with initial count, 2) Subtract first sale, 3) Add delivery, 4) Subtract second sale, 5) State final count"
Python

Self-consistency improves CoT by generating multiple reasoning paths (with higher temperature) and taking the majority answer. Costs 3-5x more but significantly improves accuracy on uncertain problems. Use selectively for high-value decisions.

ReAct: Reasoning + Acting

What is it? ReAct is a pattern where the model alternates between thinking (reasoning about what to do) and doing (calling external tools like search, calculators, or APIs). It’s the foundation for most AI agents.

When to use it: When your LLM needs to interact with the outside world – fetching real-time data, performing calculations, looking up information in databases.

The pattern follows: Thought → Action → Observation → Thought → Action → … → Answer

# ReAct system prompt structure
"""Available tools: search(query), calculate(expression), get_weather(location)

Format:
Thought: [Your reasoning about what to do next]
Action: [tool_name(arguments)]
Observation: [Result will be provided]
...repeat until...
Answer: [Your final answer]"""
Plaintext

Technique Comparison

TechniqueBest ForCostComplexity
Chain-of-ThoughtMath, logic, multi-step reasoningLow (just add “step by step”)Easy
Self-ConsistencyHigh-stakes decisions, uncertain answers3-5x base costMedium
ReActTool use, real-time data, API callsVariableMedium-High
Tree of ThoughtsComplex planning, exploring alternatives10x+ base costHigh

Handling Hallucinations

What are hallucinations? LLMs generate information that sounds confident and plausible but is factually incorrect. The model might cite fake research papers, invent statistics, or confidently state wrong information.

Why do they happen? LLMs are trained to predict plausible text, not to verify truth. They don’t have a fact-checking mechanism. When the model doesn’t know something, it generates text that sounds correct based on patterns in its training data.

Real-world example: Ask GPT-4 about a specific legal case, and it might give you confident answer with case numbers, dates, and outcomes – all completely fabricated.

Mitigation Strategies

1. Lower temperature for factual tasks – temperature=0.0 is more conservative, fewer hallucinations. Temperature=1.0 is more creative but hallucinates more.

2. Grounding with retrieval – Instead of asking the model to recall facts, provide them:

prompt = f"""Based on this policy: {policy_doc}
Answer: What is the return policy for electronics?
Use only information from the provided policy. If the policy doesn't cover something, say "The policy doesn't specify this."
"""
Python

This is the foundation of RAG (Retrieval-Augmented Generation).

3. Request citations – Ask the model to cite sources using format like [doc_id: page_number]. Citations let you verify claims by checking the source.

4. Explicit uncertainty prompting – Tell the model: “If confident (>90%), answer directly. If uncertain (<50%), say ‘I don’t have enough information.'” This doesn’t eliminate hallucinations but makes them less confident.

Hallucination prevention checklist:

  1. Use low temperature (0.0-0.3) for factual tasks
  2. Provide source documents and ask the model to cite them
  3. Add “If you don’t know, say ‘I don’t know'” to your system prompt
  4. For critical applications, always verify LLM outputs against trusted sources
  5. Use RAG to ground responses in your own data

Prompt Templates and Versioning

As you move from experiments to production, managing prompts becomes a real challenge. Hardcoding prompts leads to maintenance nightmares.

Why templates matter:

  • Consistency: Same prompt structure across your application
  • Maintainability: Update one template instead of 50 hardcoded strings
  • Testability: Version and A/B test different prompts
  • Separation of concerns: Non-engineers can iterate on prompts without touching code

Basic templating works with Python f-strings or Jinja2. For production, add Pydantic validation to ensure required fields are present.

Version control for prompts is critical. Treat prompts like code:

  • Use semantic versioning (MAJOR.MINOR.PATCH)
  • Store prompts in version control, separate from code
  • Track which version is used for each request
  • Keep previous versions deployed for quick rollback

A/B testing prompts: Hash user_id for consistent assignment to variants, record success metrics, analyze after 7+ days. Sometimes a slightly better prompt costs 20% more in tokens, making it a bad trade overall.

Getting started roadmap:

  • Week 1-2: Simple f-string templates in your code
  • Month 1: Move templates to separate files, add basic versioning
  • Month 2+: Implement A/B testing, add metrics tracking
  • At scale: Consider prompt management platforms (LangSmith, PromptLayer, Humanloop)

Basic vs Optimized: Real-World Comparison

# BASIC PROMPT - 65% accuracy, inconsistent format
"Classify this support ticket: {ticket_text}"

# OPTIMIZED PROMPT - 94% accuracy, consistent JSON format
"""You are a support ticket classifier. 
Classify into exactly one category: technical, billing, shipping, or account.

Guidelines:
- "technical": software bugs, feature requests, integration issues
- "billing": payments, refunds, invoices, pricing questions
- "shipping": delivery status, tracking, damaged goods
- "account": login, password, profile, permissions

Output: {"category": "<category>", "confidence": <0-100>}

Examples:
"I can't log in" → {"category": "account", "confidence": 95}
"When will my order arrive?" → {"category": "shipping", "confidence": 90}
"The API returns a 500 error" → {"category": "technical", "confidence": 100}"""
Python

What made the optimized prompt better?

  1. Clear category definitions – No ambiguity
  2. Explicit output format – JSON with specific fields
  3. Few-shot examples – Shows exactly what good outputs look like
  4. Confidence scores – Lets you route uncertain cases to humans

This costs ~200 more tokens per request ($0.002 on GPT-4) but improves accuracy from 65% to 94%. At 10K tickets/day, that’s $600/month, but it reduces misrouted tickets from 3,500 to 600 daily.

Measuring Prompt Quality

You can’t improve what you don’t measure.

1. Automated metrics – Run prompts against labeled test sets (100+ examples minimum). Compare variants: basic_v1: 65% → with_examples_v2: 78% → optimized_v4: 94%.

2. LLM-as-a-judge – For tasks without clear correct answers (creative writing, explanations), use GPT-4 to evaluate quality on criteria like accuracy, clarity, completeness. Correlates well with human judgment (r=0.85-0.90) and is cheaper than human evaluation.

3. Human feedback loops – Collect user ratings (1-5 stars), correlate with prompt versions. If v2.1.0 has 4.7 stars over 1,000 interactions and v2.2.0 has 4.2 stars over 200 interactions, v2.1.0 is likely better.

Measurement maturity levels:

  • Level 0: No measurement (“it seems to work”)
  • Level 1: Manual spot-checking of outputs
  • Level 2: Automated test suite with labeled examples
  • Level 3: LLM-as-judge for subjective quality
  • Level 4: Production feedback loops with user ratings
  • Level 5: Full observability with A/B testing, cost tracking, continuous monitoring

Most teams should aim for Level 2-3 before launching, and Level 4-5 for production systems.

Key Takeaways

Prompt Patterns Library:

  • System prompts for consistent behavior framing
  • Few-shot examples for format compliance (3-5 examples usually sufficient)
  • Chain-of-Thought for complex reasoning (add “solve step by step”)
  • ReAct for tool-using agents (Thought/Action/Observation loop)
  • Negative instructions to prevent common mistakes
  • Citation requests to enable verification
  • Explicit output format with JSON schemas

Testing Strategies:

  • Build labeled test sets (100+ examples minimum)
  • Automate evaluation before deploying prompt changes
  • Use LLM-as-a-judge for subjective quality
  • Collect user feedback and correlate with prompt versions
  • A/B test significant changes (run 7+ days)
  • Monitor production accuracy continuously

Versioning Approach:

  • Store prompts in version control, separate from code
  • Use semantic versioning (MAJOR.MINOR.PATCH)
  • Track which version is used for each request
  • Gradual rollouts (10% → 50% → 100%)
  • Keep previous versions for quick rollback

For beginners just getting started:

  1. Start simple – a clear system prompt with role and constraints goes a long way
  2. Add few-shot examples when you need consistent formatting
  3. Use Chain-of-Thought (“step by step”) for reasoning tasks
  4. Always include “If you don’t know, say so” to reduce hallucinations
  5. Iterate based on actual failures, not hypothetical edge cases

Common beginner mistakes to avoid:

  • Being too vague (“help me with this”) instead of specific (“summarize in 3 bullets”)
  • Providing too much irrelevant context (wastes tokens, confuses the model)
  • Not specifying output format (leads to inconsistent responses)
  • Trusting LLM outputs without verification for critical applications
  • Over-engineering prompts before validating the basic approach works

Prompt engineering is your primary control mechanism for LLM behavior. Small changes have outsized effects. The difference between 65% and 95% accuracy often comes down to specific examples, explicit constraints, or output format requirements. Treat prompts as critical infrastructure: version them, test them, and monitor their performance in production.

The next article covers embeddings and vector databases, which enable semantic search and retrieval-augmented generation. These are the foundation for grounding LLMs in your proprietary data.

Prabhat Kashyap

Writer & Blogger

Related Posts:

  • All Post
  • Best Practices
  • GenAI
  • General
  • Java
  • QuickFix/J
  • Rust
  • Scala
  • Talk
    •   Back
    • Rust Series
    •   Back
    • Spring Boot

Guided by solid engineering principles, Prabhat builds precise, reliable systems with thoughtful, decisive execution.

© 2026 Prabhat.Dev