Udaiy’s Blog

The Art of Prompt Engineering Part 1 : 7 Battle-Tested Techniques That Actually Work

For a few months, I was debugging a customer support chatbot that kept giving wildly inconsistent answers to the same question. One minute it would recommend Product A, the next minute Product C for an identical query. The stakeholder demo was in two days, and I was frantically tweaking the system prompt when I realized the real issue wasn't the model, it was my prompt architecture.

The fix wasn't more sophisticated instructions or a bigger model. It was implementing self-consistency validation and structured outputs. That experience taught me something crucial: prompt engineering isn't just about writing better instructions it's about designing reliable, predictable systems that scale.

Why Prompt Engineering Matters More Than Ever

In the rush to ship generative AI features, many teams treat prompts as afterthoughts of quick instructions thrown together to "make the model work." But here's the reality: your prompt is your interface to a trillion-parameter reasoning engine. It's simultaneously your API specification, your error handling, and your quality control system.

Poor prompt engineering creates brittle systems that hallucinate, ignore instructions, or produce inconsistent outputs. Good prompt engineering creates reliable AI applications that users actually trust. The difference often comes down to a few key techniques that most developers never learn.

Core Techniques That Transform Your Prompts

1. Structured Outputs: Stop Parsing, Start Validating

The Problem: Raw text responses create maintenance nightmares. Your LLM says "The user seems frustrated, priority: medium" and you're stuck writing fragile regex to extract sentiment and priority levels.

The Solution: Define your expected output structure upfront using schemas like Pydantic models.

from pydantic import BaseModel
from typing import List

class CustomerIssue(BaseModel):
    sentiment: str  # "positive", "neutral", "negative", "frustrated"  
    priority: str   # "low", "medium", "high", "critical"
    category: str   # "billing", "technical", "feature_request"
    action_items: List[str]

@llm.call(model="gpt-4o-mini", response_model=CustomerIssue)
@prompt_template("""
Analyze this customer message: {message}
""")
def analyze_ticket(message: str): ...

When to use it:

Pro tip: Start with the structure you need, then work backwards to the prompt. This forces you to think about validation and error handling upfront.

2. In-Context Learning: Show, Don't Just Tell

The Problem: Abstract instructions like "be helpful but concise" produce unpredictable results. The model's interpretation of "concise" might be very different from yours.

The Solution: Include 2-3 concrete examples that demonstrate exactly what good outputs look like.

examples = [
    Example(
        query="How do I reset my password?",
        answer="Click 'Forgot Password' on the login page, enter your email, and check your inbox for reset instructions. This usually takes 1-2 minutes."
    ),
    Example(
        query="Why is my account suspended?", 
        answer="Account suspensions typically result from payment issues or policy violations. Please check your email for details, or contact billing@company.com for immediate assistance."
    )
]

@prompt_template("""
You are a customer support assistant. Provide helpful, specific answers.

<examples>
{examples_block}
</examples>

Customer: {query}
""")
def customer_support(query: str, examples: List[Example]): ...

When to use it:

Pro tips:

3. Task Decomposition: Break Complex Tasks Into Focused Components

The Problem: Asking an LLM to "analyze this support ticket and extract issues, determine priority, categorize the problem, generate a response, and suggest internal actions" creates competing objectives that reduce overall quality.

The Solution: Chain multiple focused calls, each with a single clear objective.

# Step 1: Issue Analysis
@llm.call(model="gpt-4o-mini")
@prompt_template("""
Analyze this support ticket and extract:
1. Main issue (one clear sentence)
2. Customer sentiment (positive, neutral, negative, frustrated)  
3. Issue category (technical, billing, feature_request, account, other)
4. Urgency level (low, medium, high, critical)

Ticket: {ticket_text}
""")
def analyze_ticket(ticket_text: str) -> TicketAnalysis: ...

# Step 2: Response Generation  
@llm.call(model="gpt-4o")
@prompt_template("""
Generate a customer support response for this issue.
Issue: {analysis.issue}
Sentiment: {analysis.sentiment}
Customer History: {customer_history}
""")
def generate_response(analysis: TicketAnalysis, customer_history: str) -> str: ...

# Orchestration
def process_support_ticket(ticket_text: str, customer_history: str):
    analysis = analyze_ticket(ticket_text)
    response = generate_response(analysis, customer_history)
    return TicketResult(analysis=analysis, response=response)

When to use it:

Pro tip: Each component should have one clear success metric. If you can't easily evaluate a component's performance, it's probably doing too much.

4. Self-Consistency: Get Confidence Scores for Critical Decisions

The Problem: LLMs can give completely different answers to identical prompts. For critical decisions, a single response might be unreliable.

The Solution: Generate multiple responses and select the most consistent answer, giving you both reliability and a confidence measure.

import asyncio
from collections import Counter

@llm.call(model="gpt-4o-mini", response_model=DiagnosisResult)
async def diagnose_issue_single(symptoms: str) -> DiagnosisResult: ...

async def diagnose_with_consistency(symptoms: str, num_samples: int = 5) -> ConsistencyResult:
    # Generate multiple responses in parallel
    tasks = [diagnose_issue_single(symptoms) for _ in range(num_samples)]
    responses = await asyncio.gather(*tasks)
    
    # Count diagnosis frequency
    diagnoses = [r.diagnosis for r in responses]
    diagnosis_counts = Counter(diagnoses)
    most_common_diagnosis, frequency = diagnosis_counts.most_common(1)[0]
    
    # Calculate agreement score
    agreement_score = frequency / num_samples
    
    return ConsistencyResult(
        final_answer=most_common_diagnosis,
        is_reliable=agreement_score >= 0.6,
        all_responses=diagnoses,
        confidence=agreement_score
    )

When to use it:

Pro tip: Set reliability thresholds based on your use case. Medical diagnosis might need 80% agreement, while content categorization might be fine with 60%.

5. Explicit Rejection Types: Teach Your Model to Say "No"

The Problem: Models forced to always provide answers will hallucinate rather than admit uncertainty, leading to confident but incorrect responses.

The Solution: Use union types that allow explicit rejection with clear reasons.

from typing import Union

class Answer(BaseModel):
    answer: str
    confidence: float

class Rejection(BaseModel):
    reason: str  # "insufficient_context", "outside_scope", "policy_violation"
    suggestion: str

class Response(BaseModel):
    response: Union[Answer, Rejection]

@llm.call(model="gpt-4o-mini", response_model=Response)
@prompt_template("""
Answer this question based on the provided context. If you cannot provide a confident answer, explicitly reject with a reason.

Context: {context}
Question: {question}

If rejecting, use reasons: "insufficient_context", "outside_scope", or "policy_violation"
""")
def answer_with_rejection(question: str, context: str) -> Response: ...

When to use it:

Pro tip: Track rejection rates to identify gaps in your knowledge base or prompting strategy.

6. XML Separators: Eliminate Parsing Ambiguity

The Problem: Using markdown code fences or simple dashes for separation creates ambiguity when user content contains similar patterns.

The Solution: Use XML-style tags that are inherently unambiguous and rarely appear in user content.

### Before: 
Ambiguous separators
PROMPT_TEMPLATE = 
"""
Please extract code from this message:
---
User message:
def hello():
    print("world")
"""
### After 
PROMPT_TEMPLATE = 
"""
Extract and explain code from this user message:

<user_message> I'm getting this error:

- ValueError: invalid syntax

Can you help debug this code?

def process_data():
    return data.split('---')

</user_message>

<instructions> Extract any code blocks and explain their functionality. Format your response as: - Language: [detected language] - Code: [extracted code] - Explanation: [your explanation] </instructions> 
"""

When to use it:

Pro tips:

7. Vague Query Handling: Guide Users to Better Questions

The Problem: Vague inputs like "something is wrong" lead to generic, unhelpful responses that waste compute and frustrate users.

The Solution: Detect unclear queries and intelligently guide users to provide better information.

class QueryAnalysis(BaseModel):
    is_specific: bool
    missing_context: List[str]

class ClarificationRequest(BaseModel):
    message: str
    suggested_questions: List[str]

@llm.call(model="gpt-4o-mini", response_model=QueryAnalysis)
@prompt_template("""
Analyze if this query contains enough specific information for a helpful response:

Query: "{query}"

Consider:
- Does it specify the problem domain?
- Does it include relevant context or error details?  
- Can you provide a specific, actionable answer?

Determine if query is specific and list what context is missing.
""")
def analyze_query_clarity(query: str) -> QueryAnalysis: ...

def smart_query_handler(user_query: str) -> str:
    analysis = analyze_query_clarity(user_query)
    
    if analysis.is_specific:
        return handle_specific_query(user_query)
    else:
        clarification = generate_clarification(user_query, analysis.missing_context)
        return f"""I'd be happy to help! To give you the best answer, could you provide a bit more detail?

{clarification.message}

For example, you could ask:
{chr(10).join(f"- {q}" for q in clarification.suggested_questions)}"""

When to use it:

Pro tip: Track how often users provide better information after clarification requests this helps you tune your clarity thresholds.

Common Pitfalls to Avoid

1. The "More Instructions" Trap

When outputs are inconsistent, resist adding more rules to your prompt. Often the issue is structural (need examples or schemas) rather than instructional. A 500-word prompt with examples usually beats a 1000-word prompt without them.

2. Ignoring Token Economics

Don't put static content in every prompt if you're making many calls. Use prompt caching for system instructions and examples. A cacheable prefix can reduce costs by 90% while improving latency.

3. Single Point of Failure Design

Building prompts that must work perfectly on the first try creates brittle systems. Design for graceful degradation: explicit rejections, confidence scores, and escalation paths when simple approaches fail.

Try It Yourself

The best way to internalize these techniques is to experiment with them. I've created a GitHub gist template with working examples of each technique: Prompt Engineering Techniques from Prod 2025.

Start with structured outputs; it's the foundation that makes everything else more reliable. Then add in-context learning with 2-3 examples for your specific use case. You'll be amazed how much more predictable your AI systems become.

Next in this series: We'll dive into advanced RAG architectures, covering hybrid search, reranking strategies, and chunk quality control. These techniques transform basic retrieval systems into production-ready knowledge engines.

#AI #Prompt Engineering #discussions #llm #llm-challenges