The Reflection Pattern: Self-Improving AI Agents

The reflection pattern is one of the most powerful techniques in agentic AI. It's simple in concept: have the AI review its own work. But the impact is profound—reflection can dramatically improve output quality, catch errors before they reach users, and enable iterative refinement that approaches human-level review.

This guide explains how reflection works and provides practical implementations you can use today.

What is the Reflection Pattern?

At its core, reflection means the agent examines its own outputs and identifies ways to improve them. This mirrors how humans work: we draft, review, revise, and repeat until satisfied.

The basic flow:

Generate Output → Reflect on Output → Identify Improvements → Refine → Repeat

This simple loop enables several powerful capabilities:

Error Detection: Catching mistakes before delivery
Quality Improvement: Each iteration gets better
Completeness Checking: Ensuring nothing is missed
Consistency Verification: Aligning with requirements
Self-Correction: Fixing issues without human intervention

Why Reflection Works

When an LLM generates output in a single pass, it makes decisions based on local context—what comes immediately before. Reflection adds a second perspective, allowing the model to:

See the complete output before judging it
Apply different evaluation criteria
Notice patterns invisible during generation
Consider the output from the user's perspective

Research shows reflection can improve task performance by 10-30% across many domains, from code generation to creative writing.

Basic Reflection Implementation

Here's a simple reflection loop:

from anthropic import Anthropic

client = Anthropic()

def generate_with_reflection(task: str, max_reflections: int = 3) -> str:
    """Generate output with iterative reflection"""

    # Initial generation
    output = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        messages=[{"role": "user", "content": task}]
    ).content[0].text

    for i in range(max_reflections):
        # Reflect on the output
        reflection = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            messages=[{
                "role": "user",
                "content": f"""Review this output and identify specific improvements:

Task: {task}

Output:
{output}

Provide:
1. What's good about this output
2. Specific issues or gaps
3. Concrete suggestions for improvement

If the output is excellent and needs no changes, say "APPROVED"."""
            }]
        ).content[0].text

        # Check if approved
        if "APPROVED" in reflection:
            break

        # Refine based on reflection
        output = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=2048,
            messages=[{
                "role": "user",
                "content": f"""Improve this output based on the feedback:

Original task: {task}

Current output:
{output}

Feedback:
{reflection}

Provide an improved version that addresses the feedback."""
            }]
        ).content[0].text

    return output

Reflection Patterns

Pattern 1: Self-Critique

The agent critiques its own work from multiple angles:

class SelfCritiqueAgent:
    def __init__(self, llm):
        self.llm = llm
        self.critique_dimensions = [
            "accuracy",
            "completeness",
            "clarity",
            "relevance",
            "actionability"
        ]

    def generate_with_critique(self, task: str) -> dict:
        # Generate initial output
        output = self.llm.generate(task)

        # Self-critique from multiple dimensions
        critiques = {}
        for dimension in self.critique_dimensions:
            critique = self.llm.generate(f"""
            Evaluate this output on {dimension}:

            Task: {task}
            Output: {output}

            Rate 1-10 and explain your rating.
            If below 8, suggest specific improvements.
            """)
            critiques[dimension] = critique

        # Synthesize improvements
        improvements_needed = [
            dim for dim, critique in critiques.items()
            if "improve" in critique.lower() or any(
                str(i) in critique for i in range(1, 8)
            )
        ]

        if improvements_needed:
            output = self.refine(output, task, critiques)

        return {
            "output": output,
            "critiques": critiques,
            "refined": bool(improvements_needed)
        }

    def refine(self, output: str, task: str, critiques: dict) -> str:
        critique_summary = "\n".join([
            f"{dim}: {critique}" for dim, critique in critiques.items()
        ])

        return self.llm.generate(f"""
        Improve this output based on the critiques:

        Task: {task}
        Output: {output}
        Critiques: {critique_summary}

        Address each critique while maintaining what's already good.
        """)

Pattern 2: Criteria-Based Reflection

Reflect against explicit criteria:

class CriteriaReflector:
    def __init__(self, llm):
        self.llm = llm

    def reflect_against_criteria(
        self,
        output: str,
        task: str,
        criteria: list[dict]
    ) -> dict:
        """
        criteria format: [
            {"name": "criterion_name", "description": "what to check", "weight": 1.0},
            ...
        ]
        """
        evaluations = []

        for criterion in criteria:
            evaluation = self.llm.generate(f"""
            Evaluate this output against the criterion:

            Criterion: {criterion['name']}
            Description: {criterion['description']}

            Output to evaluate:
            {output}

            Provide:
            - Score (0-100)
            - Justification
            - Specific improvements needed (if any)

            Format as JSON.
            """)

            eval_data = json.loads(evaluation)
            eval_data["weight"] = criterion["weight"]
            evaluations.append(eval_data)

        # Calculate weighted score
        total_weight = sum(e["weight"] for e in evaluations)
        weighted_score = sum(
            e["score"] * e["weight"] for e in evaluations
        ) / total_weight

        # Identify needed improvements
        improvements = [
            e["improvements"] for e in evaluations
            if e["score"] < 80 and e.get("improvements")
        ]

        return {
            "score": weighted_score,
            "evaluations": evaluations,
            "improvements_needed": improvements,
            "passed": weighted_score >= 80
        }

Pattern 3: Comparative Reflection

Compare multiple outputs and select the best:

class ComparativeReflector:
    def __init__(self, llm):
        self.llm = llm

    def generate_and_compare(self, task: str, num_candidates: int = 3) -> str:
        # Generate multiple candidates
        candidates = []
        for i in range(num_candidates):
            # Use different temperatures or prompts for variety
            candidate = self.llm.generate(
                task,
                temperature=0.3 + (i * 0.3)  # Vary temperature
            )
            candidates.append(candidate)

        # Compare candidates
        comparison = self.llm.generate(f"""
        Compare these {num_candidates} outputs for the task: {task}

        {chr(10).join([f"Candidate {i+1}:{chr(10)}{c}" for i, c in enumerate(candidates)])}

        For each candidate:
        1. List strengths
        2. List weaknesses
        3. Rate overall quality (1-10)

        Then select the best candidate and explain why.
        Finally, suggest how to combine the best elements of all candidates.
        """)

        # Generate final refined version
        final = self.llm.generate(f"""
        Create an optimal output by combining the best elements:

        Task: {task}

        Candidates and analysis:
        {comparison}

        Generate the best possible output incorporating insights from all candidates.
        """)

        return final

Pattern 4: Role-Based Reflection

Different "reviewers" provide different perspectives:

class MultiReviewerReflection:
    def __init__(self, llm):
        self.llm = llm
        self.reviewers = [
            {
                "role": "Technical Expert",
                "focus": "technical accuracy, implementation details, edge cases"
            },
            {
                "role": "End User",
                "focus": "clarity, usability, practical value"
            },
            {
                "role": "Editor",
                "focus": "structure, grammar, conciseness"
            },
            {
                "role": "Devil's Advocate",
                "focus": "assumptions, potential issues, counterarguments"
            }
        ]

    def multi_perspective_review(self, output: str, task: str) -> dict:
        reviews = {}

        for reviewer in self.reviewers:
            review = self.llm.generate(f"""
            You are a {reviewer['role']} reviewing this output.
            Focus on: {reviewer['focus']}

            Task: {task}
            Output: {output}

            Provide your review with:
            1. What works well from your perspective
            2. What needs improvement
            3. Specific suggestions
            """)
            reviews[reviewer["role"]] = review

        # Synthesize all reviews
        synthesis = self.llm.generate(f"""
        Synthesize these reviews into actionable improvements:

        {chr(10).join([f"{role}:{chr(10)}{review}" for role, review in reviews.items()])}

        Identify:
        1. Consensus issues (mentioned by multiple reviewers)
        2. Unique insights from each perspective
        3. Priority order for addressing feedback
        """)

        return {
            "reviews": reviews,
            "synthesis": synthesis
        }

Pattern 5: Iterative Deepening

Start broad, then refine specific aspects:

class IterativeDeepeningReflector:
    def __init__(self, llm):
        self.llm = llm

    def deep_reflect(self, output: str, task: str, depth: int = 3) -> str:
        current = output

        for level in range(depth):
            if level == 0:
                # High-level structural review
                focus = "overall structure, main points, logical flow"
            elif level == 1:
                # Section-level review
                focus = "each section's content, transitions, completeness"
            else:
                # Detail-level review
                focus = "specific claims, word choice, examples, precision"

            reflection = self.llm.generate(f"""
            Depth level {level + 1} review focusing on: {focus}

            Task: {task}
            Current output: {current}

            Identify specific issues at this level and suggest fixes.
            """)

            current = self.llm.generate(f"""
            Apply these refinements:

            Output: {current}
            Refinements: {reflection}

            Maintain all previous improvements while addressing new feedback.
            """)

        return current

Reflection for Code

Code particularly benefits from reflection:

class CodeReflector:
    def __init__(self, llm):
        self.llm = llm

    def reflect_on_code(self, code: str, requirements: str) -> dict:
        # Multiple reflection passes
        reflections = {}

        # Correctness check
        reflections["correctness"] = self.llm.generate(f"""
        Review this code for correctness:

        Requirements: {requirements}
        Code: {code}

        Check:
        1. Does it meet all requirements?
        2. Are there logic errors?
        3. Are edge cases handled?
        4. Could it fail at runtime?
        """)

        # Security check
        reflections["security"] = self.llm.generate(f"""
        Review this code for security issues:

        Code: {code}

        Check for:
        1. Input validation issues
        2. Injection vulnerabilities
        3. Authentication/authorization issues
        4. Data exposure risks
        """)

        # Performance check
        reflections["performance"] = self.llm.generate(f"""
        Review this code for performance:

        Code: {code}

        Check for:
        1. Unnecessary complexity (time/space)
        2. N+1 query patterns
        3. Memory leaks
        4. Blocking operations
        """)

        # Style check
        reflections["style"] = self.llm.generate(f"""
        Review this code for style and maintainability:

        Code: {code}

        Check for:
        1. Naming conventions
        2. Code organization
        3. Documentation
        4. Readability
        """)

        # Generate improved version
        improved = self.llm.generate(f"""
        Improve this code based on all feedback:

        Original: {code}

        Feedback:
        Correctness: {reflections['correctness']}
        Security: {reflections['security']}
        Performance: {reflections['performance']}
        Style: {reflections['style']}

        Generate improved code that addresses all issues.
        """)

        return {
            "original": code,
            "reflections": reflections,
            "improved": improved
        }

When to Use Reflection

Reflection adds latency and cost (multiple LLM calls). Use it when:

Good Use Cases

High-stakes outputs: Anything that will be published, sent to customers, or has significant consequences
Complex tasks: Multi-step problems where errors compound
Quality-critical work: Code reviews, technical documentation
Creative work: Where refinement improves quality
Learning from mistakes: When you want the agent to self-correct

When to Skip Reflection

Simple queries: Factual questions with clear answers
Time-critical responses: When speed matters more than perfection
Low-stakes outputs: Informal or internal communications
Already-validated inputs: When the task is well-defined with clear success criteria

Optimizing Reflection

1. Limit Reflection Depth

Diminishing returns set in quickly:

def adaptive_reflection(output, task, quality_threshold=0.8):
    """Stop reflecting when quality is good enough"""
    for i in range(5):  # Max 5 iterations
        score = evaluate_quality(output)
        if score >= quality_threshold:
            break
        output = reflect_and_improve(output, task)
    return output

2. Use Faster Models for Reflection

Use a faster/cheaper model for reflection, expensive model for generation:

def efficient_reflection(task):
    # Use powerful model for generation
    output = claude_opus.generate(task)

    # Use faster model for reflection
    reflection = claude_haiku.generate(f"Review: {output}")

    # Use powerful model only if refinement needed
    if needs_improvement(reflection):
        output = claude_opus.generate(f"Improve based on: {reflection}")

    return output

3. Parallel Reflection

Run multiple reflection checks in parallel:

async def parallel_reflection(output, task):
    # Run all checks simultaneously
    results = await asyncio.gather(
        check_accuracy(output),
        check_completeness(output),
        check_clarity(output),
        check_style(output)
    )

    # Combine feedback
    all_feedback = combine_feedback(results)

    # Single refinement pass
    return await refine(output, all_feedback)

4. Cached Reflection Patterns

Reuse common reflection patterns:

REFLECTION_TEMPLATES = {
    "code": """Check for: bugs, security, performance, style""",
    "writing": """Check for: clarity, grammar, structure, engagement""",
    "analysis": """Check for: accuracy, completeness, logic, bias""",
}

def templated_reflection(output, task_type):
    template = REFLECTION_TEMPLATES.get(task_type, REFLECTION_TEMPLATES["analysis"])
    return reflect_with_template(output, template)

Measuring Reflection Effectiveness

Track whether reflection is helping:

class ReflectionMetrics:
    def __init__(self):
        self.metrics = {
            "reflection_count": 0,
            "improvements_made": 0,
            "quality_before": [],
            "quality_after": [],
        }

    def record_reflection(self, before_score, after_score, improvements):
        self.metrics["reflection_count"] += 1
        self.metrics["quality_before"].append(before_score)
        self.metrics["quality_after"].append(after_score)
        if after_score > before_score:
            self.metrics["improvements_made"] += 1

    def get_summary(self):
        avg_before = sum(self.metrics["quality_before"]) / len(self.metrics["quality_before"])
        avg_after = sum(self.metrics["quality_after"]) / len(self.metrics["quality_after"])

        return {
            "total_reflections": self.metrics["reflection_count"],
            "improvement_rate": self.metrics["improvements_made"] / self.metrics["reflection_count"],
            "average_quality_gain": avg_after - avg_before,
        }

Conclusion

The reflection pattern is deceptively simple but remarkably effective. By having agents review their own work, you get:

Higher quality outputs
Fewer errors reaching users
Self-correcting behavior
Continuous improvement within a single task

Key principles:

Generate first, reflect second
Use specific criteria for evaluation
Limit iterations to avoid over-refinement
Match reflection depth to task importance
Measure effectiveness and adjust

When implemented well, reflection transforms agents from first-draft generators into refined, polished output producers.

Want to connect your reflective agents to external tools? Check out The Tool-Use Pattern for the next agentic design pattern.

The Reflection Pattern: Self-Improving AI Agents

This guide explains how reflection works and provides practical implementations you can use today.

What is the Reflection Pattern?

At its core, reflection means the agent examines its own outputs and identifies ways to improve them. This mirrors how humans work: we draft, review, revise, and repeat until satisfied.

The basic flow:

Generate Output → Reflect on Output → Identify Improvements → Refine → Repeat

This simple loop enables several powerful capabilities:

Error Detection: Catching mistakes before delivery
Quality Improvement: Each iteration gets better
Completeness Checking: Ensuring nothing is missed
Consistency Verification: Aligning with requirements
Self-Correction: Fixing issues without human intervention

Why Reflection Works

When an LLM generates output in a single pass, it makes decisions based on local context—what comes immediately before. Reflection adds a second perspective, allowing the model to:

See the complete output before judging it
Apply different evaluation criteria
Notice patterns invisible during generation
Consider the output from the user's perspective

Research shows reflection can improve task performance by 10-30% across many domains, from code generation to creative writing.

Basic Reflection Implementation

Here's a simple reflection loop:

from anthropic import Anthropic

client = Anthropic()

def generate_with_reflection(task: str, max_reflections: int = 3) -> str:
    """Generate output with iterative reflection"""

    # Initial generation
    output = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        messages=[{"role": "user", "content": task}]
    ).content[0].text

    for i in range(max_reflections):
        # Reflect on the output
        reflection = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            messages=[{
                "role": "user",
                "content": f"""Review this output and identify specific improvements:

Task: {task}

Output:
{output}

Provide:
1. What's good about this output
2. Specific issues or gaps
3. Concrete suggestions for improvement

If the output is excellent and needs no changes, say "APPROVED"."""
            }]
        ).content[0].text

        # Check if approved
        if "APPROVED" in reflection:
            break

        # Refine based on reflection
        output = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=2048,
            messages=[{
                "role": "user",
                "content": f"""Improve this output based on the feedback:

Original task: {task}

Current output:
{output}

Feedback:
{reflection}

Provide an improved version that addresses the feedback."""
            }]
        ).content[0].text

    return output

Reflection Patterns

Pattern 1: Self-Critique

The agent critiques its own work from multiple angles:

class SelfCritiqueAgent:
    def __init__(self, llm):
        self.llm = llm
        self.critique_dimensions = [
            "accuracy",
            "completeness",
            "clarity",
            "relevance",
            "actionability"
        ]

    def generate_with_critique(self, task: str) -> dict:
        # Generate initial output
        output = self.llm.generate(task)

        # Self-critique from multiple dimensions
        critiques = {}
        for dimension in self.critique_dimensions:
            critique = self.llm.generate(f"""
            Evaluate this output on {dimension}:

            Task: {task}
            Output: {output}

            Rate 1-10 and explain your rating.
            If below 8, suggest specific improvements.
            """)
            critiques[dimension] = critique

        # Synthesize improvements
        improvements_needed = [
            dim for dim, critique in critiques.items()
            if "improve" in critique.lower() or any(
                str(i) in critique for i in range(1, 8)
            )
        ]

        if improvements_needed:
            output = self.refine(output, task, critiques)

        return {
            "output": output,
            "critiques": critiques,
            "refined": bool(improvements_needed)
        }

    def refine(self, output: str, task: str, critiques: dict) -> str:
        critique_summary = "\n".join([
            f"{dim}: {critique}" for dim, critique in critiques.items()
        ])

        return self.llm.generate(f"""
        Improve this output based on the critiques:

        Task: {task}
        Output: {output}
        Critiques: {critique_summary}

        Address each critique while maintaining what's already good.
        """)

Pattern 2: Criteria-Based Reflection

Reflect against explicit criteria:

class CriteriaReflector:
    def __init__(self, llm):
        self.llm = llm

    def reflect_against_criteria(
        self,
        output: str,
        task: str,
        criteria: list[dict]
    ) -> dict:
        """
        criteria format: [
            {"name": "criterion_name", "description": "what to check", "weight": 1.0},
            ...
        ]
        """
        evaluations = []

        for criterion in criteria:
            evaluation = self.llm.generate(f"""
            Evaluate this output against the criterion:

            Criterion: {criterion['name']}
            Description: {criterion['description']}

            Output to evaluate:
            {output}

            Provide:
            - Score (0-100)
            - Justification
            - Specific improvements needed (if any)

            Format as JSON.
            """)

            eval_data = json.loads(evaluation)
            eval_data["weight"] = criterion["weight"]
            evaluations.append(eval_data)

        # Calculate weighted score
        total_weight = sum(e["weight"] for e in evaluations)
        weighted_score = sum(
            e["score"] * e["weight"] for e in evaluations
        ) / total_weight

        # Identify needed improvements
        improvements = [
            e["improvements"] for e in evaluations
            if e["score"] < 80 and e.get("improvements")
        ]

        return {
            "score": weighted_score,
            "evaluations": evaluations,
            "improvements_needed": improvements,
            "passed": weighted_score >= 80
        }

Pattern 3: Comparative Reflection

Compare multiple outputs and select the best:

class ComparativeReflector:
    def __init__(self, llm):
        self.llm = llm

    def generate_and_compare(self, task: str, num_candidates: int = 3) -> str:
        # Generate multiple candidates
        candidates = []
        for i in range(num_candidates):
            # Use different temperatures or prompts for variety
            candidate = self.llm.generate(
                task,
                temperature=0.3 + (i * 0.3)  # Vary temperature
            )
            candidates.append(candidate)

        # Compare candidates
        comparison = self.llm.generate(f"""
        Compare these {num_candidates} outputs for the task: {task}

        {chr(10).join([f"Candidate {i+1}:{chr(10)}{c}" for i, c in enumerate(candidates)])}

        For each candidate:
        1. List strengths
        2. List weaknesses
        3. Rate overall quality (1-10)

        Then select the best candidate and explain why.
        Finally, suggest how to combine the best elements of all candidates.
        """)

        # Generate final refined version
        final = self.llm.generate(f"""
        Create an optimal output by combining the best elements:

        Task: {task}

        Candidates and analysis:
        {comparison}

        Generate the best possible output incorporating insights from all candidates.
        """)

        return final

Pattern 4: Role-Based Reflection

Different "reviewers" provide different perspectives:

class MultiReviewerReflection:
    def __init__(self, llm):
        self.llm = llm
        self.reviewers = [
            {
                "role": "Technical Expert",
                "focus": "technical accuracy, implementation details, edge cases"
            },
            {
                "role": "End User",
                "focus": "clarity, usability, practical value"
            },
            {
                "role": "Editor",
                "focus": "structure, grammar, conciseness"
            },
            {
                "role": "Devil's Advocate",
                "focus": "assumptions, potential issues, counterarguments"
            }
        ]

    def multi_perspective_review(self, output: str, task: str) -> dict:
        reviews = {}

        for reviewer in self.reviewers:
            review = self.llm.generate(f"""
            You are a {reviewer['role']} reviewing this output.
            Focus on: {reviewer['focus']}

            Task: {task}
            Output: {output}

            Provide your review with:
            1. What works well from your perspective
            2. What needs improvement
            3. Specific suggestions
            """)
            reviews[reviewer["role"]] = review

        # Synthesize all reviews
        synthesis = self.llm.generate(f"""
        Synthesize these reviews into actionable improvements:

        {chr(10).join([f"{role}:{chr(10)}{review}" for role, review in reviews.items()])}

        Identify:
        1. Consensus issues (mentioned by multiple reviewers)
        2. Unique insights from each perspective
        3. Priority order for addressing feedback
        """)

        return {
            "reviews": reviews,
            "synthesis": synthesis
        }

Pattern 5: Iterative Deepening

Start broad, then refine specific aspects:

class IterativeDeepeningReflector:
    def __init__(self, llm):
        self.llm = llm

    def deep_reflect(self, output: str, task: str, depth: int = 3) -> str:
        current = output

        for level in range(depth):
            if level == 0:
                # High-level structural review
                focus = "overall structure, main points, logical flow"
            elif level == 1:
                # Section-level review
                focus = "each section's content, transitions, completeness"
            else:
                # Detail-level review
                focus = "specific claims, word choice, examples, precision"

            reflection = self.llm.generate(f"""
            Depth level {level + 1} review focusing on: {focus}

            Task: {task}
            Current output: {current}

            Identify specific issues at this level and suggest fixes.
            """)

            current = self.llm.generate(f"""
            Apply these refinements:

            Output: {current}
            Refinements: {reflection}

            Maintain all previous improvements while addressing new feedback.
            """)

        return current

Reflection for Code

Code particularly benefits from reflection:

class CodeReflector:
    def __init__(self, llm):
        self.llm = llm

    def reflect_on_code(self, code: str, requirements: str) -> dict:
        # Multiple reflection passes
        reflections = {}

        # Correctness check
        reflections["correctness"] = self.llm.generate(f"""
        Review this code for correctness:

        Requirements: {requirements}
        Code: {code}

        Check:
        1. Does it meet all requirements?
        2. Are there logic errors?
        3. Are edge cases handled?
        4. Could it fail at runtime?
        """)

        # Security check
        reflections["security"] = self.llm.generate(f"""
        Review this code for security issues:

        Code: {code}

        Check for:
        1. Input validation issues
        2. Injection vulnerabilities
        3. Authentication/authorization issues
        4. Data exposure risks
        """)

        # Performance check
        reflections["performance"] = self.llm.generate(f"""
        Review this code for performance:

        Code: {code}

        Check for:
        1. Unnecessary complexity (time/space)
        2. N+1 query patterns
        3. Memory leaks
        4. Blocking operations
        """)

        # Style check
        reflections["style"] = self.llm.generate(f"""
        Review this code for style and maintainability:

        Code: {code}

        Check for:
        1. Naming conventions
        2. Code organization
        3. Documentation
        4. Readability
        """)

        # Generate improved version
        improved = self.llm.generate(f"""
        Improve this code based on all feedback:

        Original: {code}

        Feedback:
        Correctness: {reflections['correctness']}
        Security: {reflections['security']}
        Performance: {reflections['performance']}
        Style: {reflections['style']}

        Generate improved code that addresses all issues.
        """)

        return {
            "original": code,
            "reflections": reflections,
            "improved": improved
        }

When to Use Reflection

Reflection adds latency and cost (multiple LLM calls). Use it when:

Good Use Cases

High-stakes outputs: Anything that will be published, sent to customers, or has significant consequences
Complex tasks: Multi-step problems where errors compound
Quality-critical work: Code reviews, technical documentation
Creative work: Where refinement improves quality
Learning from mistakes: When you want the agent to self-correct

When to Skip Reflection

Simple queries: Factual questions with clear answers
Time-critical responses: When speed matters more than perfection
Low-stakes outputs: Informal or internal communications
Already-validated inputs: When the task is well-defined with clear success criteria

Optimizing Reflection

1. Limit Reflection Depth

Diminishing returns set in quickly:

def adaptive_reflection(output, task, quality_threshold=0.8):
    """Stop reflecting when quality is good enough"""
    for i in range(5):  # Max 5 iterations
        score = evaluate_quality(output)
        if score >= quality_threshold:
            break
        output = reflect_and_improve(output, task)
    return output

2. Use Faster Models for Reflection

Use a faster/cheaper model for reflection, expensive model for generation:

def efficient_reflection(task):
    # Use powerful model for generation
    output = claude_opus.generate(task)

    # Use faster model for reflection
    reflection = claude_haiku.generate(f"Review: {output}")

    # Use powerful model only if refinement needed
    if needs_improvement(reflection):
        output = claude_opus.generate(f"Improve based on: {reflection}")

    return output

3. Parallel Reflection

Run multiple reflection checks in parallel:

async def parallel_reflection(output, task):
    # Run all checks simultaneously
    results = await asyncio.gather(
        check_accuracy(output),
        check_completeness(output),
        check_clarity(output),
        check_style(output)
    )

    # Combine feedback
    all_feedback = combine_feedback(results)

    # Single refinement pass
    return await refine(output, all_feedback)

4. Cached Reflection Patterns

Reuse common reflection patterns:

REFLECTION_TEMPLATES = {
    "code": """Check for: bugs, security, performance, style""",
    "writing": """Check for: clarity, grammar, structure, engagement""",
    "analysis": """Check for: accuracy, completeness, logic, bias""",
}

def templated_reflection(output, task_type):
    template = REFLECTION_TEMPLATES.get(task_type, REFLECTION_TEMPLATES["analysis"])
    return reflect_with_template(output, template)

Measuring Reflection Effectiveness

Track whether reflection is helping:

class ReflectionMetrics:
    def __init__(self):
        self.metrics = {
            "reflection_count": 0,
            "improvements_made": 0,
            "quality_before": [],
            "quality_after": [],
        }

    def record_reflection(self, before_score, after_score, improvements):
        self.metrics["reflection_count"] += 1
        self.metrics["quality_before"].append(before_score)
        self.metrics["quality_after"].append(after_score)
        if after_score > before_score:
            self.metrics["improvements_made"] += 1

    def get_summary(self):
        avg_before = sum(self.metrics["quality_before"]) / len(self.metrics["quality_before"])
        avg_after = sum(self.metrics["quality_after"]) / len(self.metrics["quality_after"])

        return {
            "total_reflections": self.metrics["reflection_count"],
            "improvement_rate": self.metrics["improvements_made"] / self.metrics["reflection_count"],
            "average_quality_gain": avg_after - avg_before,
        }

Conclusion

The reflection pattern is deceptively simple but remarkably effective. By having agents review their own work, you get:

Higher quality outputs
Fewer errors reaching users
Self-correcting behavior
Continuous improvement within a single task

Key principles:

Generate first, reflect second
Use specific criteria for evaluation
Limit iterations to avoid over-refinement
Match reflection depth to task importance
Measure effectiveness and adjust

When implemented well, reflection transforms agents from first-draft generators into refined, polished output producers.

Want to connect your reflective agents to external tools? Check out The Tool-Use Pattern for the next agentic design pattern.