Testing Claude Code Skills: Strategies and Tools

Building Claude Code skills is only half the battle. The other half—often overlooked—is ensuring they work reliably across different scenarios, edge cases, and user contexts. Unlike traditional software testing, skill testing involves validating AI-generated outputs, which introduces unique challenges.

This guide covers practical testing strategies for Claude Code skills, from simple manual validation to sophisticated automated testing pipelines.

Why Testing Skills Is Different

Traditional software testing validates deterministic outputs: given input X, expect output Y. AI skill testing is fundamentally different because:

Outputs are probabilistic. The same input might produce slightly different outputs across runs.
Context matters enormously. A skill that works perfectly in one codebase might fail in another.
Edge cases are unbounded. You cannot enumerate all possible inputs to an AI system.
Failure modes are subtle. A skill might produce syntactically correct but semantically wrong output.

These differences require a testing philosophy shift: instead of testing for exact outputs, we test for behavioral invariants, quality thresholds, and safety boundaries.

The Three Levels of Skill Testing

Level 1: Manual Testing

Manual testing is where every skill should start. Before automating anything, you need to understand how your skill behaves in practice.

The REPL Approach

Claude Code's interactive mode serves as your primary testing REPL. Run your skill with various inputs and observe:

# Test your skill with a simple case
claude --skill your-skill-name "Process this simple input"

# Test with a complex case
claude --skill your-skill-name "Handle this edge case with special characters: @#$%"

# Test with context from a specific directory
cd /path/to/test/project && claude --skill your-skill-name "Work with this codebase"

What to Observe During Manual Testing

Correctness: Does the output match your expectations?
Consistency: Run the same input 3-5 times. Are outputs reasonably consistent?
Graceful degradation: What happens with malformed inputs?
Context sensitivity: Does the skill adapt to different project types?
Performance: How long does execution take?

Creating a Manual Test Matrix

Document your manual tests in a structured format:

## Manual Test Matrix for [Skill Name]

| Test Case | Input | Expected Behavior | Actual Result | Pass/Fail |
|-----------|-------|-------------------|---------------|-----------|
| Simple input | "hello" | Processes correctly | [result] | Pass |
| Empty input | "" | Returns helpful error | [result] | Pass |
| Unicode input | "Hello" | Handles emoji correctly | [result] | Pass |
| Large input | [1000+ chars] | Completes in <5s | [result] | Pass |
| Malformed input | "{{broken" | Graceful error message | [result] | Fail |

Level 2: Unit Testing Skill Components

While you cannot unit test the AI's decision-making, you can unit test the deterministic components of your skill:

Testing Prompt Templates

If your skill constructs prompts dynamically, test the template logic:

// tests/prompt-template.test.ts
import { buildPrompt } from '../lib/skill-prompts';

describe('Skill Prompt Templates', () => {
  it('should include project context when provided', () => {
    const prompt = buildPrompt({
      task: 'refactor code',
      projectType: 'typescript',
      conventions: ['use-semicolons', 'prefer-const']
    });

    expect(prompt).toContain('typescript');
    expect(prompt).toContain('use-semicolons');
    expect(prompt).toContain('prefer-const');
  });

  it('should handle missing optional fields', () => {
    const prompt = buildPrompt({
      task: 'refactor code'
    });

    expect(prompt).not.toContain('undefined');
    expect(prompt).not.toContain('null');
  });

  it('should escape special characters in user input', () => {
    const prompt = buildPrompt({
      task: 'handle this: ${dangerous}'
    });

    expect(prompt).not.toMatch(/\$\{.*\}/);
  });
});

Testing Output Parsers

If your skill parses structured output from Claude, test those parsers:

// tests/output-parser.test.ts
import { parseSkillOutput } from '../lib/skill-parser';

describe('Skill Output Parser', () => {
  it('should extract code blocks correctly', () => {
    const output = `
Here is the refactored code:

\`\`\`typescript
const x = 1;
\`\`\`

This improves readability.
`;

    const result = parseSkillOutput(output);
    expect(result.codeBlocks).toHaveLength(1);
    expect(result.codeBlocks[0].language).toBe('typescript');
    expect(result.codeBlocks[0].content).toBe('const x = 1;');
  });

  it('should handle outputs without code blocks', () => {
    const output = 'No code changes needed.';

    const result = parseSkillOutput(output);
    expect(result.codeBlocks).toHaveLength(0);
    expect(result.explanation).toBe('No code changes needed.');
  });
});

Testing File Operations

If your skill reads or writes files, test those operations in isolation:

// tests/file-operations.test.ts
import { prepareSkillContext, applySkillChanges } from '../lib/skill-files';
import * as fs from 'fs';
import * as path from 'path';
import * as os from 'os';

describe('Skill File Operations', () => {
  let testDir: string;

  beforeEach(() => {
    testDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-test-'));
    fs.writeFileSync(
      path.join(testDir, 'test.ts'),
      'const x = 1;'
    );
  });

  afterEach(() => {
    fs.rmSync(testDir, { recursive: true });
  });

  it('should read project files correctly', async () => {
    const context = await prepareSkillContext(testDir);
    expect(context.files).toContainEqual({
      path: 'test.ts',
      content: 'const x = 1;'
    });
  });

  it('should apply changes without data loss', async () => {
    const changes = [{
      path: 'test.ts',
      content: 'const x = 2;'
    }];

    await applySkillChanges(testDir, changes);

    const content = fs.readFileSync(
      path.join(testDir, 'test.ts'),
      'utf-8'
    );
    expect(content).toBe('const x = 2;');
  });
});

Level 3: Integration Testing

Integration tests validate the entire skill execution flow, including AI interactions. These are more expensive to run but provide the highest confidence.

Snapshot Testing for AI Outputs

Since AI outputs vary, traditional equality assertions fail. Instead, use semantic similarity or structural validation:

// tests/integration/skill-integration.test.ts
import { executeSkill } from '../lib/skill-executor';
import { validateOutput } from '../lib/output-validator';

describe('Skill Integration Tests', () => {
  // Use longer timeout for AI operations
  jest.setTimeout(60000);

  it('should generate valid TypeScript when refactoring', async () => {
    const input = `
      function add(a, b) {
        return a + b;
      }
    `;

    const output = await executeSkill('typescript-refactor', {
      code: input,
      targetVersion: 'es2022'
    });

    // Structural validation: must be valid TypeScript
    const validation = await validateOutput(output, {
      language: 'typescript',
      mustCompile: true
    });

    expect(validation.isValid).toBe(true);
    expect(validation.errors).toHaveLength(0);
  });

  it('should preserve function semantics during refactor', async () => {
    const input = `const add = (a, b) => a + b;`;

    const output = await executeSkill('typescript-refactor', {
      code: input
    });

    // Semantic validation using a safe execution sandbox
    // Note: Use vm2 or isolated-vm for safe code execution in tests
    const vm = require('vm2');
    const sandbox = new vm.VM();

    const originalResult = sandbox.run(input + '; add(2, 3)');
    const refactoredResult = sandbox.run(output.code + '; add(2, 3)');

    expect(refactoredResult).toBe(originalResult);
  });
});

Golden File Testing

For skills that produce complex outputs, use golden files—known-good outputs that serve as baselines:

// tests/golden/commit-message.test.ts
import { executeSkill } from '../lib/skill-executor';
import * as fs from 'fs';
import * as path from 'path';

describe('Commit Message Skill - Golden Tests', () => {
  const goldenDir = path.join(__dirname, 'golden-files');

  it('should match golden output for feature commits', async () => {
    const input = fs.readFileSync(
      path.join(goldenDir, 'feature-diff.txt'),
      'utf-8'
    );

    const output = await executeSkill('commit-message', {
      diff: input,
      type: 'feature'
    });

    // Check structural properties, not exact match
    expect(output.title.length).toBeLessThanOrEqual(72);
    expect(output.title).toMatch(/^(feat|feature):/i);
    expect(output.body.split('\n').length).toBeGreaterThan(1);
  });

  it('should produce consistent quality across runs', async () => {
    const input = fs.readFileSync(
      path.join(goldenDir, 'bugfix-diff.txt'),
      'utf-8'
    );

    // Run multiple times to check consistency
    const outputs = await Promise.all(
      Array(3).fill(null).map(() =>
        executeSkill('commit-message', { diff: input, type: 'fix' })
      )
    );

    // All outputs should have similar structure
    outputs.forEach(output => {
      expect(output.title).toMatch(/^fix:/i);
      expect(output.title.length).toBeLessThanOrEqual(72);
    });
  });
});

Testing Patterns for Common Skill Types

Testing Command Skills

Command skills are the most straightforward to test because they have clear inputs and outputs:

describe('PR Review Command', () => {
  it('should identify security issues in code', async () => {
    const code = `
      const query = "SELECT * FROM users WHERE id = " + userId;
    `;

    const review = await executeSkill('pr-review', { code });

    expect(review.issues.some(i =>
      i.category === 'security' &&
      i.description.toLowerCase().includes('sql injection')
    )).toBe(true);
  });

  it('should not produce false positives for safe code', async () => {
    const code = `
      const query = db.prepare("SELECT * FROM users WHERE id = ?");
      query.bind(userId);
    `;

    const review = await executeSkill('pr-review', { code });

    expect(review.issues.filter(i =>
      i.category === 'security'
    )).toHaveLength(0);
  });
});

Testing Agent Skills

Agent skills that orchestrate multiple tools require more sophisticated testing:

describe('Codebase Analyzer Agent', () => {
  let mockToolCalls: string[];

  beforeEach(() => {
    mockToolCalls = [];
    // Mock tool execution to track calls
    jest.spyOn(executor, 'executeTool').mockImplementation(
      async (tool, args) => {
        mockToolCalls.push(`${tool}:${JSON.stringify(args)}`);
        return { success: true, result: 'mocked' };
      }
    );
  });

  it('should use grep before read for large codebases', async () => {
    await executeSkill('codebase-analyzer', {
      question: 'Where is authentication handled?',
      projectSize: 'large'
    });

    // Verify grep was called before read
    const grepIndex = mockToolCalls.findIndex(c => c.startsWith('grep:'));
    const readIndex = mockToolCalls.findIndex(c => c.startsWith('read:'));

    expect(grepIndex).toBeLessThan(readIndex);
  });

  it('should limit file reads to prevent context overflow', async () => {
    await executeSkill('codebase-analyzer', {
      question: 'Analyze all files',
      projectSize: 'large'
    });

    const readCalls = mockToolCalls.filter(c => c.startsWith('read:'));
    expect(readCalls.length).toBeLessThanOrEqual(10);
  });
});

Testing Hook Skills

Hooks run at specific lifecycle points and require testing both the trigger conditions and the actions:

describe('Pre-Commit Hook', () => {
  it('should block commits with secrets', async () => {
    const files = [{
      path: 'config.js',
      content: 'const API_KEY = "sk-1234567890abcdef";'
    }];

    const result = await executeHook('pre-commit', { stagedFiles: files });

    expect(result.allowed).toBe(false);
    expect(result.reason).toContain('secret');
  });

  it('should allow commits with environment variables', async () => {
    const files = [{
      path: 'config.js',
      content: 'const API_KEY = process.env.API_KEY;'
    }];

    const result = await executeHook('pre-commit', { stagedFiles: files });

    expect(result.allowed).toBe(true);
  });

  it('should provide actionable feedback on rejection', async () => {
    const files = [{
      path: '.env',
      content: 'DATABASE_URL=postgres://user:pass@host/db'
    }];

    const result = await executeHook('pre-commit', { stagedFiles: files });

    expect(result.allowed).toBe(false);
    expect(result.suggestion).toBeDefined();
    expect(result.suggestion).toContain('.gitignore');
  });
});

Automated Testing Pipeline

For production skills, set up a CI/CD pipeline that runs tests on every change:

# .github/workflows/skill-tests.yml
name: Skill Tests

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
      - run: npm ci
      - run: npm test -- --coverage

  integration-tests:
    runs-on: ubuntu-latest
    needs: unit-tests
    env:
      ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
      - run: npm ci
      - run: npm run test:integration

  skill-validation:
    runs-on: ubuntu-latest
    needs: unit-tests
    steps:
      - uses: actions/checkout@v4
      - name: Validate SKILL.md
        run: |
          npx skill-validator validate ./SKILL.md
      - name: Lint skill prompts
        run: |
          npx prompt-lint ./prompts/

Debugging Failed Tests

When tests fail, use these strategies to identify the root cause:

1. Capture Full Execution Traces

import { enableDebugLogging } from '../lib/debug';

beforeAll(() => {
  enableDebugLogging({
    capturePrompts: true,
    captureResponses: true,
    saveToFile: './test-logs/'
  });
});

2. Compare Across Multiple Runs

it('should be consistent across runs', async () => {
  const results = [];

  for (let i = 0; i < 5; i++) {
    const output = await executeSkill('your-skill', { input: 'test' });
    results.push(output);
  }

  // Log for debugging
  console.log('Run results:', JSON.stringify(results, null, 2));

  // Check for consistency
  const uniqueResults = new Set(results.map(r => JSON.stringify(r)));
  expect(uniqueResults.size).toBeLessThanOrEqual(2); // Allow some variation
});

3. Isolate AI vs. Code Issues

it('should handle AI errors gracefully', async () => {
  // Mock a failed AI response
  jest.spyOn(claude, 'complete').mockRejectedValueOnce(
    new Error('Rate limited')
  );

  const result = await executeSkill('your-skill', { input: 'test' });

  // Skill should handle the error, not crash
  expect(result.error).toBeDefined();
  expect(result.error.code).toBe('AI_ERROR');
  expect(result.error.retryable).toBe(true);
});

Best Practices Summary

Start with manual testing. Understand your skill's behavior before automating.
Test the boundaries. Focus on edge cases, error conditions, and unexpected inputs.
Use structural validation. Don't assert exact outputs; validate structure and invariants.
Mock expensive operations. Use mocks for AI calls in unit tests; reserve real calls for integration tests.
Track consistency. Run tests multiple times to catch non-determinism.
Automate in CI/CD. Catch regressions before they reach users.
Log everything. When tests fail, you need context to debug.
Version your test data. Golden files and test fixtures should be version controlled.

Testing Claude Code skills requires a mindset shift from traditional software testing, but the fundamentals remain: start simple, automate gradually, and always prioritize the user experience. A well-tested skill is a reliable skill—and reliability builds trust.

Ready to ensure your skills are production-quality? Check out our Security Best Practices guide next.

Testing Claude Code Skills: Strategies and Tools

This guide covers practical testing strategies for Claude Code skills, from simple manual validation to sophisticated automated testing pipelines.

Why Testing Skills Is Different

Traditional software testing validates deterministic outputs: given input X, expect output Y. AI skill testing is fundamentally different because:

Outputs are probabilistic. The same input might produce slightly different outputs across runs.
Context matters enormously. A skill that works perfectly in one codebase might fail in another.
Edge cases are unbounded. You cannot enumerate all possible inputs to an AI system.
Failure modes are subtle. A skill might produce syntactically correct but semantically wrong output.

These differences require a testing philosophy shift: instead of testing for exact outputs, we test for behavioral invariants, quality thresholds, and safety boundaries.

The Three Levels of Skill Testing

Level 1: Manual Testing

Manual testing is where every skill should start. Before automating anything, you need to understand how your skill behaves in practice.

The REPL Approach

Claude Code's interactive mode serves as your primary testing REPL. Run your skill with various inputs and observe:

# Test your skill with a simple case
claude --skill your-skill-name "Process this simple input"

# Test with a complex case
claude --skill your-skill-name "Handle this edge case with special characters: @#$%"

# Test with context from a specific directory
cd /path/to/test/project && claude --skill your-skill-name "Work with this codebase"

What to Observe During Manual Testing

Correctness: Does the output match your expectations?
Consistency: Run the same input 3-5 times. Are outputs reasonably consistent?
Graceful degradation: What happens with malformed inputs?
Context sensitivity: Does the skill adapt to different project types?
Performance: How long does execution take?

Creating a Manual Test Matrix

Document your manual tests in a structured format:

## Manual Test Matrix for [Skill Name]

| Test Case | Input | Expected Behavior | Actual Result | Pass/Fail |
|-----------|-------|-------------------|---------------|-----------|
| Simple input | "hello" | Processes correctly | [result] | Pass |
| Empty input | "" | Returns helpful error | [result] | Pass |
| Unicode input | "Hello" | Handles emoji correctly | [result] | Pass |
| Large input | [1000+ chars] | Completes in <5s | [result] | Pass |
| Malformed input | "{{broken" | Graceful error message | [result] | Fail |

Level 2: Unit Testing Skill Components

While you cannot unit test the AI's decision-making, you can unit test the deterministic components of your skill:

Testing Prompt Templates

If your skill constructs prompts dynamically, test the template logic:

// tests/prompt-template.test.ts
import { buildPrompt } from '../lib/skill-prompts';

describe('Skill Prompt Templates', () => {
  it('should include project context when provided', () => {
    const prompt = buildPrompt({
      task: 'refactor code',
      projectType: 'typescript',
      conventions: ['use-semicolons', 'prefer-const']
    });

    expect(prompt).toContain('typescript');
    expect(prompt).toContain('use-semicolons');
    expect(prompt).toContain('prefer-const');
  });

  it('should handle missing optional fields', () => {
    const prompt = buildPrompt({
      task: 'refactor code'
    });

    expect(prompt).not.toContain('undefined');
    expect(prompt).not.toContain('null');
  });

  it('should escape special characters in user input', () => {
    const prompt = buildPrompt({
      task: 'handle this: ${dangerous}'
    });

    expect(prompt).not.toMatch(/\$\{.*\}/);
  });
});

Testing Output Parsers

If your skill parses structured output from Claude, test those parsers:

// tests/output-parser.test.ts
import { parseSkillOutput } from '../lib/skill-parser';

describe('Skill Output Parser', () => {
  it('should extract code blocks correctly', () => {
    const output = `
Here is the refactored code:

\`\`\`typescript
const x = 1;
\`\`\`

This improves readability.
`;

    const result = parseSkillOutput(output);
    expect(result.codeBlocks).toHaveLength(1);
    expect(result.codeBlocks[0].language).toBe('typescript');
    expect(result.codeBlocks[0].content).toBe('const x = 1;');
  });

  it('should handle outputs without code blocks', () => {
    const output = 'No code changes needed.';

    const result = parseSkillOutput(output);
    expect(result.codeBlocks).toHaveLength(0);
    expect(result.explanation).toBe('No code changes needed.');
  });
});

Testing File Operations

If your skill reads or writes files, test those operations in isolation:

// tests/file-operations.test.ts
import { prepareSkillContext, applySkillChanges } from '../lib/skill-files';
import * as fs from 'fs';
import * as path from 'path';
import * as os from 'os';

describe('Skill File Operations', () => {
  let testDir: string;

  beforeEach(() => {
    testDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-test-'));
    fs.writeFileSync(
      path.join(testDir, 'test.ts'),
      'const x = 1;'
    );
  });

  afterEach(() => {
    fs.rmSync(testDir, { recursive: true });
  });

  it('should read project files correctly', async () => {
    const context = await prepareSkillContext(testDir);
    expect(context.files).toContainEqual({
      path: 'test.ts',
      content: 'const x = 1;'
    });
  });

  it('should apply changes without data loss', async () => {
    const changes = [{
      path: 'test.ts',
      content: 'const x = 2;'
    }];

    await applySkillChanges(testDir, changes);

    const content = fs.readFileSync(
      path.join(testDir, 'test.ts'),
      'utf-8'
    );
    expect(content).toBe('const x = 2;');
  });
});

Level 3: Integration Testing

Integration tests validate the entire skill execution flow, including AI interactions. These are more expensive to run but provide the highest confidence.

Snapshot Testing for AI Outputs

Since AI outputs vary, traditional equality assertions fail. Instead, use semantic similarity or structural validation:

// tests/integration/skill-integration.test.ts
import { executeSkill } from '../lib/skill-executor';
import { validateOutput } from '../lib/output-validator';

describe('Skill Integration Tests', () => {
  // Use longer timeout for AI operations
  jest.setTimeout(60000);

  it('should generate valid TypeScript when refactoring', async () => {
    const input = `
      function add(a, b) {
        return a + b;
      }
    `;

    const output = await executeSkill('typescript-refactor', {
      code: input,
      targetVersion: 'es2022'
    });

    // Structural validation: must be valid TypeScript
    const validation = await validateOutput(output, {
      language: 'typescript',
      mustCompile: true
    });

    expect(validation.isValid).toBe(true);
    expect(validation.errors).toHaveLength(0);
  });

  it('should preserve function semantics during refactor', async () => {
    const input = `const add = (a, b) => a + b;`;

    const output = await executeSkill('typescript-refactor', {
      code: input
    });

    // Semantic validation using a safe execution sandbox
    // Note: Use vm2 or isolated-vm for safe code execution in tests
    const vm = require('vm2');
    const sandbox = new vm.VM();

    const originalResult = sandbox.run(input + '; add(2, 3)');
    const refactoredResult = sandbox.run(output.code + '; add(2, 3)');

    expect(refactoredResult).toBe(originalResult);
  });
});

Golden File Testing

For skills that produce complex outputs, use golden files—known-good outputs that serve as baselines:

// tests/golden/commit-message.test.ts
import { executeSkill } from '../lib/skill-executor';
import * as fs from 'fs';
import * as path from 'path';

describe('Commit Message Skill - Golden Tests', () => {
  const goldenDir = path.join(__dirname, 'golden-files');

  it('should match golden output for feature commits', async () => {
    const input = fs.readFileSync(
      path.join(goldenDir, 'feature-diff.txt'),
      'utf-8'
    );

    const output = await executeSkill('commit-message', {
      diff: input,
      type: 'feature'
    });

    // Check structural properties, not exact match
    expect(output.title.length).toBeLessThanOrEqual(72);
    expect(output.title).toMatch(/^(feat|feature):/i);
    expect(output.body.split('\n').length).toBeGreaterThan(1);
  });

  it('should produce consistent quality across runs', async () => {
    const input = fs.readFileSync(
      path.join(goldenDir, 'bugfix-diff.txt'),
      'utf-8'
    );

    // Run multiple times to check consistency
    const outputs = await Promise.all(
      Array(3).fill(null).map(() =>
        executeSkill('commit-message', { diff: input, type: 'fix' })
      )
    );

    // All outputs should have similar structure
    outputs.forEach(output => {
      expect(output.title).toMatch(/^fix:/i);
      expect(output.title.length).toBeLessThanOrEqual(72);
    });
  });
});

Testing Patterns for Common Skill Types

Testing Command Skills

Command skills are the most straightforward to test because they have clear inputs and outputs:

describe('PR Review Command', () => {
  it('should identify security issues in code', async () => {
    const code = `
      const query = "SELECT * FROM users WHERE id = " + userId;
    `;

    const review = await executeSkill('pr-review', { code });

    expect(review.issues.some(i =>
      i.category === 'security' &&
      i.description.toLowerCase().includes('sql injection')
    )).toBe(true);
  });

  it('should not produce false positives for safe code', async () => {
    const code = `
      const query = db.prepare("SELECT * FROM users WHERE id = ?");
      query.bind(userId);
    `;

    const review = await executeSkill('pr-review', { code });

    expect(review.issues.filter(i =>
      i.category === 'security'
    )).toHaveLength(0);
  });
});

Testing Agent Skills

Agent skills that orchestrate multiple tools require more sophisticated testing:

describe('Codebase Analyzer Agent', () => {
  let mockToolCalls: string[];

  beforeEach(() => {
    mockToolCalls = [];
    // Mock tool execution to track calls
    jest.spyOn(executor, 'executeTool').mockImplementation(
      async (tool, args) => {
        mockToolCalls.push(`${tool}:${JSON.stringify(args)}`);
        return { success: true, result: 'mocked' };
      }
    );
  });

  it('should use grep before read for large codebases', async () => {
    await executeSkill('codebase-analyzer', {
      question: 'Where is authentication handled?',
      projectSize: 'large'
    });

    // Verify grep was called before read
    const grepIndex = mockToolCalls.findIndex(c => c.startsWith('grep:'));
    const readIndex = mockToolCalls.findIndex(c => c.startsWith('read:'));

    expect(grepIndex).toBeLessThan(readIndex);
  });

  it('should limit file reads to prevent context overflow', async () => {
    await executeSkill('codebase-analyzer', {
      question: 'Analyze all files',
      projectSize: 'large'
    });

    const readCalls = mockToolCalls.filter(c => c.startsWith('read:'));
    expect(readCalls.length).toBeLessThanOrEqual(10);
  });
});

Testing Hook Skills

Hooks run at specific lifecycle points and require testing both the trigger conditions and the actions:

describe('Pre-Commit Hook', () => {
  it('should block commits with secrets', async () => {
    const files = [{
      path: 'config.js',
      content: 'const API_KEY = "sk-1234567890abcdef";'
    }];

    const result = await executeHook('pre-commit', { stagedFiles: files });

    expect(result.allowed).toBe(false);
    expect(result.reason).toContain('secret');
  });

  it('should allow commits with environment variables', async () => {
    const files = [{
      path: 'config.js',
      content: 'const API_KEY = process.env.API_KEY;'
    }];

    const result = await executeHook('pre-commit', { stagedFiles: files });

    expect(result.allowed).toBe(true);
  });

  it('should provide actionable feedback on rejection', async () => {
    const files = [{
      path: '.env',
      content: 'DATABASE_URL=postgres://user:pass@host/db'
    }];

    const result = await executeHook('pre-commit', { stagedFiles: files });

    expect(result.allowed).toBe(false);
    expect(result.suggestion).toBeDefined();
    expect(result.suggestion).toContain('.gitignore');
  });
});

Automated Testing Pipeline

For production skills, set up a CI/CD pipeline that runs tests on every change:

# .github/workflows/skill-tests.yml
name: Skill Tests

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
      - run: npm ci
      - run: npm test -- --coverage

  integration-tests:
    runs-on: ubuntu-latest
    needs: unit-tests
    env:
      ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
      - run: npm ci
      - run: npm run test:integration

  skill-validation:
    runs-on: ubuntu-latest
    needs: unit-tests
    steps:
      - uses: actions/checkout@v4
      - name: Validate SKILL.md
        run: |
          npx skill-validator validate ./SKILL.md
      - name: Lint skill prompts
        run: |
          npx prompt-lint ./prompts/

Debugging Failed Tests

When tests fail, use these strategies to identify the root cause:

1. Capture Full Execution Traces

import { enableDebugLogging } from '../lib/debug';

beforeAll(() => {
  enableDebugLogging({
    capturePrompts: true,
    captureResponses: true,
    saveToFile: './test-logs/'
  });
});

2. Compare Across Multiple Runs

it('should be consistent across runs', async () => {
  const results = [];

  for (let i = 0; i < 5; i++) {
    const output = await executeSkill('your-skill', { input: 'test' });
    results.push(output);
  }

  // Log for debugging
  console.log('Run results:', JSON.stringify(results, null, 2));

  // Check for consistency
  const uniqueResults = new Set(results.map(r => JSON.stringify(r)));
  expect(uniqueResults.size).toBeLessThanOrEqual(2); // Allow some variation
});

3. Isolate AI vs. Code Issues

it('should handle AI errors gracefully', async () => {
  // Mock a failed AI response
  jest.spyOn(claude, 'complete').mockRejectedValueOnce(
    new Error('Rate limited')
  );

  const result = await executeSkill('your-skill', { input: 'test' });

  // Skill should handle the error, not crash
  expect(result.error).toBeDefined();
  expect(result.error.code).toBe('AI_ERROR');
  expect(result.error.retryable).toBe(true);
});

Best Practices Summary

Start with manual testing. Understand your skill's behavior before automating.
Test the boundaries. Focus on edge cases, error conditions, and unexpected inputs.
Use structural validation. Don't assert exact outputs; validate structure and invariants.
Mock expensive operations. Use mocks for AI calls in unit tests; reserve real calls for integration tests.
Track consistency. Run tests multiple times to catch non-determinism.
Automate in CI/CD. Catch regressions before they reach users.
Log everything. When tests fail, you need context to debug.
Version your test data. Golden files and test fixtures should be version controlled.

Ready to ensure your skills are production-quality? Check out our Security Best Practices guide next.