Testing Claude Code Skills: Strategies and Tools
Learn how to test Claude Code skills effectively with unit tests, integration tests, and manual testing patterns for reliable AI-powered automation.
Testing Claude Code Skills: Strategies and Tools
Building Claude Code skills is only half the battle. The other half—often overlooked—is ensuring they work reliably across different scenarios, edge cases, and user contexts. Unlike traditional software testing, skill testing involves validating AI-generated outputs, which introduces unique challenges.
This guide covers practical testing strategies for Claude Code skills, from simple manual validation to sophisticated automated testing pipelines.
Why Testing Skills Is Different
Traditional software testing validates deterministic outputs: given input X, expect output Y. AI skill testing is fundamentally different because:
- Outputs are probabilistic. The same input might produce slightly different outputs across runs.
- Context matters enormously. A skill that works perfectly in one codebase might fail in another.
- Edge cases are unbounded. You cannot enumerate all possible inputs to an AI system.
- Failure modes are subtle. A skill might produce syntactically correct but semantically wrong output.
These differences require a testing philosophy shift: instead of testing for exact outputs, we test for behavioral invariants, quality thresholds, and safety boundaries.
The Three Levels of Skill Testing
Level 1: Manual Testing
Manual testing is where every skill should start. Before automating anything, you need to understand how your skill behaves in practice.
The REPL Approach
Claude Code's interactive mode serves as your primary testing REPL. Run your skill with various inputs and observe:
# Test your skill with a simple case
claude --skill your-skill-name "Process this simple input"
# Test with a complex case
claude --skill your-skill-name "Handle this edge case with special characters: @#$%"
# Test with context from a specific directory
cd /path/to/test/project && claude --skill your-skill-name "Work with this codebase"
What to Observe During Manual Testing
- Correctness: Does the output match your expectations?
- Consistency: Run the same input 3-5 times. Are outputs reasonably consistent?
- Graceful degradation: What happens with malformed inputs?
- Context sensitivity: Does the skill adapt to different project types?
- Performance: How long does execution take?
Creating a Manual Test Matrix
Document your manual tests in a structured format:
## Manual Test Matrix for [Skill Name]
| Test Case | Input | Expected Behavior | Actual Result | Pass/Fail |
|-----------|-------|-------------------|---------------|-----------|
| Simple input | "hello" | Processes correctly | [result] | Pass |
| Empty input | "" | Returns helpful error | [result] | Pass |
| Unicode input | "Hello" | Handles emoji correctly | [result] | Pass |
| Large input | [1000+ chars] | Completes in <5s | [result] | Pass |
| Malformed input | "{{broken" | Graceful error message | [result] | Fail |
Level 2: Unit Testing Skill Components
While you cannot unit test the AI's decision-making, you can unit test the deterministic components of your skill:
Testing Prompt Templates
If your skill constructs prompts dynamically, test the template logic:
// tests/prompt-template.test.ts
import { buildPrompt } from '../lib/skill-prompts';
describe('Skill Prompt Templates', () => {
it('should include project context when provided', () => {
const prompt = buildPrompt({
task: 'refactor code',
projectType: 'typescript',
conventions: ['use-semicolons', 'prefer-const']
});
expect(prompt).toContain('typescript');
expect(prompt).toContain('use-semicolons');
expect(prompt).toContain('prefer-const');
});
it('should handle missing optional fields', () => {
const prompt = buildPrompt({
task: 'refactor code'
});
expect(prompt).not.toContain('undefined');
expect(prompt).not.toContain('null');
});
it('should escape special characters in user input', () => {
const prompt = buildPrompt({
task: 'handle this: ${dangerous}'
});
expect(prompt).not.toMatch(/\$\{.*\}/);
});
});
Testing Output Parsers
If your skill parses structured output from Claude, test those parsers:
// tests/output-parser.test.ts
import { parseSkillOutput } from '../lib/skill-parser';
describe('Skill Output Parser', () => {
it('should extract code blocks correctly', () => {
const output = `
Here is the refactored code:
\`\`\`typescript
const x = 1;
\`\`\`
This improves readability.
`;
const result = parseSkillOutput(output);
expect(result.codeBlocks).toHaveLength(1);
expect(result.codeBlocks[0].language).toBe('typescript');
expect(result.codeBlocks[0].content).toBe('const x = 1;');
});
it('should handle outputs without code blocks', () => {
const output = 'No code changes needed.';
const result = parseSkillOutput(output);
expect(result.codeBlocks).toHaveLength(0);
expect(result.explanation).toBe('No code changes needed.');
});
});
Testing File Operations
If your skill reads or writes files, test those operations in isolation:
// tests/file-operations.test.ts
import { prepareSkillContext, applySkillChanges } from '../lib/skill-files';
import * as fs from 'fs';
import * as path from 'path';
import * as os from 'os';
describe('Skill File Operations', () => {
let testDir: string;
beforeEach(() => {
testDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-test-'));
fs.writeFileSync(
path.join(testDir, 'test.ts'),
'const x = 1;'
);
});
afterEach(() => {
fs.rmSync(testDir, { recursive: true });
});
it('should read project files correctly', async () => {
const context = await prepareSkillContext(testDir);
expect(context.files).toContainEqual({
path: 'test.ts',
content: 'const x = 1;'
});
});
it('should apply changes without data loss', async () => {
const changes = [{
path: 'test.ts',
content: 'const x = 2;'
}];
await applySkillChanges(testDir, changes);
const content = fs.readFileSync(
path.join(testDir, 'test.ts'),
'utf-8'
);
expect(content).toBe('const x = 2;');
});
});
Level 3: Integration Testing
Integration tests validate the entire skill execution flow, including AI interactions. These are more expensive to run but provide the highest confidence.
Snapshot Testing for AI Outputs
Since AI outputs vary, traditional equality assertions fail. Instead, use semantic similarity or structural validation:
// tests/integration/skill-integration.test.ts
import { executeSkill } from '../lib/skill-executor';
import { validateOutput } from '../lib/output-validator';
describe('Skill Integration Tests', () => {
// Use longer timeout for AI operations
jest.setTimeout(60000);
it('should generate valid TypeScript when refactoring', async () => {
const input = `
function add(a, b) {
return a + b;
}
`;
const output = await executeSkill('typescript-refactor', {
code: input,
targetVersion: 'es2022'
});
// Structural validation: must be valid TypeScript
const validation = await validateOutput(output, {
language: 'typescript',
mustCompile: true
});
expect(validation.isValid).toBe(true);
expect(validation.errors).toHaveLength(0);
});
it('should preserve function semantics during refactor', async () => {
const input = `const add = (a, b) => a + b;`;
const output = await executeSkill('typescript-refactor', {
code: input
});
// Semantic validation using a safe execution sandbox
// Note: Use vm2 or isolated-vm for safe code execution in tests
const vm = require('vm2');
const sandbox = new vm.VM();
const originalResult = sandbox.run(input + '; add(2, 3)');
const refactoredResult = sandbox.run(output.code + '; add(2, 3)');
expect(refactoredResult).toBe(originalResult);
});
});
Golden File Testing
For skills that produce complex outputs, use golden files—known-good outputs that serve as baselines:
// tests/golden/commit-message.test.ts
import { executeSkill } from '../lib/skill-executor';
import * as fs from 'fs';
import * as path from 'path';
describe('Commit Message Skill - Golden Tests', () => {
const goldenDir = path.join(__dirname, 'golden-files');
it('should match golden output for feature commits', async () => {
const input = fs.readFileSync(
path.join(goldenDir, 'feature-diff.txt'),
'utf-8'
);
const output = await executeSkill('commit-message', {
diff: input,
type: 'feature'
});
// Check structural properties, not exact match
expect(output.title.length).toBeLessThanOrEqual(72);
expect(output.title).toMatch(/^(feat|feature):/i);
expect(output.body.split('\n').length).toBeGreaterThan(1);
});
it('should produce consistent quality across runs', async () => {
const input = fs.readFileSync(
path.join(goldenDir, 'bugfix-diff.txt'),
'utf-8'
);
// Run multiple times to check consistency
const outputs = await Promise.all(
Array(3).fill(null).map(() =>
executeSkill('commit-message', { diff: input, type: 'fix' })
)
);
// All outputs should have similar structure
outputs.forEach(output => {
expect(output.title).toMatch(/^fix:/i);
expect(output.title.length).toBeLessThanOrEqual(72);
});
});
});
Testing Patterns for Common Skill Types
Testing Command Skills
Command skills are the most straightforward to test because they have clear inputs and outputs:
describe('PR Review Command', () => {
it('should identify security issues in code', async () => {
const code = `
const query = "SELECT * FROM users WHERE id = " + userId;
`;
const review = await executeSkill('pr-review', { code });
expect(review.issues.some(i =>
i.category === 'security' &&
i.description.toLowerCase().includes('sql injection')
)).toBe(true);
});
it('should not produce false positives for safe code', async () => {
const code = `
const query = db.prepare("SELECT * FROM users WHERE id = ?");
query.bind(userId);
`;
const review = await executeSkill('pr-review', { code });
expect(review.issues.filter(i =>
i.category === 'security'
)).toHaveLength(0);
});
});
Testing Agent Skills
Agent skills that orchestrate multiple tools require more sophisticated testing:
describe('Codebase Analyzer Agent', () => {
let mockToolCalls: string[];
beforeEach(() => {
mockToolCalls = [];
// Mock tool execution to track calls
jest.spyOn(executor, 'executeTool').mockImplementation(
async (tool, args) => {
mockToolCalls.push(`${tool}:${JSON.stringify(args)}`);
return { success: true, result: 'mocked' };
}
);
});
it('should use grep before read for large codebases', async () => {
await executeSkill('codebase-analyzer', {
question: 'Where is authentication handled?',
projectSize: 'large'
});
// Verify grep was called before read
const grepIndex = mockToolCalls.findIndex(c => c.startsWith('grep:'));
const readIndex = mockToolCalls.findIndex(c => c.startsWith('read:'));
expect(grepIndex).toBeLessThan(readIndex);
});
it('should limit file reads to prevent context overflow', async () => {
await executeSkill('codebase-analyzer', {
question: 'Analyze all files',
projectSize: 'large'
});
const readCalls = mockToolCalls.filter(c => c.startsWith('read:'));
expect(readCalls.length).toBeLessThanOrEqual(10);
});
});
Testing Hook Skills
Hooks run at specific lifecycle points and require testing both the trigger conditions and the actions:
describe('Pre-Commit Hook', () => {
it('should block commits with secrets', async () => {
const files = [{
path: 'config.js',
content: 'const API_KEY = "sk-1234567890abcdef";'
}];
const result = await executeHook('pre-commit', { stagedFiles: files });
expect(result.allowed).toBe(false);
expect(result.reason).toContain('secret');
});
it('should allow commits with environment variables', async () => {
const files = [{
path: 'config.js',
content: 'const API_KEY = process.env.API_KEY;'
}];
const result = await executeHook('pre-commit', { stagedFiles: files });
expect(result.allowed).toBe(true);
});
it('should provide actionable feedback on rejection', async () => {
const files = [{
path: '.env',
content: 'DATABASE_URL=postgres://user:pass@host/db'
}];
const result = await executeHook('pre-commit', { stagedFiles: files });
expect(result.allowed).toBe(false);
expect(result.suggestion).toBeDefined();
expect(result.suggestion).toContain('.gitignore');
});
});
Automated Testing Pipeline
For production skills, set up a CI/CD pipeline that runs tests on every change:
# .github/workflows/skill-tests.yml
name: Skill Tests
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
unit-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
- run: npm ci
- run: npm test -- --coverage
integration-tests:
runs-on: ubuntu-latest
needs: unit-tests
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
- run: npm ci
- run: npm run test:integration
skill-validation:
runs-on: ubuntu-latest
needs: unit-tests
steps:
- uses: actions/checkout@v4
- name: Validate SKILL.md
run: |
npx skill-validator validate ./SKILL.md
- name: Lint skill prompts
run: |
npx prompt-lint ./prompts/
Debugging Failed Tests
When tests fail, use these strategies to identify the root cause:
1. Capture Full Execution Traces
import { enableDebugLogging } from '../lib/debug';
beforeAll(() => {
enableDebugLogging({
capturePrompts: true,
captureResponses: true,
saveToFile: './test-logs/'
});
});
2. Compare Across Multiple Runs
it('should be consistent across runs', async () => {
const results = [];
for (let i = 0; i < 5; i++) {
const output = await executeSkill('your-skill', { input: 'test' });
results.push(output);
}
// Log for debugging
console.log('Run results:', JSON.stringify(results, null, 2));
// Check for consistency
const uniqueResults = new Set(results.map(r => JSON.stringify(r)));
expect(uniqueResults.size).toBeLessThanOrEqual(2); // Allow some variation
});
3. Isolate AI vs. Code Issues
it('should handle AI errors gracefully', async () => {
// Mock a failed AI response
jest.spyOn(claude, 'complete').mockRejectedValueOnce(
new Error('Rate limited')
);
const result = await executeSkill('your-skill', { input: 'test' });
// Skill should handle the error, not crash
expect(result.error).toBeDefined();
expect(result.error.code).toBe('AI_ERROR');
expect(result.error.retryable).toBe(true);
});
Best Practices Summary
-
Start with manual testing. Understand your skill's behavior before automating.
-
Test the boundaries. Focus on edge cases, error conditions, and unexpected inputs.
-
Use structural validation. Don't assert exact outputs; validate structure and invariants.
-
Mock expensive operations. Use mocks for AI calls in unit tests; reserve real calls for integration tests.
-
Track consistency. Run tests multiple times to catch non-determinism.
-
Automate in CI/CD. Catch regressions before they reach users.
-
Log everything. When tests fail, you need context to debug.
-
Version your test data. Golden files and test fixtures should be version controlled.
Testing Claude Code skills requires a mindset shift from traditional software testing, but the fundamentals remain: start simple, automate gradually, and always prioritize the user experience. A well-tested skill is a reliable skill—and reliability builds trust.
Ready to ensure your skills are production-quality? Check out our Security Best Practices guide next.