Scientific Skills: Bioinformatics Deep Dive

Bioinformatics sits at the intersection of biology, computer science, and statistics. Researchers in this field juggle complex data formats, intricate analysis pipelines, and rapidly evolving tools. Claude Code skills for bioinformatics bring AI assistance to these specialized workflows, helping researchers focus on scientific questions rather than technical friction.

This deep dive explores the bioinformatics skill ecosystem: what's available, how the skills work, and how they can transform computational biology workflows.

The Bioinformatics Challenge

Computational biology presents unique challenges for AI assistance:

Domain Complexity: Understanding DNA sequences, protein structures, and biological pathways requires specialized knowledge.

Tool Ecosystem: Dozens of established tools (BLAST, HMMER, BWA, samtools) with specific invocation patterns and output formats.

Data Volumes: Genomic datasets are massive. A single sequencing run generates gigabytes of data.

Pipeline Orchestration: Analyses chain multiple tools, each with dependencies on previous steps.

Reproducibility: Scientific work must be reproducible. Every step needs documentation.

Effective bioinformatics skills address these challenges by encoding domain knowledge and tool expertise.

Skill Category 1: Sequence Analysis

The most fundamental bioinformatics operation is sequence analysis: working with DNA, RNA, and protein sequences.

sequence-analyzer

A comprehensive skill for basic sequence operations:

# Sequence Analyzer

Analyze nucleotide and protein sequences.

## Capabilities

### Sequence Identification
- Detect sequence type (DNA, RNA, Protein)
- Identify organism origin when possible
- Check for common sequence features

### Basic Analysis
- Length and composition
- GC content (for nucleotides)
- Molecular weight (for proteins)
- Codon usage

### Motif Finding
- Restriction enzyme sites
- Regulatory motifs
- Protein domains

## Sequence Input Formats

Supports:
- Raw sequence (ACGT...)
- FASTA format
- GenBank format
- FASTQ (with quality scores)

## Example Workflow

User: "Analyze this sequence: ATGCGATCGATCGATCG"

1. Identify: DNA sequence, 17bp
2. Composition: A=4, T=4, G=5, C=4
3. GC Content: 52.9%
4. Restriction sites: Found HaeIII (GGCC) at position 8
5. Open reading frames: 2 potential ORFs detected

blast-assistant

Integrates with BLAST for sequence similarity searches:

# BLAST Assistant

Run and interpret BLAST searches.

## Supported Programs

- blastn: Nucleotide vs nucleotide
- blastp: Protein vs protein
- blastx: Translated nucleotide vs protein
- tblastn: Protein vs translated nucleotide
- tblastx: Translated vs translated

## Workflow

### 1. Query Preparation
- Validate sequence format
- Choose appropriate BLAST program
- Select target database

### 2. Run BLAST
```bash
blastn -query input.fasta -db nt -outfmt 6 -out results.tsv

3. Results Interpretation

Filter by E-value threshold
Identify significant hits
Report taxonomic distribution
Assess coverage and identity

Output Format

Summary

Query: [sequence ID]
Database: [database name]
Hits: [number found]
Top hit: [description] (E-value: X, Identity: Y%)

Significant Alignments

[Table of hits with E-value, identity, coverage]

Interpretation

[Plain language explanation of results]


### primer-designer

Specialized skill for PCR primer design:

```markdown
# Primer Designer

Design PCR primers following best practices.

## Design Parameters

### Physical Properties
- Length: 18-25bp (optimal 20-22)
- GC content: 40-60%
- Melting temperature: 55-65C
- Tm difference between primers: <2C

### Sequence Considerations
- 3' end: G or C (GC clamp)
- Avoid runs: >4 identical bases
- Avoid hairpins: >4bp self-complementary
- Avoid dimers: >4bp primer-primer complementary

### Amplicon
- Product size: User specified
- Check for off-target sites

## Workflow

1. Accept target sequence
2. Identify region to amplify
3. Generate candidate primers
4. Score candidates against parameters
5. Check for specificity (BLAST against genome)
6. Report best primer pairs

## Output

### Recommended Primer Pair

**Forward Primer**
- Sequence: 5'-ATGCGATCGATCGATC-3'
- Length: 16bp
- Tm: 58.2C
- GC: 50%

**Reverse Primer**
- Sequence: 5'-GATCGATCGATCGCAT-3'
- Length: 16bp
- Tm: 58.4C
- GC: 50%

**Amplicon**
- Size: 450bp
- Location: chr1:12345-12795

### Quality Assessment
[Score against each parameter]

### Order-Ready Format
[Format suitable for ordering from synthesis company]

Skill Category 2: Genomics Pipelines

More complex skills orchestrate multi-step genomic analyses.

variant-caller

Coordinates the variant calling pipeline:

# Variant Caller Pipeline

Call variants from sequencing data.

## Pipeline Steps

### 1. Quality Control
```bash
fastqc raw_reads.fastq

Report read quality, adapter content, duplicates.

2. Trimming (if needed)

trimmomatic PE raw_R1.fastq raw_R2.fastq \
  trimmed_R1.fastq unpaired_R1.fastq \
  trimmed_R2.fastq unpaired_R2.fastq \
  ILLUMINACLIP:adapters.fa:2:30:10 \
  LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

3. Alignment

bwa mem -t 8 reference.fa trimmed_R1.fastq trimmed_R2.fastq \
  | samtools sort -o aligned.bam
samtools index aligned.bam

4. Mark Duplicates

picard MarkDuplicates \
  I=aligned.bam \
  O=dedup.bam \
  M=metrics.txt

5. Variant Calling

gatk HaplotypeCaller \
  -R reference.fa \
  -I dedup.bam \
  -O variants.vcf

6. Filtering

gatk VariantFiltration \
  -R reference.fa \
  -V variants.vcf \
  -O filtered.vcf \
  --filter-expression "QD < 2.0" --filter-name "QD2" \
  --filter-expression "FS > 60.0" --filter-name "FS60"

Output Interpretation

Summary Statistics

Total reads: X
Mapped reads: Y (Z%)
Duplicates removed: A (B%)
Variants called: C
SNPs: D
Indels: E

Quality Metrics

[Table of quality scores at each step]

Notable Variants

[List of high-impact variants with annotation]


### rnaseq-pipeline

RNA-seq analysis from reads to differential expression:

```markdown
# RNA-seq Analysis Pipeline

Analyze RNA-seq data for differential expression.

## Pipeline Phases

### Phase 1: Read Processing
1. Quality check (FastQC)
2. Adapter trimming (cutadapt)
3. Quality filtering

### Phase 2: Alignment
Two approaches:
- Genome alignment (STAR/HISAT2) for splice-aware mapping
- Pseudoalignment (kallisto/salmon) for fast quantification

### Phase 3: Quantification
- Generate count matrix
- Normalize counts (TPM, FPKM, or for DE analysis)

### Phase 4: Differential Expression
Using DESeq2 or edgeR:
1. Load count matrix
2. Define experimental design
3. Run DE analysis
4. Adjust for multiple testing

### Phase 5: Visualization & Interpretation
- MA plots
- Volcano plots
- PCA of samples
- Heatmaps of DE genes
- GO enrichment analysis

## Experimental Design Guidance

### Required Information
- Sample grouping (condition, treatment, etc.)
- Batch information if applicable
- Paired design if applicable

### Statistical Considerations
- Minimum 3 replicates per condition
- Account for batch effects
- Choose appropriate multiple testing correction

## Output

### Summary Report
- Samples processed: X
- Reads mapped: Y% average
- Genes detected: Z
- Differentially expressed: A (padj < 0.05)
  - Upregulated: B
  - Downregulated: C

### Key Results
[Top 20 differentially expressed genes with descriptions]

### Enrichment Analysis
[Enriched GO terms and pathways]

Skill Category 3: Structural Biology

Skills for protein structure and molecular modeling.

structure-analyzer

Protein structure analysis and visualization:

# Structure Analyzer

Analyze protein 3D structures.

## Input Formats
- PDB files
- mmCIF files
- AlphaFold predictions

## Analyses

### Basic Metrics
- Chain count and lengths
- Resolution (for experimental structures)
- pLDDT scores (for AlphaFold)
- Secondary structure composition

### Structural Features
- Domain identification
- Active site detection
- Binding pocket analysis
- Surface area calculations

### Comparison
- RMSD to reference structure
- Structural alignment
- Conservation mapping

## Integration with Tools

### PyMOL/ChimeraX Commands
Generate visualization commands for external tools:

fetch 1XYZ show cartoon color spectrum, b


### Structure Prediction
Interface with:
- AlphaFold for prediction
- ESMFold for fast folding
- ColabFold for multimer prediction

## Output

### Structure Report

**General Information**
- PDB ID: 1XYZ
- Resolution: 2.1A
- Chains: A, B (homodimer)
- Length: 342 residues per chain

**Secondary Structure**
- Alpha helix: 45%
- Beta sheet: 22%
- Loops/coils: 33%

**Functional Sites**
- Active site residues: H143, D165, S203
- Binding pocket volume: 450 A3

**Quality Assessment**
- Ramachandran favored: 98%
- Rotamer outliers: 1.2%

docking-assistant

Molecular docking workflow support:

# Docking Assistant

Support molecular docking calculations.

## Workflow

### 1. Receptor Preparation
- Load protein structure
- Remove water and ions
- Add hydrogens
- Assign charges
- Define binding site

### 2. Ligand Preparation
- Generate 3D structure from SMILES/SDF
- Add hydrogens
- Generate conformers
- Assign charges

### 3. Docking Run
Configure docking software (AutoDock Vina, Glide, GOLD):
```bash
vina --receptor receptor.pdbqt \
     --ligand ligand.pdbqt \
     --center_x 10 --center_y 20 --center_z 30 \
     --size_x 20 --size_y 20 --size_z 20 \
     --out results.pdbqt

4. Results Analysis

Rank poses by score
Analyze binding interactions
Visualize top poses
Calculate binding energies

Output

Docking Results

Top 5 Poses

Rank	Score (kcal/mol)	RMSD	Key Interactions
1	-9.2	0.0	H-bond to R143, pi-stack with F167
...	...	...	...

Binding Analysis

Hydrogen bonds: 3
Hydrophobic contacts: 7
Salt bridges: 1

Visualization [PyMOL commands to visualize top pose]


## Skill Category 4: Data Wrangling

Bioinformatics-specific data manipulation skills.

### format-converter

Convert between bioinformatics file formats:

```markdown
# Format Converter

Convert between bioinformatics file formats.

## Sequence Formats

### Conversions Supported
- FASTA <-> GenBank
- FASTA <-> FASTQ (with quality)
- FASTQ -> FASTA (strip quality)
- GFF <-> GTF
- VCF <-> BED
- SAM <-> BAM
- BAM <-> CRAM

### Implementation

Uses standard tools:
- seqtk for sequence formats
- samtools for alignment formats
- bcftools for variant formats
- gffread for annotation formats

## Example Conversions

### FASTQ to FASTA
```bash
seqtk seq -a input.fastq > output.fasta

BAM to BED (coverage)

bedtools genomecov -ibam input.bam -bg > output.bed

VCF to BED

bcftools query -f '%CHROM\t%POS0\t%END\n' input.vcf > output.bed

Batch Processing

For multiple files:

for f in *.fastq; do
  seqtk seq -a "$f" > "${f%.fastq}.fasta"
done


### stats-calculator

Calculate common bioinformatics statistics:

```markdown
# Bioinformatics Statistics Calculator

Calculate statistics for biological data.

## Sequence Statistics

### DNA/RNA
- Length distribution
- GC content
- Nucleotide frequencies
- Di/trinucleotide frequencies
- Sequence complexity (K-mer diversity)

### Protein
- Length distribution
- Amino acid frequencies
- Molecular weight distribution
- Isoelectric point distribution

## Alignment Statistics

### Read Mapping
- Total reads
- Mapped/unmapped counts
- Mapping quality distribution
- Insert size distribution (paired-end)
- Coverage depth

### Multiple Sequence Alignment
- Alignment length
- Conservation scores
- Gap statistics
- Phylogenetic signal

## Variant Statistics

- Total variants
- SNP/Indel ratio
- Transition/transversion ratio
- Allele frequency distribution
- Variant effect predictions

## Output Formats

- Plain text summary
- TSV for downstream analysis
- JSON for programmatic use
- Plots (via matplotlib/R)

Building Custom Bioinformatics Skills

When existing skills don't fit your workflow, build custom ones. Key patterns:

Pattern 1: Tool Wrapper

Wrap a specific bioinformatics tool with sensible defaults:

# HMMER Search

Search protein sequences against HMM databases.

## Default Configuration
- Database: Pfam-A (protein domain database)
- E-value cutoff: 1e-5
- CPU threads: 4

## Command
```bash
hmmscan --cpu 4 -E 1e-5 --tblout results.tsv Pfam-A.hmm query.fasta

Output Interpretation

[Parse and explain hmmscan output]


### Pattern 2: Multi-Tool Pipeline

Chain multiple tools with proper data flow:

```markdown
# Metagenomics Pipeline

Analyze metagenomic sequencing data.

## Stage 1: Quality & Host Removal
[FastQC -> Trimmomatic -> Bowtie2 host removal]

## Stage 2: Assembly
[metaSPAdes or megahit]

## Stage 3: Binning
[MetaBAT2 or CONCOCT]

## Stage 4: Taxonomy
[Kraken2 or MetaPhlAn]

## Stage 5: Function
[Prokka annotation -> eggNOG-mapper]

Pattern 3: Analysis Framework

Provide structured analysis with interpretation:

# Phylogenetic Analysis

Build and interpret phylogenetic trees.

## Input
Multiple sequence alignment (MSA)

## Analysis Steps
1. Model selection (ModelTest-NG)
2. Tree building (RAxML-NG or IQ-TREE)
3. Bootstrap analysis
4. Tree visualization

## Interpretation Guidance
- Bootstrap values interpretation
- Branch length meaning
- Monophyly assessment
- Divergence time estimates (if calibrated)

Practical Considerations

Data Size

Bioinformatics data is large. Skills should:

Use streaming when possible
Provide progress indicators for long operations
Support resumable operations
Clean up intermediate files

Reproducibility

Scientific work must be reproducible:

Log all commands with parameters
Record software versions
Save random seeds
Document reference databases used

Validation

Bioinformatics has many ways to go wrong:

Validate input formats before processing
Check for common errors (empty files, truncated data)
Verify output sanity (reasonable values)
Compare against known controls when available

Tool Installation

Many bioinformatics tools have complex dependencies:

Document installation requirements
Provide conda/docker options
Check tool availability before running
Suggest alternatives for unavailable tools

Integration Examples

Lab Notebook Integration

## Experiment: RNA-seq Analysis of Treatment Response

### Date: 2025-01-19

### Samples
- Control: C1, C2, C3
- Treated: T1, T2, T3

### Analysis
Used /rnaseq-pipeline with default parameters.
Reference genome: GRCh38
Annotation: GENCODE v42

### Commands Run
[Auto-logged by skill]

### Results
[Structured output from skill]

### Interpretation
[Manual notes]

Workflow Automation

# Automated variant calling for new samples
for sample in samples/*.fastq.gz; do
  claude "/variant-caller --input $sample --reference GRCh38"
done

Conclusion

Bioinformatics skills for Claude Code bring AI assistance to one of the most data-intensive scientific disciplines. From basic sequence analysis to complex multi-stage pipelines, these skills encode domain expertise and tool knowledge into reusable, consistent workflows.

For researchers new to bioinformatics, these skills lower the barrier to entry. Commands that previously required reading extensive documentation become accessible through natural language. For experienced bioinformaticians, skills standardize routine operations and free cognitive resources for scientific interpretation.

The key to effective bioinformatics skills is balancing abstraction with control. Scientists need to understand what's happening (for reproducibility and interpretation) while not getting bogged down in tool invocation details. The best skills provide sensible defaults while exposing parameters that matter scientifically.

As genomic and proteomic data generation continues to accelerate, AI-assisted analysis becomes increasingly valuable. Claude Code skills position researchers to handle this data flood effectively, spending their time on biological insight rather than computational mechanics.

Exploring other scientific domains? Check out PDF Processing Skills Compared for research paper extraction, or explore Documentation Skills Roundup for generating research documentation.

Scientific Skills: Bioinformatics Deep Dive

This deep dive explores the bioinformatics skill ecosystem: what's available, how the skills work, and how they can transform computational biology workflows.

The Bioinformatics Challenge

Computational biology presents unique challenges for AI assistance:

Domain Complexity: Understanding DNA sequences, protein structures, and biological pathways requires specialized knowledge.

Tool Ecosystem: Dozens of established tools (BLAST, HMMER, BWA, samtools) with specific invocation patterns and output formats.

Data Volumes: Genomic datasets are massive. A single sequencing run generates gigabytes of data.

Pipeline Orchestration: Analyses chain multiple tools, each with dependencies on previous steps.

Reproducibility: Scientific work must be reproducible. Every step needs documentation.

Effective bioinformatics skills address these challenges by encoding domain knowledge and tool expertise.

Skill Category 1: Sequence Analysis

The most fundamental bioinformatics operation is sequence analysis: working with DNA, RNA, and protein sequences.

sequence-analyzer

A comprehensive skill for basic sequence operations:

# Sequence Analyzer

Analyze nucleotide and protein sequences.

## Capabilities

### Sequence Identification
- Detect sequence type (DNA, RNA, Protein)
- Identify organism origin when possible
- Check for common sequence features

### Basic Analysis
- Length and composition
- GC content (for nucleotides)
- Molecular weight (for proteins)
- Codon usage

### Motif Finding
- Restriction enzyme sites
- Regulatory motifs
- Protein domains

## Sequence Input Formats

Supports:
- Raw sequence (ACGT...)
- FASTA format
- GenBank format
- FASTQ (with quality scores)

## Example Workflow

User: "Analyze this sequence: ATGCGATCGATCGATCG"

1. Identify: DNA sequence, 17bp
2. Composition: A=4, T=4, G=5, C=4
3. GC Content: 52.9%
4. Restriction sites: Found HaeIII (GGCC) at position 8
5. Open reading frames: 2 potential ORFs detected

blast-assistant

Integrates with BLAST for sequence similarity searches:

# BLAST Assistant

Run and interpret BLAST searches.

## Supported Programs

- blastn: Nucleotide vs nucleotide
- blastp: Protein vs protein
- blastx: Translated nucleotide vs protein
- tblastn: Protein vs translated nucleotide
- tblastx: Translated vs translated

## Workflow

### 1. Query Preparation
- Validate sequence format
- Choose appropriate BLAST program
- Select target database

### 2. Run BLAST
```bash
blastn -query input.fasta -db nt -outfmt 6 -out results.tsv

3. Results Interpretation

Filter by E-value threshold
Identify significant hits
Report taxonomic distribution
Assess coverage and identity

Output Format

Summary

Query: [sequence ID]
Database: [database name]
Hits: [number found]
Top hit: [description] (E-value: X, Identity: Y%)

Significant Alignments

[Table of hits with E-value, identity, coverage]

Interpretation

[Plain language explanation of results]


### primer-designer

Specialized skill for PCR primer design:

```markdown
# Primer Designer

Design PCR primers following best practices.

## Design Parameters

### Physical Properties
- Length: 18-25bp (optimal 20-22)
- GC content: 40-60%
- Melting temperature: 55-65C
- Tm difference between primers: <2C

### Sequence Considerations
- 3' end: G or C (GC clamp)
- Avoid runs: >4 identical bases
- Avoid hairpins: >4bp self-complementary
- Avoid dimers: >4bp primer-primer complementary

### Amplicon
- Product size: User specified
- Check for off-target sites

## Workflow

1. Accept target sequence
2. Identify region to amplify
3. Generate candidate primers
4. Score candidates against parameters
5. Check for specificity (BLAST against genome)
6. Report best primer pairs

## Output

### Recommended Primer Pair

**Forward Primer**
- Sequence: 5'-ATGCGATCGATCGATC-3'
- Length: 16bp
- Tm: 58.2C
- GC: 50%

**Reverse Primer**
- Sequence: 5'-GATCGATCGATCGCAT-3'
- Length: 16bp
- Tm: 58.4C
- GC: 50%

**Amplicon**
- Size: 450bp
- Location: chr1:12345-12795

### Quality Assessment
[Score against each parameter]

### Order-Ready Format
[Format suitable for ordering from synthesis company]

Skill Category 2: Genomics Pipelines

More complex skills orchestrate multi-step genomic analyses.

variant-caller

Coordinates the variant calling pipeline:

# Variant Caller Pipeline

Call variants from sequencing data.

## Pipeline Steps

### 1. Quality Control
```bash
fastqc raw_reads.fastq

Report read quality, adapter content, duplicates.

2. Trimming (if needed)

trimmomatic PE raw_R1.fastq raw_R2.fastq \
  trimmed_R1.fastq unpaired_R1.fastq \
  trimmed_R2.fastq unpaired_R2.fastq \
  ILLUMINACLIP:adapters.fa:2:30:10 \
  LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

3. Alignment

bwa mem -t 8 reference.fa trimmed_R1.fastq trimmed_R2.fastq \
  | samtools sort -o aligned.bam
samtools index aligned.bam

4. Mark Duplicates

picard MarkDuplicates \
  I=aligned.bam \
  O=dedup.bam \
  M=metrics.txt

5. Variant Calling

gatk HaplotypeCaller \
  -R reference.fa \
  -I dedup.bam \
  -O variants.vcf

6. Filtering

gatk VariantFiltration \
  -R reference.fa \
  -V variants.vcf \
  -O filtered.vcf \
  --filter-expression "QD < 2.0" --filter-name "QD2" \
  --filter-expression "FS > 60.0" --filter-name "FS60"

Output Interpretation

Summary Statistics

Total reads: X
Mapped reads: Y (Z%)
Duplicates removed: A (B%)
Variants called: C
SNPs: D
Indels: E

Quality Metrics

[Table of quality scores at each step]

Notable Variants

[List of high-impact variants with annotation]


### rnaseq-pipeline

RNA-seq analysis from reads to differential expression:

```markdown
# RNA-seq Analysis Pipeline

Analyze RNA-seq data for differential expression.

## Pipeline Phases

### Phase 1: Read Processing
1. Quality check (FastQC)
2. Adapter trimming (cutadapt)
3. Quality filtering

### Phase 2: Alignment
Two approaches:
- Genome alignment (STAR/HISAT2) for splice-aware mapping
- Pseudoalignment (kallisto/salmon) for fast quantification

### Phase 3: Quantification
- Generate count matrix
- Normalize counts (TPM, FPKM, or for DE analysis)

### Phase 4: Differential Expression
Using DESeq2 or edgeR:
1. Load count matrix
2. Define experimental design
3. Run DE analysis
4. Adjust for multiple testing

### Phase 5: Visualization & Interpretation
- MA plots
- Volcano plots
- PCA of samples
- Heatmaps of DE genes
- GO enrichment analysis

## Experimental Design Guidance

### Required Information
- Sample grouping (condition, treatment, etc.)
- Batch information if applicable
- Paired design if applicable

### Statistical Considerations
- Minimum 3 replicates per condition
- Account for batch effects
- Choose appropriate multiple testing correction

## Output

### Summary Report
- Samples processed: X
- Reads mapped: Y% average
- Genes detected: Z
- Differentially expressed: A (padj < 0.05)
  - Upregulated: B
  - Downregulated: C

### Key Results
[Top 20 differentially expressed genes with descriptions]

### Enrichment Analysis
[Enriched GO terms and pathways]

Skill Category 3: Structural Biology

Skills for protein structure and molecular modeling.

structure-analyzer

Protein structure analysis and visualization:

# Structure Analyzer

Analyze protein 3D structures.

## Input Formats
- PDB files
- mmCIF files
- AlphaFold predictions

## Analyses

### Basic Metrics
- Chain count and lengths
- Resolution (for experimental structures)
- pLDDT scores (for AlphaFold)
- Secondary structure composition

### Structural Features
- Domain identification
- Active site detection
- Binding pocket analysis
- Surface area calculations

### Comparison
- RMSD to reference structure
- Structural alignment
- Conservation mapping

## Integration with Tools

### PyMOL/ChimeraX Commands
Generate visualization commands for external tools:

fetch 1XYZ show cartoon color spectrum, b


### Structure Prediction
Interface with:
- AlphaFold for prediction
- ESMFold for fast folding
- ColabFold for multimer prediction

## Output

### Structure Report

**General Information**
- PDB ID: 1XYZ
- Resolution: 2.1A
- Chains: A, B (homodimer)
- Length: 342 residues per chain

**Secondary Structure**
- Alpha helix: 45%
- Beta sheet: 22%
- Loops/coils: 33%

**Functional Sites**
- Active site residues: H143, D165, S203
- Binding pocket volume: 450 A3

**Quality Assessment**
- Ramachandran favored: 98%
- Rotamer outliers: 1.2%

docking-assistant

Molecular docking workflow support:

# Docking Assistant

Support molecular docking calculations.

## Workflow

### 1. Receptor Preparation
- Load protein structure
- Remove water and ions
- Add hydrogens
- Assign charges
- Define binding site

### 2. Ligand Preparation
- Generate 3D structure from SMILES/SDF
- Add hydrogens
- Generate conformers
- Assign charges

### 3. Docking Run
Configure docking software (AutoDock Vina, Glide, GOLD):
```bash
vina --receptor receptor.pdbqt \
     --ligand ligand.pdbqt \
     --center_x 10 --center_y 20 --center_z 30 \
     --size_x 20 --size_y 20 --size_z 20 \
     --out results.pdbqt

4. Results Analysis

Rank poses by score
Analyze binding interactions
Visualize top poses
Calculate binding energies

Output

Docking Results

Top 5 Poses

Rank	Score (kcal/mol)	RMSD	Key Interactions
1	-9.2	0.0	H-bond to R143, pi-stack with F167
...	...	...	...

Binding Analysis

Hydrogen bonds: 3
Hydrophobic contacts: 7
Salt bridges: 1

Visualization [PyMOL commands to visualize top pose]


## Skill Category 4: Data Wrangling

Bioinformatics-specific data manipulation skills.

### format-converter

Convert between bioinformatics file formats:

```markdown
# Format Converter

Convert between bioinformatics file formats.

## Sequence Formats

### Conversions Supported
- FASTA <-> GenBank
- FASTA <-> FASTQ (with quality)
- FASTQ -> FASTA (strip quality)
- GFF <-> GTF
- VCF <-> BED
- SAM <-> BAM
- BAM <-> CRAM

### Implementation

Uses standard tools:
- seqtk for sequence formats
- samtools for alignment formats
- bcftools for variant formats
- gffread for annotation formats

## Example Conversions

### FASTQ to FASTA
```bash
seqtk seq -a input.fastq > output.fasta

BAM to BED (coverage)

bedtools genomecov -ibam input.bam -bg > output.bed

VCF to BED

bcftools query -f '%CHROM\t%POS0\t%END\n' input.vcf > output.bed

Batch Processing

For multiple files:

for f in *.fastq; do
  seqtk seq -a "$f" > "${f%.fastq}.fasta"
done


### stats-calculator

Calculate common bioinformatics statistics:

```markdown
# Bioinformatics Statistics Calculator

Calculate statistics for biological data.

## Sequence Statistics

### DNA/RNA
- Length distribution
- GC content
- Nucleotide frequencies
- Di/trinucleotide frequencies
- Sequence complexity (K-mer diversity)

### Protein
- Length distribution
- Amino acid frequencies
- Molecular weight distribution
- Isoelectric point distribution

## Alignment Statistics

### Read Mapping
- Total reads
- Mapped/unmapped counts
- Mapping quality distribution
- Insert size distribution (paired-end)
- Coverage depth

### Multiple Sequence Alignment
- Alignment length
- Conservation scores
- Gap statistics
- Phylogenetic signal

## Variant Statistics

- Total variants
- SNP/Indel ratio
- Transition/transversion ratio
- Allele frequency distribution
- Variant effect predictions

## Output Formats

- Plain text summary
- TSV for downstream analysis
- JSON for programmatic use
- Plots (via matplotlib/R)

Building Custom Bioinformatics Skills

When existing skills don't fit your workflow, build custom ones. Key patterns:

Pattern 1: Tool Wrapper

Wrap a specific bioinformatics tool with sensible defaults:

# HMMER Search

Search protein sequences against HMM databases.

## Default Configuration
- Database: Pfam-A (protein domain database)
- E-value cutoff: 1e-5
- CPU threads: 4

## Command
```bash
hmmscan --cpu 4 -E 1e-5 --tblout results.tsv Pfam-A.hmm query.fasta

Output Interpretation

[Parse and explain hmmscan output]


### Pattern 2: Multi-Tool Pipeline

Chain multiple tools with proper data flow:

```markdown
# Metagenomics Pipeline

Analyze metagenomic sequencing data.

## Stage 1: Quality & Host Removal
[FastQC -> Trimmomatic -> Bowtie2 host removal]

## Stage 2: Assembly
[metaSPAdes or megahit]

## Stage 3: Binning
[MetaBAT2 or CONCOCT]

## Stage 4: Taxonomy
[Kraken2 or MetaPhlAn]

## Stage 5: Function
[Prokka annotation -> eggNOG-mapper]

Pattern 3: Analysis Framework

Provide structured analysis with interpretation:

# Phylogenetic Analysis

Build and interpret phylogenetic trees.

## Input
Multiple sequence alignment (MSA)

## Analysis Steps
1. Model selection (ModelTest-NG)
2. Tree building (RAxML-NG or IQ-TREE)
3. Bootstrap analysis
4. Tree visualization

## Interpretation Guidance
- Bootstrap values interpretation
- Branch length meaning
- Monophyly assessment
- Divergence time estimates (if calibrated)

Practical Considerations

Data Size

Bioinformatics data is large. Skills should:

Use streaming when possible
Provide progress indicators for long operations
Support resumable operations
Clean up intermediate files

Reproducibility

Scientific work must be reproducible:

Log all commands with parameters
Record software versions
Save random seeds
Document reference databases used

Validation

Bioinformatics has many ways to go wrong:

Validate input formats before processing
Check for common errors (empty files, truncated data)
Verify output sanity (reasonable values)
Compare against known controls when available

Tool Installation

Many bioinformatics tools have complex dependencies:

Document installation requirements
Provide conda/docker options
Check tool availability before running
Suggest alternatives for unavailable tools

Integration Examples

Lab Notebook Integration

## Experiment: RNA-seq Analysis of Treatment Response

### Date: 2025-01-19

### Samples
- Control: C1, C2, C3
- Treated: T1, T2, T3

### Analysis
Used /rnaseq-pipeline with default parameters.
Reference genome: GRCh38
Annotation: GENCODE v42

### Commands Run
[Auto-logged by skill]

### Results
[Structured output from skill]

### Interpretation
[Manual notes]

Workflow Automation

# Automated variant calling for new samples
for sample in samples/*.fastq.gz; do
  claude "/variant-caller --input $sample --reference GRCh38"
done

Conclusion

Exploring other scientific domains? Check out PDF Processing Skills Compared for research paper extraction, or explore Documentation Skills Roundup for generating research documentation.