Self-Hosting AI After Usage Limits Hit

It happens to every heavy Claude Code user eventually. You are deep in a feature build, the code is flowing, and then the message appears: you have hit your usage limit. Your AI coding assistant is now unavailable for the next few hours. The feature is half-built. Your tests are failing. And you are stuck.

This is not an argument against cloud-based AI tools. Claude's quality justifies the subscription. But depending entirely on a single provider with usage caps is a single point of failure in your workflow. A local model fallback eliminates that vulnerability.

Key Takeaways

Local models are not Claude replacements -- they are fallback tools for when cloud limits hit at the worst possible moment
Ollama makes running local models as simple as ollama run with no GPU configuration needed for smaller models
A 7B parameter model running locally handles 80% of routine coding tasks -- completions, simple refactors, boilerplate generation
The setup takes 30 minutes and zero ongoing maintenance, making it worth having even if you rarely need it
Switching between Claude and local models should be seamless -- design your workflow so the transition requires no configuration changes

When Limits Become a Problem

Claude Pro at $20/month and Claude Max at $100-200/month are generous for most workflows. But certain patterns burn through limits fast.

Long context sessions. Working on a large codebase with many files in context consumes tokens quickly. A single session with 50 files loaded can use more tokens than a day of short, focused prompts.

Iterative debugging. When you are going back and forth with Claude to fix a tricky bug, each iteration consumes tokens on both the prompt and the response. Ten rounds of "try this, no that did not work, try this instead" can consume a significant chunk of your daily allocation.

Multi-session parallel work. If you are running multiple Claude Code sessions across different worktrees (as described in the AI dev workflow guide), each session's token usage adds up independently.

The limit is not a problem until it is. And when it is, it is always at the worst time -- mid-feature, mid-debug, mid-deadline.

The Local Model Stack

The local model ecosystem has matured significantly. Here is the stack that works.

Ollama: The Runtime

Ollama is the simplest way to run language models locally. It handles model downloading, quantization, and serving behind a simple API.

# Install Ollama
brew install ollama

# Start the Ollama server
ollama serve

# Pull a coding-focused model
ollama pull codellama:13b
ollama pull deepseek-coder-v2:16b

Ollama exposes an OpenAI-compatible API on localhost:11434, which means tools that work with the OpenAI API can point to Ollama with a URL change.

Model Selection

Not all local models are equal for coding tasks. Here is what works at different hardware levels.

8GB RAM (no dedicated GPU):

CodeLlama 7B (Q4 quantized) -- basic completions, simple refactors
Decent for boilerplate, weak on complex logic

16GB RAM (no dedicated GPU):

DeepSeek Coder V2 16B (Q4) -- stronger reasoning, multi-file awareness
Good enough for most routine development tasks

32GB+ RAM or dedicated GPU:

CodeLlama 34B or Mixtral 8x7B -- approaching cloud model quality for many tasks
Can handle architectural decisions and complex debugging

Apple Silicon Macs (M1/M2/M3/M4):

Unified memory architecture means the GPU memory is your RAM
M2 Pro (16GB) comfortably runs 13B models
M3 Max (64GB) can run 70B models with reasonable speed

The Quality Gap

Let me be direct: local models are worse than Claude for coding tasks. The gap is significant for complex reasoning, architectural decisions, and understanding large codebases. You will notice the difference immediately.

But for the specific use case of "my cloud AI is down and I need to finish this feature," local models cover the gap. The tasks you are most likely doing when you hit a limit -- finishing a half-built feature, fixing a known bug, writing boilerplate -- are exactly the tasks local models handle adequately.

Setting Up the Fallback

The goal is a setup where switching from Claude to a local model requires minimal effort.

Option 1: Continue CLI

Continue is an open-source coding assistant that works with both cloud and local models. Configure it with Claude as the primary and Ollama as the fallback.

{
  "models": [
    {
      "title": "Claude (Primary)",
      "provider": "anthropic",
      "model": "claude-sonnet-4-20250514"
    },
    {
      "title": "Local Fallback",
      "provider": "ollama",
      "model": "deepseek-coder-v2:16b"
    }
  ]
}

When Claude limits hit, switch to the local model with a single click.

Option 2: Terminal-Based Fallback

If you prefer staying in the terminal (and you should, based on the workflow from our dev setup guide), set up a shell alias that routes to your local model.

# In .zshrc
alias ai-local="ollama run deepseek-coder-v2:16b"

# Or use a wrapper that provides similar UX to Claude Code
alias ai-fallback="aichat --model ollama:deepseek-coder-v2:16b"

Option 3: API-Compatible Proxy

For tools that expect the OpenAI or Anthropic API format, run a proxy that routes requests to Ollama.

# LiteLLM proxy translates API formats
pip install litellm
litellm --model ollama/deepseek-coder-v2:16b --port 8000

This proxy accepts requests in the OpenAI format and routes them to your local Ollama instance. Tools that work with the OpenAI API work with this proxy without changes.

What Local Models Handle Well

Code Completion

Given a partial function with clear context, local models complete it correctly most of the time. The pattern is straightforward and the model has seen millions of similar examples.

// Given this context in a TypeScript file:
function calculateDiscount(price: number, tier: 'basic' | 'premium' | 'enterprise') {
  // A local model completes this reasonably well
}

Simple Refactors

Renaming variables, extracting functions, converting callbacks to async/await -- these mechanical transformations are within the capability of even 7B models.

Boilerplate Generation

React components, API routes, test scaffolds, database queries -- repetitive patterns that follow templates are local model territory. They may not match Claude's polish, but they produce working code.

Documentation and Comments

Writing JSDoc comments, README sections, and inline documentation is where local models approach cloud model quality. The task is more about language than reasoning.

What Local Models Handle Poorly

Multi-File Changes

Local models struggle to keep multiple files in context simultaneously. If your task requires modifying a component, its tests, and the page that uses it, expect to guide the model through each file separately rather than describing the change once and having it propagate.

Complex Debugging

"This test fails intermittently and I think it's a race condition in the WebSocket handler" -- this level of debugging requires reasoning that local models do not reliably provide. Save these tasks for when Claude is available.

Architecture Decisions

"Should I use a state machine or a simple boolean flag for this workflow?" -- local models will give you an answer, but it will lack the nuanced trade-off analysis Claude provides. Do not make architecture decisions based on local model advice.

Cost Analysis

Running local models has no per-token cost, but it is not free.

Electricity. Running a 13B model on an M2 Pro draws about 15-20 watts of additional power. At US electricity rates, this is roughly $0.50/month for heavy use.

Hardware depreciation. If you bought a Mac with more RAM specifically for local models, factor that into the cost. An extra $200 for 16GB of additional unified memory amortized over 4 years is about $4/month.

Speed cost. Local models generate tokens at 10-30 tokens/second on typical hardware, compared to 60-100 tokens/second from Claude. You will wait longer for responses.

Total effective cost: Under $5/month for occasional fallback use. Worth it for the insurance alone.

The Hybrid Workflow

The ideal setup uses cloud models for primary work and local models as a seamless fallback.

Start every session with Claude Code
If you hit limits, switch to your local model for routine tasks
Queue complex tasks for when Claude is available again
Use the local model for lower-stakes work: documentation, simple features, boilerplate

This hybrid approach means you never lose a full afternoon to usage limits. You might lose some quality on the tasks you push to the local model, but you keep making progress. For more on managing your workflow across tools, see the CLI commands reference.

FAQ

Do I need a GPU to run local models?

No. Modern quantized models run on CPU with reasonable performance. Apple Silicon Macs use their unified memory architecture to accelerate inference without a discrete GPU. On Intel/AMD systems, expect slower generation but still usable speeds for 7B-13B models.

Which local model is closest to Claude for coding?

As of early 2026, DeepSeek Coder V2 and CodeLlama 34B are the strongest open coding models. Neither matches Claude's reasoning quality, but both handle routine coding tasks competently.

Can I use local models with Claude Code directly?

Claude Code is designed to work with Anthropic's API. You cannot point it at a local model. For local model interaction, use alternative tools like Continue, aichat, or direct Ollama CLI access.

Is it legal to run these models locally?

Yes. The models mentioned in this guide (CodeLlama, DeepSeek Coder, Mixtral) are released under permissive licenses that allow local use. Always verify the license of any model you download.

How much disk space do models require?

A 7B model (Q4 quantized) uses about 4GB. A 13B model uses about 8GB. A 34B model uses about 20GB. Ollama manages downloads and storage automatically.

Explore production-ready AI skills at aiskill.market/browse or submit your own skill to the marketplace.

Sources

Ollama Documentation - Local model runtime setup and usage
DeepSeek Coder - Open-source coding model documentation
LiteLLM Proxy - API-compatible proxy for local models
Apple Silicon ML Benchmarks - Performance data for Apple hardware