Sovereign AI — Running LLMs Locally

Track 06 — Sovereign AIL-01

Why Run LLMs Locally?

Sovereignty over your AI stack means owning the models, the data, and the inference. No rate limits. No surveillance. No surprise bill.

The Four Reasons

Cloud LLM APIs are convenient. But they come with hidden costs that compound over time: every query is logged, every prompt is shared with the provider, every minute the model could change underneath you, and your costs scale linearly with usage.

Running an LLM locally — on your own hardware — flips all four of those tradeoffs:

Privacy — your data never leaves your machine. No prompt logging. No training on your conversations.
Cost — pay once for the hardware, run unlimited inference. After ~6 months, local typically beats cloud on cost.
Sovereignty — no rate limits, no censorship, no provider quietly changing the model. You control the version.
Offline capability — your firm keeps running with no internet, on a plane, during an outage. Real resilience.

The DeadCatFound Standard

The firm uses cloud APIs (Claude, OpenAI) for high-stakes scoring and complex reasoning. It uses local LLMs for high-volume tasks: sentiment classification, log summarization, news triage, and routine analysis. Hybrid stack — best of both.

When Local Beats Cloud

Local LLMs are not a universal replacement for Claude or GPT-4. They are a complementary tool. Here's the decision matrix:

When to Use Local vs. Cloud

Scenario	Local LLM	Cloud API
High-volume routine tasks (sentiment, classification)	✓ Cheaper, faster	✗ Expensive at scale
Complex multi-step reasoning	✗ Smaller models struggle	✓ Claude/GPT-4 wins
Sensitive financial data	✓ Never leaves machine	✗ Goes to provider
Real-time intraday alerts	✓ <1s latency	~1-3s latency
Cutting-edge model capability	Lags 6-12 months	✓ Always latest
Offline / air-gapped	✓ Works anywhere	✗ Needs internet

Track 06 — Sovereign AIL-02

Cost & Benefit Analysis

Math is the only thing that matters. Here's the real economics of local vs. cloud LLMs at every usage tier.

Cloud API Pricing — May 2026

Per-Million Token Pricing (USD)

Model	Input	Output	Best Use
Claude Haiku 4.5	$0.25	$1.25	Fast bulk scoring
Claude Sonnet 4.6	$3.00	$15.00	Complex reasoning
GPT-4o	$5.00	$15.00	General purpose
GPT-4o mini	$0.15	$0.60	Cheap bulk tasks
Gemini 2.0 Flash	$0.075	$0.30	Cheapest API tier

Local LLM Cost — One-Time Hardware

Hardware Cost vs. Capability

Setup	Cost (USD)	Capable of Running	Tokens/sec
MacBook Air M2 (16GB)	$1,200	7B models (Q4)	~20-30
Mac Mini M4 Pro (48GB)	$2,000	30B-70B models (Q4)	~15-25
RTX 4090 PC build	$3,500	30B models (Q4-Q5)	~50-80
Mac Studio M3 Ultra (192GB)	$7,000	120B+ models (Q4)	~25-40
Dual RTX 4090 workstation	$8,000	70B+ models (Q6-Q8)	~80-120

The Crossover Point

Local hardware breaks even with cloud API costs at the following monthly burn rates (assuming a $2,000 Mac Mini amortized over 24 months ≈ $83/month):

When Local Wins on Pure Cost

If your monthly cloud LLM bill exceeds $80/month for routine tasks, local LLMs pay for themselves within 2 years.

At $500/month cloud spend, a $2,000 local rig pays for itself in 4 months.

The Hidden Benefits (Not Just Cost)

Privacy — your trading signals, strategies, and proprietary code never get logged by an external provider
No rate limits — burst through 10,000 sentiment classifications in a row without throttling
Latency — local inference avoids the network round-trip — usually faster than a cloud call
Model permanence — the version you tested is the version that ships forever; no surprise deprecations

Track 06 — Sovereign AIL-03

Hardware Requirements

Pick your hardware before picking your model. The wrong combo means a model that either won't load or runs at 1 token per second.

The Critical Resource: Memory

The single most important spec for running LLMs locally is memory bandwidth and capacity. The model must fit entirely in RAM (or VRAM on a GPU) to run fast. If it doesn't fit, the inference falls back to disk swap and slows by 100x.

Memory Required by Model Size (Q4 Quantization)

Model Size	RAM Needed	Disk Space	Example Models
3B params	~3 GB	~2 GB	Phi-3 Mini, Llama 3.2 3B
7B-8B params	~6 GB	~4 GB	Llama 3.1 8B, Mistral 7B
13B-14B params	~10 GB	~8 GB	Qwen 2.5 14B
30B-32B params	~20 GB	~18 GB	Qwen 2.5 32B, Yi 34B
70B params	~40 GB	~38 GB	Llama 3.1 70B
120B+ params	~70 GB	~65 GB	Llama 3.1 405B (Q3 only)

Apple Silicon vs. NVIDIA GPU

Both platforms work well — they trade off differently:

Apple Silicon vs. NVIDIA GPU

Dimension	Apple Silicon (M-series)	NVIDIA RTX
Unified memory	✓ Up to 192GB shared	Fixed VRAM (24GB max consumer)
Tokens/sec on 7B	~30-50	~80-120
Tokens/sec on 70B	~10-20 (fits in unified RAM)	Needs dual GPU
Power consumption	~30W idle, ~80W load	~150W idle, ~400W load
Software ecosystem	llama.cpp Metal, MLX	CUDA — most mature
Best for	Big models, low power	Speed, fine-tuning

Practical Recommendation

For most DeadCatFound readers, a Mac Mini M4 Pro with 48GB unified memory ($1,999) is the sweet spot. Runs every 30B model smoothly, fits 70B at Q4, uses ~30W idle, and you don't need to manage drivers or assemble a PC.

Track 06 — Sovereign AIL-04

Ollama — The Easy Path

Ollama is the easiest way to run LLMs locally. One command installs it, one command downloads a model, one command runs it. If you're starting from zero, start here.

🦙

Ollama

Run LLMs with a single command. Built-in model library. OpenAI-compatible API.

Platforms

macOS, Linux, Windows

License

MIT — Free, open source

Best for

Getting started fast

Download → Browse Models GitHub

Install & First Run

terminal — install ollama

# macOS — via Homebrew (or download from ollama.com)
brew install ollama

# Linux / WSL
curl -fsSL https://ollama.com/install.sh | sh

# Start the Ollama server (runs in background)
ollama serve &

# Pull and run a model — first time downloads weights
ollama run llama3.1:8b

# Or just chat with it directly in your terminal
>>> What is the capital of France?
Paris.

The OpenAI-Compatible API

Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1/. This means you can swap any cloud OpenAI call to use Ollama with one line of code:

python — ollama via openai client

from openai import OpenAI

client = OpenAI(
    base_url = "http://localhost:11434/v1",
    api_key  = "ollama",  # ignored but required
)

response = client.chat.completions.create(
    model    = "llama3.1:8b",
    messages = [
        {"role": "system", "content": "You score stocks 0-100."},
        {"role": "user",   "content": "Score NVDA based on momentum."},
    ],
)
print(response.choices[0].message.content)

Useful Ollama Commands

ollama list — show downloaded models
ollama pull llama3.1:70b — download a model
ollama rm llama3.1:8b — delete a model
ollama ps — show running models

Track 06 — Sovereign AIL-05

LM Studio — GUI Desktop App

LM Studio is Ollama with a graphical interface. If you want a chat window, a built-in model browser, and don't want to use the terminal — this is your tool.

🖥️

LM Studio

Desktop app for running local LLMs. Built-in chat UI + OpenAI-compatible API server.

Platforms

macOS, Windows, Linux

License

Free for personal use

Best for

GUI users, model exploration

Download → Documentation

Why Use LM Studio

Visual model browser — see model size, RAM requirements, and reviews before downloading
Built-in chat interface — talk to your model in a polished chat UI, no terminal needed
One-click API server — flip a switch and it exposes an OpenAI-compatible endpoint
Multi-model loading — keep multiple models loaded at once and switch between them
System prompt presets — save and switch between role-specific prompts

Workflow

Open LM Studio
Browse models in the Discover tab
Click Download on any model (it tells you if your RAM can handle it)
Load the model in the Chat tab
Talk to it — or go to the Local Server tab and start the API server
Connect any OpenAI-compatible client to http://localhost:1234/v1

Ollama vs. LM Studio — Which to Pick?

Pick Ollama if you're a developer comfortable with the terminal — it's faster, scriptable, and integrates better with launchd/cron.

Pick LM Studio if you want a polished desktop app, like to browse models visually, or are exploring before committing.

Track 06 — Sovereign AIL-06

llama.cpp & GGUF Models

Underneath both Ollama and LM Studio is llama.cpp — the open-source engine that started the local LLM revolution. Understand it and you can run any model anywhere.

⚙️

llama.cpp

The C++ inference engine. Powers Ollama, LM Studio, and most local LLM tools.

Platforms

Everything — including ARM, x86, Apple Silicon, CUDA

License

MIT

Best for

Low-level control, scripting

GitHub → GGUF Models on HuggingFace

What is GGUF?

GGUF stands for GPT-Generated Unified Format. It is the file format that llama.cpp (and Ollama, LM Studio) uses to store models. A single .gguf file contains the model weights, the tokenizer, and the metadata — everything needed to run the model.

GGUF files are quantized — the model weights are compressed to fewer bits to fit in less RAM. The quantization level dramatically affects both size and quality:

Quantization Levels Explained

Level	Bits	Size vs. FP16	Quality	Recommendation
Q2_K	2 bits	~14%	Poor — visible degradation	Avoid unless RAM constrained
Q3_K_M	3 bits	~21%	Acceptable for some tasks	Last resort
Q4_K_M	4 bits	~28%	Great — minor quality loss	✓ Best balance
Q5_K_M	5 bits	~35%	Near-FP16	If you have RAM
Q6_K	6 bits	~41%	Essentially FP16	For critical tasks
Q8_0	8 bits	~53%	Indistinguishable from FP16	If RAM permits

The Q4_K_M Rule

For 95% of use cases, download the Q4_K_M variant of any model. You get ~4x compression with almost no quality loss. It's the universal default for a reason.

Manual llama.cpp Workflow

terminal — manual llama.cpp

# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Download a GGUF model from HuggingFace
wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

# Run inference
./llama-cli -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  -p "Score NVDA based on momentum:" \
  -n 256

# Or start an OpenAI-compatible server
./llama-server -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf --port 8080

Track 06 — Sovereign AIL-07

Model Recommendations

Hundreds of open-weight models exist. Most are not worth your disk space. These are the ones actually worth running — with download links.

Top Picks by Use Case

🦙

Llama 3.1 8B Instruct

Meta's flagship small model. Best general-purpose 7-8B model as of 2026. Excellent for sentiment, summarization, structured output.

Recommended

Size (Q4_K_M)

~4.9 GB

Context Window

128K tokens

License

Llama 3.1 Community License

HuggingFace → via Ollama

🚀

Qwen 2.5 14B / 32B Instruct

Alibaba's model — best open-weight model in this size class. Strong reasoning, exceptional at code, multilingual.

Recommended

Size (Q4_K_M, 14B)

~9 GB

Size (Q4_K_M, 32B)

~20 GB

License

Apache 2.0

HuggingFace 32B → via Ollama

🧠

Mistral 7B Instruct v0.3

Compact, fast, and proven. The default choice when you need 7B at speed. Strong instruction following.

Size (Q4_K_M)

~4.4 GB

Context Window

32K tokens

License

Apache 2.0

HuggingFace → via Ollama

⚡

Phi-3 Mini 3.8B

Microsoft's tiny powerhouse. Punches far above its weight. Perfect for laptop / low-RAM setups.

Size (Q4_K_M)

~2.4 GB

Context Window

128K tokens

License

MIT

HuggingFace → via Ollama

🔬

DeepSeek-V3 / R1

Chinese-built reasoning model. Strong on math, code, and chain-of-thought reasoning. The cheapest path to o1-class reasoning.

Heavy

Size (Q4, distilled 32B)

~20 GB

Context Window

64K tokens

License

MIT

HuggingFace → via Ollama

🦙

Llama 3.1 70B Instruct

Meta's flagship. Genuinely competitive with GPT-4-class models on many benchmarks. Needs serious RAM.

Heavy

Size (Q4_K_M)

~40 GB

RAM Needed

~48 GB

License

Llama 3.1 Community

HuggingFace → via Ollama

Where to Discover More Models

The HuggingFace LLM leaderboard ranks open-weight models by benchmark performance. Filter by size to find the strongest model your hardware can run.
huggingface.co/spaces/open-llm-leaderboard

Track 06 — Sovereign AIL-08

Integration with Your Firm

A local LLM is only useful when it's wired into your strategy. Here's how to plug Ollama into the DeadCatFound trading firm in under 50 lines of code.

The OpenAI-Compatible Pattern

Every local LLM server (Ollama, LM Studio, llama.cpp's llama-server) exposes an OpenAI-compatible endpoint. This is the most important detail in the entire course: your existing OpenAI / Claude code works unchanged — just swap the base URL.

scripts/local_scorer.py

"""
Drop-in replacement for cloud LLM scoring in the firm.
Uses local Llama 3.1 8B via Ollama.
"""
from openai import OpenAI
import json

# Point at Ollama instead of OpenAI
client = OpenAI(
    base_url = "http://localhost:11434/v1",
    api_key  = "ollama",  # ignored
)

def score_stock(symbol: str, metrics: dict) -> dict:
    """Score a stock 0-100 using local LLM. Returns {score, reason}."""
    response = client.chat.completions.create(
        model = "llama3.1:8b",
        messages = [
            {"role": "system", "content":
             "You are a quantitative analyst. Reply with JSON only."},
            {"role": "user", "content":
             f"Score {symbol} from 0-100 based on these metrics: {metrics}. "
             "Return: {\"score\": int, \"reason\": str}"},
        ],
        temperature = 0.3,
    )
    return json.loads(response.choices[0].message.content)

if __name__ == "__main__":
    result = score_stock("NVDA", {"rsi": 62, "momentum_30d": 0.18})
    print(result)

Use Cases in the Firm

Bulk stock screening — score 200 stocks daily without burning $50 in cloud API fees
News sentiment classification — classify thousands of headlines per day for free
Log summarization — local LLM summarizes daily firm activity into a single Telegram alert
Pre-screening before cloud LLM — local LLM does the cheap first pass; Claude only gets the survivors

Hybrid Architecture — The Smart Stack

When to Route Where

Task	Volume / Day	Route To	Why
Universe pre-filter	500+ stocks	Local Llama 3.1 8B	Cheap, fast, no quality needed
News sentiment	1,000+ headlines	Local Llama 3.1 8B	Free, parallel-able
Top-20 stock deep scoring	20 stocks	Claude Haiku 4.5	Quality matters; cheap on small volume
Strategy design / debugging	~10 calls	Claude Sonnet 4.6	Best reasoning available

Sovereignty Achieved

You now have a complete picture of how to bring your AI in-house. The cloud is for the hard problems. Your local stack handles the volume. The result: a firm that runs at scale on a budget, with zero data exposure, that keeps running when the internet doesn't.

Your AI Is Now Sovereign.

You can run LLMs locally. You know which models to pick. You know how to integrate them. You're no longer dependent on any single provider — and you control your data end-to-end.

Continue building on DeadCatFound — every track stacks together.

Back to DeadCatFound →

🔒 Private Access

The Four Reasons

When Local Beats Cloud

Cloud API Pricing — May 2026

Local LLM Cost — One-Time Hardware

The Crossover Point

The Hidden Benefits (Not Just Cost)

The Critical Resource: Memory

Apple Silicon vs. NVIDIA GPU

Install & First Run

The OpenAI-Compatible API

Why Use LM Studio

Workflow

What is GGUF?

Manual llama.cpp Workflow

Top Picks by Use Case

The OpenAI-Compatible Pattern

Use Cases in the Firm

Hybrid Architecture — The Smart Stack

Your AI Is Now Sovereign.