πŸ”’ Private Access

DeadCatFound is in development.
Enter your access code to continue.

Incorrect access code.
← DeadCatFound
Courses Performance Live Trades
Track 06 β€” Sovereign AIL-01
Why Run LLMs Locally?
Sovereignty over your AI stack means owning the models, the data, and the inference. No rate limits. No surveillance. No surprise bill.

The Four Reasons

Cloud LLM APIs are convenient. But they come with hidden costs that compound over time: every query is logged, every prompt is shared with the provider, every minute the model could change underneath you, and your costs scale linearly with usage.

Running an LLM locally β€” on your own hardware β€” flips all four of those tradeoffs:

  • Privacy β€” your data never leaves your machine. No prompt logging. No training on your conversations.
  • Cost β€” pay once for the hardware, run unlimited inference. After ~6 months, local typically beats cloud on cost.
  • Sovereignty β€” no rate limits, no censorship, no provider quietly changing the model. You control the version.
  • Offline capability β€” your firm keeps running with no internet, on a plane, during an outage. Real resilience.
The DeadCatFound Standard
The firm uses cloud APIs (Claude, OpenAI) for high-stakes scoring and complex reasoning. It uses local LLMs for high-volume tasks: sentiment classification, log summarization, news triage, and routine analysis. Hybrid stack β€” best of both.

When Local Beats Cloud

Local LLMs are not a universal replacement for Claude or GPT-4. They are a complementary tool. Here's the decision matrix:

When to Use Local vs. Cloud
ScenarioLocal LLMCloud API
High-volume routine tasks (sentiment, classification)βœ“ Cheaper, fasterβœ— Expensive at scale
Complex multi-step reasoningβœ— Smaller models struggleβœ“ Claude/GPT-4 wins
Sensitive financial dataβœ“ Never leaves machineβœ— Goes to provider
Real-time intraday alertsβœ“ <1s latency~1-3s latency
Cutting-edge model capabilityLags 6-12 monthsβœ“ Always latest
Offline / air-gappedβœ“ Works anywhereβœ— Needs internet
Track 06 β€” Sovereign AIL-02
Cost & Benefit Analysis
Math is the only thing that matters. Here's the real economics of local vs. cloud LLMs at every usage tier.

Cloud API Pricing β€” May 2026

Per-Million Token Pricing (USD)
ModelInputOutputBest Use
Claude Haiku 4.5$0.25$1.25Fast bulk scoring
Claude Sonnet 4.6$3.00$15.00Complex reasoning
GPT-4o$5.00$15.00General purpose
GPT-4o mini$0.15$0.60Cheap bulk tasks
Gemini 2.0 Flash$0.075$0.30Cheapest API tier

Local LLM Cost β€” One-Time Hardware

Hardware Cost vs. Capability
SetupCost (USD)Capable of RunningTokens/sec
MacBook Air M2 (16GB)$1,2007B models (Q4)~20-30
Mac Mini M4 Pro (48GB)$2,00030B-70B models (Q4)~15-25
RTX 4090 PC build$3,50030B models (Q4-Q5)~50-80
Mac Studio M3 Ultra (192GB)$7,000120B+ models (Q4)~25-40
Dual RTX 4090 workstation$8,00070B+ models (Q6-Q8)~80-120

The Crossover Point

Local hardware breaks even with cloud API costs at the following monthly burn rates (assuming a $2,000 Mac Mini amortized over 24 months β‰ˆ $83/month):

When Local Wins on Pure Cost
If your monthly cloud LLM bill exceeds $80/month for routine tasks, local LLMs pay for themselves within 2 years.

At $500/month cloud spend, a $2,000 local rig pays for itself in 4 months.

The Hidden Benefits (Not Just Cost)

  • Privacy β€” your trading signals, strategies, and proprietary code never get logged by an external provider
  • No rate limits β€” burst through 10,000 sentiment classifications in a row without throttling
  • Latency β€” local inference avoids the network round-trip β€” usually faster than a cloud call
  • Model permanence β€” the version you tested is the version that ships forever; no surprise deprecations
Track 06 β€” Sovereign AIL-03
Hardware Requirements
Pick your hardware before picking your model. The wrong combo means a model that either won't load or runs at 1 token per second.

The Critical Resource: Memory

The single most important spec for running LLMs locally is memory bandwidth and capacity. The model must fit entirely in RAM (or VRAM on a GPU) to run fast. If it doesn't fit, the inference falls back to disk swap and slows by 100x.

Memory Required by Model Size (Q4 Quantization)
Model SizeRAM NeededDisk SpaceExample Models
3B params~3 GB~2 GBPhi-3 Mini, Llama 3.2 3B
7B-8B params~6 GB~4 GBLlama 3.1 8B, Mistral 7B
13B-14B params~10 GB~8 GBQwen 2.5 14B
30B-32B params~20 GB~18 GBQwen 2.5 32B, Yi 34B
70B params~40 GB~38 GBLlama 3.1 70B
120B+ params~70 GB~65 GBLlama 3.1 405B (Q3 only)

Apple Silicon vs. NVIDIA GPU

Both platforms work well β€” they trade off differently:

Apple Silicon vs. NVIDIA GPU
DimensionApple Silicon (M-series)NVIDIA RTX
Unified memoryβœ“ Up to 192GB sharedFixed VRAM (24GB max consumer)
Tokens/sec on 7B~30-50~80-120
Tokens/sec on 70B~10-20 (fits in unified RAM)Needs dual GPU
Power consumption~30W idle, ~80W load~150W idle, ~400W load
Software ecosystemllama.cpp Metal, MLXCUDA β€” most mature
Best forBig models, low powerSpeed, fine-tuning
Practical Recommendation
For most DeadCatFound readers, a Mac Mini M4 Pro with 48GB unified memory ($1,999) is the sweet spot. Runs every 30B model smoothly, fits 70B at Q4, uses ~30W idle, and you don't need to manage drivers or assemble a PC.
Track 06 β€” Sovereign AIL-04
Ollama β€” The Easy Path
Ollama is the easiest way to run LLMs locally. One command installs it, one command downloads a model, one command runs it. If you're starting from zero, start here.
πŸ¦™
Ollama
Run LLMs with a single command. Built-in model library. OpenAI-compatible API.

macOS, Linux, Windows

MIT β€” Free, open source

Getting started fast

Install & First Run

terminal β€” install ollama
# macOS β€” via Homebrew (or download from ollama.com) brew install ollama # Linux / WSL curl -fsSL https://ollama.com/install.sh | sh # Start the Ollama server (runs in background) ollama serve & # Pull and run a model β€” first time downloads weights ollama run llama3.1:8b # Or just chat with it directly in your terminal >>> What is the capital of France? Paris.

The OpenAI-Compatible API

Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1/. This means you can swap any cloud OpenAI call to use Ollama with one line of code:

python β€” ollama via openai client
from openai import OpenAI client = OpenAI( base_url = "http://localhost:11434/v1", api_key = "ollama", # ignored but required ) response = client.chat.completions.create( model = "llama3.1:8b", messages = [ {"role": "system", "content": "You score stocks 0-100."}, {"role": "user", "content": "Score NVDA based on momentum."}, ], ) print(response.choices[0].message.content)
Useful Ollama Commands
ollama list β€” show downloaded models
ollama pull llama3.1:70b β€” download a model
ollama rm llama3.1:8b β€” delete a model
ollama ps β€” show running models
Track 06 β€” Sovereign AIL-05
LM Studio β€” GUI Desktop App
LM Studio is Ollama with a graphical interface. If you want a chat window, a built-in model browser, and don't want to use the terminal β€” this is your tool.
πŸ–₯️
LM Studio
Desktop app for running local LLMs. Built-in chat UI + OpenAI-compatible API server.

macOS, Windows, Linux

Free for personal use

GUI users, model exploration

Why Use LM Studio

  • Visual model browser β€” see model size, RAM requirements, and reviews before downloading
  • Built-in chat interface β€” talk to your model in a polished chat UI, no terminal needed
  • One-click API server β€” flip a switch and it exposes an OpenAI-compatible endpoint
  • Multi-model loading β€” keep multiple models loaded at once and switch between them
  • System prompt presets β€” save and switch between role-specific prompts

Workflow

  1. Open LM Studio
  2. Browse models in the Discover tab
  3. Click Download on any model (it tells you if your RAM can handle it)
  4. Load the model in the Chat tab
  5. Talk to it β€” or go to the Local Server tab and start the API server
  6. Connect any OpenAI-compatible client to http://localhost:1234/v1
Ollama vs. LM Studio β€” Which to Pick?
Pick Ollama if you're a developer comfortable with the terminal β€” it's faster, scriptable, and integrates better with launchd/cron.

Pick LM Studio if you want a polished desktop app, like to browse models visually, or are exploring before committing.
Track 06 β€” Sovereign AIL-06
llama.cpp & GGUF Models
Underneath both Ollama and LM Studio is llama.cpp β€” the open-source engine that started the local LLM revolution. Understand it and you can run any model anywhere.
βš™οΈ
llama.cpp
The C++ inference engine. Powers Ollama, LM Studio, and most local LLM tools.

Everything β€” including ARM, x86, Apple Silicon, CUDA

MIT

Low-level control, scripting

What is GGUF?

GGUF stands for GPT-Generated Unified Format. It is the file format that llama.cpp (and Ollama, LM Studio) uses to store models. A single .gguf file contains the model weights, the tokenizer, and the metadata β€” everything needed to run the model.

GGUF files are quantized β€” the model weights are compressed to fewer bits to fit in less RAM. The quantization level dramatically affects both size and quality:

Quantization Levels Explained
LevelBitsSize vs. FP16QualityRecommendation
Q2_K2 bits~14%Poor β€” visible degradationAvoid unless RAM constrained
Q3_K_M3 bits~21%Acceptable for some tasksLast resort
Q4_K_M4 bits~28%Great β€” minor quality lossβœ“ Best balance
Q5_K_M5 bits~35%Near-FP16If you have RAM
Q6_K6 bits~41%Essentially FP16For critical tasks
Q8_08 bits~53%Indistinguishable from FP16If RAM permits
The Q4_K_M Rule
For 95% of use cases, download the Q4_K_M variant of any model. You get ~4x compression with almost no quality loss. It's the universal default for a reason.

Manual llama.cpp Workflow

terminal β€” manual llama.cpp
# Clone and build git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make # Download a GGUF model from HuggingFace wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf # Run inference ./llama-cli -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \ -p "Score NVDA based on momentum:" \ -n 256 # Or start an OpenAI-compatible server ./llama-server -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf --port 8080
Track 06 β€” Sovereign AIL-07
Model Recommendations
Hundreds of open-weight models exist. Most are not worth your disk space. These are the ones actually worth running β€” with download links.

Top Picks by Use Case

πŸ¦™
Llama 3.1 8B Instruct
Meta's flagship small model. Best general-purpose 7-8B model as of 2026. Excellent for sentiment, summarization, structured output.
Recommended

~4.9 GB

128K tokens

Llama 3.1 Community License

πŸš€
Qwen 2.5 14B / 32B Instruct
Alibaba's model β€” best open-weight model in this size class. Strong reasoning, exceptional at code, multilingual.
Recommended

~9 GB

~20 GB

Apache 2.0

🧠
Mistral 7B Instruct v0.3
Compact, fast, and proven. The default choice when you need 7B at speed. Strong instruction following.

~4.4 GB

32K tokens

Apache 2.0

⚑
Phi-3 Mini 3.8B
Microsoft's tiny powerhouse. Punches far above its weight. Perfect for laptop / low-RAM setups.

~2.4 GB

128K tokens

MIT

πŸ”¬
DeepSeek-V3 / R1
Chinese-built reasoning model. Strong on math, code, and chain-of-thought reasoning. The cheapest path to o1-class reasoning.
Heavy

~20 GB

64K tokens

MIT

πŸ¦™
Llama 3.1 70B Instruct
Meta's flagship. Genuinely competitive with GPT-4-class models on many benchmarks. Needs serious RAM.
Heavy

~40 GB

~48 GB

Llama 3.1 Community

Where to Discover More Models
The HuggingFace LLM leaderboard ranks open-weight models by benchmark performance. Filter by size to find the strongest model your hardware can run.
huggingface.co/spaces/open-llm-leaderboard
Track 06 β€” Sovereign AIL-08
Integration with Your Firm
A local LLM is only useful when it's wired into your strategy. Here's how to plug Ollama into the DeadCatFound trading firm in under 50 lines of code.

The OpenAI-Compatible Pattern

Every local LLM server (Ollama, LM Studio, llama.cpp's llama-server) exposes an OpenAI-compatible endpoint. This is the most important detail in the entire course: your existing OpenAI / Claude code works unchanged β€” just swap the base URL.

scripts/local_scorer.py
""" Drop-in replacement for cloud LLM scoring in the firm. Uses local Llama 3.1 8B via Ollama. """ from openai import OpenAI import json # Point at Ollama instead of OpenAI client = OpenAI( base_url = "http://localhost:11434/v1", api_key = "ollama", # ignored ) def score_stock(symbol: str, metrics: dict) -> dict: """Score a stock 0-100 using local LLM. Returns {score, reason}.""" response = client.chat.completions.create( model = "llama3.1:8b", messages = [ {"role": "system", "content": "You are a quantitative analyst. Reply with JSON only."}, {"role": "user", "content": f"Score {symbol} from 0-100 based on these metrics: {metrics}. " "Return: {\"score\": int, \"reason\": str}"}, ], temperature = 0.3, ) return json.loads(response.choices[0].message.content) if __name__ == "__main__": result = score_stock("NVDA", {"rsi": 62, "momentum_30d": 0.18}) print(result)

Use Cases in the Firm

  • Bulk stock screening β€” score 200 stocks daily without burning $50 in cloud API fees
  • News sentiment classification β€” classify thousands of headlines per day for free
  • Log summarization β€” local LLM summarizes daily firm activity into a single Telegram alert
  • Pre-screening before cloud LLM β€” local LLM does the cheap first pass; Claude only gets the survivors

Hybrid Architecture β€” The Smart Stack

When to Route Where
TaskVolume / DayRoute ToWhy
Universe pre-filter500+ stocksLocal Llama 3.1 8BCheap, fast, no quality needed
News sentiment1,000+ headlinesLocal Llama 3.1 8BFree, parallel-able
Top-20 stock deep scoring20 stocksClaude Haiku 4.5Quality matters; cheap on small volume
Strategy design / debugging~10 callsClaude Sonnet 4.6Best reasoning available
Sovereignty Achieved
You now have a complete picture of how to bring your AI in-house. The cloud is for the hard problems. Your local stack handles the volume. The result: a firm that runs at scale on a budget, with zero data exposure, that keeps running when the internet doesn't.

Your AI Is Now Sovereign.

You can run LLMs locally. You know which models to pick. You know how to integrate them. You're no longer dependent on any single provider β€” and you control your data end-to-end.

Continue building on DeadCatFound β€” every track stacks together.

Back to DeadCatFound β†’
⚠ Important Disclaimer

DeadCatFound is an educational platform only. We are not a registered financial advisor, broker-dealer, investment advisor, or financial institution of any kind. Nothing in this course constitutes financial advice, investment advice, or any recommendation to buy or sell any security.

All content is for educational and informational purposes only. Trading involves substantial risk of loss. Always consult a licensed financial professional. By accessing this course you acknowledge that DeadCatFound bears no liability for any financial outcomes.