Overview

Most people using AI are routing their data through external servers they do not control. Every query leaves their machine; every document uploaded to a cloud assistant becomes part of someone else's infrastructure. For use cases involving personal writing, academic notes, or anything sensitive, that is a meaningful problem.

Ramone is the answer to that problem. A fully self-hosted AI infrastructure built on consumer hardware, serving five large language models locally with no cloud dependency. Conversations, documents, and RAG knowledge bases stay on the machine. The system runs on a dedicated NVMe drive, making it hardware-portable and rebuildable in under 30 minutes.

This case study documents the full engineering process, including the structural bottlenecks encountered during development and the reasoning behind every resolution. The problems were as instructive as the solutions.

System Architecture

The final stack is a layered execution environment built entirely within the Linux subsystem on Windows 11.

Final Architecture

Windows 11WSL2 / UbuntuDockerOpen WebUI :8080 connected directly to Ollama :11434 via --network=host

Why WSL2 as the foundation

WSL2 runs a real Ubuntu kernel inside Windows at near-native speed via a lightweight Hyper-V hypervisor. A full Linux VM would have introduced disk I/O latency and management overhead; WSL2 gives direct access to GPU hardware via CUDA passthrough, which is non-negotiable for LLM inference on an RTX 5070.

The Ubuntu filesystem lives on the dedicated L:\ NVMe at L:\wsl\Ubuntu\ext4.vhdx, completely isolated from the system drive. Models are stored at L:\Ollama\ and survive any OS reinstall.

Ollama as the inference engine

Ollama handles downloading, managing, and serving large language models. It exposes an API on port 11434 that other applications communicate with, and manages GPU offloading automatically; pushing as many model layers as possible onto the RTX 5070's 12GB VRAM and spilling overflow into the 64GB DDR5 system RAM. This enables models well above VRAM capacity without paging to disk.

Docker and Open WebUI

Open WebUI runs as a Docker container, providing a full-featured web interface at localhost:8080. It connects to Ollama on port 11434 and handles the workbot system, RAG knowledge bases, conversation history, and model switching. Docker ensures the interface runs consistently regardless of other system state.

Phase I The Ghost Network & Orphaned Configuration

The first structural problem appeared immediately after initial setup. Open WebUI's network pipeline was established inside Docker, but it was configured to communicate via host.docker.internal (a Windows environment loopback interface). It was looking for Ollama on the Windows host network, but Ollama was not installed as a Windows background service. Every request was hitting empty space.

The Problem

Open WebUI could not reach the Ollama engine. The container was communicating via the Windows loopback while Ollama was either absent from Windows entirely or not running as a service.

The resolution required a fundamental shift in execution paradigm. Running everything natively inside the Linux layer (Ollama as a native Ubuntu process alongside Docker, rather than as a Windows service) would unify operations within the WSL2 subsystem, eliminate the cross-OS network boundary, and improve GPU access by keeping inference on the Linux side where CUDA performance is better.

Resolution

Uninstall Ollama from Windows. Install natively inside the Ubuntu WSL2 instance. Reconfigure Open WebUI to point at the Linux-native Ollama endpoint.

Phase II The Subsystem Dependency Bottleneck

Running the native Linux installation sequence inside WSL2 threw an extraction failure immediately:

$ curl -fsSL https://ollama.com/install.sh | sh

ERROR: This version requires zstd for extraction.
Please install zstd and try again.

The fresh Ubuntu WSL image was missing zstd, a high-ratio compression utility required to unpack Ollama's binary payload. This is a common gap in minimal Ubuntu installations; WSL images strip optional packages to reduce size, and zstd is not included by default.

Resolution

Update the APT repository index and install the missing dependency before retrying:

sudo apt-get update && sudo apt-get install -y zstd

The broader lesson is standard practice for Linux environment setup: always resolve dependency trees before running third-party install scripts on minimal images.

Phase III The Cross-OS Permission & Case-Sensitivity Conflict

This was the most technically complex phase; a collision between Windows file system behaviour and Unix environment expectations that produced two distinct failures.

The DrvFs terminal freeze

To preserve system drive space, model assets were relocated to the secondary L:\ NVMe at /mnt/l/Ollama. Running standard Unix permission commands (sudo chown and sudo chmod) directly inside the mounted Windows directory caused a complete terminal lockup.

Root Cause

The default WSL mount driver (DrvFs) handles Windows NTFS allocations. NTFS does not support native Unix security flags. Forcing a Unix ownership command onto a filesystem that explicitly rejects it caused a systemic freeze at the hypervisor layer, stripping user environment variables and dropping the atlas@SPECULAR-CORE prompt entirely.

Recovery required a manual hypervisor reset (wsl --shutdown from PowerShell, then restarting the subsystem). The fix is to never run Unix permission commands on DrvFs-mounted Windows paths; permission management for files on Windows drives must go through Windows ACLs, not Unix flags.

The case-sensitivity trap

After resolving the permission issue and successfully starting Ollama, it reported total blobs: 0 despite the model files being physically present on the L:\ drive.

Root Cause

Linux paths are strictly case-sensitive. The model files were stored inside L:\Ollama\Models (capital M), but Ollama's directory indexing expects the lowercase path models/blobs and models/manifests, with a mandatory nested hierarchy. The flat folder structure with the wrong case was invisible to the engine.

The resolution was a clean rebuild of the directory structure on the L:\ drive; separating local data assets from raw model caches, and rebuilding the correct nested hierarchy before re-pulling models via the network.

Phase IV The Systemd vs. Direct Execution Pivot

The native Ollama installer creates a system service managed by systemctl, designed to run Ollama as a background daemon on boot. Attempting to use this service produced repeated crashes:

● ollama.service - Ollama Service
   Loaded: loaded (/etc/systemd/system/ollama.service)
   Active: failed (Result: exit-code)
  Process: ExecStart=/usr/local/bin/ollama serve
 Main PID: 1847 (code=exited, status=1/FAILURE)
Root Cause

The installer creates an isolated system user account named ollama. This low-privilege account lacked the clearance to traverse the Windows drive mount boundary at /mnt/l/, causing the daemon to crash on every boot attempt.

The resolution was to bypass the system service entirely. The automatic background service was disabled, the OLLAMA_MODELS environment variable was injected permanently into ~/.bashrc to point at the correct L:\ path, and Ollama was configured to run directly under the personal user context rather than the restricted system account.

# Added to ~/.bashrc
export OLLAMA_MODELS="/mnt/l/Ollama/models"

Running under the personal user context gives full access to the mounted NVMe path without the permission boundary that caused the system service to fail.

Model Selection

Five models were selected to cover the full range of task profiles, from fast lightweight queries to deep reasoning and large-corpus RAG retrieval. Selection was driven by a specific hardware constraint: the RTX 5070 has 12GB VRAM, with 64GB DDR5 available for overflow via CPU offloading.

Model Size VRAM Primary use
llama3.2:3b2GBFits entirelyFast lightweight tasks, quick lookups
llama3.1:8b5GBFits entirelyBalanced reasoning and general chat
mistral:7b4GBFits entirelyStructured output, writing tasks
deepseek-coder-v2:16b9GBPartial offloadCode generation and architecture
qwen2.5:32b~20GBOverflow to RAMDeep reasoning and RAG retrieval

The qwen2.5:32b model intentionally exceeds VRAM capacity and spills into system RAM. The 64GB DDR5 buffer means even 32B parameter models run without paging to disk, keeping inference latency acceptable for complex reasoning tasks where the larger context window justifies the trade-off.

RAG Pipeline Architecture

Retrieval Augmented Generation grounds model responses in real source documents rather than relying purely on training data. Several workbots use RAG to answer questions about specific corpora (university lecture notes, personal writing style, CV content).

Knowledge base construction

University lecture PDFs were processed into chunked markdown files and uploaded to Open WebUI as named knowledge collections. Chunking strategy matters significantly; chunks that are too large lose retrieval precision, while chunks that are too small lose context. A middle-ground of roughly 500 tokens per chunk with 50-token overlap was used.

The four knowledge collections

  • University_Library — full academic reading list, processed from PDFs into retrievable chunks
  • University Notes — lecture notes chunked from markdown, indexed for the Academic Vector Index workbot
  • Core Profile Context — CV, background, and personal context for the Professional Correspondent workbot
  • Writing Style — email samples and essays used to train the writing style mimicry system

The Archivist workbot uses qwen2.5:32b specifically for RAG retrieval. The larger context window and reasoning capability produces noticeably more coherent synthesis across large document corpora compared to the smaller models.

Workbot Engineering

Ten specialised AI assistants were built inside Open WebUI, each locked into a specific role via a carefully engineered system prompt. The design principle was that each bot should be the best possible tool for exactly one job, rather than a general assistant trying to cover everything.

  • Therapy/Journal — emotional support and personal reflection, with hard stops against technical topics
  • The Scrubber — high-speed data cleaning, classification, and structured output engine
  • The Archivist — deep RAG analysis across large document corpora using qwen2.5:32b
  • Professional Correspondent — mimics personal writing style via RAG-grounded CV and email samples
  • Architect-Bot — specialist in Unreal Engine, audio systems, C++, and game development architecture
  • Academic Vector Index — university research assistant linked to chunked lecture notes and academic reading
  • Narrative Forger V1 & V2 — high-volume lore and item description generation for game development pipelines
  • Senior Architect — high-level system design and codebase architecture for complex multi-file projects
  • Reasoning Brain — general-purpose deep reasoning using the largest available model
  • API Tester — minimal connection verification and endpoint testing

The hard stops in the Therapy/Journal bot are deliberate. A system prompt that explicitly prevents task-switching ensures the emotional context of that conversation space is never contaminated by technical work; the separation is the feature.

Boot Automation

Starting the full stack manually on every boot (WSL2, Docker, Open WebUI container, Ollama service) would require four separate operations in a terminal. ATLAS_BOOTSTRAP.bat reduces this to zero manual steps.

What the bootstrap script does

  • Waits for the L:\ NVMe drive to mount before attempting any operations
  • Starts the Docker engine inside Ubuntu via WSL2
  • Launches the Open WebUI container if not already running
  • Starts Ollama under the correct user context
  • Polls localhost:11434 until Ollama confirms it is live
  • Plays a completion sound on successful startup
  • Opens a styled Ubuntu terminal with the ATLAS boot banner

The script lives on the L:\ NVMe and runs automatically via the Windows shell startup folder. The styled terminal output (colour-coded status messages, ASCII logo, live progress indicators) provides immediate visual confirmation that every service layer is healthy before use.

Design Decision

The NVMe mount wait is critical. If the script runs before the L:\ drive mounts, Ollama starts without access to the model directory and reports zero models. The polling loop prevents this race condition entirely.

Outcomes

Data sovereignty
All conversations, RAG documents, and model weights remain on local hardware
Rebuild time
Full system rebuildable on new hardware in under 30 minutes
Models running
5 LLMs from 3B to 32B parameters, matched to specific task profiles
Workbots deployed
10 specialised agents across reasoning, writing, code, and research

The most significant outcome is architectural. The system demonstrates that a production-grade private AI infrastructure is achievable on consumer hardware, provided the engineering decisions are deliberate. The bottlenecks encountered (DrvFs permission boundaries, case-sensitivity conflicts, systemd privilege restrictions) are exactly the kind of cross-OS friction that appears in real infrastructure work; each one required diagnosis, root cause analysis, and a deliberate resolution.

The --network=host flag that unified the final stack is a good illustration of the core principle: when virtualised network routing introduces unnecessary complexity, removing the abstraction layer entirely and letting processes communicate directly is usually the correct answer.