Engineering NoteMay 2026

How to Choose an AI Workstation for Private LLM and RAG Workloads

Selecting hardware for private LLM inference and RAG applications requires a different calculus than training workloads. The primary constraints shift from throughput-maximizing multi-GPU interconnects to VRAM capacity, single-GPU inference performance, and system-level balance across CPU, RAM, and storage.

This engineering note examines the technical factors that matter most when configuring a workstation for local LLM inference and retrieval-augmented generation pipelines.

VRAM capacity

VRAM determines the model size ceiling

VRAM capacity is the hard constraint for which models you can run. A 70B parameter model, such as Llama 3.1 70B at 70.6B parameters, stored at FP16 precision requires approximately 140 GB of VRAM for weights alone. That exceeds even high-end workstation GPUs.

Quantization creates predictable memory and quality tradeoffs. INT8 reduces the weight footprint to roughly 70 GB, while INT4 reduces it to roughly 35 GB. These figures do not include runtime overhead: vLLM documents that KV cache memory scales with batch size, sequence length, hidden size, layers, and dtype size, so practical deployments need additional VRAM beyond the model weights.

In practice, INT4 quantization through formats such as Q4_K_M in llama.cpp provides the most practical workstation balance for many private LLM deployments: a large size reduction with manageable quality degradation when the quantization method is chosen carefully.

Inference scaling

Inference scales differently than training

Unlike training, single-GPU inference is primarily memory-bandwidth-bound rather than compute-bound. Once model weights are loaded into VRAM, token generation depends heavily on on-device memory bandwidth. The PCIe bus matters mostly for initial model loading and for returning generated tokens to the host.

Multi-GPU configurations are necessary only when a model exceeds single-GPU VRAM capacity or when throughput targets require multiple model replicas. Tensor parallelism distributes the model across GPUs, but it also introduces synchronization overhead. NVLink can reduce that overhead, while PCIe-only tensor parallelism usually raises per-query latency.

Practical implication

For interactive private LLM workloads, prioritize single-GPU VRAM capacity before splitting one model across multiple smaller GPUs.

RAG pipelines

RAG combines distinct hardware bottlenecks

A RAG pipeline has three sequential stages, each with different hardware demands. The query is first encoded into a vector by an embedding model. It then searches a vector index for semantically similar documents. Finally, retrieved context is injected into the LLM prompt, and the model generates a response.

The vector retrieval stage is usually CPU and RAM intensive. FAISS and Milvus rely on approximate nearest neighbor algorithms such as HNSW, with indexes commonly kept in memory for low-latency search. Milvus estimates that 10 million vectors at 1536 dimensions require about 60 GB of RAM before broader application overhead is considered.

The LLM inference stage is usually the longest portion of the pipeline. Interactive text RAG targets often aim for p95 latency below a few seconds, while voice RAG requires much lower latency and is generally impractical with large models without additional serving optimizations.

Storage and ingestion

Storage matters most when indexes and documents move

Vector database performance depends heavily on whether the index fits in RAM. NVMe SSDs help with persistence, fast index loading, and large-scale document ingestion. Once the hot index is loaded into memory, query performance depends more on CPU, RAM, and index structure than on raw storage bandwidth.

Document ingestion is different. Embedding and indexing new documents is a batch process shaped by embedding throughput, concurrent writes, preprocessing, and storage behavior. GPUs accelerate embedding generation more directly than faster SSDs, but fast local NVMe still helps keep ingestion predictable.

Workload mapping

Map use cases to hardware constraints

Individual or small team

7-13B models

Single 24-48 GB GPU class
64 GB system RAM minimum for moderate vector indexes
Codebase Q&A, document analysis, research assistance

Team deployment

30-70B models

Single 48-80 GB GPU class
128 GB system RAM for larger document corpora
Internal knowledge base, support assist, content workflows

Production service

Multi-instance serving

Multiple model replicas or larger 70B+ model serving
256 GB or more system RAM for larger indexes
Batching and request queuing with serving frameworks

Engineering summary

Private LLM inference and RAG are fundamentally different workloads from distributed training.

VRAM capacity, GPU memory bandwidth, and system RAM matter far more than GPU-to-GPU interconnects for most workstation RAG deployments. Workstation-class GPUs with 48-80 GB VRAM, paired with ample system memory and fast local storage, can deliver acceptable interactive RAG latency without data center infrastructure.

References

Meta Llama 3.1 70B Model Card — official model card with parameter count.
llama.cpp Quantization README — Q4_K_M quantization documentation and recommendations.
vLLM Optimization and Tuning — KV cache memory behavior and configuration.
Metrics for Quantized LLM Evaluation — quantization quality and perplexity discussion.
NVIDIA H100 Tensor Core GPU Datasheet — H100 memory bandwidth and NVLink specifications.
NVIDIA RTX A6000 Datasheet — RTX A6000 memory bandwidth specifications.
BGE-M3 Model Card and BGE-M3 Memory Requirements — embedding model context length and VRAM discussion.
FAISS, Milvus overview, and HNSW in Milvus — vector search architecture and index behavior.
Milvus hardware requirements — memory requirements for large-scale vector search.
RAG system latency requirements and RAG pipeline latency for voice — latency targets for interactive and voice applications.
Scaling Enterprise RAG and High-Performance RAG with SSD Offloading — storage architecture and RAG performance analysis.
llama.cpp benchmark results — throughput benchmarks for model and hardware combinations.

Back to resources Continue with platform guide