How to Choose an AI Workstation for Private LLM and RAG Workloads
Selecting hardware for private LLM inference and RAG applications requires a different calculus than training workloads. The primary constraints shift from throughput-maximizing multi-GPU interconnects to VRAM capacity, single-GPU inference performance, and system-level balance across CPU, RAM, and storage.
This engineering note examines the technical factors that matter most when configuring a workstation for local LLM inference and retrieval-augmented generation pipelines.
01
VRAM capacity
VRAM determines the model size ceiling
VRAM capacity is the hard constraint for which models you can run. A 70B parameter model, such as Llama 3.1 70B at 70.6B parameters, stored at FP16 precision requires approximately 140 GB of VRAM for weights alone. That exceeds even high-end workstation GPUs.
Quantization creates predictable memory and quality tradeoffs. INT8 reduces the weight footprint to roughly 70 GB, while INT4 reduces it to roughly 35 GB. These figures do not include runtime overhead: vLLM documents that KV cache memory scales with batch size, sequence length, hidden size, layers, and dtype size, so practical deployments need additional VRAM beyond the model weights.
In practice, INT4 quantization through formats such as Q4_K_M in llama.cpp provides the most practical workstation balance for many private LLM deployments: a large size reduction with manageable quality degradation when the quantization method is chosen carefully.
02
Inference scaling
Inference scales differently than training
Unlike training, single-GPU inference is primarily memory-bandwidth-bound rather than compute-bound. Once model weights are loaded into VRAM, token generation depends heavily on on-device memory bandwidth. The PCIe bus matters mostly for initial model loading and for returning generated tokens to the host.
Multi-GPU configurations are necessary only when a model exceeds single-GPU VRAM capacity or when throughput targets require multiple model replicas. Tensor parallelism distributes the model across GPUs, but it also introduces synchronization overhead. NVLink can reduce that overhead, while PCIe-only tensor parallelism usually raises per-query latency.
Practical implication
For interactive private LLM workloads, prioritize single-GPU VRAM capacity before splitting one model across multiple smaller GPUs.
03
RAG pipelines
RAG combines distinct hardware bottlenecks
A RAG pipeline has three sequential stages, each with different hardware demands. The query is first encoded into a vector by an embedding model. It then searches a vector index for semantically similar documents. Finally, retrieved context is injected into the LLM prompt, and the model generates a response.
The vector retrieval stage is usually CPU and RAM intensive. FAISS and Milvus rely on approximate nearest neighbor algorithms such as HNSW, with indexes commonly kept in memory for low-latency search. Milvus estimates that 10 million vectors at 1536 dimensions require about 60 GB of RAM before broader application overhead is considered.
The LLM inference stage is usually the longest portion of the pipeline. Interactive text RAG targets often aim for p95 latency below a few seconds, while voice RAG requires much lower latency and is generally impractical with large models without additional serving optimizations.
04
Storage and ingestion
Storage matters most when indexes and documents move
Vector database performance depends heavily on whether the index fits in RAM. NVMe SSDs help with persistence, fast index loading, and large-scale document ingestion. Once the hot index is loaded into memory, query performance depends more on CPU, RAM, and index structure than on raw storage bandwidth.
Document ingestion is different. Embedding and indexing new documents is a batch process shaped by embedding throughput, concurrent writes, preprocessing, and storage behavior. GPUs accelerate embedding generation more directly than faster SSDs, but fast local NVMe still helps keep ingestion predictable.
05
Workload mapping
Map use cases to hardware constraints
Individual or small team
7-13B models
- Single 24-48 GB GPU class
- 64 GB system RAM minimum for moderate vector indexes
- Codebase Q&A, document analysis, research assistance
Team deployment
30-70B models
- Single 48-80 GB GPU class
- 128 GB system RAM for larger document corpora
- Internal knowledge base, support assist, content workflows
Production service
Multi-instance serving
- Multiple model replicas or larger 70B+ model serving
- 256 GB or more system RAM for larger indexes
- Batching and request queuing with serving frameworks
Engineering summary
Private LLM inference and RAG are fundamentally different workloads from distributed training.
VRAM capacity, GPU memory bandwidth, and system RAM matter far more than GPU-to-GPU interconnects for most workstation RAG deployments. Workstation-class GPUs with 48-80 GB VRAM, paired with ample system memory and fast local storage, can deliver acceptable interactive RAG latency without data center infrastructure.
References
- Meta Llama 3.1 70B Model Card — official model card with parameter count.
- llama.cpp Quantization README — Q4_K_M quantization documentation and recommendations.
- vLLM Optimization and Tuning — KV cache memory behavior and configuration.
- Metrics for Quantized LLM Evaluation — quantization quality and perplexity discussion.
- NVIDIA H100 Tensor Core GPU Datasheet — H100 memory bandwidth and NVLink specifications.
- NVIDIA RTX A6000 Datasheet — RTX A6000 memory bandwidth specifications.
- BGE-M3 Model Card and BGE-M3 Memory Requirements — embedding model context length and VRAM discussion.
- FAISS, Milvus overview, and HNSW in Milvus — vector search architecture and index behavior.
- Milvus hardware requirements — memory requirements for large-scale vector search.
- RAG system latency requirements and RAG pipeline latency for voice — latency targets for interactive and voice applications.
- Scaling Enterprise RAG and High-Performance RAG with SSD Offloading — storage architecture and RAG performance analysis.
- llama.cpp benchmark results — throughput benchmarks for model and hardware combinations.