Sora Yazılım
English
Custom software solutions from Türkiye

Running Local LLMs: Workstation or GPU Server?

Sora Yazılım Ekibi

Running local LLMs is becoming a corporate necessity. Under pressure from data privacy regulations, low-latency requirements, and long-term cost sustainability, CTOs and IT directors are increasingly choosing to host large language models on their own infrastructure rather than relying on cloud APIs. The key question: workstation or GPU server?

What Does Running a Local LLM Actually Require?

The most critical resource for local LLM inference is VRAM. Model weights are loaded into GPU memory; without sufficient VRAM the model either fails to load or falls back to CPU offloading, making inference impractically slow.

Running large language models on your own infrastructure was, until a few years ago, the exclusive domain of large research laboratories. Today, consumer-grade GPUs and open-source tools like Ollama have made it straightforward for enterprise IT teams. However, understanding hardware requirements correctly is essential for a successful deployment.

A LLM requires all of its weights — or a quantized subset — to be loaded into GPU VRAM. If this is not possible, frameworks such as PyTorch or llama.cpp fall back to CPU offloading, where inference speed drops to seconds per token — effectively unusable for interactive workloads. System RAM and CPU speed matter for preprocessing and I/O, but the true bottleneck is always VRAM.

On the storage side, models range from 4 GB to over 1.3 TB on disk. Fast NVMe SSD significantly reduces model load time; PCIe 5.0 NVMe is recommended for large models. Network bandwidth is irrelevant to local inference beyond the initial model download.

  • VRAM: The primary resource hosting model weights — non-negotiable
  • CPU: Important for tokenization, preprocessing, and system coordination; modern EPYC or Xeon recommended
  • System RAM: 64 GB minimum, 128 GB+ recommended for offloading or complex context management
  • NVMe SSD: Determines model load speed; PCIe 5.0 NVMe recommended for large models
  • Power supply and cooling: Dual RTX 5090 configurations may require 1000 W+ PSU

VRAM and Model Size: Which Model Fits Which GPU?

Model size and quantization level directly determine VRAM requirements. A 7B model needs roughly 14 GB at FP16 but only 4–5 GB at Q4 quantization. A 70B model still requires 35–40 GB even at Q4, demanding high-end hardware.

Quantization reduces the precision of model weights (e.g., from FP16 to INT4), dramatically cutting VRAM usage. Accuracy loss at Q4 is generally minimal and acceptable for most enterprise use cases. Q8 offers a balanced middle ground between accuracy and VRAM efficiency.

Model SizeFP16 VRAMQ8 VRAMQ4 VRAMSuitable GPU (Q4)
7B~14 GB~7 GB~4–5 GBRTX 3090 24 GB, RTX 4080 16 GB
13B~26 GB~13 GB~8 GBRTX 3090 24 GB, RTX 4090 24 GB
30B~60 GB~30 GB~17 GBRTX 4090 24 GB (tight), RTX 5090 32 GB
70B~140 GB~70 GB~35–40 GBRTX 5090 32 GB (marginal), Dual RTX 5090 64 GB
120B (MoE)~240 GB+~120 GB~65–70 GBRTX PRO 6000 Blackwell 96 GB, Dual A100 80 GB
405B+~800 GB+~400 GB~200 GB+Multi-GPU server, A100/H100 cluster

The figures in the table represent theoretical minimums. Practical VRAM usage increases with context length and KV-cache size. If you plan to run with long context windows (32K+ tokens), add at least 20–30% to the values shown.

On the secondary market, RTX 3090 24 GB cards are available for approximately $650–750 and offer excellent cost-performance for 7B and 13B models. In enterprise environments, however, warranty, support, and reliability considerations generally favor new RTX 5090 or datacenter-class cards.

Local LLMs on a Workstation: Single User and Small Teams

An AI workstation is the ideal starting point for local LLM inference in single-developer or small-team scenarios. The RTX 5090 with 32 GB VRAM comfortably runs a 70B Q4 model; adding a second card doubles VRAM to 64 GB.

Workstation-based local LLM setups are particularly popular for model development, fine-tuning experiments, and prototyping. A desktop or tower workstation can be operated in an office environment without datacenter infrastructure and offers a more manageable noise and cooling profile.

The NVIDIA RTX 5090, with 32 GB of GDDR7 VRAM, currently sits at the top of the consumer segment. It can run a Q4-quantized 70B model on a single card — models such as Meta LLaMA 3.1 70B Q4 have been validated on this configuration in practice. With two RTX 5090 cards installed, total VRAM reaches 64 GB, allowing 70B models to run with comfortable margin and supporting longer context lengths.

Selecting an AI workstation requires evaluating GPU count, PCIe bandwidth, and cooling requirements alongside model size targets. Other important factors include the motherboard's NVLink or PCIe 5.0 support, adequate power supply capacity, and ECC memory options.

  • Advantages: Lower upfront cost, easy setup, operates in office environment, manageable noise
  • Advantages: Sufficient for a single developer or small team; Ollama gets you running in minutes
  • Limitations: Limited concurrent users (typically 1–4 active sessions)
  • Limitations: Configurations beyond dual GPU are difficult in tower chassis; 4+ cards require server chassis
  • Limitations: Business continuity may require UPS and redundant power planning

Scalable LLM Serving with a GPU Server

A GPU server is the right choice for enterprise LLM deployments that must serve multiple concurrent users with high availability and scalability. Combined with vLLM, throughput is substantially higher than a workstation setup.

In an enterprise environment, an LLM service may need to respond to dozens or hundreds of concurrent users. In this scenario, workstation infrastructure quickly becomes a bottleneck; GPU servers and high-throughput inference frameworks take over. GPU server and AI infrastructure selection must weigh throughput, latency, and memory bandwidth together.

The RTX PRO 6000 Blackwell, with 96 GB of GDDR7 ECC VRAM, is a professional GPU designed for enterprise workloads. It can run 120B-parameter Mixture-of-Experts (MoE) models on a single card at Q4 quantization. In the datacenter class, A100 80 GB and H100 80 GB cards connected via NVLink can handle 405B+ parameter models.

Processor architecture is also a deciding factor in server platform selection. AMD EPYC and Intel Xeon Scalable processors, with their multi-channel memory architectures and high PCIe lane counts, have become standard on GPU server platforms. Server processor selection — a dedicated guide comparing Xeon, EPYC, and Threadripper Pro — covers this in depth.

ScenarioRecommended HardwareEst. Concurrent UsersSuitable Model Size
Single developer / prototypeRTX 5090 32 GB (single card)1–27B–70B Q4
Small team (5–15 users)Dual RTX 5090 64 GB3–870B Q4 or 30B FP16
Department (15–50 users)RTX PRO 6000 Blackwell 96 GB10–20120B MoE Q4
Enterprise (50+ users)Multi-GPU server (A100/H100 80 GB x4+)50+405B+ or multi-model
Hybrid (critical + general)On-premise + cloud burstFlexibleAll sizes

The Software Stack: Ollama, vLLM, and LM Studio

Ollama is the easiest entry point — it downloads and runs a model with a single command. vLLM delivers high throughput for concurrent production traffic. LM Studio is a GUI-based desktop application for those who prefer a graphical interface.

The local LLM ecosystem has matured remarkably over the past two years. Today there are purpose-built tools for ML engineers working at the CLI, product managers who prefer GUI applications, and DevOps engineers deploying high-throughput services on Kubernetes.

ToolTarget UserSetup DifficultyThroughputAPI CompatibilityBest Scenario
OllamaDeveloper, DevOpsVery Easy (single command)ModerateOpenAI-compatible RESTPrototyping, individual use, quick testing
vLLMML Engineer, DevOpsModerate (Python env)Very HighOpenAI-compatible RESTProduction service, high concurrent requests
LM StudioDeveloper, analystVery Easy (GUI)Low–ModerateLimited local APIDesktop use, model exploration
llama.cppAdvanced developerModerate–HardModerate (CPU-capable)Basic APILow-power devices, CPU inference
text-generation-webuiResearcherModerateModerateWide plugin supportModel comparison, fine-tuning experiments

Ollama's greatest advantage is zero-configuration startup: the command `ollama run llama3.1:70b` automatically downloads the model, detects the GPU, and exposes a REST API. Quantization level is selected automatically, though users can specify tags such as Q4 or Q8 if needed.

vLLM uses a PagedAttention algorithm to manage KV-cache memory far more efficiently. Under high concurrent request loads, it delivers meaningfully higher tokens per second compared to Ollama. For production environments targeting 10+ concurrent users, vLLM is the preferred choice. It can be deployed via Docker, a Python virtual environment, or a Kubernetes Helm chart.

Data Privacy and On-Premise: Local LLMs Under KVKK

Processing personal data under KVKK requires that data not be transferred to servers abroad. On-premise LLM deployment directly satisfies this requirement; cloud-based API calls require additional contractual and technical safeguards.

Turkey's Personal Data Protection Law No. 6698 (KVKK) requires explicit consent or adequate protection guarantees for transferring personal data outside the country. When patient records, financial data, or employee information are sent as prompts to an LLM API, those data are technically transmitted to the API provider's infrastructure — creating significant legal exposure for healthcare, finance, and public sector organizations.

On-premise LLM deployment addresses this at the root: data never physically leaves the organization's infrastructure, log records remain under organizational control, and audit trails are in the organization's hands. Comparing on-premise AI servers to cloud GPUs — covering cost and compliance dimensions in detail — is available in our dedicated guide.

Network segmentation is equally critical from a technical isolation perspective. Running the LLM service in an isolated network segment without internet egress minimizes data leakage risk. Hosting model weights in a secure internal artifact registry and enforcing model version control further strengthens alignment with corporate security policies.

  • KVKK Article 9: Cross-border transfer requires explicit consent or adequate protection — on-premise eliminates this risk
  • Healthcare sector: Sending patient data in prompts to foreign APIs risks non-compliance with Ministry of Health regulations
  • Financial sector: BDDK and CMB regulations require data sovereignty for customer financial data
  • Public institutions: Cybersecurity legislation mandates domestic processing of sensitive data
  • ISO 27001 compliance: On-premise LLM more easily satisfies access control and audit trail requirements

Workstation or Server? The Decision Matrix

A workstation is the right choice for a single developer or small team; a GPU server is correct for enterprise multi-user deployments. The decision turns on concurrent user count, model size, budget, and manageability requirements.

Both platforms can run local LLMs, but they differ in scale, management complexity, and cost profile. The decision matrix below helps match your organization's requirements to the available options. What is a workstation vs. a server? — our foundational guide — covers platform differences from a broader perspective.

CriterionWorkstation (RTX 5090)GPU Server (Multi-GPU)
Upfront costModerate ($5,000–15,000)High ($20,000–100,000+)
Concurrent users1–8 (with vLLM)10–100+ (with vLLM)
Max VRAM (single chassis)64 GB (dual RTX 5090)96–640 GB+ (RTX PRO 6000 / H100)
ScalabilityLimited (2–4 GPUs)High (8+ GPUs, cluster support)
Management complexityLowModerate–High (Kubernetes, Slurm)
High availabilityNo (single point)Yes (redundant configuration)
Noise and coolingManageable (office)Requires datacenter
Payback vs. cloud5–7 months8–18 months (scale-dependent)
KVKK complianceYes (data stays local)Yes (data stays local)
Best scenarioPrototype, small teamEnterprise service, multi-user

From a cost perspective, comparing a local RTX 5090-based server with equivalent cloud GPU capacity (e.g., hourly A10G or A100 rental) shows that local infrastructure amortizes in approximately 5–7 months of heavy use. This period shortens further under intensive usage patterns.

Hybrid approaches are also growing in popularity: baseline workloads run on on-premise workstations or GPU servers, while cloud GPU capacity is used for burst traffic. This model offers a balanced solution for both cost optimization and KVKK compliance.

Frequently Asked Questions

Which GPU is best for running local LLMs?

It depends on your needs. For an individual developer, the RTX 5090 (32 GB) or the budget-friendly RTX 3090 (24 GB) are ideal. For enterprise multi-user serving, the RTX PRO 6000 Blackwell (96 GB) or datacenter-class A100/H100 are recommended.

How much VRAM does a 70B model require?

A 70B model needs approximately 140 GB of VRAM at FP16 precision; Q4 quantization reduces this to 35–40 GB. A single RTX 5090 (32 GB) runs it at a tight margin; dual RTX 5090 (64 GB) provides a comfortable buffer for 70B Q4.

Should I use Ollama or vLLM?

Ollama for quick start and individual use — single command, zero configuration. If your production environment targets 10+ concurrent requests, vLLM's PagedAttention mechanism delivers far higher throughput. The two are not mutually exclusive: prototype with Ollama, serve in production with vLLM.

Is local LLM inference cheaper than cloud LLM APIs?

Under heavy use, local infrastructure is generally more economical; an RTX 5090-based setup amortizes against equivalent cloud capacity in roughly 5–7 months. At low or variable utilization, cloud may be more cost-effective.

How secure is on-premise LLM from a data privacy standpoint?

With on-premise LLM, data never physically leaves your network — there is no cross-border transfer risk under KVKK Article 9. Combined with network isolation and access control policies, this provides the highest level of data sovereignty.

How much do CPU and RAM affect LLM performance?

When VRAM is exhausted and offloading occurs, CPU speed becomes critical. In normal GPU inference, the CPU handles tokenization and preprocessing; a modern multi-core CPU (EPYC, Xeon) with 64 GB+ system RAM is recommended. The primary bottleneck remains VRAM.

What is quantization and does it degrade the model?

Quantization reduces the precision of model weights (FP16 → INT8 → INT4), lowering VRAM requirements. At Q4, accuracy loss is negligible for most enterprise tasks, making Q4 or Q8 the standard choice for local deployment.

Conclusion

The decision to run local LLMs rests on VRAM capacity, concurrent user count, data privacy obligations, and long-term cost balance. For a single developer or small team, an RTX 5090 workstation combined with Ollama can be operational within days; enterprise multi-user services require GPU server infrastructure and vLLM. In sectors regulated by KVKK, on-premise deployment provides not just a cost advantage but legal compliance assurance.

If you want to plan your organization's local LLM infrastructure, select the right GPU and software stack, or compare your existing cloud spend against an on-premise investment, our Sora local LLM team is available for a complimentary discovery call.

Need help with the topics in this post?

Schedule a free discovery call with Sora Yazılım — we'll propose a concrete roadmap.