AI Chatbot

On-Premise LLM Deployment: Hardware, Models, and TCO for Enterprise (2026)

From "why go local" to "what do I actually buy"

If you’ve read why enterprises across Southeast Asia and Japan are moving LLMs inside the firewall, you already know the drivers: data sovereignty law, contractual restrictions on where customer data can travel, and the simple fact that some documents should never leave the building. This post picks up where that one left off — the practical questions a CTO or infrastructure lead actually needs answered before signing a purchase order.

If you’re an individual developer sizing a personal workstation, our hardware sizing guide covers single-user setups in depth. This post is about the different conversation that guide points toward: multi-user, production-grade, organizational deployment — where uptime, concurrency, integration, and compliance all matter alongside raw model quality.


Table of Contents

  1. Four Deployment Tiers for Enterprise
  2. Which Open-Weight Model Fits Your Use Case
  3. Choosing a Serving Stack
  4. The TCO Question: When Local Beats API
  5. The Compliance Overlay
  6. A Decision Framework
  7. FAQ

Four Deployment Tiers for Enterprise

Unlike a personal workstation, an enterprise deployment has to answer for concurrent users, uptime expectations, and a support path when something breaks at 2am. Four tiers cover almost every organization we’ve worked with.

Tier 1 — Pilot / Proof of Concept

A single workstation-class GPU with 24–48 GB of VRAM (current-generation consumer or entry professional cards) runs dense models up to roughly 32B parameters at 4-bit quantization, or mixture-of-experts models with a much larger total parameter count but a small active-parameter footprint. This tier is for proving the use case with one or a handful of users before committing real budget — typically a $4,000–$15,000 build.

Good for: validating a RAG pipeline, testing a coding assistant with a small team, building the business case.
Not good for: more than a handful of concurrent users, or anything customer-facing.

Tier 2 — Department-Scale Server

A single professional-tier GPU with 80–96 GB of VRAM (the current top-end workstation and data-center cards in this class) comfortably serves a 70B-class dense model at usable quantization, or larger mixture-of-experts models, for a department of concurrent users. Typical all-in build cost lands in the $15,000–$30,000 range depending on chassis, CPU platform, and RAM.

Good for: a single department’s production workload — legal document review, an internal coding assistant, a customer-support RAG system for one business unit.

Tier 3 — Multi-User Production

Multiple GPUs in a single server (commonly 4–8 cards, either professional workstation-class or data-center class with high-bandwidth interconnects) support dozens of concurrent users, larger models at higher precision, or several models served simultaneously. Budgets here typically run from $60,000 well into six figures, and RAM pricing has become a meaningful line item in 2026 — DRAM contract prices have risen sharply this year as manufacturers reallocate capacity toward AI accelerator memory, so a server that "should" cost X on paper can cost noticeably more by the time it ships.

Good for: organization-wide deployment of a single primary model, or serving several teams from shared infrastructure.

Tier 4 — Frontier-Scale MoE

The largest open-weight mixture-of-experts models — the kind that approach frontier proprietary model quality — need multiple high-memory data-center GPUs with fast interconnects, typically eight or more cards in a single system. This is a $400,000+ commitment and realistically only makes sense for organizations with either very high query volume or a specific requirement for frontier-class reasoning that smaller models can’t meet.

Good for: large enterprises replacing substantial existing API spend, or use cases where model quality genuinely can’t be compromised.

Tier Typical Hardware Model Class Concurrent Users Approx. Budget
1 — Pilot 1× 24–48GB GPU Up to 32B dense 1–5 $4K–$15K
2 — Department 1× 80–96GB GPU ~70B dense / mid MoE 10–30 $15K–$30K
3 — Production 4–8× GPU cluster Larger MoE, multi-model 50–200+ $60K–$250K+
4 — Frontier 8+× data-center GPU Frontier-class MoE Enterprise-wide $400K+

A refurbished-hardware path is worth investigating at Tiers 2 and 3 — enterprise-grade GPUs and servers hold up well for years beyond their first deployment, and refurbished units carry the same silicon at meaningfully lower cost, which matters more than usual given current memory pricing.

For teams without GPU budget at all, it’s worth knowing that modern server CPUs with matrix-acceleration extensions can now run mixture-of-experts models entirely without a GPU — usable for interactive single-stream chat, and scaling reasonably well under light concurrent load. It won’t compete with a GPU tier on throughput, but it turns on-premise LLM serving into a software decision rather than a hardware procurement project, which is a meaningful option for teams with power, cooling, or budget constraints.


Which Open-Weight Model Fits Your Use Case

The open-weight model ecosystem in 2026 is genuinely competitive with hosted frontier APIs for most enterprise tasks. The decision isn’t really "open-weight vs. proprietary" anymore — it’s which open-weight family fits your hardware tier, language requirements, and task.

For general-purpose enterprise chat and document work: mid-size dense or mixture-of-experts models in the 30B–70B class (families like Qwen, Gemma, and Mistral’s mid-tier releases) give strong multilingual performance — genuinely important for Thai, Japanese, and Chinese-language document work — at hardware costs that fit Tier 1–2 budgets.

For coding and agentic workflows: the strongest open-weight coding models currently sit in the mixture-of-experts category with a large total parameter count but a much smaller active-parameter footprint per forward pass, which keeps inference fast despite the model’s overall size. These generally need Tier 2 hardware or above to run well.

For maximum reasoning quality on-premise: the largest open-weight mixture-of-experts models can approach frontier proprietary quality but need Tier 3–4 hardware to serve at production speed. Most enterprise use cases don’t need this tier — it’s worth benchmarking a mid-size model against your actual workload before assuming you need it.

Licensing matters as much as capability. Always confirm the specific license on a model card before commercial deployment — some open-weight releases carry usage restrictions that aren’t obvious from the name alone. This is a five-minute check that avoids a much longer conversation with legal later.

A practical rule that holds up across most deployments: start with the smallest model that reliably solves your task, not the largest one you can technically afford to run. Bigger models cost more in hardware, latency, and power for gains that often don’t show up in your actual use case.


Choosing a Serving Stack

The serving software matters almost as much as the hardware for real-world throughput. Three options cover most enterprise deployments:

vLLM — the default choice for teams that want maximum flexibility, zero licensing cost, and the fastest access to newly released open models. The tradeoff is that your team owns integration, hardening, and support.

Vendor-packaged inference containers (such as NVIDIA’s NIM) — a turnkey container with a vendor SLA, proactive security patching, and validated performance profiles, at the cost of a per-GPU licensing fee (commonly quoted around $4,500/GPU/year, though enterprise volume and term discounts apply) and tighter coupling to specific model versions.

SGLang / TensorRT-LLM — worth evaluating for teams with very specific latency or throughput requirements that the more general-purpose stacks don’t hit out of the box.

For a Tier 1 pilot, tools like Ollama or LM Studio in their server modes are a reasonable starting point — both now support production-style deployment patterns including continuous batching and REST APIs, which wasn’t true a couple of years ago. Moving to vLLM or a vendor container is the natural next step once the pilot proves out and concurrency needs grow.


The TCO Question: When Local Beats API

This is the question every budget owner actually wants answered, and the honest answer is: it depends heavily on volume, and compliance requirements change the math entirely.

The Basic Formula

For API-based deployment, monthly cost scales directly with usage:

Monthly API cost = (requests/day × avg. input tokens × input price/1M
                    + requests/day × avg. output tokens × output price/1M) × 30

Frontier proprietary API pricing in 2026 commonly runs in the $2–$5 per million input tokens and $10–$25 per million output tokens range at the flagship tier, with budget-tier and open-weight-hosted options available well below $1 per million tokens for less demanding tasks. Output tokens typically cost three to ten times more than input tokens, so applications with long generated responses are far more sensitive to model choice than applications with short outputs and long input context.

For on-premise deployment, the cost structure inverts: a large upfront hardware investment, followed by relatively flat ongoing costs (electricity, maintenance, and any support contracts) that don’t scale with query volume in the same way.

A Worked Example

Take a mid-size deployment: 500,000 requests per month, averaging 1,000 input tokens and 400 output tokens per request. At typical flagship API pricing, that lands in the low-to-mid five figures per month — which compounds to a meaningful six-figure annual number quickly. The same workload on a Tier 2 on-premise deployment (a $20,000–$25,000 hardware investment, plus electricity and a part-time engineering allocation) breaks even against API spend well within the first year, and every month after that is close to pure savings.

Independent analysis of on-premise LLM deployment economics generally finds moderate-usage organizations breaking even against equivalent cloud API costs somewhere in the 6–12 month range — consistent with what we see in client deployments. Below that usage threshold, API access is usually still the better financial choice; the crossover point is a real number worth calculating for your specific workload before committing to hardware.

What the Formula Leaves Out

Three factors change the calculation beyond raw token math:

  • Compliance requirements can make the "cheaper" option a non-option. If PDPA, APPI, PIPL, or 等保2.0 obligations require the data never leave your infrastructure, the TCO conversation isn’t really about cost anymore — it’s about which on-premise tier fits your budget, not whether to go on-premise at all.
  • Engineering time isn’t free. Self-hosting requires someone to own model updates, monitoring, and incident response. Budget for this explicitly rather than treating it as background overhead.
  • Retry and quality-control costs apply to both paths. A cheaper model that needs human review on a meaningful share of outputs can end up costing more per completed task than a more expensive model with a lower review rate — the same logic applies whether you’re comparing API tiers or comparing a self-hosted model against a hosted one.

The Compliance Overlay

The regulatory landscape across our core markets continues to push toward data locality, and it shapes deployment tier decisions as much as budget does:

  • Thailand (PDPA): cross-border data transfer restrictions apply directly to any workflow that sends documents containing personal data to an offshore API.
  • Japan (APPI): the ongoing reform cycle has tightened processor supervision obligations, and 経済安全保障推進法 adds specific considerations for critical infrastructure operators.
  • China (等保2.0 / PIPL / 数据安全法): 数据不出境 — data not leaving the country — is the governing principle for many enterprise deployments, effectively mandating on-premise or in-country hosting for a wide range of use cases.

Where these apply, the deployment tier decision becomes primarily a budget and scale question rather than a build-vs-buy question — because "buy" (an offshore API) may not be a compliant option at all.


A Decision Framework

flowchart TD
A["Regulatory requirement that data cannot leave your infrastructure"] -->|Yes| B["On premise is mandatory. Pick tier by budget and concurrency"]
A -->|No| C["Estimate monthly query volume"]
C --> D["High volume sustained usage"]
D -->|Yes| E["Model TCO breakeven. Often favors on premise within a year"]
D -->|No| F["API access is usually the better financial choice"]
B --> G["Match tier to concurrent users. Tier 1 pilot through Tier 4 frontier"]
E --> G
G --> H["Choose model by task. General chat, coding, or max reasoning"]
H --> I["Choose serving stack. vLLM, vendor container, or lightweight tools"]

FAQ

Do we need a GPU to run a local LLM at all?
Not necessarily. Modern server CPUs with matrix-acceleration extensions can serve mixture-of-experts models entirely without a GPU, at speeds usable for interactive chat. It’s the right call for teams with existing CPU fleets, or constrained power and cooling — though a GPU tier will outperform it on throughput once concurrency grows.

How much does quantization actually hurt quality?
For most enterprise tasks, 4-bit quantization (commonly labeled Q4) is the practical default and the quality loss is small enough not to matter for chat, summarization, and RAG. Higher precision (8-bit or above) is worth the extra memory mainly for coding and complex reasoning tasks where small errors compound.

Can we mix on-premise and API access?
Yes, and many of our clients do — routing routine queries to a self-hosted model and only sending genuinely hard queries to a proprietary API for the cases where quality gap matters most. This hybrid approach can meaningfully reduce cost without giving up quality on the tasks that need it, though anything involving regulated data still needs to route through the on-premise path only.

How long does a Tier 2 deployment typically take from decision to production?
For a department-scale deployment with a defined use case, four to eight weeks from hardware order to production is realistic, assuming the use case and integration points are already clear. Longer timelines usually come from unclear requirements, not from the technical build itself.

What’s the fastest way to know which tier we actually need?
Run the Enterprise Local LLM Readiness Assessment — a 25-question self-scoring tool that maps your data sensitivity, volume, and compliance requirements to a recommended starting tier.


Where to Go From Here

Hardware and model selection are solvable with the right framework — the harder part is usually mapping your specific compliance obligations, existing systems, and team capacity onto the right tier. If you want a second opinion before committing budget, the vendor evaluation guide covers what to look for in a deployment partner, and we’re happy to talk through your specific requirements directly.

Get in touch: hello@simplico.net