vLLM

L4 — Intelligent Retrieval LLM Inference Free (OSS) Apache-2.0 · OSS

OSS high-throughput LLM inference engine with PagedAttention. Apache-2.0. Originated at UC Berkeley; standard backend for self-hosted LLM serving. Supports continuous batching, tensor parallelism, quantization (AWQ, GPTQ, FP8).

AI Analysis

vLLM is the dominant OSS LLM inference engine for production deployments — Apache-2.0 licensed, originated at UC Berkeley, and the de facto serving layer for self-hosted Llama, Mistral, Qwen, DeepSeek, and dozens of other open-weight models. Its PagedAttention algorithm delivers 3-24x throughput vs naive implementations, and continuous batching keeps GPU utilization high under variable request load. Pick vLLM when you control the inference infrastructure, need OpenAI-compatible APIs, and want the strongest production-tested OSS posture for serving open-weight models at scale.

Trust Before Intelligence

vLLM is a serving runtime, not a managed service. From a Trust Before Intelligence lens, that means trust comes from your deployment posture — vLLM itself doesn't sign BAAs, doesn't hold SOC 2, and offers no native authorization model. Authentication, audit logging, and per-tenant policy enforcement must come from L5 governance (LiteLLM proxy with virtual keys, Cerbos, OpenFGA, Envoy/Kong API gateway in front). The trust signals vLLM does provide are operational: Prometheus metrics on every dimension that matters (queue depth, KV cache utilization, time-to-first-token, throughput), OpenAI-compatible API surface that's vendor-neutral, and Apache-2.0 license that won't relicense out from under you.

INPACT Score

23/36

I — Instant

5/6

Continuous batching + PagedAttention deliver 3-24x throughput vs naive implementations per the original Berkeley paper. P95 time-to-first-token under 1s for typical Llama-class models on appropriate GPUs; per-token latency 20-50ms after first token. Cap rule N/A.

N — Natural

4/6

OpenAI-compatible chat completions API, plus native HF model loading by name. Tools/function-calling support across most popular models. Rich enough that drop-in replacement of OpenAI clients works for many workloads. Cap rule N/A.

P — Permitted

2/6

No native auth model — server is open by default. Deployment-driven authentication via reverse proxy. Cap rule applied: P-low for inference servers without engine-level authorization. ABAC + RBAC must come from L5.

A — Adaptive

5/6

Runs on any NVIDIA GPU (Volta+ for FP16, Hopper for FP8), AMD MI-series via ROCm, Intel Gaudi via plugins, AWS Inferentia via Neuron. Multi-cloud (AWS, GCP, Azure, on-prem, K8s-native). Strongest A in the LLM Inference category.

C — Contextual

3/6

Per-request token counts, model metadata, generation parameters captured. No native lineage — request-to-output traceability requires app-layer instrumentation. Cap rule applied: no native lineage caps at 3.

T — Transparent

4/6

Prometheus metrics out of the box (request rate, queue depth, KV cache hit rate, time-to-first-token p50/p95/p99, decode throughput). Request logs with full input/output. Per-request cost computable from token counts. Cap rule N/A.

GOALS Score

17/25

G — Governance

2/6

G1=N (no engine-level ABAC), G2=N (request logs are local; not formal access audit), G3=N (inference engine, not workflow), G4=N (no model versioning surface beyond what you bake into deployment), G5=N (LLM threat modeling lives at L5), G6=N (no compliance mapping). 0/6 -> 2.

O — Observability

4/6

O1=Y (Prometheus exporter is comprehensive), O2=N (no native distributed tracing — comes from app), O3=Y (per-request token counts enable cost attribution), O4=Y (queue depth + latency alarms catch incidents fast), O5=N, O6=N. 3/6 -> 4 lenient (observability is among vLLM's strongest points).

A — Availability

4/6

A1=Y (sub-second TTFT), A2=Y (request streaming for incremental output), A3=Y (KV cache reuse via PagedAttention), A4=Y (multi-replica deployment behind a load balancer), A5=Y (production deployments at large scale documented), A6=Y (continuous batching is parallel by design). 6/6 -> 4 (capped to 4 because single-instance has no native HA — needs orchestrator-layer redundancy).

L — Lexicon

3/6

L1=N, L2=N, L3=N, L4=N, L5=Y (model name + version + tokenizer registry as terminology resource), L6=N. 1/6 -> 3 lenient (model registry richness lifts L; HF Hub integration adds breadth).

S — Solid

4/6

S1=Y (deterministic at fixed sampling params), S2=Y (typed completion fields), S3=N (no cross-replica consistency check; output may drift across instances at non-zero temperature), S4=Y (typed request/response), S5=N (no built-in content quality validation; that's L5 guardrails), S6=Y (Prometheus metrics flag throughput/latency anomalies). 5/6 -> 4.

AI-Identified Strengths

+ PagedAttention + continuous batching: 3-24x throughput vs naive implementations. The reason most production self-hosted LLM stacks run vLLM.
+ OpenAI-compatible API surface — drop-in replacement for OpenAI clients across Python, JS, Go ecosystems
+ Apache-2.0 license with no relicensing risk; Berkeley + community-backed development trajectory
+ Broad accelerator support: NVIDIA (Volta+), AMD (ROCm), Intel Gaudi, AWS Inferentia/Trainium, TPUs in beta
+ Quantization breadth: AWQ, GPTQ, FP8, AWQ-Marlin, GGUF — pick the right speed/quality tradeoff per workload
+ Prometheus metrics out of the box covering every operational dimension that matters (queue, KV cache, TTFT, decode rate)
+ Active community, frequent releases, fast-tracked support for new model architectures (Llama 4, DeepSeek-V3, Qwen3, etc.)

AI-Identified Limitations

- No native authorization, audit logging, or multi-tenancy. Authentication is deployment-driven (reverse proxy + API key). ABAC must come from L5.
- Single-instance has no HA. Multi-replica deployment requires orchestrator-layer redundancy (Kubernetes, KServe, BentoML).
- Memory tuning is non-trivial. KV cache sizing, GPU memory utilization, and batch size interact in ways that affect both latency and throughput. Production tuning requires expertise.
- GPU-only for production performance. CPU inference works for development but is rarely viable for production at meaningful throughput.
- Quantization choices have correctness implications. AWQ/GPTQ at 4-bit can degrade specific tasks (math, code) more than benchmarks suggest; validate with workload-specific evals.
- No native tool-call validation. Function-calling output may not conform to schema; pair with Outlines/Instructor or downstream validation.
- Compliance attestations come from your deployment substrate, not from vLLM.

Industry Fit

Best suited for

Self-hosted LLM serving for cost, sovereignty, or model-choice reasons (open-weight Llama/Mistral/Qwen/DeepSeek)RAG production stacks where embedding inference + LLM completion both run on owned infraMulti-tenant SaaS with per-tenant model isolation needsData-sensitive workloads (healthcare, financial, defense) requiring on-prem or VPC-isolated inferenceCost-sensitive workloads at scale where OpenAI/Anthropic per-token pricing dominates the billMulti-cloud / hybrid stacks needing inference portability

Compliance certifications

vLLM the project holds no compliance certifications. Compliance comes from the substrate: AWS GovCloud, Azure Gov, GCP Assured Workloads for FedRAMP; HIPAA via VPC-isolated deployment in BAA-signing infrastructure; SOC 2 via your own org's audit. The vLLM API server has no native auth — proxies (LiteLLM, Envoy, Kong) provide authentication boundary.

Use with caution for

Teams without GPU operational expertise — model serving is non-trivial; managed services (HF Inference Endpoints, OpenAI, Anthropic) may be more cost-effective once labor is includedWorkloads needing frontier-model capability (GPT-4 / Claude Opus 4 class) — open-weight models lag the closed frontier on hardest reasoning benchmarksSingle-instance deployments without HA — needs orchestrator-layer redundancyCompliance-attested workloads — vLLM the project holds no certs; rely on substrate or managed alternative

AI-Suggested Alternatives

Text Generation Inference (TGI)

Choose TGI for tighter Hugging Face Hub integration and HF-native deployment patterns. vLLM wins on raw throughput in most benchmarks; TGI wins on HF ecosystem ergonomics. Both are Apache-2.0, both production-grade. TGI fits better when HF Inference Endpoints are part of the broader strategy.

View analysis →

SGLang

Choose SGLang for agent-heavy and structured-output workloads — RadixAttention prefix-cache reuse outperforms vLLM on multi-turn agent traces. vLLM wins on operational maturity and breadth of model support; SGLang wins on agent-specific throughput. Newer; smaller community.

View analysis →

Ollama

Choose Ollama for developer laptops and on-device inference — wraps llama.cpp for trivial setup. vLLM wins on production throughput at scale; Ollama wins on developer ergonomics and CPU/Apple Silicon support. Different tools for different layers of the stack.

View analysis →

OpenAI (GPT-4)

Choose OpenAI's API when you don't want to operate inference infrastructure, need GPT-4 class capability, and accept the SaaS posture. vLLM wins on cost-at-scale, data sovereignty, and model choice (open-weight only); OpenAI wins on operational simplicity and frontier-model access.

View analysis →

Integration in 7-Layer Architecture

Role: L4 LLM Inference engine. Serves open-weight models via OpenAI-compatible API. The runtime that sits behind L7 agent frameworks and in front of L1 vector stores in RAG architectures.

Upstream: Receives requests from L7 agent runtimes (LangGraph, CrewAI, AutoGen) and L4 RAG frameworks (LangChain, LlamaIndex, Haystack). Ingests model weights from HF Hub, S3, or local disk at startup.

Downstream: Returns completions to callers; emits Prometheus metrics consumed by L6 observability (Prometheus/Grafana, SigNoz, VictoriaMetrics). Per-request logs feed L6 LLM evaluation (Promptfoo, LangSmith, Langfuse, Arize).

⚡ Trust Risks

high vLLM server exposed without authentication. Default behavior is open HTTP — anyone with network access can call it.

Mitigation: Always deploy behind an authenticating proxy: LiteLLM with virtual keys, Envoy with JWT auth, Kong with API key plugin, or a service mesh. Test that the bare vLLM port is not externally reachable. Use NetworkPolicies or security groups to enforce.

high Model output not validated against expected schema. Function-calling output may be malformed or hallucinated when the model doesn't fully grasp the schema.

Mitigation: Pair vLLM with Outlines, Guidance, or Instructor for guaranteed-structured output. Validate every tool-call response with a Pydantic model. For high-stakes tool calls, require LLM self-consistency (retry, confirm) before execution.

high Memory exhaustion under load — KV cache fills, requests queue, latency spikes. Severe load can OOM the GPU.

Mitigation: Set max-num-seqs and gpu-memory-utilization explicitly (don't accept defaults). Monitor KV cache hit rate via Prometheus. Pre-load test under expected p99 concurrency. Use admission control (queue depth threshold returns 503) to shed load gracefully.

medium Quantization degrades specific task performance more than benchmarks indicate. AWQ/GPTQ-4bit may pass MMLU but fail on workload-specific math or code generation.

Mitigation: Run task-specific evals on the quantized model BEFORE production deploy. Maintain a canary deployment running unquantized for A/B comparison. Watch task accuracy metrics in production via Promptfoo or similar.

medium Single-instance deployment treated as HA. Production traffic against one vLLM container has no failover.

Mitigation: Run at least 2 replicas behind a load balancer. Use KServe or BentoML for K8s-native autoscaling and rolling updates. Monitor replica health; alert on single-replica conditions.

Use Case Scenarios

strong Healthcare RAG stack on AWS GovCloud with HIPAA BAA

vLLM in EKS within AWS GovCloud; AWS provides the HIPAA BAA and FedRAMP authorization for the substrate. vLLM serves Llama-class models on-prem-equivalent (no data leaves the BAA boundary). LiteLLM proxy in front handles auth + audit log to the SIEM.

strong Multi-tenant SaaS with per-tenant model isolation

vLLM cluster with model-per-tenant sharding via KServe. LiteLLM virtual keys map to per-tenant quotas + budgets. Per-tenant usage flows to billing via token counts in Prometheus.

weak Low-traffic developer prototype on a laptop

Ollama is a better fit — wraps llama.cpp for one-command model run on Apple Silicon or CPU. vLLM requires GPU for meaningful throughput; overhead isn't justified for low-traffic prototyping.

Stack Impact

L1 vLLM at L4 colocates with L1 vector DBs (Pinecone, Qdrant, Weaviate, pgvector) and L1 cache (Valkey/Redis) for RAG retrieval. Co-located deployment eliminates cross-region latency on retrieval-augmented inference.

L4 vLLM is the L4 inference engine. Choice cascades to L4 RAG framework (LangChain/LlamaIndex/Haystack call vLLM via OpenAI-compatible client), L4 reranker (Cohere Rerank or self-hosted reranker via vLLM), and L4 agent memory (Letta/Mem0 use vLLM as their inference backend).

L5 L5 governance must provide authentication, ABAC, audit logging, and rate-limiting since vLLM has none natively. LiteLLM proxy is a common pattern: virtual keys + per-team budgets + LLM cost attribution. Envoy/Kong handle ingress auth. NeMo Guardrails or Promptfoo for content/safety policies.

L6 vLLM's Prometheus metrics feed L6 observability (Prometheus + Grafana, or Datadog/SigNoz/VictoriaMetrics). Per-request token counts enable LLM cost attribution at L6 (LangSmith, Helicone, Langfuse, Arize Phoenix can ingest).

L7 vLLM is the inference primitive that L7 agent frameworks call (LangGraph, CrewAI, AutoGen, AG2, smolagents, Letta, Mem0). Agent orchestration layers issue chat-completion requests; vLLM serves them with continuous batching.

⚠ Watch For

! vLLM port exposed publicly without an authenticating proxy
! Single-instance deployment in production with no replica failover
! max-num-seqs and gpu-memory-utilization left at defaults (cap exhaustion under load)
! Quantized model deployed without task-specific eval
! Function-calling output consumed without schema validation
! No Prometheus scrape configured — production blind to queue/cache health

2-Week POC Checklist

☐ Deploy vLLM behind LiteLLM or Envoy proxy with API key auth. Verify the bare vLLM port is not externally reachable.
☐ Configure Prometheus scrape; build dashboards for queue depth, KV cache hit rate, TTFT p50/p95/p99, decode throughput. Set alerts on regression.
☐ Run task-specific evals (Promptfoo or custom) comparing quantized vs unquantized variants on workload-representative prompts. Document accuracy delta before commit.
☐ Load-test at expected p99 concurrency. Tune max-num-seqs and gpu-memory-utilization. Verify graceful degradation under overload (admission control).
☐ Deploy at least 2 replicas behind a load balancer. Test failover by killing one replica under load.
☐ Validate cost model: token counts × tariff vs managed alternative (OpenAI/Anthropic). Include amortized GPU cost + ops labor.

Explore in Interactive Stack Builder →

Visit vLLM website →

This analysis is AI-generated using the INPACT and GOALS frameworks from "Trust Before Intelligence." Scores and assessments are algorithmic and may not reflect the vendor's complete capabilities. Always validate with your own evaluation.