for AWS partners →Deploy it with AWS credits + a partner →

deploy an open-source LLM on AWS · the complete how-to · 2026

How to deploy an open-source LLM on AWS — Bedrock, SageMaker JumpStart, or self-host (2026).

You have an open-weight model — Llama, Mistral, Qwen, DeepSeek, Gemma, or one you fine-tuned — and you want it serving traffic on AWS. There are three real paths, not one: call it through Amazon Bedrock if it is in the managed catalog (or import it), stand up a managed endpoint in a few clicks with Amazon SageMaker JumpStart, or self-host on EC2 GPU / Inferentia with vLLM or TGI for maximum control and the lowest unit cost. This page maps all three honestly — the effort, cost, and scaling tradeoffs — gives a concrete step-by-step for the SageMaker path, works the cost math, covers autoscaling and cold-starts, and says plainly when managed beats self-host.

Deploy it with AWS credits + a partner →→ jump to the decision table

real paths

fastest to live

minutes (JumpStart)

lowest unit cost

self-host*

credits to fund it

up to $1M

TL;DR

There are three ways to run an open-weight LLM (Llama, Mistral, Qwen, DeepSeek, Gemma, or your own fine-tune) on AWS. Amazon Bedrock — zero infrastructure, pay per token — if the model is in the managed catalog or you bring it via Custom Model Import for a supported architecture. Amazon SageMaker JumpStart — a managed real-time endpoint on instances you pick (GPU or Inferentia) in your own account, deployable in a few clicks. Self-host on EC2 GPU or Inferentia/Trainium with vLLM or TGI — maximum control and the lowest unit cost at high utilization, in exchange for the most operational work.
The decision is mostly about traffic shape, ops appetite, and how custom your weights are. Bedrock wins for spiky or unpredictable traffic and fastest time-to-value (you pay nothing when idle). SageMaker JumpStart is the pragmatic middle — your model, your account, your instance choice, AWS managing the serving layer and autoscaling. Self-hosting wins when traffic is steady and high-volume and you want to squeeze unit cost or need full control of the stack. The right answer for many products is a mix.
Whichever path you pick, GPU and accelerator hours (or Bedrock tokens) are exactly what AWS credits are built to absorb — Activate up to $100K, Bedrock / GenAI PoC funding $10K–$50K, and the Generative AI Accelerator up to $1M. CloudRoute routes you to a vetted AWS partner who deploys the model (JumpStart, vLLM/TGI, or Bedrock import), tunes the autoscaling and FinOps, and files the credit applications. Customer pays $0 — AWS funds it.

the map

IThe three ways to run an open LLM on AWS

Most guides describe one path and present it as "the" way. The honest picture is that AWS gives you three distinct deployment surfaces for an open-weight model, on a spectrum from least-ops/least-control to most-ops/most-control. Picking the right one before you start saves weeks.

By "open-source LLM" — more precisely, an open-weight model — we mean a model whose weights you can download and run yourself: Llama (Meta), Mistral / Mixtral, Qwen, DeepSeek, Gemma, Falcon, and the long tail of fine-tunes built on top of them. Unlike a closed API model, you are responsible for where and how it runs — which is exactly the choice this page is about. The three surfaces below all run inside AWS, under one bill, one IAM model, and (for the self-managed options) your own VPC.

Path 1 — Amazon Bedrock (fully managed, pay per token). Several open-weight models — notably the Llama family and Mistral models — are available directly in the Bedrock catalog. You call them through the same API as Claude or Nova, you provision nothing, and you pay per input/output token. If your model (or a close-enough one) is in the catalog, this is the shortest path to production. For an open-weight model that is not in the catalog, Bedrock also offers Custom Model Import, which lets you bring the weights of a supported architecture (Llama-class and several other common families) and serve them through the Bedrock API on managed infrastructure. The constraint is architecture support — your model must match an importable family.

Path 2 — Amazon SageMaker JumpStart (managed endpoint, your account, your instance). JumpStart is SageMaker's model hub: a catalog of hundreds of open and proprietary models you can deploy to a real-time endpoint in a few clicks or a few lines of the SageMaker SDK. You keep control over the instance type (GPU such as the g- and p-families, or Inferentia inf2), the model runs in your own account and VPC, and AWS manages the serving container, the health checks, and the autoscaling. This is the pragmatic middle ground: far less work than wiring up your own server, far more control than a managed token API. It is the path this page walks step-by-step in section IV.

Path 3 — Self-host on EC2 with vLLM or TGI. The maximum-control path: launch a GPU instance (or an Inferentia/Trainium inf/trn instance) yourself, run a high-throughput inference server — vLLM or Hugging Face Text Generation Inference (TGI) — load the weights, and put the whole thing behind a load balancer or Kubernetes (EKS) with your own autoscaling. You own every layer, you get the lowest unit cost when the box is well-utilized, and you carry all of the operational responsibility — sizing, scaling, patching, and keeping it busy. For steady, high-volume traffic where unit cost dominates, this is where the cheapest cost-per-token lives.

A useful way to hold the three in your head: Bedrock is "the model as an API call," SageMaker JumpStart is "a managed endpoint you own the dial on," and self-hosting is "you run the server." They are not ranked best-to-worst — each wins for a different traffic shape and team. The rest of the page is about matching your situation to the right one.

open-weight ≠ free

"Open-source" refers to the model weights being available — it does not mean running the model is free. On AWS you still pay for the compute that serves it: GPU/accelerator instance-hours for the self-managed paths, or per-token usage on Bedrock. The headline saving of an open model versus a closed one is that you avoid a proprietary per-token premium and gain control — but you take on the infrastructure cost and (for self-host) the ops. The credit and FinOps sections below are about making that cost manageable, then $0.

effort vs cost vs scaling

IIThe tradeoffs: effort, cost, and scaling

The three paths trade the same three things against each other — how much work it is to stand up and run, what it costs per unit of inference, and how it scales with traffic. Understanding the shape of each trade is what makes the decision obvious for your case.

The throughline across all three trades: traffic shape is the deciding variable. Steady, high, predictable traffic rewards the self-managed paths (you keep the box full and the unit cost low). Spiky, low, or unpredictable traffic rewards Bedrock (you pay nothing in the troughs). Most real products have both — a steady baseline plus bursts — which is why the pragmatic answer is so often a mix, covered in the decision section.

Effort (time-to-live + ongoing ops)

On effort, the order is unambiguous. Bedrock is the least work — there is nothing to deploy; you authenticate and call the API, and time-to-first-response is minutes. (Custom Model Import adds a one-time import step but no ongoing server ops.) SageMaker JumpStart is the middle — a few clicks or SDK calls stand up a managed endpoint, and AWS runs the serving layer, but you own the endpoint, choose the instance, and manage its lifecycle. Self-hosting is the most work — you build the AMI or container, install and tune vLLM/TGI, wire up load balancing and autoscaling on EC2 or EKS, handle drivers and CUDA (or the Neuron SDK for Inferentia), and keep it all patched and observed. The effort gap is largest after launch: a managed endpoint mostly runs itself; a self-hosted fleet is a standing operational commitment.

Cost (the unit economics)

On cost, the order can invert depending on utilization, which is the single most important idea on this page. Bedrock charges per token — you pay only for what you use and nothing when idle, which makes it the cheapest option for spiky or low-volume traffic. The self-managed paths (JumpStart and self-host) charge per instance-hour — the box bills 24/7 whether or not requests are flowing, so their cost-per-token is only low when the instance is kept well-utilized. At high, steady utilization, a self-managed GPU or Inferentia endpoint typically beats per-token pricing on cost-per-million-tokens; at low utilization it can be dramatically more expensive than Bedrock because you are paying for idle silicon. The metric that decides it is cost per million tokens at your real traffic shape — not the instance's hourly rate and not the headline per-token price.

Scaling (how it behaves under load)

Bedrock scales transparently — AWS handles capacity behind the API, and you scale by simply sending more requests (within account/model throughput limits; Provisioned Throughput exists for guaranteed capacity). SageMaker JumpStart endpoints scale via SageMaker's managed autoscaling — you set target metrics and min/max instance counts, and the endpoint adds or removes instances automatically, including scale-to-zero for serverless-style patterns on supported configurations (with the cold-start tradeoff covered in section VI). Self-hosted fleets scale however you build it — EC2 Auto Scaling groups or Kubernetes HPA on EKS — which is the most flexible and the most work, and where you own the cold-start and warm-pool behavior yourself.

model availability

IIIWhich open models run where

Before choosing a path you have to know whether your specific model is even available on it. Catalog membership (Bedrock), hub coverage (JumpStart), and "anything you can download" (self-host) are three different levels of availability.

The availability picture, in plain terms. Bedrock's catalog carries selected open-weight families — the Llama models and Mistral models are the headline open options, alongside Amazon Nova/Titan, Claude, Cohere, AI21, and others. If your model is one of those, the managed-API path is open to you with zero deployment. Bedrock Custom Model Import widens this to open-weight models you bring yourself, provided the architecture is supported (Llama-class and several other common transformer families) — you upload the weights and Bedrock serves them on managed infrastructure. SageMaker JumpStart's hub is far broader: hundreds of open models — Llama, Mistral/Mixtral, Qwen, Falcon, Gemma, embeddings and vision models, and more — each deployable to an endpoint you control. And self-hosting has no catalog limit at all: if you can download the weights (from Hugging Face or elsewhere) and they run under vLLM or TGI, you can serve them, including bleeding-edge releases and your own private fine-tunes that no managed catalog has yet.

The practical consequence: your model's availability narrows the choice for you. A current Llama or Mistral model gives you all three options, so you decide purely on traffic and ops. A newer or more niche open model (a fresh Qwen or DeepSeek release, say) may not be in the Bedrock catalog yet — then it is JumpStart (if the hub has it) or self-host. A private fine-tune on a custom base, or a brand-new architecture, usually means JumpStart-with-your-own-artifact or self-host. Confirm the current Bedrock model list and JumpStart hub coverage in the consoles before committing — both expand continually.

open-weight model availability by path · representative as of 2026 (confirm current catalogs in-console)

Model family	Bedrock catalog	Bedrock Custom Model Import	SageMaker JumpStart	Self-host (vLLM/TGI)
Llama (Meta)	Yes (selected versions)	Yes (Llama-class supported)	Yes	Yes
Mistral / Mixtral	Yes (selected versions)	Often (check support)	Yes	Yes
Qwen	Varies / newer releases lag	If architecture supported	Commonly	Yes
DeepSeek	Some availability	If architecture supported	Commonly	Yes
Gemma / Falcon	Varies	If architecture supported	Commonly	Yes
Your private fine-tune	No	Yes if base architecture supported	Yes (bring your artifact)	Yes
Brand-new architecture	No (until added)	No (until supported)	If the hub adds it	Yes (day one)

Catalog membership and import support change frequently as AWS adds models and architectures. This table shows the typical pattern, not a guaranteed current list — verify your exact model and version in the Amazon Bedrock and SageMaker JumpStart consoles before you build around any one path.

the step-by-step

IVStep-by-step: deploy an open LLM on SageMaker JumpStart

JumpStart is the path most teams should try first when the model is not already in the Bedrock catalog, because it gives you a production-grade managed endpoint in your own account with very little code. Here is the end-to-end flow, in the order you actually do it.

The walkthrough below assumes you want a real-time endpoint for an open-weight chat/instruct model (a Llama or Mistral instruct variant is the canonical example). The same flow applies to other JumpStart LLMs; only the model ID and the recommended instance change. Exact instance names, container versions, and quotas move — treat the specifics as representative and confirm current values in the console.

Step 1 — Prepare the account and quotas — In your AWS account, make sure you have a SageMaker execution role with access to S3 and the right permissions, and — critically — that you have service quota for the GPU or inf2 instance type you intend to use (e.g. an ml.g5 / ml.g6 GPU instance or an ml.inf2 instance for the endpoint). New accounts often have a zero quota for large GPU instances; request an increase early because approval can take time. Open SageMaker Studio as your workbench.
Step 2 — Find the model in JumpStart — In SageMaker Studio, open JumpStart and search for your model (e.g. the Llama or Mistral instruct variant you want). JumpStart shows the model card, the license terms (accept them — open-weight models carry usage licenses you must agree to), and a recommended instance type. You can deploy from the UI or, more reproducibly, from the SageMaker Python SDK using the model's JumpStart model ID.
Step 3 — Choose the instance and serving config — Pick the instance type and count. For a mid-size open LLM, a single modern GPU instance is often enough; larger models need a multi-GPU instance or sharding, and Inferentia inf2 is the lower-cost option if your model has a supported Neuron path. Set the inference container (JumpStart wires up an optimized one — frequently a TGI- or vLLM-based image — automatically) and the number of model replicas per instance if applicable.
Step 4 — Deploy the endpoint — Deploy. From the SDK this is essentially a JumpStartModel(...).deploy(...) call with your instance type and initial instance count; from the UI it is a button. SageMaker provisions the instance(s), pulls the optimized container, downloads the weights, and brings up a real-time HTTPS endpoint in your account. Initial deployment typically takes several minutes (model download + container start).
Step 5 — Invoke and validate — Call the endpoint with the SageMaker runtime (invoke_endpoint) or your application SDK, sending a chat/completion payload in the model's expected format. Validate that outputs are correct, then measure latency — time-to-first-token and inter-token latency under realistic prompt sizes — and confirm it meets your application's requirements.
Step 6 — Configure autoscaling — Attach Application Auto Scaling to the endpoint: set a target metric (e.g. invocations-per-instance or a concurrency/utilization target), a minimum and maximum instance count, and scale-in/scale-out cooldowns. This is what turns a single fixed instance into an endpoint that grows and shrinks with traffic. Decide here whether you want a warm minimum (lower latency, higher floor cost) or scale-to-near-zero (lower cost, cold-start risk) — covered in section VI.
Step 7 — Add observability and guardrails — Wire the endpoint into CloudWatch for latency, error-rate, and instance-count metrics; enable SageMaker Model Monitor if you want drift detection; and put your own input/output validation or a safety layer in front for production. Tag the endpoint for cost allocation so the spend is attributable.
Step 8 — Iterate on cost — Once it is serving real traffic, benchmark cost-per-million-tokens at your actual utilization and compare it against (a) a Bedrock per-token equivalent and (b) the same model on Inferentia inf2 if you started on GPU. This three-way number at your real traffic shape is what tells you whether to stay on JumpStart-GPU, move to inf2, or push spiky overflow to Bedrock. Then apply AWS credits so the instance-hours bill to credits, not your card.

why JumpStart before raw EC2

JumpStart gives you ~80% of the control of self-hosting (your account, your VPC, your instance choice, your autoscaling) for a fraction of the setup work, because AWS manages the serving container, health checks, and rolling updates. Many teams who think they need a hand-built vLLM fleet find a JumpStart endpoint meets the need — and only graduate to raw EC2/EKS self-hosting when they have a specific reason (a custom serving stack, extreme cost tuning at very high volume, or a model JumpStart does not host).

the control path

VSelf-hosting on EC2 GPU or Inferentia with vLLM / TGI

When you need full control — a custom serving stack, the absolute lowest unit cost at high volume, or a model nothing else hosts — you run the server yourself on EC2. This is more work, and the work has a known shape.

The self-host pattern is: pick the compute, run a high-throughput inference server, put it behind scaling. On compute, you choose between EC2 GPU instances (the g- and p-families — the familiar CUDA path, broadest model support, no porting) and EC2 Inferentia/Trainium instances (inf2 / trn — AWS's custom silicon, lower cost-per-token at high utilization for supported models, but you serve through the Neuron SDK rather than CUDA and your model must have a clean Neuron compilation path). For the inference server, the two standard choices are vLLM — a high-throughput engine known for PagedAttention and continuous batching, with both CUDA and Neuron support — and Hugging Face Text Generation Inference (TGI), a production-grade server with streaming, continuous batching, and broad model coverage. Both turn raw weights into an efficient, batched, streaming endpoint; the choice between them is mostly ecosystem preference and specific model/feature support.

On scaling and operations, you have two common topologies. The simpler one runs the server on EC2 instances behind an Application Load Balancer with an EC2 Auto Scaling group sized to a utilization or queue-depth metric. The more flexible one runs it as pods on EKS (Kubernetes) with the Horizontal Pod Autoscaler and a GPU/Neuron device plugin, which gives you fine-grained packing, rolling deploys, and multi-model density at the cost of running Kubernetes. Either way you own the parts SageMaker would otherwise manage: the AMI/container build, driver and CUDA (or Neuron SDK) versions, health checks, autoscaling policy, cold-start/warm-pool behavior, and patching. That standing operational load is the real price of self-hosting — it is rarely the model that is hard, it is keeping the fleet healthy, full, and cheap over time.

The reason teams take this on anyway is unit cost at scale and total control. At steady high volume, a well-utilized self-hosted fleet — especially on Inferentia inf2 — generally delivers the lowest cost-per-million-tokens of any path, because you have stripped out every managed-service margin and can tune batching, quantization, and instance choice to your exact model and traffic. And you control everything: the serving engine, the model version, quantization and speculative-decoding tricks, the network boundary, and the deployment cadence. If you are running one or a few high-traffic models where every cent of unit cost matters, self-hosting is the path that lets you drive it to the floor.

GPU vs Inferentia for self-hosting — the quick rule

Choose EC2 GPU (g/p) when: you want zero porting and the full CUDA ecosystem, your model uses custom CUDA kernels, you are iterating on brand-new architectures that may outrun Neuron support, or your volume does not justify a Neuron port.
Choose EC2 Inferentia (inf2) when: you have steady, high-volume traffic to keep the box full, your model has a clean Neuron path (most mainstream LLMs do via Optimum Neuron / vLLM-on-Neuron), and you want the lowest infrastructure cost-per-token with control.
Either way: the cost advantage only materializes at high utilization — an idle self-hosted box (GPU or inf2) can cost more per request than Bedrock's pay-per-token. Keep it busy or do not self-host.

the numbers

VICost math, autoscaling, and cold-starts

The whole decision turns on one number: cost per million tokens at your real utilization. This section shows how to compute it for each path and how autoscaling and cold-starts move it — with representative figures you should replace with current rates and your own benchmarks.

For the self-managed paths (JumpStart and self-host), the unit cost is a division problem: cost per million tokens = (instance $/hour ÷ tokens served per hour) × 1,000,000. The instance hourly rate is fixed; the tokens-per-hour you achieve depends on the model, the instance, and — above all — how effectively you batch concurrent requests. This is why utilization is everything: the same GPU instance serving 10× the tokens per hour has roughly one-tenth the cost per token. A box at 80% utilization can be dramatically cheaper per token than the identical box at 15%, even though the hourly bill is the same. The lever you control is keeping the instance full (good batching + right-sized autoscaling), and choosing cheaper silicon (inf2) where it fits.

For Bedrock, the unit cost is simply the published per-token price for the model (input and output priced separately), with no idle cost — and reducible further with Batch (roughly half-price for non-real-time work) and prompt caching (cuts the cost of repeated context). Because you pay nothing in the troughs, Bedrock's effective cost per token over a spiky day can beat a self-managed box that sat idle for half of it — even when Bedrock's headline per-token price is higher than the self-managed box's best-case per-token cost. The comparison that matters is therefore not "list price vs list price" but "Bedrock's pay-per-use total vs the self-managed box's 24/7 total" across your actual daily traffic curve.

The honest way to choose, then, is to take your real traffic shape — requests per hour across a representative day, with realistic prompt and completion lengths — and compute the monthly cost three ways: Bedrock per-token (including idle troughs at $0), a JumpStart/GPU endpoint at its achievable utilization, and an inf2 endpoint at its achievable utilization. The cheapest depends almost entirely on how steady and high your volume is. The representative table below shows the shape of that answer; your numbers come from current AWS pricing and a benchmark of your specific model.

Autoscaling and cold-starts

Autoscaling is how a self-managed endpoint reconciles "keep it full" (for low unit cost) with "don't pay for idle." You set a minimum instance count (your always-warm floor), a maximum (your ceiling for bursts), and a target metric (concurrency or invocations per instance) that triggers scale-out and scale-in. The tension is the cold-start: spinning up a new LLM instance is not instant — provisioning, pulling a multi-gigabyte container, and loading large weights into accelerator memory can take from tens of seconds to a few minutes, during which a just-arrived request either waits or is rejected. So scaling-to-zero minimizes cost but exposes users to a cold-start when traffic resumes; keeping a warm minimum eliminates the cold-start but pays for an always-on floor. The right setting is a product decision: latency-critical user-facing features usually keep a warm minimum, while internal or batch-tolerant workloads can scale to zero and accept the occasional cold-start. Mitigations include SageMaker warm pools, provisioned/min-capacity floors, smaller or quantized models that load faster, and routing burst overflow to Bedrock (which has no cold-start you manage) while your own fleet scales up.

representative cost-per-path by traffic shape · illustrative shape only, not current pricing — benchmark your own

Traffic shape	Bedrock (pay-per-token)	JumpStart / GPU endpoint	Self-host inf2 (well-utilized)	Cheapest (typical)
Low / spiky / unpredictable	Pay only for use; $0 idle	Pays 24/7 even when idle — poor fit	Pays 24/7 even when idle — poor fit	Bedrock
Medium, somewhat steady	Competitive; simplest	Competitive if kept busy	Cheapest if Neuron path + busy	Tie / depends on utilization
High, steady, predictable	Per-token adds up at volume	Low unit cost at high utilization	Lowest unit cost when full	Self-host inf2 (then GPU)
Steady baseline + bursts	Great for the bursts	Good for the baseline	Cheapest for the baseline	Mix: self-managed base + Bedrock burst

Figures intentionally omitted because real per-token and per-hour rates move and depend on the exact model and instance — this table shows which path tends to win for each traffic shape, not a dollar amount. Compute your own three-way cost-per-million-tokens from current AWS pricing and a benchmark of your model at your real concurrency before deciding.

the honest verdict

VIIWhen managed beats self-host (and when it does not)

The instinct to self-host "to save money" is often wrong, and sometimes exactly right. Here is the honest version of when a managed path (Bedrock or JumpStart) beats running your own server — and the narrower set of cases where self-hosting genuinely wins.

Managed wins more often than engineers expect, for one structural reason: the cost advantage of self-hosting is conditional on high utilization, but the operational cost of self-hosting is unconditional. You pay the ops tax — building, scaling, patching, on-call — whether the box is full or empty, and you only collect the unit-cost reward if you keep it full. For the many teams whose traffic is not yet steady-and-high, a managed path is both cheaper (no idle burn, no engineer-months) and faster. The time-to-value gap compounds the point: Bedrock is live in minutes and JumpStart in an afternoon, while a hardened self-hosted fleet with proper autoscaling and observability is a multi-week project that then needs ongoing care.

Choose managed (Bedrock or JumpStart) when…

Traffic is spiky, low, or still unpredictable — you cannot keep a self-hosted box full, so idle burn or pay-per-token economics favor managed (Bedrock especially).
You want speed-to-production and to avoid standing up serving infrastructure and on-call rotations.
Your model is in the Bedrock catalog (call it) or in the JumpStart hub / importable (managed endpoint) — there is no need to hand-build a server.
Your team is small or has no dedicated ML-infra/SRE capacity — the ops load of a self-hosted fleet would come out of product work.
You value the managed surrounding features — Bedrock Guardrails, Knowledge Bases, Agents; SageMaker autoscaling, Model Monitor, rolling updates — over squeezing the last cent of unit cost.

Choose self-host when…

Traffic is steady, high-volume, and predictable — you can keep instances well-utilized, which is the only regime where self-host's low unit cost actually materializes.
Unit cost is the dominant line item and shaving cost-per-million-tokens (via inf2, quantization, custom batching) is worth real engineering.
You need a custom serving stack — a specific engine version, speculative decoding, custom kernels, multi-model packing — that managed paths do not expose.
You run a model nothing hosts — bleeding-edge or bespoke architectures not in the Bedrock catalog or JumpStart hub.
You have the ML-infra/SRE capacity to own and operate the fleet over time, not just stand it up once.

the pattern most production teams land on

A mature open-LLM deployment on AWS is frequently a mix: a self-managed endpoint (JumpStart or self-hosted inf2) carrying the steady, predictable baseline at low unit cost, with Bedrock absorbing bursts and the long tail of spiky traffic at zero idle cost — and a fast managed path used early for time-to-market before any of it is optimized. The right architecture routes each slice of traffic to the option whose economics fit its shape, rather than betting the whole workload on one path.

funding + building it

VIIIHow CloudRoute gets the model deployed — and the bill to $0

Deploying an open LLM on AWS is two problems: doing the deployment well (the right path, tuned autoscaling, optimized unit cost) and paying for it. A vetted AWS partner plus AWS credits solve both — and the credits take the cash cost to zero.

Whichever path fits, the compute it runs on is exactly what AWS credits are designed to cover. SageMaker endpoint instance-hours and self-hosted EC2 GPU/inf2 hours are standard AWS compute; Bedrock tokens are standard Bedrock usage. All of it draws from the same credit pools: AWS Activate (up to $100K), Bedrock / GenAI PoC funding ($10K–$50K to prove out a model or feature), and the Generative AI Accelerator (up to $1M for selected AI-first companies). For an inference workload — which, once live, runs continuously — that credit runway can cover a production open-LLM deployment for a long time.

But the deployment itself benefits from expertise that is easy to underestimate. Choosing the path (Bedrock vs JumpStart vs self-host), picking GPU versus Inferentia, getting a clean Neuron compilation path if you go inf2, tuning autoscaling and cold-start behavior, and routing traffic so each slice runs where it is cheapest — this is real engineering, and getting it wrong is the difference between a deployment that quietly drains budget and one that is fast, cheap, and stable. CloudRoute (cloudroutehq.com) addresses both halves: it routes you to a vetted AWS partner who actually does the deployment — stands up the JumpStart endpoint or the vLLM/TGI fleet or the Bedrock import, tunes the autoscaling and the FinOps, and benchmarks the unit cost across paths — and files the AWS credit applications through the ACE program so the bill is funded from day one.

The economics for you: the customer pays $0. AWS funds the credit pool because it wants production GenAI on AWS for the long term; the partner is paid by AWS through engagement-funding programs; CloudRoute is paid by the partner as a routing commission. You never see an invoice from CloudRoute. You get an open LLM deployed on the right path, autoscaling and unit cost tuned by people who do this daily, and credits that cover the GPU/accelerator hours or Bedrock tokens — a deployment that is funded and optimized rather than improvised and expensive. For the broader credit picture, see AWS credits for generative-AI startups and how AWS PoC / Bedrock POC funding works.

side by side

Bedrock vs SageMaker JumpStart vs self-host — the deployment decision

The three paths to running an open LLM on AWS, against the dimensions that actually decide it. There is no single winner — match the row that describes your traffic and team to the column that fits.

Dimension	Amazon Bedrock	SageMaker JumpStart	Self-host (EC2 GPU / inf2)
What it is	Managed foundation-model API	Managed endpoint you control	Your own inference server
Cost model	Per token — $0 when idle	Per instance-hour (your instance)	Per instance-hour (your fleet)
Best traffic shape	Spiky / low / unpredictable	Medium → high, steady-ish	High, steady, predictable
Lowest unit cost at scale	No (per-token premium at volume)	Good when kept busy	Yes — best when full (esp. inf2)
Effort to launch	Minutes (API call)	An afternoon (few clicks/SDK)	Weeks (build, scale, harden)
Ongoing ops	None — fully managed	Light — AWS runs serving layer	Heavy — you own the fleet
Model availability	Catalog + Custom Model Import	Hundreds of hub models + your artifact	Anything you can download
Custom serving stack	No	Limited (managed container)	Full — any engine/version
Cold-starts	None you manage	Yes if scaling to zero (tunable)	Yes — you own warm-pool logic
Cash cost with CloudRoute	$0 — credits cover tokens	$0 — credits cover endpoint hours	$0 — credits cover EC2/inf hours

Representative as of 2026; model catalogs, instance options, and pricing all move. Confirm current Bedrock model availability, JumpStart hub coverage, and per-token/per-hour rates in-console, and benchmark cost-per-million-tokens at your real traffic before committing to a path.

ready to deploy the model?

Get matched with a partner who deploys your open LLM and files the credits

Start in 3 minutes →

a recent match

An open LLM deployed, autoscaled, and funded — anonymized

inquiry · Series-A B2B SaaS, fine-tuned open-weight model for an in-product feature

Series-A B2B SaaS, ~25 people, wanting to serve a fine-tuned open-weight model (Llama-class) behind a customer-facing feature with a steady weekday baseline and evening bursts

Situation: The team had fine-tuned an open-weight base model and proven it in a notebook, but had never deployed an LLM in production. They were unsure whether to put it on Bedrock (their fine-tune was not a catalog model), stand up SageMaker JumpStart, or hand-build a vLLM server on GPU — and they were worried the GPU bill would be unpredictable and large. They had no ML-infra/SRE capacity to spare and no AWS credits cushioning the cost.

What CloudRoute did: CloudRoute routed them within a day to an AWS partner with SageMaker and inference-FinOps experience. The partner deployed the fine-tuned model as a SageMaker JumpStart-style managed endpoint (their own artifact, in their account, on a right-sized GPU instance), validated latency, and configured Application Auto Scaling with a small warm minimum for the weekday baseline plus burst headroom. They benchmarked cost-per-million-tokens on GPU versus inf2 and versus a Bedrock equivalent, routed the spiky evening overflow to Bedrock to avoid an oversized always-on floor, and filed Activate plus GenAI PoC credits through ACE to cover the endpoint hours.

Outcome: The model was serving production traffic on a managed, autoscaled endpoint within the week, with a warm floor for latency and Bedrock absorbing bursts so the GPU fleet stayed right-sized. Benchmarked unit cost landed well below their feared GPU bill, and credits covered the instance-hours and Bedrock tokens — taking the cash cost of the deployment to roughly $0 for the credit runway. The partner did the deployment and the FinOps; the team kept building product. CloudRoute was paid by the partner from AWS engagement funding — the company paid $0.

path: JumpStart-style managed endpoint + Bedrock burst · time-to-production: ~1 week · unit cost vs feared GPU bill: well below · cost to customer: $0

faq

Common questions

How do I deploy an open-source LLM on AWS?

There are three real paths. (1) Amazon Bedrock — if your model (e.g. a Llama or Mistral version) is in the managed catalog, you call it through the API and pay per token with no infrastructure; for open-weight models not in the catalog, Bedrock Custom Model Import lets you bring supported architectures. (2) Amazon SageMaker JumpStart — deploy the open model to a managed real-time endpoint in your own account, on a GPU or Inferentia instance you choose, in a few clicks or a few SDK calls. (3) Self-host on EC2 GPU or Inferentia with vLLM or Hugging Face TGI for maximum control and the lowest unit cost at high utilization. Choose based on traffic shape, ops appetite, and how custom your weights are.

Can I run Llama (or Mistral) on Amazon Bedrock?

Yes — selected Llama and Mistral versions are available directly in the Amazon Bedrock catalog, called through the same API as any other Bedrock model and billed per token, with no infrastructure to manage. Exactly which versions are available changes as AWS updates the catalog, so confirm the current list in the Bedrock console. If the specific open-weight model you want is not in the catalog, you can either use Bedrock Custom Model Import (for supported architectures like Llama-class), deploy it via SageMaker JumpStart, or self-host it.

What is the cheapest way to serve an open LLM on AWS?

It depends entirely on your traffic shape, and the order can invert. For steady, high-volume, predictable traffic, a well-utilized self-hosted endpoint — especially on Inferentia inf2 — usually gives the lowest cost per million tokens, because per-instance-hour pricing divided over high throughput beats per-token pricing. For spiky, low, or unpredictable traffic, Amazon Bedrock is typically cheapest because you pay per token and nothing when idle, whereas a self-managed box bills 24/7 even when empty. The deciding metric is cost per million tokens at your real utilization, computed from current AWS rates and a benchmark of your model — not the instance hourly rate or the headline per-token price.

Should I use SageMaker JumpStart or self-host with vLLM/TGI?

Try JumpStart first unless you have a specific reason not to. JumpStart gives you a managed real-time endpoint in your own account — your instance choice (GPU or inf2), your VPC, your autoscaling — while AWS manages the serving container, health checks, and rolling updates, so you get most of the control of self-hosting for a fraction of the setup and ongoing ops. Graduate to self-hosting on EC2/EKS with vLLM or TGI when you need a custom serving stack, want to push unit cost to the floor at very high steady volume, or run a model JumpStart does not host — and you have the ML-infra/SRE capacity to operate the fleet over time.

GPU or Inferentia for serving an open LLM?

Use EC2 GPU instances (g/p families) when you want zero porting and the full CUDA ecosystem, your model uses custom CUDA kernels, or you are iterating on brand-new architectures. Use Inferentia inf2 when you have steady high-volume traffic to keep the box full, your model has a clean Neuron compilation path (most mainstream LLMs do via Optimum Neuron or vLLM-on-Neuron), and you want the lowest infrastructure cost-per-token with control. The key caveat for both: the cost advantage only appears at high utilization — an idle self-hosted box, GPU or inf2, can cost more per request than Bedrock's pay-per-token.

How does autoscaling and cold-start work for an LLM endpoint?

A SageMaker endpoint (or a self-hosted fleet) scales by adding and removing instances against a target metric like concurrency or invocations-per-instance, between a minimum and maximum count you set. The tension is the cold-start: bringing up a new LLM instance means provisioning, pulling a large container, and loading multi-gigabyte weights into accelerator memory, which can take tens of seconds to a few minutes. Scaling to zero minimizes cost but exposes users to that cold-start when traffic resumes; keeping a warm minimum removes the cold-start but pays for an always-on floor. Latency-critical features usually keep a warm minimum; batch-tolerant workloads scale to zero. Warm pools, min-capacity floors, smaller/quantized models, and routing burst overflow to Bedrock all mitigate cold-starts.

When does a managed path beat self-hosting?

Managed (Bedrock or SageMaker JumpStart) beats self-hosting more often than people expect, because self-hosting's cost advantage is conditional on high utilization while its operational cost is unconditional — you pay the build-scale-patch-on-call tax whether the box is full or empty. Managed wins when traffic is spiky/low/unpredictable, when you want speed-to-production, when your team has no spare ML-infra/SRE capacity, or when your model is in the catalog/hub. Self-hosting wins when traffic is steady and high-volume (so you can keep instances full), when unit cost is the dominant line item, when you need a custom serving stack, or when you run a model nothing else hosts.

How do I pay for deploying an open LLM on AWS?

The compute is exactly what AWS credits are built to absorb: SageMaker endpoint hours and self-hosted EC2 GPU/inf2 hours are standard AWS compute, and Bedrock tokens are standard Bedrock usage. All draw from the same pools — Activate (up to $100K), Bedrock/GenAI PoC funding ($10K–$50K), and the Generative AI Accelerator (up to $1M for selected AI-first companies). CloudRoute routes you to a vetted AWS partner who actually does the deployment (JumpStart, vLLM/TGI self-host, or Bedrock import), tunes the autoscaling and FinOps, and files the credit applications through ACE. The customer pays $0 — AWS funds the credits, the partner is paid by AWS, and CloudRoute is paid by the partner.

Deploy your open LLM on AWS — and fund it to $0

CloudRoute connects ML teams with vetted AWS partners who deploy the model on the right path (Bedrock, SageMaker JumpStart, or self-hosted vLLM/TGI), tune the autoscaling and unit cost, and file the AWS credits that cover the bill. Customer pays $0 — AWS funds it.

Get matched in 24h →→ see the data-AI persona detail

matched within< 24h

credit ceilingup to $1M

cost to you$0