You have an open-weight model — Llama, Mistral, Qwen, DeepSeek, Gemma, or one you fine-tuned — and you want it serving traffic on AWS. There are three real paths, not one: call it through Amazon Bedrock if it is in the managed catalog (or import it), stand up a managed endpoint in a few clicks with Amazon SageMaker JumpStart, or self-host on EC2 GPU / Inferentia with vLLM or TGI for maximum control and the lowest unit cost. This page maps all three honestly — the effort, cost, and scaling tradeoffs — gives a concrete step-by-step for the SageMaker path, works the cost math, covers autoscaling and cold-starts, and says plainly when managed beats self-host.
Most guides describe one path and present it as "the" way. The honest picture is that AWS gives you three distinct deployment surfaces for an open-weight model, on a spectrum from least-ops/least-control to most-ops/most-control. Picking the right one before you start saves weeks.
By "open-source LLM" — more precisely, an open-weight model — we mean a model whose weights you can download and run yourself: Llama (Meta), Mistral / Mixtral, Qwen, DeepSeek, Gemma, Falcon, and the long tail of fine-tunes built on top of them. Unlike a closed API model, you are responsible for where and how it runs — which is exactly the choice this page is about. The three surfaces below all run inside AWS, under one bill, one IAM model, and (for the self-managed options) your own VPC.
Path 1 — Amazon Bedrock (fully managed, pay per token). Several open-weight models — notably the Llama family and Mistral models — are available directly in the Bedrock catalog. You call them through the same API as Claude or Nova, you provision nothing, and you pay per input/output token. If your model (or a close-enough one) is in the catalog, this is the shortest path to production. For an open-weight model that is not in the catalog, Bedrock also offers Custom Model Import, which lets you bring the weights of a supported architecture (Llama-class and several other common families) and serve them through the Bedrock API on managed infrastructure. The constraint is architecture support — your model must match an importable family.
Path 2 — Amazon SageMaker JumpStart (managed endpoint, your account, your instance). JumpStart is SageMaker's model hub: a catalog of hundreds of open and proprietary models you can deploy to a real-time endpoint in a few clicks or a few lines of the SageMaker SDK. You keep control over the instance type (GPU such as the g- and p-families, or Inferentia inf2), the model runs in your own account and VPC, and AWS manages the serving container, the health checks, and the autoscaling. This is the pragmatic middle ground: far less work than wiring up your own server, far more control than a managed token API. It is the path this page walks step-by-step in section IV.
Path 3 — Self-host on EC2 with vLLM or TGI. The maximum-control path: launch a GPU instance (or an Inferentia/Trainium inf/trn instance) yourself, run a high-throughput inference server — vLLM or Hugging Face Text Generation Inference (TGI) — load the weights, and put the whole thing behind a load balancer or Kubernetes (EKS) with your own autoscaling. You own every layer, you get the lowest unit cost when the box is well-utilized, and you carry all of the operational responsibility — sizing, scaling, patching, and keeping it busy. For steady, high-volume traffic where unit cost dominates, this is where the cheapest cost-per-token lives.
A useful way to hold the three in your head: Bedrock is "the model as an API call," SageMaker JumpStart is "a managed endpoint you own the dial on," and self-hosting is "you run the server." They are not ranked best-to-worst — each wins for a different traffic shape and team. The rest of the page is about matching your situation to the right one.
"Open-source" refers to the model weights being available — it does not mean running the model is free. On AWS you still pay for the compute that serves it: GPU/accelerator instance-hours for the self-managed paths, or per-token usage on Bedrock. The headline saving of an open model versus a closed one is that you avoid a proprietary per-token premium and gain control — but you take on the infrastructure cost and (for self-host) the ops. The credit and FinOps sections below are about making that cost manageable, then $0.
The three paths trade the same three things against each other — how much work it is to stand up and run, what it costs per unit of inference, and how it scales with traffic. Understanding the shape of each trade is what makes the decision obvious for your case.
The throughline across all three trades: traffic shape is the deciding variable. Steady, high, predictable traffic rewards the self-managed paths (you keep the box full and the unit cost low). Spiky, low, or unpredictable traffic rewards Bedrock (you pay nothing in the troughs). Most real products have both — a steady baseline plus bursts — which is why the pragmatic answer is so often a mix, covered in the decision section.
On effort, the order is unambiguous. Bedrock is the least work — there is nothing to deploy; you authenticate and call the API, and time-to-first-response is minutes. (Custom Model Import adds a one-time import step but no ongoing server ops.) SageMaker JumpStart is the middle — a few clicks or SDK calls stand up a managed endpoint, and AWS runs the serving layer, but you own the endpoint, choose the instance, and manage its lifecycle. Self-hosting is the most work — you build the AMI or container, install and tune vLLM/TGI, wire up load balancing and autoscaling on EC2 or EKS, handle drivers and CUDA (or the Neuron SDK for Inferentia), and keep it all patched and observed. The effort gap is largest after launch: a managed endpoint mostly runs itself; a self-hosted fleet is a standing operational commitment.
On cost, the order can invert depending on utilization, which is the single most important idea on this page. Bedrock charges per token — you pay only for what you use and nothing when idle, which makes it the cheapest option for spiky or low-volume traffic. The self-managed paths (JumpStart and self-host) charge per instance-hour — the box bills 24/7 whether or not requests are flowing, so their cost-per-token is only low when the instance is kept well-utilized. At high, steady utilization, a self-managed GPU or Inferentia endpoint typically beats per-token pricing on cost-per-million-tokens; at low utilization it can be dramatically more expensive than Bedrock because you are paying for idle silicon. The metric that decides it is cost per million tokens at your real traffic shape — not the instance's hourly rate and not the headline per-token price.
Bedrock scales transparently — AWS handles capacity behind the API, and you scale by simply sending more requests (within account/model throughput limits; Provisioned Throughput exists for guaranteed capacity). SageMaker JumpStart endpoints scale via SageMaker's managed autoscaling — you set target metrics and min/max instance counts, and the endpoint adds or removes instances automatically, including scale-to-zero for serverless-style patterns on supported configurations (with the cold-start tradeoff covered in section VI). Self-hosted fleets scale however you build it — EC2 Auto Scaling groups or Kubernetes HPA on EKS — which is the most flexible and the most work, and where you own the cold-start and warm-pool behavior yourself.
Before choosing a path you have to know whether your specific model is even available on it. Catalog membership (Bedrock), hub coverage (JumpStart), and "anything you can download" (self-host) are three different levels of availability.
The availability picture, in plain terms. Bedrock's catalog carries selected open-weight families — the Llama models and Mistral models are the headline open options, alongside Amazon Nova/Titan, Claude, Cohere, AI21, and others. If your model is one of those, the managed-API path is open to you with zero deployment. Bedrock Custom Model Import widens this to open-weight models you bring yourself, provided the architecture is supported (Llama-class and several other common transformer families) — you upload the weights and Bedrock serves them on managed infrastructure. SageMaker JumpStart's hub is far broader: hundreds of open models — Llama, Mistral/Mixtral, Qwen, Falcon, Gemma, embeddings and vision models, and more — each deployable to an endpoint you control. And self-hosting has no catalog limit at all: if you can download the weights (from Hugging Face or elsewhere) and they run under vLLM or TGI, you can serve them, including bleeding-edge releases and your own private fine-tunes that no managed catalog has yet.
The practical consequence: your model's availability narrows the choice for you. A current Llama or Mistral model gives you all three options, so you decide purely on traffic and ops. A newer or more niche open model (a fresh Qwen or DeepSeek release, say) may not be in the Bedrock catalog yet — then it is JumpStart (if the hub has it) or self-host. A private fine-tune on a custom base, or a brand-new architecture, usually means JumpStart-with-your-own-artifact or self-host. Confirm the current Bedrock model list and JumpStart hub coverage in the consoles before committing — both expand continually.
| Model family | Bedrock catalog | Bedrock Custom Model Import | SageMaker JumpStart | Self-host (vLLM/TGI) |
|---|---|---|---|---|
| Llama (Meta) | Yes (selected versions) | Yes (Llama-class supported) | Yes | Yes |
| Mistral / Mixtral | Yes (selected versions) | Often (check support) | Yes | Yes |
| Qwen | Varies / newer releases lag | If architecture supported | Commonly | Yes |
| DeepSeek | Some availability | If architecture supported | Commonly | Yes |
| Gemma / Falcon | Varies | If architecture supported | Commonly | Yes |
| Your private fine-tune | No | Yes if base architecture supported | Yes (bring your artifact) | Yes |
| Brand-new architecture | No (until added) | No (until supported) | If the hub adds it | Yes (day one) |
JumpStart is the path most teams should try first when the model is not already in the Bedrock catalog, because it gives you a production-grade managed endpoint in your own account with very little code. Here is the end-to-end flow, in the order you actually do it.
The walkthrough below assumes you want a real-time endpoint for an open-weight chat/instruct model (a Llama or Mistral instruct variant is the canonical example). The same flow applies to other JumpStart LLMs; only the model ID and the recommended instance change. Exact instance names, container versions, and quotas move — treat the specifics as representative and confirm current values in the console.
JumpStart gives you ~80% of the control of self-hosting (your account, your VPC, your instance choice, your autoscaling) for a fraction of the setup work, because AWS manages the serving container, health checks, and rolling updates. Many teams who think they need a hand-built vLLM fleet find a JumpStart endpoint meets the need — and only graduate to raw EC2/EKS self-hosting when they have a specific reason (a custom serving stack, extreme cost tuning at very high volume, or a model JumpStart does not host).
When you need full control — a custom serving stack, the absolute lowest unit cost at high volume, or a model nothing else hosts — you run the server yourself on EC2. This is more work, and the work has a known shape.
The self-host pattern is: pick the compute, run a high-throughput inference server, put it behind scaling. On compute, you choose between EC2 GPU instances (the g- and p-families — the familiar CUDA path, broadest model support, no porting) and EC2 Inferentia/Trainium instances (inf2 / trn — AWS's custom silicon, lower cost-per-token at high utilization for supported models, but you serve through the Neuron SDK rather than CUDA and your model must have a clean Neuron compilation path). For the inference server, the two standard choices are vLLM — a high-throughput engine known for PagedAttention and continuous batching, with both CUDA and Neuron support — and Hugging Face Text Generation Inference (TGI), a production-grade server with streaming, continuous batching, and broad model coverage. Both turn raw weights into an efficient, batched, streaming endpoint; the choice between them is mostly ecosystem preference and specific model/feature support.
On scaling and operations, you have two common topologies. The simpler one runs the server on EC2 instances behind an Application Load Balancer with an EC2 Auto Scaling group sized to a utilization or queue-depth metric. The more flexible one runs it as pods on EKS (Kubernetes) with the Horizontal Pod Autoscaler and a GPU/Neuron device plugin, which gives you fine-grained packing, rolling deploys, and multi-model density at the cost of running Kubernetes. Either way you own the parts SageMaker would otherwise manage: the AMI/container build, driver and CUDA (or Neuron SDK) versions, health checks, autoscaling policy, cold-start/warm-pool behavior, and patching. That standing operational load is the real price of self-hosting — it is rarely the model that is hard, it is keeping the fleet healthy, full, and cheap over time.
The reason teams take this on anyway is unit cost at scale and total control. At steady high volume, a well-utilized self-hosted fleet — especially on Inferentia inf2 — generally delivers the lowest cost-per-million-tokens of any path, because you have stripped out every managed-service margin and can tune batching, quantization, and instance choice to your exact model and traffic. And you control everything: the serving engine, the model version, quantization and speculative-decoding tricks, the network boundary, and the deployment cadence. If you are running one or a few high-traffic models where every cent of unit cost matters, self-hosting is the path that lets you drive it to the floor.
The whole decision turns on one number: cost per million tokens at your real utilization. This section shows how to compute it for each path and how autoscaling and cold-starts move it — with representative figures you should replace with current rates and your own benchmarks.
For the self-managed paths (JumpStart and self-host), the unit cost is a division problem: cost per million tokens = (instance $/hour ÷ tokens served per hour) × 1,000,000. The instance hourly rate is fixed; the tokens-per-hour you achieve depends on the model, the instance, and — above all — how effectively you batch concurrent requests. This is why utilization is everything: the same GPU instance serving 10× the tokens per hour has roughly one-tenth the cost per token. A box at 80% utilization can be dramatically cheaper per token than the identical box at 15%, even though the hourly bill is the same. The lever you control is keeping the instance full (good batching + right-sized autoscaling), and choosing cheaper silicon (inf2) where it fits.
For Bedrock, the unit cost is simply the published per-token price for the model (input and output priced separately), with no idle cost — and reducible further with Batch (roughly half-price for non-real-time work) and prompt caching (cuts the cost of repeated context). Because you pay nothing in the troughs, Bedrock's effective cost per token over a spiky day can beat a self-managed box that sat idle for half of it — even when Bedrock's headline per-token price is higher than the self-managed box's best-case per-token cost. The comparison that matters is therefore not "list price vs list price" but "Bedrock's pay-per-use total vs the self-managed box's 24/7 total" across your actual daily traffic curve.
The honest way to choose, then, is to take your real traffic shape — requests per hour across a representative day, with realistic prompt and completion lengths — and compute the monthly cost three ways: Bedrock per-token (including idle troughs at $0), a JumpStart/GPU endpoint at its achievable utilization, and an inf2 endpoint at its achievable utilization. The cheapest depends almost entirely on how steady and high your volume is. The representative table below shows the shape of that answer; your numbers come from current AWS pricing and a benchmark of your specific model.
Autoscaling is how a self-managed endpoint reconciles "keep it full" (for low unit cost) with "don't pay for idle." You set a minimum instance count (your always-warm floor), a maximum (your ceiling for bursts), and a target metric (concurrency or invocations per instance) that triggers scale-out and scale-in. The tension is the cold-start: spinning up a new LLM instance is not instant — provisioning, pulling a multi-gigabyte container, and loading large weights into accelerator memory can take from tens of seconds to a few minutes, during which a just-arrived request either waits or is rejected. So scaling-to-zero minimizes cost but exposes users to a cold-start when traffic resumes; keeping a warm minimum eliminates the cold-start but pays for an always-on floor. The right setting is a product decision: latency-critical user-facing features usually keep a warm minimum, while internal or batch-tolerant workloads can scale to zero and accept the occasional cold-start. Mitigations include SageMaker warm pools, provisioned/min-capacity floors, smaller or quantized models that load faster, and routing burst overflow to Bedrock (which has no cold-start you manage) while your own fleet scales up.
| Traffic shape | Bedrock (pay-per-token) | JumpStart / GPU endpoint | Self-host inf2 (well-utilized) | Cheapest (typical) |
|---|---|---|---|---|
| Low / spiky / unpredictable | Pay only for use; $0 idle | Pays 24/7 even when idle — poor fit | Pays 24/7 even when idle — poor fit | Bedrock |
| Medium, somewhat steady | Competitive; simplest | Competitive if kept busy | Cheapest if Neuron path + busy | Tie / depends on utilization |
| High, steady, predictable | Per-token adds up at volume | Low unit cost at high utilization | Lowest unit cost when full | Self-host inf2 (then GPU) |
| Steady baseline + bursts | Great for the bursts | Good for the baseline | Cheapest for the baseline | Mix: self-managed base + Bedrock burst |
The instinct to self-host "to save money" is often wrong, and sometimes exactly right. Here is the honest version of when a managed path (Bedrock or JumpStart) beats running your own server — and the narrower set of cases where self-hosting genuinely wins.
Managed wins more often than engineers expect, for one structural reason: the cost advantage of self-hosting is conditional on high utilization, but the operational cost of self-hosting is unconditional. You pay the ops tax — building, scaling, patching, on-call — whether the box is full or empty, and you only collect the unit-cost reward if you keep it full. For the many teams whose traffic is not yet steady-and-high, a managed path is both cheaper (no idle burn, no engineer-months) and faster. The time-to-value gap compounds the point: Bedrock is live in minutes and JumpStart in an afternoon, while a hardened self-hosted fleet with proper autoscaling and observability is a multi-week project that then needs ongoing care.
A mature open-LLM deployment on AWS is frequently a mix: a self-managed endpoint (JumpStart or self-hosted inf2) carrying the steady, predictable baseline at low unit cost, with Bedrock absorbing bursts and the long tail of spiky traffic at zero idle cost — and a fast managed path used early for time-to-market before any of it is optimized. The right architecture routes each slice of traffic to the option whose economics fit its shape, rather than betting the whole workload on one path.
Deploying an open LLM on AWS is two problems: doing the deployment well (the right path, tuned autoscaling, optimized unit cost) and paying for it. A vetted AWS partner plus AWS credits solve both — and the credits take the cash cost to zero.
Whichever path fits, the compute it runs on is exactly what AWS credits are designed to cover. SageMaker endpoint instance-hours and self-hosted EC2 GPU/inf2 hours are standard AWS compute; Bedrock tokens are standard Bedrock usage. All of it draws from the same credit pools: AWS Activate (up to $100K), Bedrock / GenAI PoC funding ($10K–$50K to prove out a model or feature), and the Generative AI Accelerator (up to $1M for selected AI-first companies). For an inference workload — which, once live, runs continuously — that credit runway can cover a production open-LLM deployment for a long time.
But the deployment itself benefits from expertise that is easy to underestimate. Choosing the path (Bedrock vs JumpStart vs self-host), picking GPU versus Inferentia, getting a clean Neuron compilation path if you go inf2, tuning autoscaling and cold-start behavior, and routing traffic so each slice runs where it is cheapest — this is real engineering, and getting it wrong is the difference between a deployment that quietly drains budget and one that is fast, cheap, and stable. CloudRoute (cloudroutehq.com) addresses both halves: it routes you to a vetted AWS partner who actually does the deployment — stands up the JumpStart endpoint or the vLLM/TGI fleet or the Bedrock import, tunes the autoscaling and the FinOps, and benchmarks the unit cost across paths — and files the AWS credit applications through the ACE program so the bill is funded from day one.
The economics for you: the customer pays $0. AWS funds the credit pool because it wants production GenAI on AWS for the long term; the partner is paid by AWS through engagement-funding programs; CloudRoute is paid by the partner as a routing commission. You never see an invoice from CloudRoute. You get an open LLM deployed on the right path, autoscaling and unit cost tuned by people who do this daily, and credits that cover the GPU/accelerator hours or Bedrock tokens — a deployment that is funded and optimized rather than improvised and expensive. For the broader credit picture, see AWS credits for generative-AI startups and how AWS PoC / Bedrock POC funding works.
The three paths to running an open LLM on AWS, against the dimensions that actually decide it. There is no single winner — match the row that describes your traffic and team to the column that fits.
| Dimension | Amazon Bedrock | SageMaker JumpStart | Self-host (EC2 GPU / inf2) |
|---|---|---|---|
| What it is | Managed foundation-model API | Managed endpoint you control | Your own inference server |
| Cost model | Per token — $0 when idle | Per instance-hour (your instance) | Per instance-hour (your fleet) |
| Best traffic shape | Spiky / low / unpredictable | Medium → high, steady-ish | High, steady, predictable |
| Lowest unit cost at scale | No (per-token premium at volume) | Good when kept busy | Yes — best when full (esp. inf2) |
| Effort to launch | Minutes (API call) | An afternoon (few clicks/SDK) | Weeks (build, scale, harden) |
| Ongoing ops | None — fully managed | Light — AWS runs serving layer | Heavy — you own the fleet |
| Model availability | Catalog + Custom Model Import | Hundreds of hub models + your artifact | Anything you can download |
| Custom serving stack | No | Limited (managed container) | Full — any engine/version |
| Cold-starts | None you manage | Yes if scaling to zero (tunable) | Yes — you own warm-pool logic |
| Cash cost with CloudRoute | $0 — credits cover tokens | $0 — credits cover endpoint hours | $0 — credits cover EC2/inf hours |
Situation: The team had fine-tuned an open-weight base model and proven it in a notebook, but had never deployed an LLM in production. They were unsure whether to put it on Bedrock (their fine-tune was not a catalog model), stand up SageMaker JumpStart, or hand-build a vLLM server on GPU — and they were worried the GPU bill would be unpredictable and large. They had no ML-infra/SRE capacity to spare and no AWS credits cushioning the cost.
What CloudRoute did: CloudRoute routed them within a day to an AWS partner with SageMaker and inference-FinOps experience. The partner deployed the fine-tuned model as a SageMaker JumpStart-style managed endpoint (their own artifact, in their account, on a right-sized GPU instance), validated latency, and configured Application Auto Scaling with a small warm minimum for the weekday baseline plus burst headroom. They benchmarked cost-per-million-tokens on GPU versus inf2 and versus a Bedrock equivalent, routed the spiky evening overflow to Bedrock to avoid an oversized always-on floor, and filed Activate plus GenAI PoC credits through ACE to cover the endpoint hours.
Outcome: The model was serving production traffic on a managed, autoscaled endpoint within the week, with a warm floor for latency and Bedrock absorbing bursts so the GPU fleet stayed right-sized. Benchmarked unit cost landed well below their feared GPU bill, and credits covered the instance-hours and Bedrock tokens — taking the cash cost of the deployment to roughly $0 for the credit runway. The partner did the deployment and the FinOps; the team kept building product. CloudRoute was paid by the partner from AWS engagement funding — the company paid $0.
path: JumpStart-style managed endpoint + Bedrock burst · time-to-production: ~1 week · unit cost vs feared GPU bill: well below · cost to customer: $0
CloudRoute connects ML teams with vetted AWS partners who deploy the model on the right path (Bedrock, SageMaker JumpStart, or self-hosted vLLM/TGI), tune the autoscaling and unit cost, and file the AWS credits that cover the bill. Customer pays $0 — AWS funds it.