Nvidia GPUs (P5/H100/H200) are the default and the most expensive. AWS Trainium and Inferentia are 30–50% cheaper per unit of work — if your model ports cleanly to the Neuron SDK, which is the real catch nobody quotes you. This guide does the price-performance math for training and inference, shows when each chip actually wins, and where Bedrock's managed pricing beats all of them.
There are really three distinct ways to run AI workloads on AWS in 2026: rent Nvidia GPUs by the hour, rent AWS's own silicon (Trainium for training, Inferentia for inference), or skip owning accelerators entirely and call a managed model on Amazon Bedrock. They are priced on completely different axes, which is why naive per-hour comparisons mislead.
The instinct is to compare hourly instance prices and pick the lowest. That is the wrong frame. A Trainium instance can show a lower hourly rate than a comparable GPU instance and still cost you more for a given training run if your model trains slower on it, or if you burned two engineer-weeks porting to it. Conversely, a GPU instance can look expensive per hour and be the cheapest path to a finished model because it just works on day one. The unit that matters is dollars per unit of useful work — dollars per training run to a target loss, or dollars per million inference tokens served at your latency SLA — not dollars per hour.
AWS deliberately offers all three because they map to different buyer situations. GPUs are the universal substrate: maximum ecosystem compatibility, maximum flexibility, maximum price, and chronic capacity scarcity for the newest parts. Trainium and Inferentia are AWS's bet that for a large slice of mainstream workloads it can deliver materially better price-performance using silicon it designs itself and is not paying Nvidia's margin on — provided you accept its software stack, the Neuron SDK. Bedrock is the abstraction on top of all of it: you never see a GPU, you pay per token or per provisioned throughput, and AWS owns the capacity-planning headache.
This guide walks each option on its real economics, then does the break-even math, then gives you a decision table for training versus serving. Every dollar and throughput figure below is a hedged 2026 estimate drawn from public list pricing and typical benchmark ranges — treat them as directional planning numbers, not quotes. Your actual numbers depend on region, instance generation, reservation terms, model architecture, and how well your specific workload maps to each chip.
GPUs cost more and always work; Trainium/Inferentia cost 30–50% less per unit of work but only after a Neuron SDK port whose effort ranges from trivial to impossible depending on your model; Bedrock removes the chip decision entirely and charges per token. The right answer is usually a portfolio, not a single chip.
Nvidia GPUs are the substrate the entire AI ecosystem is built on. On AWS that means the P5 family (H100, 8 GPUs per instance), P5e and P5en (H200, more memory and faster networking), and the older P4d/P4de (A100) instances that are now the budget tier. Everything — PyTorch, JAX, vLLM, TensorRT-LLM, every model on Hugging Face, every custom CUDA kernel — runs on them unchanged.
The headline economics: a P5 instance (8× H100, 640 GB of HBM total) lists at roughly $98/hour on-demand in 2026. That is about $12.30 per GPU-hour. The H200-based P5e and P5en instances list higher — call it the $110–$130/hour range for the 8-GPU node — and you are paying for the extra HBM (141 GB per H200 versus 80 GB per H100) and, on P5en, substantially faster EFA networking that matters for multi-node training. The older P4d (8× A100 40 GB) sits well below, in the $32/hour band on-demand, and is the value choice for workloads that fit in A100-class memory and do not need Hopper-generation throughput.
Those on-demand numbers are the worst-case price. Almost nobody training seriously pays on-demand. A 1-year Savings Plan or Reserved commitment typically cuts the rate by 40–50%; a 3-year commitment can approach 60% off. EC2 Capacity Blocks for ML let you reserve a GPU cluster for a fixed window (one day to several weeks, often booked weeks ahead) at a price between on-demand and a long reservation — this is how most teams actually get H100/H200 capacity for a bounded training run without a multi-year commitment.
The defining constraint on GPUs in 2026 is not price, it is availability. P5 and P5e capacity is rationed in most regions; getting a large contiguous cluster (say 16–64 nodes for a real pretraining run) on demand is frequently impossible, which is exactly why Capacity Blocks and reservations exist. This scarcity is the single biggest reason teams look at Trainium at all: AWS has far more of its own silicon to allocate than it has Nvidia parts.
The strategic point about GPUs: you are not just renting compute, you are buying out of all porting risk and into the entire CUDA ecosystem. If your work involves bleeding-edge model architectures, custom kernels (FlashAttention variants, fused ops, Triton kernels), exotic quantization, or any framework that is not mainstream PyTorch, the GPU is not the expensive option — it is the only option that works without an open-ended engineering project. You pay the Nvidia premium to make the software problem disappear.
Trainium is AWS's purpose-built training accelerator. Trn1 instances (Trainium1) and the newer Trn2 (Trainium2) are designed for one job: training and fine-tuning deep-learning models at a materially better price-performance than renting Nvidia silicon. The pitch is roughly 30–50% lower cost to reach the same trained model — and for large, well-suited models AWS markets even larger gains on Trn2.
The instance shape: a Trn1 instance packs 16 Trainium1 accelerators and lists in the rough $21–$22/hour on-demand range — a fraction of a comparable H100 node's hourly rate. Trn2 raises the per-instance performance substantially (more accelerators, more memory, faster NeuronLink interconnect) and is positioned for large-model training and even some large-model inference. With Savings Plans the effective Trainium rate drops further, and because AWS has more of its own silicon to allocate, capacity is generally easier to secure than H100/H200.
The price-performance claim is credible for the workloads Trainium is tuned for: standard transformer training and fine-tuning, where the per-run cost to a target loss lands meaningfully below the GPU equivalent once you account for both the lower hourly rate and competitive throughput. AWS publishes the strongest numbers for large language-model pretraining and fine-tuning at scale, which is exactly the workload where a 30–50% saving on a six- or seven-figure compute bill is worth a porting project.
But the per-run saving only materializes if your model trains efficiently on Trainium, and that is entirely a function of the Neuron SDK — the software layer that compiles your model to the chip. This is the catch the hourly price never shows, and it is important enough to get its own section below. The short version: a standard PyTorch transformer often ports in days; a model with custom CUDA kernels, unusual ops, or a non-mainstream framework can take weeks or simply not be supported.
There is also a throughput-versus-rate subtlety. Trainium's lower hourly rate does not automatically mean lower cost-to-train, because if a given model runs at lower hardware utilization on Trainium than on an H100, the wall-clock training time stretches and eats into the rate advantage. For models in Trainium's sweet spot the net is still a clear win; for poorly-suited models the rate advantage can evaporate. The only honest way to know is to compile your actual model and benchmark a short run on both before committing a long one.
Trainium's 30–50% cost advantage is real for models in its sweet spot (mainstream transformer training/fine-tuning) after a successful Neuron port. The decision is never "is Trainium cheaper per hour" — it is "does my model compile and run efficiently on Neuron, and is the porting cost amortized over enough training runs to come out ahead." Benchmark a short run before you commit a long one.
Inferentia is the inference counterpart to Trainium. Inf2 instances (Inferentia2) are built to serve models — LLMs, embeddings, vision, recommendation — at a much lower cost per inference than GPUs, with the same Neuron SDK caveat. Inference, not training, is where most production AI bills actually accumulate over time, which makes Inferentia the higher-leverage cost decision for many companies.
The instance lineup spans a wide range: Inf2 starts small (inf2.xlarge with a single Inferentia2 accelerator, in the low-single-digit dollars per hour) and scales up to inf2.48xlarge with 12 accelerators in the roughly $12–$13/hour band. The value proposition is cost per million tokens (or per million inferences) served at your latency target. For a high-throughput, steady-state serving workload — a chatbot, a classification API, an embedding pipeline running 24/7 — Inferentia frequently lands 40–60% below the per-inference cost of serving the same model on a comparably-sized GPU instance.
Why inference is the bigger prize than training: a training run is a bounded, periodic expense — you train or fine-tune, then you are done for a while. Inference is the bill that runs forever, scaling with traffic, every hour of every day the product is live. A 50% reduction in serving cost compounds month after month in a way a one-time training saving does not. For a company serving meaningful inference volume, optimizing the serving stack onto Inferentia is often the single largest AI-infrastructure cost lever available.
The same Neuron porting tax applies, but with an important asymmetry: inference graphs are generally simpler and more static than training graphs, so the Neuron compiler tends to handle them more readily. Popular open-weight model families (Llama-class models, many Hugging Face transformers, common embedding models) have well-trodden Inferentia serving paths and reference deployments. If you are serving a standard open model, the Inferentia port is often the easier of the two; if you are serving something custom or exotic, the same week-plus porting risk reappears.
A critical scoping note: Inferentia only helps if you are self-hosting the model and serving enough volume to keep the instances busy. If your traffic is spiky or low, a constantly-running Inf2 instance can cost more than a pay-per-token managed API would — you are paying for idle silicon. This is precisely the boundary where Amazon Bedrock's consumption pricing wins, which the next section addresses.
Every Trainium and Inferentia saving in this guide is gated behind one thing: getting your model to compile and run efficiently on the AWS Neuron SDK. This is the catch the hourly price never reflects, and it is the single most common reason a "cheaper" AWS-silicon plan ends up costing more than staying on GPUs. Understanding the porting spectrum is the most important practical decision in the whole comparison.
Neuron is the compiler-and-runtime stack that turns your model into something Trainium or Inferentia can execute. It plugs into PyTorch (and supports JAX and other paths to varying degrees) through a layer that traces your model graph, compiles it for the Neuron cores, and runs it. When your model uses standard, well-supported operations, this is close to transparent — you change a few lines, compile, and run. When your model uses operations Neuron does not support, or relies on custom CUDA kernels that have no Neuron equivalent, you hit a wall that ranges from "rewrite this op" to "this is not feasible right now."
The porting effort sorts into a rough spectrum. A vanilla PyTorch transformer using standard layers and attention — the most common case — typically ports in a few days, mostly spent on compilation tuning and getting throughput acceptable. A model with a non-standard but expressible architecture takes one to two weeks, with real time spent finding supported substitutes for unsupported ops. A model built around custom CUDA kernels, fused operations, Triton kernels, or a bleeding-edge architecture can take several weeks, and a meaningful fraction of those attempts conclude that the workload should stay on GPU until Neuron support catches up. There is no way to know which bucket you are in without trying to compile your specific model.
This is why the only defensible way to evaluate Trainium or Inferentia is a small spike: take your actual model, attempt the Neuron port, compile it, and benchmark a short run against the GPU baseline. That spike costs a few engineer-days and a small amount of compute, and it converts an open-ended risk into a known number. Skipping the spike and committing to AWS silicon on the strength of the marketing price-performance figures is how teams end up two weeks into a port with a launch slipping.
The amortization math matters as much as the porting effort itself. A two-week port that saves 40% on a workload you run once is a loss. The same two-week port on a model you retrain monthly for two years, or serve continuously at high volume, pays for itself many times over. Inference workloads, because they run perpetually, almost always clear this bar; one-off training runs frequently do not. Frame the porting cost as a fixed investment and ask how many runs (or how many months of serving) it takes to break even — the answer usually decides it cleanly.
Budget the Neuron port as a real engineering line item: a few days for a standard PyTorch transformer, one to two weeks for a non-standard architecture, several weeks (or "not yet") for custom-kernel or bleeding-edge models. Then ask how many training runs or months of serving amortize it. Always run a short benchmark spike before committing a long workload — it is the cheapest insurance in this entire decision.
The fourth path is to not own accelerators at all. Amazon Bedrock is AWS's managed model service: you call a hosted foundation model (Anthropic Claude, Meta Llama, Amazon Nova, Mistral, and others) through an API and pay per token of input and output, or reserve dedicated capacity via Provisioned Throughput. There is no instance to manage, no Neuron port, no capacity to reserve — and for a large set of use cases it is the cheapest total-cost option precisely because you pay only for what you use.
Bedrock is priced on a fundamentally different axis: dollars per million tokens for on-demand usage, or a fixed hourly rate for Provisioned Throughput when you need guaranteed capacity and predictable latency at high volume. The on-demand model means an idle application costs nothing — there is no instance ticking over at $12/hour while traffic is light. This flips the entire economics versus self-hosting: where Inferentia wins on steady high volume, Bedrock wins on variable, spiky, or moderate volume where you would otherwise be paying for idle silicon.
The break-even between Bedrock on-demand and self-hosting on Inferentia is fundamentally a utilization question. Below some traffic threshold, per-token pricing is cheaper because you only pay for the tokens you actually process. Above that threshold — when an Inf2 instance would run hot enough that its hourly cost divided by tokens served beats the per-token rate — self-hosting pulls ahead. Bedrock Provisioned Throughput sits in between: a committed hourly capacity buy for teams with high, predictable volume who still do not want to operate the serving stack themselves. The crossover depends on your exact token mix and model, so model it with your real traffic numbers.
Bedrock also eliminates two costs the chip comparison tends to ignore: the operational burden of running a serving fleet (autoscaling, patching, monitoring, on-call) and the capacity-planning risk. For a team without dedicated ML-infrastructure engineers, those hidden costs can swamp the nominal per-token premium. Many companies run a deliberately mixed strategy — Bedrock for the long tail of features and variable traffic, self-hosted Inferentia for the one or two high-volume endpoints where the unit economics justify owning the stack.
There is one category where Bedrock is essentially the only sensible answer: using frontier proprietary models like Claude. You cannot self-host those models on your own GPUs or Inferentia at all — they are available to you only as a managed API. So the question "GPU vs Trainium vs Inferentia vs Bedrock" partly dissolves: if you want a frontier closed model, Bedrock (or the equivalent managed endpoint) is the path, and the chip debate only applies to open-weight models you can actually run yourself.
The decision reduces to a few break-even calculations. None of them depend on the marketing price-performance numbers; they depend on your utilization, your porting cost, and how many times you run the workload. Here is the framework, with worked logic you can drop your own numbers into.
Training break-even (GPU vs Trainium): the question is whether the per-run compute saving on Trainium, multiplied by the number of runs, exceeds the one-time Neuron porting cost. If porting takes two engineer-weeks (call that a fixed cost in engineer-time) and Trainium saves, say, 40% of a per-run compute bill, then the more times you run that training the faster you cross into profit. A model you fine-tune once: stay on GPU. A model you retrain weekly or monthly for a year or more: Trainium almost certainly wins after the first few runs, and the saving compounds from there.
Inference break-even (Bedrock vs Inferentia): this is a utilization crossover. At low or spiky volume, Bedrock's per-token pricing wins because you pay nothing for idle capacity. As steady volume rises, there is a point where a continuously-busy Inf2 instance's hourly cost divided by the tokens it serves drops below the per-token rate — past that point, self-hosting on Inferentia wins. The Neuron porting cost shifts that crossover to the right (you need more volume to justify it), but because inference runs perpetually, high-volume endpoints clear the bar easily.
Inference break-even (GPU vs Inferentia): given that you have already decided to self-host (rather than use Bedrock), Inferentia almost always beats a GPU on cost per inference for any model that ports cleanly to Neuron, because you are not paying the Nvidia premium and inference graphs port more readily than training graphs. The GPU only wins here when the model will not port, when you need the absolute lowest latency on a large model with a mature GPU-only serving stack, or when the volume is too low to amortize the port — in which case Bedrock was probably the better answer anyway.
The meta-point: every one of these break-evens is sensitive to two numbers the vendor pricing pages never show you — your real utilization and your real porting cost. Get those two numbers from a short benchmarking spike and a candid engineering estimate, and the decision usually makes itself. Skip them and you are guessing.
AWS credits and POC/Well-Architected funding can cover the GPU and training bill outright while you run this math — which removes the riskiest variable. With the cluster funded, you can afford to train on GPUs (zero porting risk) for the first runs, benchmark the Neuron port in parallel, and only migrate to Trainium/Inferentia once you have proven the saving. The credits buy you the option to choose correctly instead of choosing under cost pressure.
Pulling it together: the right answer almost always separates the training decision from the serving decision, because the workloads have different economics. Here is the practical default for each, with the conditions that flip it.
Most mature AI teams on AWS end up running all four simultaneously, and that is the correct outcome rather than a failure to standardize: GPUs for research and first-run training, Trainium for the stable production models they retrain on a cadence, Inferentia for the high-volume serving endpoints, and Bedrock for everything variable, experimental, or dependent on a frontier closed model. The skill is matching each workload to the axis it is cheapest on — not picking one winner.
All figures are hedged 2026 estimates from public list pricing and typical benchmark ranges — directional planning numbers, not quotes. Effective prices fall substantially with Savings Plans and reservations; Bedrock is priced per token, not per hour.
| Dimension | Nvidia GPU (P5/P5e) | Trainium (Trn1/Trn2) | Inferentia (Inf2) | Bedrock (managed) |
|---|---|---|---|---|
| Primary job | Train + serve (universal) | Training / fine-tuning | Inference / serving | Inference (managed API) |
| On-demand price (rough) | ~$98/hr (P5, 8× H100) | ~$21–22/hr (Trn1, 16 acc.) | ~$12–13/hr (inf2.48xlarge) | Per million tokens |
| Relative cost per unit of work | Baseline (highest) | ~30–50% below GPU (training) | ~40–60% below GPU (serving) | Wins at low/variable volume |
| Porting effort | None — CUDA, runs as-is | Neuron SDK: days → weeks | Neuron SDK: often easier | None — API call |
| Capacity availability | Scarce (rationed) | Generally easier | Generally easier | Managed by AWS |
| Idle cost | Full hourly rate | Full hourly rate | Full hourly rate | $0 (on-demand) |
| Best for | Research, custom kernels, first runs | Stable models retrained often | High-volume steady serving | Spiky traffic, frontier closed models |
Situation: Burning real money on P5 instances to fine-tune their model every two weeks, and serving inference on the same GPUs at low utilization. The bill was the second-largest line item after payroll. They suspected Trainium/Inferentia would be cheaper but had no spare engineering cycles to gamble two weeks on a Neuron port that might not work — and no budget cushion to run both stacks in parallel while they figured it out.
What CloudRoute did: Routed within a day to a vetted AWS partner with a Neuron-SDK and applied-ML track record. The partner first filed for AWS POC / credit funding to cover the existing GPU training bill, which removed the cost pressure. With the cluster funded, they ran a one-week benchmarking spike: ported the (standard PyTorch transformer) model to Neuron, confirmed it compiled cleanly, and measured a ~42% per-run training saving on Trainium and a ~55% cost-per-inference saving on Inferentia versus the P5 baseline. Serving moved to Inf2 for the high-volume endpoint; the long-tail experimental features moved to Bedrock on-demand.
Outcome: Steady-state AI compute spend dropped by roughly half within the quarter, with training on Trainium, primary serving on Inferentia, and variable traffic on Bedrock. The GPU training bill during the transition was credit-funded — the customer paid $0 for that compute. CloudRoute's commission was paid by the partner from AWS's engagement funding.
engagement window: ~6 weeks · founder/eng time: ~1 week (the spike) · steady-state compute cut: ~50% · transition GPU bill: credit-funded
CloudRoute routes you to a vetted AWS partner who can secure credit/POC funding for the GPU and training bill, then run the Neuron benchmark spike and migrate the workloads that actually pay off. Customer pays $0; AWS funds the engagement.