llama on amazon bedrock · models, pricing, fine-tuning · 2026

Llama on Amazon Bedrock — models, pricing & fine-tuning.

A complete, neutral reference for running Meta's Llama models on Amazon Bedrock in 2026: the Llama family and which size fits which job; what open weights actually buy you versus a closed frontier model; model IDs and how to enable model access; a per-size pricing table; the capabilities that matter (image reasoning and long context, where the generation supports them); fine-tuning and custom-model hosting on Bedrock; when Llama is the right call versus Claude or Amazon Nova; use cases per size; and how AWS credits make running — and fine-tuning — Llama effectively $0.

positioning
open weights
access via
one AWS API
customization
fine-tuning on Bedrock
cost with credits
$0
TL;DR
  • Meta's Llama runs natively on Amazon Bedrock as one of the providers behind Bedrock's single API. You get the current Llama generation across a range of sizes — small models for cheap high-throughput work, mid-size models as the balanced default, and large models for the hardest reasoning — plus the newer multimodal (image-reasoning) and mixture-of-experts variants, all reached through the same Converse API and IAM/VPC controls as every other Bedrock model, with prompts and data staying in your AWS account and region.
  • Llama's distinguishing feature is that it is an open-weights model. On Bedrock that translates into three concrete advantages over closed models: deep customization (fine-tune on your own data and host the custom model), portability (the same model family runs on Bedrock, on SageMaker, or self-hosted, so you are not locked to one vendor's endpoint), and a low cost floor at the small sizes. The trade-off is that the very largest closed frontier models can still lead on the hardest reasoning — which is why many teams mix Llama with Claude behind one API.
  • Pricing is per-token and per-size: small Llama models are among the cheapest on Bedrock at cents per million tokens, mid-size models sit in the middle, and the largest models cost more per token but undercut closed frontier tiers. Fine-tuning adds a one-time training cost plus custom-model hosting. AWS credits (Activate up to $100K, Bedrock/GenAI POC $10K–$50K, GenAI Accelerator up to $1M) cover Llama inference and fine-tuning entirely — CloudRoute routes you to the credit pool and a vetted AWS partner, so you pay $0.
the models

IThe Llama family on Amazon Bedrock

Meta's Llama is available natively on Amazon Bedrock — it is one of the foundation-model providers behind Bedrock's single managed API, alongside Amazon's own Nova and Titan, Anthropic Claude, Mistral, Cohere, and others. Unlike a closed frontier model, Llama ships as a family of open-weights models across several sizes, and picking the right size is the central cost-and-quality decision.

The Llama lineup is organized by parameter size rather than by named tiers, and Bedrock typically offers several sizes from the current generation plus one or more recent generations. As of 2026 the durable shape is: small Llama models (in the single-digit-billions parameter range — think an 8B-class model) tuned for speed and very low cost on high-volume, simpler tasks; mid-size Llama models (tens of billions of parameters — a 70B-class model) that serve as the balanced default for most production reasoning at a moderate price; and large Llama models (the biggest dense or mixture-of-experts variants — a 405B-class model and the newer large MoE generations) for the hardest reasoning where quality dominates. Newer Llama generations also add multimodal (vision) sizes that reason over images, and mixture-of-experts designs that activate only part of the network per token to deliver large-model quality at lower effective cost.

The practical discipline is the same one that governs all Bedrock cost: match the model to the task. Use a small Llama for the easy, high-volume requests; use a mid-size Llama as the default for real work; reserve the largest Llama for the genuinely hard requests where its extra reasoning earns the higher price. As with any model family, a tiered router — a cheap model triages and handles the bulk, escalating only the hard cases to a stronger one — routinely cuts spend several-fold with little quality loss.

Because all sizes are served through the same Bedrock API, switching between them is usually a one-line change to the model ID, which makes that route-and-escalate pattern easy to build and tune. The capabilities (the Converse API surface, tool use, long context where the generation supports it, and image reasoning on multimodal sizes) and the security model are consistent across the family, so you design once and choose the size per request.

One caveat, stated once and meant throughout: exact model version names, model IDs, parameter sizes, regional availability, context-window sizes, multimodal support, and per-token prices all change frequently as Meta ships new Llama generations and AWS updates Bedrock. The figures and identifiers here are representative as of 2026 to convey the structure and relative cost. Always confirm the current model IDs in the Bedrock model catalog and current rates on the AWS Bedrock pricing page before you build or budget.

the size ladder

Small (8B-class) = fast and cheapest — high-volume, latency-sensitive, simpler tasks. Mid (70B-class) = the balanced default for production reasoning. Large (405B-class / large MoE) = deepest reasoning, highest cost — reserve for hard problems. Newer generations add vision-capable and mixture-of-experts variants. Switching sizes is a one-line model-ID change, which is why tiered routing is the standard cost pattern.

the positioning question

IIWhat "open weights" actually buys you on Bedrock

The single most important thing to understand about Llama is that it is an open-weights model: Meta publishes the model weights under a community license rather than serving them only from a private API. On Bedrock you still consume Llama through the same managed endpoint as a closed model — but the open-weights nature changes what you can do with it, and that is the real reason teams pick Llama.

It is worth being precise about terms, because the marketing language is loose. "Open weights" means the trained parameters are downloadable and you may run, fine-tune, and (subject to the license) redistribute the model. It is not the same as "open source" in the strict sense — Llama ships under the Meta Llama Community License, which is permissive for the vast majority of commercial uses but carries some conditions (notably an acceptable-use policy and a scale clause that affects only the largest consumer platforms). For almost every startup and company, the license is effectively "use it freely, including commercially." Here is what that openness buys you, specifically, when you run Llama on Bedrock:

  • Deep customization — real fine-tuning — Because the weights are open, you can fine-tune Llama on your own data and create a custom model that internalizes your domain, tone, and tasks — then host that custom model on Bedrock. Fine-tuning a closed frontier model is either unavailable or far more constrained; with Llama it is a first-class path. This is the headline reason teams with proprietary data choose Llama.
  • Portability — no single-vendor endpoint lock — The same Llama weights run on Bedrock, on Amazon SageMaker (JumpStart / your own endpoints), and self-hosted on your own GPUs or on Trainium/Inferentia. You can start on Bedrock's managed API for speed, then move the exact same model to SageMaker or self-hosting if economics or control demand it — without changing the model. With a closed model you are tied to that provider's endpoint.
  • A low cost floor — Open-weights competition has pushed Llama's small and mid sizes to among the lowest per-token prices on Bedrock. For high-volume, cost-sensitive workloads the small Llama models are frequently the cheapest credible option for their quality band.
  • Transparency and control — Open weights mean the model can be inspected, evaluated offline, run in fully isolated environments, and reasoned about for compliance in ways a black-box API cannot. For regulated or air-gapped scenarios the option to self-host the same model you prototyped on Bedrock is a genuine architectural advantage.
  • One API across many models — same as the rest of Bedrock — Through Bedrock's Converse API, Llama sits behind the same interface as Claude, Nova, Mistral, Cohere, and Titan. You build a model-agnostic application and can mix models — routing easy or bulk requests to a small Llama and hard ones to a larger model or to Claude — without rewriting your integration.

The honest counterpoint: open weights do not automatically mean "best." On the very hardest reasoning, coding, and agentic tasks, the largest closed frontier models can still lead the largest open ones, generation for generation. So the mature pattern is not "Llama instead of everything" but "Llama where openness, portability, customization, or cost matter; a closed frontier model where you need the absolute top of the reasoning curve" — and because both sit behind the same Bedrock API, you can do both in one application.

getting in

IIIModel IDs and how to enable model access

Before you can call Llama on Bedrock, you have to do one small but mandatory thing: request model access in your account. Foundation models on Bedrock are off by default; turning Llama on is a one-time, no-cost step in the console.

Enabling access. In the Bedrock console, open Model access, find the Llama models you want, and request access. Because Llama ships under Meta's community license, enabling it includes accepting that end-user license agreement; once accepted, access is typically granted effectively immediately. There is no charge for enabling access — you only pay when you actually call a model. Access is per-account and per-region, so if you operate in several regions, enable Llama in each one you will call from. This is also where cross-region inference profiles come in: they let Bedrock route your Llama calls across a set of regions for better availability and throughput (see the amazon-bedrock-cross-region-inference sibling).

Model IDs. Every model on Bedrock is invoked by a model ID — a string identifying the provider, model, and version (Llama IDs are namespaced under Meta, e.g. an identifier of the shape meta.llama-…, with a size and version suffix). You pass this ID to the API to choose which model and size answers a request, so moving a request from a small Llama to a mid or large one is just a change of model-ID string. Because IDs advance with each Llama generation, do not hard-code a guessed value — read the current ID from the Bedrock model catalog (console) or list it via the API/CLI, and treat it as configuration rather than a literal in your code. A fine-tuned custom model gets its own ARN/identifier, which you invoke the same way.

Permissions. The IAM principal making the call needs permission for the relevant Bedrock invoke actions (and, if you use cross-region inference profiles or a custom fine-tuned model, permission on the profile or custom-model resource). A least-privilege policy scoped to the specific Llama model ARNs you intend to use is the recommended posture. Once access is granted and IAM is in place, you are ready to call Llama through the Converse API — the same model-agnostic interface used for every chat model on Bedrock.

  • Open the Bedrock console → Model access → request access to the Llama models you need; accept Meta's license (free; usually instant).
  • Enable access in each region you will call from; consider a cross-region inference profile for availability.
  • Get the current model ID from the model catalog or via the API — do not hard-code a guessed version/size string.
  • Attach an IAM policy granting the Bedrock invoke actions on the specific Llama model ARNs (least privilege); include the custom-model ARN if you fine-tune.
  • You are billed only on invocation — enabling access costs nothing.
what it costs

IVLlama on Bedrock — per-size pricing

Llama on Bedrock is billed per token: a rate per 1,000 input tokens (everything you send) and a rate per 1,000 output tokens (everything Llama generates). The rate depends on the model size — and because the family spans from very small to very large, the spread is wide enough that size selection is the dominant cost lever.

The table below gives representative 2026 on-demand rates for the main Llama size bands, shown per 1,000 and per 1,000,000 tokens (the per-million column is simply the per-1K figure × 1,000; providers increasingly quote per-million). Use it to rank the sizes by cost and sanity-check a budget — not as an audited price sheet. Two on-demand cost levers sit on top of these rates and are not shown in the table: Batch (submit non-interactive work as an async job for roughly half the on-demand price) and prompt caching where supported (stop re-paying full input price for a repeated prefix like a long system prompt). Separately, fine-tuning introduces its own pricing — a one-time training charge plus ongoing custom-model hosting — covered in the fine-tuning section below. For very steady, high-volume traffic, Provisioned Throughput reserves dedicated capacity at an hourly rate instead of per-token.

representative on-demand Llama-on-Bedrock pricing · per 1K and per 1M tokens · 2026
Llama size bandInput / 1KOutput / 1KInput / 1MOutput / 1MCost position
Small (8B-class)$0.0001$0.0001$0.10$0.10Cheapest — high-volume / fast
Mid (70B-class)$0.0007$0.0009$0.72$0.90Mid — the balanced default
Large (405B-class / MoE)$0.0024$0.0024$2.40$2.40Highest in-family — hardest reasoning
Representative 2026 figures for relative comparison only — confirm current rates on the AWS Bedrock pricing page (they change with each generation and vary by region; some generations price input and output differently). Even the large Llama band typically undercuts closed frontier tiers per token — Llama's open-weights competition keeps prices low. Batch (~50% off) and prompt caching (where supported) lower the effective rate further. Fine-tuning and Provisioned Throughput are priced separately (see below).
what it can do

VCapabilities: image reasoning, long context, tool use

Llama on Bedrock is not just text-in/text-out for every size. The newer generations add multimodal input and longer context, and the family supports the tool-use and Converse patterns common to Bedrock chat models. Availability of any given capability varies by Llama size and generation more than it does for a closed family, so confirm specifics for your chosen model.

Image reasoning (multimodal, on vision-capable sizes)

Newer Llama generations include vision-capable sizes that accept images alongside text and reason about them — reading charts and diagrams, extracting data from screenshots, interpreting documents and photos, and answering questions about visual content. Not every Llama size is multimodal: the text-only sizes remain text-only, while the dedicated vision variants add image understanding. If your workload needs visual input, select a vision-capable Llama (or a multimodal model from another provider) and confirm support in the model catalog.

Long context

Recent Llama generations offer a substantially larger context window than the original releases — room for long documents, extended conversation history, and many retrieved chunks in a single request. The exact window depends on the generation and size, and it has grown markedly over successive Llama releases. Long context simplifies RAG and document workflows because more relevant material fits in one call; it is also a cost consideration, since input is billed per token, so a big context costs more — which is where Batch and (where supported) prompt caching help.

Tool use and the Converse API

Llama models on Bedrock are invoked through the same Converse API as every other chat model, and the family supports tool use (function calling) — you describe tools and the model decides when to call them and with what arguments, then incorporates the results. Tool use is the foundation of agentic systems and underpins Bedrock Agents, which can be configured to use a Llama model. The precise tool-use behaviour and reliability vary by size, so for agentic workloads validate on the specific Llama you plan to ship and consider escalating the hardest steps to a larger model.

Custom models via fine-tuning

The capability that most distinguishes Llama from closed families is that you can turn it into your model: fine-tune it on your data and serve the result as a custom model on Bedrock. That deserves its own treatment — see the next section — but it belongs on any capability list because, for many teams, a fine-tuned small Llama outperforms a much larger general model on their specific narrow task at a fraction of the cost.

making it yours

VIFine-tuning and custom Llama models on Bedrock

Fine-tuning is where Llama's open weights pay off most concretely. On Bedrock you can take a base Llama, train it further on your own labelled examples, and deploy the resulting custom model behind the same API — without managing training infrastructure yourself.

The managed flow is straightforward in shape: you prepare a training dataset of example prompt/response pairs (and optionally a validation set) in the expected format, stage it in Amazon S3, and create a customization job in Bedrock pointing at a base Llama model. Bedrock runs the training, produces a custom model, and reports the metrics. You then make the custom model callable by buying a small amount of Provisioned Throughput for it (custom models are served on reserved capacity rather than the shared on-demand pool), after which you invoke it by its custom-model ID exactly like any other Bedrock model. Because Llama is open, this customization is a first-class, supported path rather than a constrained add-on.

When fine-tuning earns its keep. Reach for it when prompting alone cannot get you there: a consistent house style or format the base model keeps drifting from; a specialized domain vocabulary (legal, medical, financial, industrial) the general model handles only loosely; a narrow, high-volume classification or extraction task where a fine-tuned small Llama can match a much larger general model at a fraction of the per-token cost; or a structured-output task where you need very high format reliability. For knowledge that changes often — facts, documents, current data — prefer retrieval (RAG) over fine-tuning, because RAG updates by changing the documents rather than retraining the model. The two compose well: a fine-tuned model for behaviour and format, RAG for fresh knowledge.

Cost shape. Fine-tuning introduces costs the on-demand table does not show: a one-time training charge (scaling with model size and the volume of training tokens) and ongoing custom-model hosting via Provisioned Throughput (an hourly rate for the reserved units you keep online). For an intermittent workload this hosting cost is the figure to watch — a custom model you keep warm 24/7 costs more than the same volume on shared on-demand. The economics favour fine-tuning when the quality or cost-per-request win on a high, steady volume outweighs the hosting overhead. All of it — training and hosting — is ordinary AWS spend, so AWS credits cover fine-tuning too, which removes the usual budget objection to experimenting with a custom model. For the full mechanics see the amazon-bedrock-fine-tuning sibling.

fine-tune vs RAG vs prompt

Prompt engineering first — cheapest, instant, no training. RAG when the model needs fresh or proprietary knowledge (update documents, not weights). Fine-tuning when you need consistent behaviour, format, or domain style, or a small specialised model that beats a big general one on a narrow task. They compose: fine-tune for behaviour, RAG for knowledge. Credits cover all three on Bedrock.

choosing the model

VIIWhen to pick Llama vs Claude vs Amazon Nova

Llama is one strong choice among several on Bedrock. A neutral orientation versus the two other names people weigh it against — Anthropic's Claude and Amazon's own Nova — framed around when each is the right call rather than a leaderboard ranking.

Pick Llama when openness or customization matters. The clearest reasons to choose Llama: you want to fine-tune on your own data and own the resulting model; you want portability so the same model can move between Bedrock, SageMaker, and self-hosting (avoiding lock to one vendor's endpoint); you need the transparency of an inspectable, self-hostable model for compliance or air-gapped deployment; or you want the low per-token cost of the small and mid Llama sizes for high-volume work. Llama is the open-weights anchor of a Bedrock strategy.

Pick Claude when you need the top of the reasoning curve. Anthropic's Claude (Opus / Sonnet / Haiku) tends to be the pick for the hardest reasoning, complex multi-step analysis, demanding coding, and high-stakes agentic workflows where quality dominates and you want a specific, well-characterised behaviour profile. Claude is closed-weights, so you trade Llama's customization and portability for that reasoning strength and consistency. Many teams use both: Llama (and Nova) for the cheap, high-volume, or fine-tuned paths, Claude for the quality path — trivial to combine behind one Converse API. See the claude-on-amazon-bedrock sibling.

Pick Nova for the absolute cost/latency floor inside AWS. Amazon's own Nova family (Micro / Lite / Pro / Premier for text, plus Canvas and Reel for media) is engineered for low cost and low latency and is tightly integrated with the rest of AWS. At the cheap end Nova Micro and Lite compete directly with small Llama for very high-volume, simple, latency-sensitive work; the difference is that Nova is closed and AWS-native (no fine-tune-and-take-it-elsewhere story), whereas small Llama gives you the open-weights options above. Where pure price/latency on simple tasks is the only criterion, benchmark small Llama against Nova Micro/Lite on your own traffic. See the amazon-nova sibling.

The meta-point: Bedrock lets you defer and revisit this choice cheaply. Because every model sits behind the same API, you can start on Llama, A/B Claude or Nova on part of your traffic, fine-tune a Llama for the one task that needs it, and re-tier as prices and capabilities move — without re-plumbing your application. The comparison table below puts the three side by side.

matching size to job

VIIIUse cases — which Llama for which job

The clearest way to think about the family is by mapping common production workloads to the smallest size (or the fine-tuned model) that does them well. Start a request on the smallest model that clears your quality bar and only escalate when it does not.

  • Small Llama (8B-class) — high-volume, latency-sensitive, simpler tasks — Classification, routing and triage, data extraction, short-form generation, real-time chat where speed and cost matter most, the cheap first stage of a tiered router, and bulk processing (especially via Batch). Often the strongest base to fine-tune for a single narrow task — a tuned small Llama can beat a much larger general model on that task at a fraction of the cost.
  • Mid Llama (70B-class) — the balanced default — A sensible default for a lot of real work: RAG knowledge assistants, customer-support agents, content generation, general coding assistance, and document analysis, plus the reasoning behind many Bedrock Agents. Good quality at a moderate price — where a large share of production traffic can live, with hard cases escalated upward.
  • Large Llama (405B-class / large MoE) — hardest in-family reasoning — Reserve for the genuinely difficult open-model work: deeper multi-step reasoning, harder coding, and research-style synthesis where you want a large open model specifically (for portability, customization, or to keep the whole pipeline on open weights). For the very top of the reasoning curve, also benchmark against a closed frontier model like Claude Opus.
  • Fine-tuned Llama — your specialised model — A custom Llama trained on your data for a consistent house style, a specialized domain vocabulary, or a high-volume narrow task with strict format requirements. The open-weights path makes this a first-class option, and credits cover the training and hosting — so the experiment is effectively free to try.
  • Tiered routing — mix sizes (and models) — The highest-leverage pattern: a small model triages and handles the easy majority, escalating only hard cases to a mid or large Llama — or to Claude. Because switching is a one-line model-ID change on the Converse API, this is straightforward to build and routinely cuts spend several-fold with little quality loss.
how it becomes $0

IXHow AWS credits make running Llama $0

Everything above prices Llama on Bedrock if you pay AWS directly. For most startups and many companies the relevant number is different — because AWS will frequently fund the build with credits, and Llama usage on Bedrock (inference and fine-tuning alike) draws those credits down before it ever touches your card.

Llama inference, fine-tuning, and custom-model hosting on Bedrock are ordinary AWS spend, so they are fully credit-eligible and credits apply automatically against your bill until exhausted — covering Llama tokens, any Batch and Provisioned Throughput usage, the one-time fine-tuning training charge, custom-model hosting, plus the supporting services (Knowledge Bases, vector store, S3 for training data, logging). The relevant pools: AWS Activate (general startup credits, commonly up to $100K for institutionally-funded startups); a dedicated Bedrock / Generative-AI POC pool ($10K–$50K) aimed at proving out a GenAI use case; and the competitive Generative AI Accelerator (awards up to $1M for a small cohort of AI-first startups). Because Llama is credit-eligible like everything else on Bedrock, a funded team can fine-tune and serve a custom model without the training cost ever showing up as real spend.

The practical mechanic is that most of these pools are partner-filed — requested through the AWS Partner Network (the ACE program), not a public self-serve form — which is why teams route through an AWS partner rather than applying alone. That is the gap CloudRoute fills. CloudRoute matches you to the right credit pool for your stage and to a vetted AWS DevOps/ML partner who both files the credit application and helps build the Llama workload — the tiered model router, the fine-tuning pipeline and custom-model hosting, the RAG pipeline behind Knowledge Bases, the agent with tool use. The customer pays $0 — AWS funds the credit pool, AWS pays the partner through engagement-funding programs, and the partner pays CloudRoute a routing commission. You never see an invoice.

Put together with the size-selection, Batch, and fine-tuning levers above, the picture for a startup is: build on the smallest Llama each request actually needs, fine-tune the one model that earns it, and run the whole thing — inference and training — on a $25K–$100K (or larger) credit pool while you find product-market fit, paying real money only once usage, and ideally revenue, has scaled past the credits. And because Llama is open-weights, if your economics later favour leaving the managed endpoint, you can move the same model to SageMaker or self-hosting on Trainium/Inferentia — credits cover that too. Related: AWS credits for generative-AI startups and Bedrock POC funding for the full credit mechanics.

pick a model

Llama vs Claude vs Amazon Nova on Bedrock — when each wins

The core decision in one place: Llama (open weights) against Anthropic Claude (closed frontier) and Amazon Nova (closed, AWS-native, cost-optimized), compared on the dimensions that actually decide it. All three sit behind the same Converse API, so mixing them is a model-ID change, not a rewrite. Representative 2026 framing, not quotes.

DimensionLlama (Meta)Claude (Anthropic)Nova (Amazon)
WeightsOpen — downloadable, self-hostableClosedClosed
Fine-tuningYes — first-class custom modelsLimited / constrainedLimited
PortabilityHigh — Bedrock, SageMaker, self-hostBedrock / Anthropic API onlyBedrock / AWS only
Top-end reasoningStrong (large sizes)Deepest (Opus-class)Good (Pro / Premier)
Cost floor (small size)Very low (8B-class)Low (Haiku)Lowest (Nova Micro/Lite)
Best whenCustomization, portability, open stack, low costHardest reasoning, agents, quality-criticalAbsolute price/latency floor inside AWS
Not mutually exclusive — the mature pattern is to mix: Llama (or Nova) for cheap, high-volume, or fine-tuned paths; Claude for the quality path. Llama's differentiators are open weights, fine-tuning, and portability; Claude's is top-end reasoning; Nova's is the cost/latency floor with deep AWS integration. Benchmark candidates on your own task — relative strengths shift each generation.
open weights, AWS's budget
Credits cover Llama inference AND fine-tuning on Bedrock — get the pool + a partner to build it ($0)
Get matched in 24h →
a recent match

A fine-tuned Llama replaced a pricier closed model — funded by credits — anonymized

inquiry · seed-stage vertical-AI SaaS, Toronto
Seed-stage vertical-AI SaaS, 12 people, building a document-processing product for a regulated industry

Situation: The product ran every request through a closed frontier model on a vendor API — accurate, but expensive at volume and paid out of runway, and the data-residency and inspectability story was awkward for their regulated customers. Most of their traffic was a single narrow extraction-and-classification task on domain documents, which felt like overkill for a frontier model. They wanted (a) lower per-request cost at high volume, (b) a model they could fine-tune on their proprietary document set and, if needed, self-host for compliance, and (c) to stop paying for it out of pocket.

What CloudRoute did: CloudRoute matched them in under 24 hours to an AWS partner with GenAI and fine-tuning experience. The partner (1) moved the workload onto Bedrock's Converse API under the team's IAM and region for data residency; (2) fine-tuned a small Llama on the company's labelled document set and hosted it as a custom model, with a mid-size Llama as the escalation tier for the hard cases; (3) put a RAG layer over the changing reference data via Knowledge Bases so facts stayed fresh without retraining; and (4) filed a Bedrock POC credit application plus an Activate application to fund both the inference and the fine-tuning.

Outcome: The fine-tuned small Llama matched the previous closed model's accuracy on the narrow task at a small fraction of the per-request cost, and the open-weights model gave the team a credible self-host path for their most compliance-sensitive customers. Decisively, the spend — inference, fine-tuning training, and custom-model hosting — now draws down AWS credits instead of runway, so the team pays $0 during the build and early scale. CloudRoute's commission was paid by the partner from AWS engagement funding, not by the customer.

moved: closed vendor API → fine-tuned Llama on Bedrock · pattern: fine-tune + RAG + escalation tier · credits secured: POC + Activate · out-of-pocket: $0

faq

Common questions

Is Llama available on Amazon Bedrock?
Yes. Meta's Llama runs natively on Amazon Bedrock as one of the foundation-model providers behind Bedrock's single managed API, alongside Amazon Nova and Titan, Anthropic Claude, Mistral, Cohere, and others. As of 2026 Bedrock offers the current Llama generation across several sizes — small (8B-class), mid (70B-class), and large (405B-class and newer mixture-of-experts variants), with newer generations adding vision-capable sizes — all accessed through the same Converse API and IAM/VPC controls. You enable access per account and region in the Bedrock console, accepting Meta's community license as part of the process.
What does "open weights" mean for Llama, and why does it matter on Bedrock?
Open weights means Meta publishes the trained model parameters under the Llama Community License, so you can run, fine-tune, and (subject to the license) redistribute the model — not just call it from a private API. On Bedrock that buys you three concrete things over a closed model: deep customization (fine-tune on your own data and host the custom model), portability (the same model runs on Bedrock, SageMaker, or self-hosted, so you are not locked to one vendor's endpoint), and a low per-token cost floor at the small sizes. The trade-off is that the largest closed frontier models can still lead on the hardest reasoning. Note "open weights" is not the same as "open source" in the strict sense — the license is permissive for almost all commercial use but carries an acceptable-use policy and a scale clause affecting only the largest platforms.
How much does Llama cost on Bedrock?
It is billed per token, per size: representative 2026 on-demand rates run from roughly $0.10 per million tokens for a small (8B-class) Llama, to under $1 per million for a mid (70B-class) model, to a few dollars per million for the largest sizes — typically still undercutting closed frontier tiers, because open-weights competition keeps Llama prices low. Output is sometimes priced like input and sometimes higher depending on the generation. Batch (~50% off) and prompt caching where supported lower the effective rate; fine-tuning and Provisioned Throughput are priced separately. These are representative figures for relative comparison — confirm current rates on the AWS Bedrock pricing page, as they change with each generation and vary by region.
Can I fine-tune Llama on Amazon Bedrock?
Yes — and it is one of Llama's main advantages over closed models. You stage a dataset of prompt/response examples in S3, create a customization job in Bedrock pointing at a base Llama, and Bedrock trains a custom model you then serve via a small amount of Provisioned Throughput, invoking it by its custom-model ID like any other model. Fine-tuning earns its keep when you need consistent behaviour, format, or domain style, or a small specialised model that beats a big general one on a narrow task; for knowledge that changes often, prefer RAG. Fine-tuning costs a one-time training charge plus custom-model hosting — both ordinary AWS spend, so AWS credits cover them. See the amazon-bedrock-fine-tuning sibling.
Does Llama support vision and long context on Bedrock?
It depends on the size and generation. Newer Llama generations include vision-capable (multimodal) sizes that reason over images alongside text — reading charts, screenshots, and documents — while the text-only sizes remain text-only. Context windows have grown markedly over successive Llama releases, with recent generations offering substantially larger windows for long documents and history. Because capability varies by size more than it does for a closed family, confirm vision support and the exact context window for your chosen Llama in the Bedrock model catalog before you build.
Should I use Llama, Claude, or Amazon Nova on Bedrock?
It is workload-specific. Pick Llama when openness matters — you want to fine-tune on your own data, keep portability across Bedrock/SageMaker/self-hosting, need an inspectable or self-hostable model for compliance, or want the low cost of the small sizes. Pick Claude when you need the top of the reasoning curve — hardest reasoning, complex agents, quality-critical work. Pick Nova for the absolute price/latency floor inside AWS on simple, high-volume tasks. They are not mutually exclusive: the mature pattern mixes them behind one Converse API — Llama or Nova for cheap or fine-tuned paths, Claude for the quality path. Benchmark candidates on your own task and prompts.
What is the Llama model ID on Bedrock?
Each Llama model is invoked by a model ID — a string identifying the provider, model, size, and version, namespaced under Meta (of the shape meta.llama-… with a size and version suffix). You pass it to the API to pick the size, so moving a request between a small, mid, and large Llama is just a change of model-ID string; a fine-tuned custom model gets its own ARN you invoke the same way. Because IDs advance with each generation, do not hard-code a guessed value — read the current ID from the Bedrock model catalog in the console or list it via the API/CLI, and treat it as configuration.
Can I move my Llama workload off Bedrock later?
Yes — that portability is a core reason to choose an open-weights model. The same Llama weights (and your fine-tuned variant) can run on Amazon Bedrock's managed API, on Amazon SageMaker via JumpStart or your own endpoints, or self-hosted on your own GPUs or on AWS Trainium/Inferentia accelerators. A common path is to prototype fast on Bedrock's managed endpoint, then move to SageMaker or self-hosting if cost or control demands it — without changing the model. With a closed model you are tied to that provider's endpoint. AWS credits apply across Bedrock, SageMaker, and EC2-based self-hosting, so the migration is credit-funded too.
Can AWS credits cover Llama usage and fine-tuning on Bedrock?
Yes. Llama inference, fine-tuning training, and custom-model hosting on Bedrock are all ordinary AWS spend, so they are fully credit-eligible and credits apply automatically against your bill — covering tokens, Batch and Provisioned Throughput, the one-time training charge, custom-model hosting, and supporting services. The relevant pools are AWS Activate (up to $100K), a Bedrock/GenAI POC pool ($10K–$50K), and the GenAI Accelerator (up to $1M). These are largely partner-filed via the AWS Partner Network. CloudRoute routes you to the right pool and a vetted AWS partner who files the application and builds the Llama workload — including the fine-tuning pipeline — so the customer pays $0 and AWS funds it.

Run — and fine-tune — Llama on AWS's budget, not your runway

Llama on Bedrock is open-weights, portable, fine-tunable, and credit-eligible — inference and training both draw down AWS credits under your existing IAM, VPC, and billing. CloudRoute routes you to the right credit pool (Activate up to $100K, Bedrock POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted AWS partner who stands up the Llama workload, fine-tunes your custom model, and builds the tiered router. Customer pays $0.

matched within< 24h
GenAI credit ceilingup to $1M
cost to you$0
Llama on Amazon Bedrock — models, pricing & fine-tuning · CloudRoute