for AWS partners →Get AWS credits to fund inference →

bedrock custom model import · bring your own weights · 2026

Amazon Bedrock Custom Model Import — bring your own fine-tuned model.

A complete, neutral reference for Custom Model Import on Amazon Bedrock: how to take a model you already fine-tuned on open weights — a Llama, Mistral, Mixtral, Flan-T5, or similar architecture — and import it into Bedrock so it runs behind the same unified, serverless, on-demand API as every other model. The supported architectures; the import workflow (weights in S3 → an import job → invoke); how billing actually works (Custom Model Units, charged in 5-minute windows of active compute, scaled to zero when idle); the cold-start trade-off; the limits; and exactly when to import versus fine-tune inside Bedrock versus host on SageMaker. Plus how AWS credits fund the import and the inference so the build costs you $0.

Get AWS credits to fund inference →→ import vs fine-tune vs SageMaker?

you bring

your own weights

serving model

serverless, on-demand

billing

per-compute, scales to zero

cost with credits

TL;DR

Custom Model Import lets you bring a model you fine-tuned yourself — on a supported open-weights architecture like Meta Llama, Mistral/Mixtral, or Flan-T5 — into Amazon Bedrock and invoke it through the same unified API as Claude, Nova, and the rest. You upload the model weights to Amazon S3, run a one-off import job, and Bedrock then serves your model on-demand and serverless: no endpoint to provision, no GPU fleet to manage, and it scales to zero when idle.
It is different from Bedrock fine-tuning. Fine-tuning trains a base model inside Bedrock from a JSONL dataset and produces a custom model that must be served on Provisioned Throughput (a flat hourly charge). Custom Model Import does no training — you did that elsewhere — and serves the imported model on a pay-for-active-compute model billed in Custom Model Units, in roughly 5-minute windows, with no standing hourly bill while idle. The trade-off is a cold start: a model that has scaled to zero takes some seconds to load back onto an accelerator on the next request.
Use Custom Model Import when you have already-fine-tuned open weights, want them behind the one Bedrock API with on-demand economics, and your traffic is intermittent. Use Bedrock fine-tuning when you want AWS to do the training and you have steady high volume to justify Provisioned Throughput. Use SageMaker when you need full control over the serving stack, custom containers, or unsupported architectures. AWS credits (Activate up to $100K, Bedrock/GenAI POC $10K–$50K, GenAI Accelerator up to $1M) cover the import and the inference — CloudRoute routes you to the credit pool and a vetted AWS partner, so you pay $0.

definition

IWhat Custom Model Import actually is

Custom Model Import is the Bedrock feature for bringing your own model. If you have already fine-tuned an open-weights model somewhere else — your own GPUs, SageMaker, another platform — Custom Model Import lets you upload those weights and have Bedrock serve them behind its standard, fully-managed API, exactly as if they were a native Bedrock model.

Most of the customization story on Bedrock is about adapting models that AWS hosts for you. Custom Model Import inverts that: you supply the model. You have a checkpoint — a set of fine-tuned weights for a supported architecture, sitting in a directory of safetensors files and config — and you want to call it through Bedrock instead of standing up and babysitting your own inference server. Custom Model Import takes that checkpoint, registers it as a custom model in your account, and from then on you invoke it with the same Bedrock runtime API you use for everything else.

The crucial property is that the imported model runs on-demand and serverless. You do not create or manage an endpoint, you do not choose instance types, you do not run an autoscaler, and you do not pay a flat hourly fee to keep it warm. Bedrock loads your model onto managed accelerators when requests arrive, serves them, and releases the capacity when traffic stops — billing you only for the compute your model actually uses. From the application's point of view it is just another model ID behind InvokeModel; the fact that the weights are yours is an implementation detail.

This makes Custom Model Import the bridge between two worlds that used to be separate: the freedom of open-weights fine-tuning (you own the model, you trained it exactly how you wanted, on whatever data and recipe you chose) and the operational simplicity of a managed model API (no servers, one auth model, one bill, the same Bedrock features around it). You get to keep the model you built and stop operating the infrastructure to serve it.

One caveat, stated once and meant throughout: the exact list of supported architectures, the regional availability, the size limits, and the precise pricing for Custom Model Import change as the feature evolves. The specifics here are representative as of 2026 to convey the shape of the feature and its economics. Always confirm the current supported-architecture list, Regions, quotas, and Custom Model Unit pricing on the official AWS Bedrock documentation and pricing pages before you build around them.

the one-line definition

Custom Model Import = upload your own fine-tuned open-weights model (Llama / Mistral / Flan-T5-class architecture) into Bedrock and invoke it through the same unified, serverless, on-demand API as native models — no endpoint to manage, billed for active compute, scaling to zero when idle. You bring the weights; Bedrock handles the serving.

the problem it solves

IIWhy it exists — the gap between owning a model and serving one

Teams fine-tune open-weights models for good reasons: full control of the training recipe and data, no per-token vendor pricing, the ability to take the weights anywhere. The pain comes the day they have to put that model into production and keep it running. Custom Model Import is the answer to that second problem.

Serving an open-weights LLM yourself is real systems work. You provision GPU instances (often scarce and expensive), pick an inference server, handle model sharding and quantization, build autoscaling that reacts to traffic, manage health checks and rolling deploys, and pay for the accelerators 24/7 — including all the hours nobody is sending requests. For a small team, or for a model that only gets occasional traffic, that standing cost and operational load can dwarf the value of having fine-tuned the model in the first place.

The two managed alternatives each have a catch. Calling a hosted model API means giving up your own weights and accepting per-token vendor pricing. Hosting your model on a dedicated endpoint (whether SageMaker or your own cluster) means paying for that endpoint continuously, whether or not it is busy. Custom Model Import threads the needle: you keep your own fine-tuned weights, and Bedrock serves them without a continuously-running endpoint — you pay for active compute and the model scales to zero when idle. For intermittent or unpredictable workloads, that pay-for-what-you-use shape is the whole point.

It also unifies your stack. Once your model is imported, it lives alongside Claude, Nova, Llama, Mistral and the rest under one IAM boundary, one billing relationship, and the broader Bedrock toolset. The same governance, logging (CloudTrail, model-invocation logging), and many of the same Bedrock features apply. You stop maintaining a separate serving system that exists only to host one fine-tuned model, and you fold it into the platform you are already using for the rest of your generative-AI workload.

what you can import

IIISupported architectures — what you can (and cannot) import

Custom Model Import does not accept arbitrary models. It supports a defined set of well-known open-weights architectures, and your model has to match one of them. The reason is mechanical: Bedrock has to know how to load and run the weights, so the architecture must be one it has built serving support for.

The supported set centers on the popular open text architectures. As a practical 2026 guide, Custom Model Import has supported architectures in the Meta Llama family, the Mistral and Mixtral families, and encoder-decoder/text models such as Flan-T5, along with other widely-used open architectures that AWS adds over time. The key word is architecture: support is keyed to the model's structure, not its name. Any model whose architecture matches a supported one — including your own fine-tune, a continued-pre-trained variant, or a community model built on that architecture — can generally be imported. A model built on an unsupported or bespoke architecture cannot.

There are practical requirements on the artifacts themselves. You import a model in the standard open-weights layout: the weights (commonly safetensors), the configuration files, and the tokenizer files that together define the model, staged in Amazon S3. The model has to be a compatible precision and size for Bedrock to serve, and there are parameter-count and context-length ceilings on what can be imported (see the limits in §VIII). The important discipline is the same one that trips up every customization plan: confirm your exact architecture and size are on the current supported list before you design around importing it, because this is the most common place an import plan breaks.

What you bring is typically a model you fine-tuned or further-trained yourself outside Bedrock — on SageMaker, on your own accelerators, or anywhere you had the freedom to train exactly the way you wanted. Custom Model Import does not train anything; it assumes you have already produced the weights. (If you want AWS to do the training for you, that is Bedrock fine-tuning, covered in §VII and in the amazon-bedrock-fine-tuning sibling.)

check the architecture list first

Support is keyed to the model architecture, not the name. Llama-family, Mistral/Mixtral-family, and Flan-T5-class architectures have been among the supported set, with more added over time; bespoke or unsupported architectures cannot be imported. Confirm your exact architecture, precision, and size are on the current Custom Model Import supported list on the AWS docs before building your plan around it.

how you do it

IVThe import workflow — from S3 weights to a callable model

The mechanics are deliberately simple: stage the model files in S3, run a one-off import job, and then invoke the imported model through the standard Bedrock runtime API. No endpoints, no containers, no instance configuration.

The whole flow is a one-time setup followed by ordinary inference. Here is what each step actually involves.

Step 1 — Stage the model artifacts in Amazon S3

Put your model files in an S3 bucket in the Region where you will import: the weights (typically safetensors), the model config, and the tokenizer files, in the standard open-weights directory layout. The artifacts must be complete and match a supported architecture. Keep them in the same Region as the import job, and make sure the role you will use can read the bucket.

Step 2 — Run a model-import job

In the Bedrock console (or via API/SDK) you create a model-import job: point it at the S3 location of your artifacts, give the resulting model a name, and provide an IAM role that grants Bedrock permission to read your S3 data. Bedrock validates the artifacts against the supported architecture, ingests the weights, and registers the result as a custom (imported) model in your account. The job runs once and takes on the order of minutes (longer for large models); you are not standing up any persistent infrastructure here — you are registering the model so Bedrock can serve it on demand later.

Step 3 — Invoke it through the unified Bedrock API

Once the import completes, your model is just another model in Bedrock. You call it with the standard runtime API — InvokeModel (and, where the model is chat-formatted, you can use the conversational path) — referencing the imported model by its identifier, exactly as you would call Claude or Nova. The same IAM authorization, CloudTrail audit logging, and model-invocation logging apply. Your application code does not need to know the weights are yours; it is one more model ID behind the same API.

Step 4 — Bedrock serves it on-demand (and scales to zero)

You do not deploy, scale, or manage anything to serve the model. When a request arrives, Bedrock loads your imported model onto managed accelerators and serves it; when traffic stops, it releases that capacity. You are billed only for the compute your model actively uses (see §V), and an idle imported model does not run up a standing hourly bill. The one behavioural consequence of scaling to zero is the cold start, covered in §VI.

the workflow in one line

Artifacts in S3 → a one-off model-import job (with an IAM role to read S3) → invoke through the standard Bedrock runtime API → Bedrock serves it on-demand and scales to zero. No endpoint, no instances, no autoscaler — just an import job and then ordinary inference calls.

how you actually pay

VHow billing works — Custom Model Units and active compute

This is the part that distinguishes Custom Model Import from every dedicated-hosting option, and the part most worth understanding before you commit. You are not billed a flat hourly rate to keep an endpoint warm. You are billed for the compute your model actively uses, in short windows, with no charge while it is idle.

Custom Model Import meters serving in Custom Model Units (CMUs). A larger or longer-context model needs more of this dedicated compute to serve a request than a small one, so the number of CMUs your model requires scales with its size and context length. The billing shape that matters: you pay for CMUs only while the model is actively processing, charged in roughly 5-minute increments of active inference compute, rather than as a continuous hourly reservation. When no requests are in flight and the model has scaled to zero, that compute charge stops.

Contrast this with serving a Bedrock fine-tuned model on Provisioned Throughput, which bills a flat hourly rate continuously for as long as the model is deployed, whether or not you send it traffic. For an intermittent workload — a model that is busy in bursts and idle the rest of the day — the Custom Model Import economics are dramatically better, because you are not paying for all the idle hours. For a workload that is busy essentially all the time, the comparison narrows, and a reserved option can become competitive again at the top end.

Two things to keep in mind beyond the per-compute charge. First, there is typically a storage charge for keeping your imported model artifacts available to Bedrock — a small recurring cost separate from inference. Second, because billing tracks active compute in short windows, a stream of sporadic single requests can each pull the model onto compute and incur a minimum window, so very spiky one-off traffic is not always as cheap per request as steady bursts. As always, the figures and the exact unit definitions here are representative for 2026 — confirm current CMU pricing, the window granularity, and storage charges on the AWS Bedrock pricing page.

the billing model in one line

Imported models bill in Custom Model Units for active compute, in ~5-minute windows, and scale to zero (no charge while idle) — plus a small storage charge for the artifacts. That is the opposite of a fine-tuned model on Provisioned Throughput, which bills a flat hourly rate 24/7 whether used or not. Import wins for intermittent traffic; reserved wins at constant high volume.

the trade-off

VICold-start behaviour — the price of scaling to zero

Scaling to zero is what makes the economics good, and it is also the one behaviour you have to design for. When an imported model has gone idle and been released from compute, the next request has to wait for it to be loaded back onto an accelerator. That is the cold start.

The mechanism is straightforward. Because Bedrock is not holding your model warm on dedicated capacity around the clock, a request that arrives after a period of inactivity triggers the model to be loaded onto an accelerator before it can respond. That load adds latency to the first request — on the order of seconds, and longer for larger models, since more weight has to be moved into memory. Once the model is warm, subsequent requests are served at normal speed; the cold start is a property of the first request after idleness, not of steady traffic.

Whether this matters depends entirely on your workload. For asynchronous and batch-like use — background document processing, enrichment jobs, internal tools, anything where a few extra seconds on the first call is invisible — the cold start is a non-issue and the scale-to-zero savings are pure upside. For latency-critical, user-facing paths where every interaction must respond instantly even after a quiet period, an unpredictable first-request delay can be a real problem, and you should test it against your latency budget rather than assume it away.

There are sensible ways to manage it. You can keep the model warm with a low-rate heartbeat of synthetic requests so it rarely scales fully to zero during business hours (trading a little compute cost for predictable latency), route latency-critical traffic to an always-available base model and reserve the imported model for tolerant paths, or simply accept the cold start where the workload allows. The honest framing: Custom Model Import is at its best for intermittent, latency-tolerant workloads. If you truly need constant, instant, sustained serving, that is the profile where a reserved option (Provisioned Throughput, or a dedicated SageMaker endpoint) earns its standing cost — which is exactly the decision §VII lays out.

design for the first request

Scaling to zero means the first request after idleness waits seconds while the model loads onto an accelerator (longer for bigger models); warm traffic is unaffected. Fine for async/batch and internal tools; test it against your budget for latency-critical user paths. Mitigate with a low-rate keep-warm heartbeat or route hot paths to an always-on base model.

the decision

VIICustom Model Import vs Bedrock fine-tuning vs SageMaker hosting

These three are easy to confuse and serve genuinely different needs. The cleanest way to choose is to ask two questions: did you already train the model yourself, and how steady is the traffic? Match the answer to the right tool rather than to whichever sounds most capable.

Here is the diagnostic, in plain terms:

You already fine-tuned open weights yourself, want them on the Bedrock API, and traffic is intermittent → Custom Model Import — You own a Llama/Mistral/Flan-T5-class fine-tune and want managed serverless serving with scale-to-zero economics — no training, no endpoint, pay for active compute. This page's tool.
You want AWS to do the training, and you have steady high volume → Bedrock fine-tuning — Bedrock trains a base model from your JSONL dataset and serves the result on Provisioned Throughput (a flat hourly bill). Right when you want managed training and your volume is high and steady enough to keep reserved capacity busy. See the amazon-bedrock-fine-tuning sibling.
You need full control of the serving stack, custom containers, or an unsupported architecture → SageMaker hosting — SageMaker real-time/async/serverless endpoints let you bring any model, any container, any architecture, with full control over instances, scaling, and the inference pipeline — at the cost of operating it. See the deploy-open-source-llm-on-aws sibling.
Your traffic is constant and latency-critical → a reserved option, not import — If the model must serve instantly around the clock with no cold start, the scale-to-zero model is the wrong fit. Provisioned Throughput (for a Bedrock model) or a dedicated SageMaker endpoint earns its standing hourly cost at constant high volume.
You only need a style/format tweak and have no weights yet → fine-tune (or even just prompt/RAG) — If you have not trained anything and only need a behaviour change, Bedrock fine-tuning — or, cheaper still, prompt engineering and RAG — may be the answer, not importing a model you would have to train first.

custom model import vs bedrock fine-tuning vs sagemaker hosting · 2026

Dimension	Custom Model Import	Bedrock fine-tuning	SageMaker hosting
Who trains the model	You do (outside Bedrock)	AWS does (inside Bedrock, from JSONL)	You do (on SageMaker or elsewhere)
What you bring	Your fine-tuned open weights	A labelled JSONL dataset	Any model + (optionally) a container
Architectures	Supported open architectures only	Bedrock-fine-tunable base models	Any architecture / custom
Serving model	Serverless, on-demand	Provisioned Throughput	You configure the endpoint
Infra to manage	None	None (but reserved capacity)	You manage instances & scaling
Idle cost	Scales to zero	Flat hourly, 24/7	Endpoint runs (unless serverless)
Cost shape	Per active compute (CMU, ~5-min)	Flat hourly while deployed	Per instance-hour / per-request
Cold start	Yes (first request after idle)	No (always warm)	Depends on endpoint type
Best for	Owned weights + intermittent traffic	AWS-run training + steady volume	Full control / unsupported models

Two questions decide it: did you train the model yourself (import or SageMaker, not Bedrock fine-tuning), and is the traffic intermittent (import's scale-to-zero wins) or constant (a reserved option wins). Representative for 2026 — confirm specifics on the AWS Bedrock and SageMaker docs.

the fine print

VIIILimits, quotas, and operational notes

Custom Model Import is powerful but bounded. Knowing the limits up front prevents the two most common failures: designing around an architecture or model size that cannot be imported, and being surprised by cold starts or Region gaps in production.

The constraints worth checking before you commit:

Architecture must be on the supported list — Only specific open architectures (Llama-family, Mistral/Mixtral, Flan-T5-class, and others added over time) can be imported. A bespoke or unsupported architecture is a hard stop — verify yours first.
Size and context-length ceilings apply — There are upper bounds on parameter count and context length for an importable model, and a supported precision/format for the artifacts. Very large or unusual models may exceed them; check the current quotas.
Artifacts must be complete and well-formed in S3 — You need the full set — weights (commonly safetensors), config, and tokenizer files — in the expected layout, in the same Region, readable by the import job's IAM role. Incomplete or mismatched artifacts fail the job.
Regional availability is limited and evolving — Custom Model Import is available in a subset of AWS Regions, and your imported model serves in the Region you import it into. Confirm the feature is offered in the Region your data residency and latency requirements need.
Cold start is inherent to scale-to-zero — The first request after idleness incurs load latency (seconds; more for larger models). It is a feature trade-off, not a bug — design latency-critical paths accordingly (keep-warm, or route hot paths elsewhere).
Per-account quotas and concurrency limits — There are limits on the number of imported/custom models and on concurrent serving capacity per account, some of which are soft and raisable via a support request. Plan for them if you intend to import many models.
You manage model governance and updates — An imported model is a static snapshot of the weights you uploaded; improving it means training a new version and re-importing. Bedrock serves what you gave it — versioning and retraining are your responsibility.

verify before you build

The supported architectures, size/context ceilings, Regions, and quotas all change as the feature matures. Before you design around Custom Model Import, confirm on the AWS Bedrock docs that your exact architecture and model size are importable, the feature is in your Region, and the quotas fit your plan. This is the single most common place an import plan breaks.

how it becomes $0

IXHow AWS credits fund the import — and the inference

Everything above prices Custom Model Import if you pay AWS directly. For most startups and many companies the relevant number is different, because AWS will frequently fund the work with credits — and the import job, the per-compute inference, the artifact storage, and anything you pair around the model draw those credits down before they ever touch your card.

The whole Custom Model Import workload is credit-eligible: the import job, the Custom Model Unit charges for serving your model, the S3 storage for the artifacts, and any surrounding services — a Knowledge Base for RAG, embeddings, Guardrails — all draw down AWS credits, which apply automatically against your bill until exhausted. The relevant pools are AWS Activate (general startup credits, commonly up to $100K for institutionally-funded startups), a dedicated Bedrock / Generative-AI POC pool ($10K–$50K) aimed specifically at proving out a GenAI use case — which is exactly what importing and evaluating a fine-tuned model is — and the competitive Generative AI Accelerator (awards up to $1M for a small cohort of AI-first startups).

Credits matter here for a specific reason. Custom Model Import already has the friendliest cost shape of the hosting options — scale-to-zero, pay-for-active-compute — but if you are running a real evaluation, comparing your imported fine-tune against base models, and ramping traffic during a launch, those compute charges are still the kind of spend a seed-stage team would rather not pull from runway. Credits cover exactly that proof-out and ramp period, so you can import the model, run a proper head-to-head, and only commit real money once the workload is proven and growing.

The practical mechanic is that most of these pools are partner-filed — requested through the AWS Partner Network (the ACE program), not a public self-serve form — which is why teams route through an AWS partner rather than applying alone. That is the gap CloudRoute fills. CloudRoute matches you to the right credit pool for your stage and to a vetted AWS ML partner who both files the credit application and does the work: preparing the artifacts, running the import job, wiring the model into your application behind the Bedrock API, setting up evaluation, and advising honestly on whether Custom Model Import, Bedrock fine-tuning, or SageMaker hosting is the right home for your model. The customer pays $0 — AWS funds the credit pool, AWS pays the partner through engagement-funding programs, and the partner pays CloudRoute a routing commission. You never see an invoice. Related: AWS credits for generative-AI startups and Bedrock POC funding.

pick the right home for your model

Custom Model Import vs fine-tuning vs SageMaker — on one screen

The headline decision, compressed. Match the row to your situation: who trained the model, what you bring, how steady the traffic is, and what you are willing to operate. Representative 2026 guidance, not quotes.

Approach	Best when…	You bring	Serving + idle cost	Cold start	Reach for it…
Custom Model Import	You own fine-tuned open weights; traffic is intermittent	Your weights (Llama/Mistral/Flan-T5-class)	Serverless, on-demand; scales to zero	Yes (first request after idle)	For owned weights + bursty traffic, no infra
Bedrock fine-tuning	You want AWS to train it; volume is steady & high	A labelled JSONL dataset	Provisioned Throughput; flat hourly 24/7	No (always warm)	For managed training + constant volume
SageMaker hosting	You need full control or an unsupported architecture	Any model + (optionally) a container	You configure the endpoint; runs unless serverless	Depends on endpoint type	For maximum control / custom serving
Base model + prompt/RAG	You have no weights and only need behaviour/facts	Nothing (or your documents)	Pay-per-token; no model hosting	No	First — before training or importing anything

Two questions decide most cases: did you train the model yourself (import or SageMaker), and is traffic intermittent (import's scale-to-zero) or constant (a reserved option)? If you have no model yet and only need a behaviour or facts change, do not import at all — prompt engineering, RAG, or a Bedrock fine-tune is cheaper. See amazon-bedrock-fine-tuning and deploy-open-source-llm-on-aws.

bringing your own model to Bedrock?

Get AWS credits that cover the import job AND the inference — and a partner to wire it up (you pay $0)

Get matched in 24h →

a recent match

A self-hosted Llama fine-tune, moved onto Bedrock — built on $0 — anonymized

inquiry · Series-A devtools SaaS, Amsterdam

Series-A developer-tools SaaS, 16 people, running a fine-tuned Llama model for code-comment generation

Situation: The team had fine-tuned a Llama model on their own code corpus and was self-hosting it on a pair of always-on GPU instances. The feature was used in bursts — heavy during European working hours, near-silent overnight and on weekends — but the GPUs (and the on-call burden of keeping the inference server healthy) ran 24/7. They were paying for a full week of capacity to serve roughly a third of a week of traffic, and a 16-person team did not want to keep operating a bespoke serving stack.

What CloudRoute did: CloudRoute matched them in under 24 hours to an EU AWS ML partner. The partner confirmed the model's Llama architecture was supported, staged the existing safetensors weights, config, and tokenizer in S3, and ran a Custom Model Import job — putting the same fine-tuned model behind the standard Bedrock API with no endpoint to manage and scale-to-zero billing. They routed latency-tolerant batch comment-generation straight to the imported model and added a small keep-warm heartbeat during working hours for the interactive path, eliminating cold starts where they mattered. In parallel they filed a Bedrock/GenAI POC credit application plus an Activate Portfolio application to fund the migration and the inference.

Outcome: The fine-tuned model now serves on Bedrock with no standing GPU bill — idle nights and weekends cost nothing, and the team retired the self-managed inference servers and the on-call rotation around them. The import, the Custom Model Unit inference during the ramp, and the artifact storage were all covered by the approved credits, so the team paid $0 during the migration and proof-out. CloudRoute's commission was paid by the partner from AWS engagement funding, not by the customer.

move: self-hosted GPU fleet → Bedrock Custom Model Import · idle cost: $0 (scales to zero) · credits secured: POC + Activate · out-of-pocket during build: $0

faq

Common questions

What is Amazon Bedrock Custom Model Import?

Custom Model Import is the Bedrock feature for bringing your own model. You take a model you already fine-tuned on a supported open-weights architecture (such as Meta Llama, Mistral/Mixtral, or Flan-T5), upload its weights, config, and tokenizer to Amazon S3, run a one-off import job, and Bedrock then serves it behind the same unified, serverless, on-demand API as native models like Claude and Nova. There is no endpoint to provision and no GPU fleet to manage, and the imported model scales to zero when idle. Bedrock does not train anything here — you bring the already-trained weights.

How is Custom Model Import different from Bedrock fine-tuning?

They solve different problems. Bedrock fine-tuning trains a base model for you inside Bedrock from a labelled JSONL dataset, and the resulting custom model must be served on Provisioned Throughput — a flat hourly charge that runs 24/7 while deployed. Custom Model Import does no training; you already trained the model elsewhere, and Bedrock serves your imported weights on a pay-for-active-compute model (Custom Model Units, billed in roughly 5-minute windows) that scales to zero when idle. Use fine-tuning when you want AWS to do the training and have steady high volume; use Custom Model Import when you own the weights already and traffic is intermittent.

Which model architectures can I import into Bedrock?

Support is keyed to the model architecture, not its name. As of 2026, Custom Model Import has supported popular open architectures including the Meta Llama family, the Mistral and Mixtral families, and Flan-T5-class models, with additional architectures added over time. Any model built on a supported architecture — including your own fine-tune or a community variant — can generally be imported, provided it meets the size, context-length, and precision requirements. A bespoke or unsupported architecture cannot be imported. Always confirm your exact architecture is on the current supported list in the AWS Bedrock documentation before designing around it.

What is the import workflow, step by step?

Four steps. (1) Stage the model artifacts — weights (commonly safetensors), config, and tokenizer files — in an Amazon S3 bucket in your target Region. (2) Create a model-import job in Bedrock, pointing it at the S3 location and giving it an IAM role that can read your data; Bedrock validates the architecture, ingests the weights, and registers a custom model in your account. (3) Invoke the imported model through the standard Bedrock runtime API (InvokeModel), referencing it by its identifier, exactly like any other model. (4) Bedrock serves it on-demand, loading it onto accelerators when requests arrive and scaling to zero when idle. No endpoints, containers, or instance configuration.

How does billing work for an imported model?

You are billed for active compute, not for a continuously-running endpoint. Serving is metered in Custom Model Units (CMUs) — the amount of dedicated compute your model needs, which scales with its size and context length — and you pay for those units only while the model is actively processing, charged in roughly 5-minute increments. When no requests are in flight, the model scales to zero and that compute charge stops. There is also typically a small storage charge for keeping the imported artifacts available. This is the opposite of Provisioned Throughput's flat 24/7 hourly bill, which is why import is far cheaper for intermittent traffic. Figures are representative for 2026 — confirm current rates on the AWS Bedrock pricing page.

What is the cold-start behaviour, and when does it matter?

Because an imported model scales to zero when idle rather than being held warm on dedicated capacity, the first request after a period of inactivity has to wait while the model is loaded back onto an accelerator. That adds latency to that first call — on the order of seconds, and longer for larger models — while subsequent (warm) requests are served at normal speed. It is a non-issue for asynchronous, batch-like, or internal workloads where a few extra seconds on the first call is invisible. It can matter for latency-critical, user-facing paths, where you can mitigate it with a low-rate keep-warm heartbeat, by routing hot paths to an always-available base model, or by reserving capacity instead.

When should I use Custom Model Import versus hosting on SageMaker?

Use Custom Model Import when your model is on a supported open architecture, you want it behind the managed Bedrock API with serverless scale-to-zero economics, and you do not need to operate the serving stack. Use SageMaker hosting when you need full control — a custom inference container, a specific instance type, a bespoke or unsupported architecture, or a serving pipeline Bedrock does not expose — and you are willing to manage the endpoint, scaling, and the operational load that comes with it. Custom Model Import trades control for zero operational overhead; SageMaker trades operational overhead for total control. Many teams use both for different models in the same account.

What are the main limits of Custom Model Import?

The key constraints: the model architecture must be on the supported list (Llama-family, Mistral/Mixtral, Flan-T5-class, and others over time); there are upper bounds on parameter count and context length and a supported precision/format for the artifacts; the feature is available only in a subset of AWS Regions and serves in the Region you import into; there are per-account quotas on the number of imported models and concurrent capacity (some raisable via support); the first request after idle incurs a cold start; and an imported model is a static snapshot — improving it means retraining elsewhere and re-importing. Confirm the current supported architectures, size ceilings, Regions, and quotas on the AWS Bedrock docs before building around the feature.

Can AWS credits cover the import and the inference?

Yes — the import job, the Custom Model Unit charges for serving, the S3 storage for the artifacts, and any surrounding services (a Knowledge Base, embeddings, Guardrails) are all credit-eligible, and credits apply automatically against your AWS bill. The relevant pools are AWS Activate (up to $100K), a Bedrock/GenAI POC pool ($10K–$50K), and the GenAI Accelerator (up to $1M). These are largely partner-filed via the AWS Partner Network. CloudRoute routes you to the right pool and a vetted AWS ML partner who files the application and does the work — staging the artifacts, running the import, wiring the model into your app, and setting up evaluation — so the customer pays $0 and AWS funds it.

Bring your model to Bedrock — on AWS's budget, not your runway

Custom Model Import already has the friendliest cost shape — serverless, scale-to-zero, pay only for active compute — and AWS credits cover even that. CloudRoute routes you to the right credit pool (Activate up to $100K, Bedrock POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted AWS ML partner who confirms your architecture is supported, runs the import, wires it behind the Bedrock API, and tells you honestly whether import, fine-tuning, or SageMaker is the right home for your model. Customer pays $0.

Get matched in 24h →→ see the AI-team persona detail

matched within< 24h

GenAI credit ceilingup to $1M

cost to you$0