A complete, neutral reference for Custom Model Import on Amazon Bedrock: how to take a model you already fine-tuned on open weights — a Llama, Mistral, Mixtral, Flan-T5, or similar architecture — and import it into Bedrock so it runs behind the same unified, serverless, on-demand API as every other model. The supported architectures; the import workflow (weights in S3 → an import job → invoke); how billing actually works (Custom Model Units, charged in 5-minute windows of active compute, scaled to zero when idle); the cold-start trade-off; the limits; and exactly when to import versus fine-tune inside Bedrock versus host on SageMaker. Plus how AWS credits fund the import and the inference so the build costs you $0.
Custom Model Import is the Bedrock feature for bringing your own model. If you have already fine-tuned an open-weights model somewhere else — your own GPUs, SageMaker, another platform — Custom Model Import lets you upload those weights and have Bedrock serve them behind its standard, fully-managed API, exactly as if they were a native Bedrock model.
Most of the customization story on Bedrock is about adapting models that AWS hosts for you. Custom Model Import inverts that: you supply the model. You have a checkpoint — a set of fine-tuned weights for a supported architecture, sitting in a directory of safetensors files and config — and you want to call it through Bedrock instead of standing up and babysitting your own inference server. Custom Model Import takes that checkpoint, registers it as a custom model in your account, and from then on you invoke it with the same Bedrock runtime API you use for everything else.
The crucial property is that the imported model runs on-demand and serverless. You do not create or manage an endpoint, you do not choose instance types, you do not run an autoscaler, and you do not pay a flat hourly fee to keep it warm. Bedrock loads your model onto managed accelerators when requests arrive, serves them, and releases the capacity when traffic stops — billing you only for the compute your model actually uses. From the application's point of view it is just another model ID behind InvokeModel; the fact that the weights are yours is an implementation detail.
This makes Custom Model Import the bridge between two worlds that used to be separate: the freedom of open-weights fine-tuning (you own the model, you trained it exactly how you wanted, on whatever data and recipe you chose) and the operational simplicity of a managed model API (no servers, one auth model, one bill, the same Bedrock features around it). You get to keep the model you built and stop operating the infrastructure to serve it.
One caveat, stated once and meant throughout: the exact list of supported architectures, the regional availability, the size limits, and the precise pricing for Custom Model Import change as the feature evolves. The specifics here are representative as of 2026 to convey the shape of the feature and its economics. Always confirm the current supported-architecture list, Regions, quotas, and Custom Model Unit pricing on the official AWS Bedrock documentation and pricing pages before you build around them.
Custom Model Import = upload your own fine-tuned open-weights model (Llama / Mistral / Flan-T5-class architecture) into Bedrock and invoke it through the same unified, serverless, on-demand API as native models — no endpoint to manage, billed for active compute, scaling to zero when idle. You bring the weights; Bedrock handles the serving.
Teams fine-tune open-weights models for good reasons: full control of the training recipe and data, no per-token vendor pricing, the ability to take the weights anywhere. The pain comes the day they have to put that model into production and keep it running. Custom Model Import is the answer to that second problem.
Serving an open-weights LLM yourself is real systems work. You provision GPU instances (often scarce and expensive), pick an inference server, handle model sharding and quantization, build autoscaling that reacts to traffic, manage health checks and rolling deploys, and pay for the accelerators 24/7 — including all the hours nobody is sending requests. For a small team, or for a model that only gets occasional traffic, that standing cost and operational load can dwarf the value of having fine-tuned the model in the first place.
The two managed alternatives each have a catch. Calling a hosted model API means giving up your own weights and accepting per-token vendor pricing. Hosting your model on a dedicated endpoint (whether SageMaker or your own cluster) means paying for that endpoint continuously, whether or not it is busy. Custom Model Import threads the needle: you keep your own fine-tuned weights, and Bedrock serves them without a continuously-running endpoint — you pay for active compute and the model scales to zero when idle. For intermittent or unpredictable workloads, that pay-for-what-you-use shape is the whole point.
It also unifies your stack. Once your model is imported, it lives alongside Claude, Nova, Llama, Mistral and the rest under one IAM boundary, one billing relationship, and the broader Bedrock toolset. The same governance, logging (CloudTrail, model-invocation logging), and many of the same Bedrock features apply. You stop maintaining a separate serving system that exists only to host one fine-tuned model, and you fold it into the platform you are already using for the rest of your generative-AI workload.
Custom Model Import does not accept arbitrary models. It supports a defined set of well-known open-weights architectures, and your model has to match one of them. The reason is mechanical: Bedrock has to know how to load and run the weights, so the architecture must be one it has built serving support for.
The supported set centers on the popular open text architectures. As a practical 2026 guide, Custom Model Import has supported architectures in the Meta Llama family, the Mistral and Mixtral families, and encoder-decoder/text models such as Flan-T5, along with other widely-used open architectures that AWS adds over time. The key word is architecture: support is keyed to the model's structure, not its name. Any model whose architecture matches a supported one — including your own fine-tune, a continued-pre-trained variant, or a community model built on that architecture — can generally be imported. A model built on an unsupported or bespoke architecture cannot.
There are practical requirements on the artifacts themselves. You import a model in the standard open-weights layout: the weights (commonly safetensors), the configuration files, and the tokenizer files that together define the model, staged in Amazon S3. The model has to be a compatible precision and size for Bedrock to serve, and there are parameter-count and context-length ceilings on what can be imported (see the limits in §VIII). The important discipline is the same one that trips up every customization plan: confirm your exact architecture and size are on the current supported list before you design around importing it, because this is the most common place an import plan breaks.
What you bring is typically a model you fine-tuned or further-trained yourself outside Bedrock — on SageMaker, on your own accelerators, or anywhere you had the freedom to train exactly the way you wanted. Custom Model Import does not train anything; it assumes you have already produced the weights. (If you want AWS to do the training for you, that is Bedrock fine-tuning, covered in §VII and in the amazon-bedrock-fine-tuning sibling.)
Support is keyed to the model architecture, not the name. Llama-family, Mistral/Mixtral-family, and Flan-T5-class architectures have been among the supported set, with more added over time; bespoke or unsupported architectures cannot be imported. Confirm your exact architecture, precision, and size are on the current Custom Model Import supported list on the AWS docs before building your plan around it.
The mechanics are deliberately simple: stage the model files in S3, run a one-off import job, and then invoke the imported model through the standard Bedrock runtime API. No endpoints, no containers, no instance configuration.
The whole flow is a one-time setup followed by ordinary inference. Here is what each step actually involves.
Put your model files in an S3 bucket in the Region where you will import: the weights (typically safetensors), the model config, and the tokenizer files, in the standard open-weights directory layout. The artifacts must be complete and match a supported architecture. Keep them in the same Region as the import job, and make sure the role you will use can read the bucket.
In the Bedrock console (or via API/SDK) you create a model-import job: point it at the S3 location of your artifacts, give the resulting model a name, and provide an IAM role that grants Bedrock permission to read your S3 data. Bedrock validates the artifacts against the supported architecture, ingests the weights, and registers the result as a custom (imported) model in your account. The job runs once and takes on the order of minutes (longer for large models); you are not standing up any persistent infrastructure here — you are registering the model so Bedrock can serve it on demand later.
Once the import completes, your model is just another model in Bedrock. You call it with the standard runtime API — InvokeModel (and, where the model is chat-formatted, you can use the conversational path) — referencing the imported model by its identifier, exactly as you would call Claude or Nova. The same IAM authorization, CloudTrail audit logging, and model-invocation logging apply. Your application code does not need to know the weights are yours; it is one more model ID behind the same API.
You do not deploy, scale, or manage anything to serve the model. When a request arrives, Bedrock loads your imported model onto managed accelerators and serves it; when traffic stops, it releases that capacity. You are billed only for the compute your model actively uses (see §V), and an idle imported model does not run up a standing hourly bill. The one behavioural consequence of scaling to zero is the cold start, covered in §VI.
Artifacts in S3 → a one-off model-import job (with an IAM role to read S3) → invoke through the standard Bedrock runtime API → Bedrock serves it on-demand and scales to zero. No endpoint, no instances, no autoscaler — just an import job and then ordinary inference calls.
This is the part that distinguishes Custom Model Import from every dedicated-hosting option, and the part most worth understanding before you commit. You are not billed a flat hourly rate to keep an endpoint warm. You are billed for the compute your model actively uses, in short windows, with no charge while it is idle.
Custom Model Import meters serving in Custom Model Units (CMUs). A larger or longer-context model needs more of this dedicated compute to serve a request than a small one, so the number of CMUs your model requires scales with its size and context length. The billing shape that matters: you pay for CMUs only while the model is actively processing, charged in roughly 5-minute increments of active inference compute, rather than as a continuous hourly reservation. When no requests are in flight and the model has scaled to zero, that compute charge stops.
Contrast this with serving a Bedrock fine-tuned model on Provisioned Throughput, which bills a flat hourly rate continuously for as long as the model is deployed, whether or not you send it traffic. For an intermittent workload — a model that is busy in bursts and idle the rest of the day — the Custom Model Import economics are dramatically better, because you are not paying for all the idle hours. For a workload that is busy essentially all the time, the comparison narrows, and a reserved option can become competitive again at the top end.
Two things to keep in mind beyond the per-compute charge. First, there is typically a storage charge for keeping your imported model artifacts available to Bedrock — a small recurring cost separate from inference. Second, because billing tracks active compute in short windows, a stream of sporadic single requests can each pull the model onto compute and incur a minimum window, so very spiky one-off traffic is not always as cheap per request as steady bursts. As always, the figures and the exact unit definitions here are representative for 2026 — confirm current CMU pricing, the window granularity, and storage charges on the AWS Bedrock pricing page.
Imported models bill in Custom Model Units for active compute, in ~5-minute windows, and scale to zero (no charge while idle) — plus a small storage charge for the artifacts. That is the opposite of a fine-tuned model on Provisioned Throughput, which bills a flat hourly rate 24/7 whether used or not. Import wins for intermittent traffic; reserved wins at constant high volume.
Scaling to zero is what makes the economics good, and it is also the one behaviour you have to design for. When an imported model has gone idle and been released from compute, the next request has to wait for it to be loaded back onto an accelerator. That is the cold start.
The mechanism is straightforward. Because Bedrock is not holding your model warm on dedicated capacity around the clock, a request that arrives after a period of inactivity triggers the model to be loaded onto an accelerator before it can respond. That load adds latency to the first request — on the order of seconds, and longer for larger models, since more weight has to be moved into memory. Once the model is warm, subsequent requests are served at normal speed; the cold start is a property of the first request after idleness, not of steady traffic.
Whether this matters depends entirely on your workload. For asynchronous and batch-like use — background document processing, enrichment jobs, internal tools, anything where a few extra seconds on the first call is invisible — the cold start is a non-issue and the scale-to-zero savings are pure upside. For latency-critical, user-facing paths where every interaction must respond instantly even after a quiet period, an unpredictable first-request delay can be a real problem, and you should test it against your latency budget rather than assume it away.
There are sensible ways to manage it. You can keep the model warm with a low-rate heartbeat of synthetic requests so it rarely scales fully to zero during business hours (trading a little compute cost for predictable latency), route latency-critical traffic to an always-available base model and reserve the imported model for tolerant paths, or simply accept the cold start where the workload allows. The honest framing: Custom Model Import is at its best for intermittent, latency-tolerant workloads. If you truly need constant, instant, sustained serving, that is the profile where a reserved option (Provisioned Throughput, or a dedicated SageMaker endpoint) earns its standing cost — which is exactly the decision §VII lays out.
Scaling to zero means the first request after idleness waits seconds while the model loads onto an accelerator (longer for bigger models); warm traffic is unaffected. Fine for async/batch and internal tools; test it against your budget for latency-critical user paths. Mitigate with a low-rate keep-warm heartbeat or route hot paths to an always-on base model.
These three are easy to confuse and serve genuinely different needs. The cleanest way to choose is to ask two questions: did you already train the model yourself, and how steady is the traffic? Match the answer to the right tool rather than to whichever sounds most capable.
Here is the diagnostic, in plain terms:
| Dimension | Custom Model Import | Bedrock fine-tuning | SageMaker hosting |
|---|---|---|---|
| Who trains the model | You do (outside Bedrock) | AWS does (inside Bedrock, from JSONL) | You do (on SageMaker or elsewhere) |
| What you bring | Your fine-tuned open weights | A labelled JSONL dataset | Any model + (optionally) a container |
| Architectures | Supported open architectures only | Bedrock-fine-tunable base models | Any architecture / custom |
| Serving model | Serverless, on-demand | Provisioned Throughput | You configure the endpoint |
| Infra to manage | None | None (but reserved capacity) | You manage instances & scaling |
| Idle cost | Scales to zero | Flat hourly, 24/7 | Endpoint runs (unless serverless) |
| Cost shape | Per active compute (CMU, ~5-min) | Flat hourly while deployed | Per instance-hour / per-request |
| Cold start | Yes (first request after idle) | No (always warm) | Depends on endpoint type |
| Best for | Owned weights + intermittent traffic | AWS-run training + steady volume | Full control / unsupported models |
Custom Model Import is powerful but bounded. Knowing the limits up front prevents the two most common failures: designing around an architecture or model size that cannot be imported, and being surprised by cold starts or Region gaps in production.
The constraints worth checking before you commit:
The supported architectures, size/context ceilings, Regions, and quotas all change as the feature matures. Before you design around Custom Model Import, confirm on the AWS Bedrock docs that your exact architecture and model size are importable, the feature is in your Region, and the quotas fit your plan. This is the single most common place an import plan breaks.
Everything above prices Custom Model Import if you pay AWS directly. For most startups and many companies the relevant number is different, because AWS will frequently fund the work with credits — and the import job, the per-compute inference, the artifact storage, and anything you pair around the model draw those credits down before they ever touch your card.
The whole Custom Model Import workload is credit-eligible: the import job, the Custom Model Unit charges for serving your model, the S3 storage for the artifacts, and any surrounding services — a Knowledge Base for RAG, embeddings, Guardrails — all draw down AWS credits, which apply automatically against your bill until exhausted. The relevant pools are AWS Activate (general startup credits, commonly up to $100K for institutionally-funded startups), a dedicated Bedrock / Generative-AI POC pool ($10K–$50K) aimed specifically at proving out a GenAI use case — which is exactly what importing and evaluating a fine-tuned model is — and the competitive Generative AI Accelerator (awards up to $1M for a small cohort of AI-first startups).
Credits matter here for a specific reason. Custom Model Import already has the friendliest cost shape of the hosting options — scale-to-zero, pay-for-active-compute — but if you are running a real evaluation, comparing your imported fine-tune against base models, and ramping traffic during a launch, those compute charges are still the kind of spend a seed-stage team would rather not pull from runway. Credits cover exactly that proof-out and ramp period, so you can import the model, run a proper head-to-head, and only commit real money once the workload is proven and growing.
The practical mechanic is that most of these pools are partner-filed — requested through the AWS Partner Network (the ACE program), not a public self-serve form — which is why teams route through an AWS partner rather than applying alone. That is the gap CloudRoute fills. CloudRoute matches you to the right credit pool for your stage and to a vetted AWS ML partner who both files the credit application and does the work: preparing the artifacts, running the import job, wiring the model into your application behind the Bedrock API, setting up evaluation, and advising honestly on whether Custom Model Import, Bedrock fine-tuning, or SageMaker hosting is the right home for your model. The customer pays $0 — AWS funds the credit pool, AWS pays the partner through engagement-funding programs, and the partner pays CloudRoute a routing commission. You never see an invoice. Related: AWS credits for generative-AI startups and Bedrock POC funding.
The headline decision, compressed. Match the row to your situation: who trained the model, what you bring, how steady the traffic is, and what you are willing to operate. Representative 2026 guidance, not quotes.
| Approach | Best when… | You bring | Serving + idle cost | Cold start | Reach for it… |
|---|---|---|---|---|---|
| Custom Model Import | You own fine-tuned open weights; traffic is intermittent | Your weights (Llama/Mistral/Flan-T5-class) | Serverless, on-demand; scales to zero | Yes (first request after idle) | For owned weights + bursty traffic, no infra |
| Bedrock fine-tuning | You want AWS to train it; volume is steady & high | A labelled JSONL dataset | Provisioned Throughput; flat hourly 24/7 | No (always warm) | For managed training + constant volume |
| SageMaker hosting | You need full control or an unsupported architecture | Any model + (optionally) a container | You configure the endpoint; runs unless serverless | Depends on endpoint type | For maximum control / custom serving |
| Base model + prompt/RAG | You have no weights and only need behaviour/facts | Nothing (or your documents) | Pay-per-token; no model hosting | No | First — before training or importing anything |
Situation: The team had fine-tuned a Llama model on their own code corpus and was self-hosting it on a pair of always-on GPU instances. The feature was used in bursts — heavy during European working hours, near-silent overnight and on weekends — but the GPUs (and the on-call burden of keeping the inference server healthy) ran 24/7. They were paying for a full week of capacity to serve roughly a third of a week of traffic, and a 16-person team did not want to keep operating a bespoke serving stack.
What CloudRoute did: CloudRoute matched them in under 24 hours to an EU AWS ML partner. The partner confirmed the model's Llama architecture was supported, staged the existing safetensors weights, config, and tokenizer in S3, and ran a Custom Model Import job — putting the same fine-tuned model behind the standard Bedrock API with no endpoint to manage and scale-to-zero billing. They routed latency-tolerant batch comment-generation straight to the imported model and added a small keep-warm heartbeat during working hours for the interactive path, eliminating cold starts where they mattered. In parallel they filed a Bedrock/GenAI POC credit application plus an Activate Portfolio application to fund the migration and the inference.
Outcome: The fine-tuned model now serves on Bedrock with no standing GPU bill — idle nights and weekends cost nothing, and the team retired the self-managed inference servers and the on-call rotation around them. The import, the Custom Model Unit inference during the ramp, and the artifact storage were all covered by the approved credits, so the team paid $0 during the migration and proof-out. CloudRoute's commission was paid by the partner from AWS engagement funding, not by the customer.
move: self-hosted GPU fleet → Bedrock Custom Model Import · idle cost: $0 (scales to zero) · credits secured: POC + Activate · out-of-pocket during build: $0
Custom Model Import already has the friendliest cost shape — serverless, scale-to-zero, pay only for active compute — and AWS credits cover even that. CloudRoute routes you to the right credit pool (Activate up to $100K, Bedrock POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted AWS ML partner who confirms your architecture is supported, runs the import, wires it behind the Bedrock API, and tells you honestly whether import, fine-tuning, or SageMaker is the right home for your model. Customer pays $0.