A neutral, build-grade walkthrough of how protected health information (PHI) is actually kept compliant when you put a large language model in front of it: Bedrock HIPAA-eligibility and the AWS Business Associate Addendum, the no-training guarantee, Guardrails PII redaction, encryption with KMS, PrivateLink and in-region inference, de-identification, audit logging, human-in-the-loop, the compliant reference architecture for clinical and administrative use cases, and the specific mistakes that turn a promising pilot into a breach.
Before any architecture, get the mental model right. HIPAA does not ban large language models, does not mention AI, and does not certify products. It imposes obligations on how protected health information is handled — and a generative-AI system is just a new kind of software that touches that information. Almost every design decision in this guide is downstream of a single question: where does PHI go, and who is accountable for it there?
HIPAA — the Health Insurance Portability and Accountability Act — protects protected health information (PHI): individually identifiable health information held or transmitted by a covered entity (a provider, health plan, or clearinghouse) or a business associate acting on its behalf. The relevant machinery for software is the Security Rule (administrative, physical, and technical safeguards for electronic PHI), the Privacy Rule (limits on use and disclosure, including the minimum-necessary principle), and the Breach Notification Rule (what you must do when PHI is exposed).
Two legal concepts shape every AWS decision that follows. The first is the Business Associate Addendum (BAA): when a vendor processes PHI on your behalf, you need a signed BAA with them, and that vendor must in turn have BAAs with any of its subcontractors who touch PHI. The second is the principle that compliance is a property of a system and its operation, not of any single component. There is no such thing as a "HIPAA-certified" model or a "HIPAA-compliant" API call in isolation. A service can be HIPAA-eligible — meaning it is covered under the AWS BAA and can be used in a compliant workload — but whether your overall system is compliant depends on how you configure and run it.
For generative AI specifically, three risks dominate, and the rest of this guide is a systematic answer to them. Unauthorized disclosure — PHI leaking to a party without a BAA, most often a consumer LLM or a third-party API outside your account boundary. Improper secondary use — PHI being used to train or improve a model, a use the patient never authorized. Inaccurate output causing harm — a hallucination treated as fact in a clinical or coverage decision. Keep PHI under your BAA and inside your boundary, never train on it, and never let an unverified output drive a high-stakes decision.
One framing to carry throughout: this is an engineering and governance problem with well-trodden answers on AWS, not an unsolved research problem. Healthcare organizations run PHI workloads on AWS at scale every day. Generative AI adds one new component — the model call — but the safeguards around it are the same ones AWS has documented for years. Teams struggle when they treat the model as special and forget the boundary; they succeed when they treat it as one more service inside an architecture they already know how to secure.
This is a technical and architectural reference, written to help engineering and compliance teams reason about the building blocks. It is not legal advice and not a substitute for your own HIPAA risk analysis. Your obligations depend on your role (covered entity vs business associate), your data, and your jurisdiction. Confirm specifics with your compliance, privacy, and legal counsel, and validate the current scope of any AWS service against the official AWS HIPAA-eligible services list before you design around it.
Putting protected health information into a large language model sounds reckless until you understand the specific properties that make it lawful on AWS. There are three, and a privacy or security reviewer will ask about each one. They are the foundation the entire reference architecture rests on: if any one is missing — no BAA, a service that trains on your inputs, or PHI traversing the open internet — the rest of the controls cannot save you. Get crisp on all three and most of a HIPAA review for a Bedrock workload collapses into a short, documentable conversation.
What it means: Amazon Bedrock is on the list of AWS HIPAA-eligible services, and AWS offers a Business Associate Addendum that, once accepted, contractually covers your use of in-scope services for processing PHI. That single fact is what lets you lawfully send PHI through the Bedrock Runtime at all. Without a BAA in place, no amount of encryption makes the workload compliant — the legal relationship has to exist first.
How you get it: the AWS BAA is accepted through AWS Artifact and applies across your in-scope accounts (commonly organized so PHI-bearing accounts are clearly designated). Crucially, it covers HIPAA-eligible services used in the appropriate configuration — not every AWS service, and not every model-adjacent tool you might bolt on. Treat the eligible-services list as a hard boundary: if a component is not on it, PHI does not flow through it. And note that "Bedrock is eligible" does not extend to a third-party API, logging SaaS, analytics pixel, or model endpoint outside AWS that you call from your app — each such hop is outside your AWS BAA.
What it means: on Amazon Bedrock, the content you send to a foundation model (your prompts) and the content it returns (completions) are not used to train or improve the underlying base models, and are not shared with the third-party model providers. Inference happens within the AWS environment under your account. This addresses the "improper secondary use" risk head-on: the PHI in a prompt is processed to produce your answer and is not absorbed into a model someone else will later query.
Why it is the crux of the privacy review: the objection that most often stalls healthcare GenAI is "if we put patient data in, does it become training data?" On Bedrock, the documented answer is no. That is materially different from pasting PHI into a consumer chatbot, whose terms may permit the provider to use inputs to improve their service. The no-training guarantee plus account-boundary processing is the difference between a defensible architecture and a reportable breach. Respect its limit, though: it is a property of Bedrock, not of "LLMs" generically, and the moment your data leaves Bedrock for an endpoint with different terms it no longer applies — keep the model layer on a platform whose terms you have read and your BAA covers.
What it means: under the AWS Shared Responsibility Model, AWS is responsible for the security of the cloud (the physical infrastructure, the managed-service substrate), and you are responsible for security in the cloud (how you configure encryption, identity, network isolation, logging, and what data you choose to send). HIPAA eligibility gives you compliant building blocks; assembling them into a compliant system is your job.
Why it matters practically: almost every real-world HIPAA failure on a cloud platform is a configuration failure, not a platform failure — an over-permissive role, an unencrypted bucket, a log of raw PHI in a place it should not be, a public endpoint left open. The platform did its part; the workload was misconfigured. Naming this up front sets the right expectation with leadership: AWS hands you a compliant-capable kit, and the work is in using it correctly and proving you did.
What it asks of you: a documented risk analysis, technical safeguards (encryption, access control, audit controls, transmission security), and operational discipline (least privilege, logging, incident response). The next sections translate each of those into concrete AWS services and patterns.
With the legal base in place, compliance becomes a set of concrete technical controls applied to where PHI lives and how it moves. These map directly onto the HIPAA Security Rule's technical safeguards — access control, encryption, audit controls, and transmission security — and onto specific AWS services. None of them is exotic; the discipline is in applying all of them, consistently, to every path PHI can take.
Think of PHI as having three states, and secure each one. At rest — in your source documents, your vector store, your databases, and your logs. In transit — moving between your application, the model, and your data stores. In use — sitting inside a prompt at inference time. The controls below cover all three.
Here is the canonical shape of a HIPAA-compliant generative-AI system on AWS, assembled from the controls above. The unifying principle is simple to state and demanding to honor: PHI stays inside your AWS account boundary, under your BAA, encrypted, logged, and off the public internet, from the moment it enters to the moment a result is returned to an authorized user. Most clinical and administrative use cases are a variation on this single pattern.
Trace one request through the system. An authenticated, authorized user (clinician, coder, care-team member) makes a request from your application. It hits your API tier inside a VPC — API Gateway plus Lambda, or a container on ECS Fargate — where IAM and your application authorization confirm the user may access this patient's data under minimum-necessary. If the use case is retrieval-grounded, you retrieve only the necessary records from a KMS-encrypted store (or a Bedrock Knowledge Base over an encrypted S3 corpus and vector store), applying tenant and record-level access controls so nothing out of scope is pulled in.
Before the model call, the request passes an input Guardrail that redacts or blocks identifiers the model does not need and screens for disallowed content. The inference call to Bedrock travels over a VPC endpoint (PrivateLink) to an in-region, HIPAA-eligible model — encrypted in transit, never on the public internet, never used to train it. The completion passes an output Guardrail for redaction, denied-topic enforcement, and a contextual-grounding check against the retrieved sources. The result returns to the authorized user, and for any clinical or high-stakes use, it is presented for human review rather than acted on automatically (Section VII). Throughout, CloudTrail and model-invocation logging write an encrypted, access-controlled audit trail.
Two architectural habits make this pattern robust rather than fragile. First, minimize and de-identify at the edges: pull the fewest records that answer the question, and strip identifiers the model does not need before they reach the prompt (Section V). Second, treat the boundary as sacred: every integration — logging, analytics, a third-party API, an email notification — is a potential PHI exit, so each is either inside your BAA and boundary or never sees PHI. Most breaches in cloud GenAI are not the model leaking; they are PHI escaping through an unconsidered side channel.
Examples: clinical documentation assistance (drafting notes from an encounter), summarizing a patient's longitudinal record for a clinician, surfacing relevant guidelines, or drafting patient-facing explanations for clinician review. These touch care decisions, so they carry the strictest posture: full PHI safeguards, tight retrieval scoping, contextual-grounding checks to suppress hallucination, and — non-negotiable — a qualified human reviews and owns the output before it informs care. The model drafts and assists; the clinician decides.
Examples: prior-authorization and claims drafting, medical-coding assistance, call-center and inbox triage, eligibility Q&A, and back-office document processing. PHI may still be involved, so the same boundary, encryption, BAA, no-training, and logging controls all apply — but the consequence of an error is operational rather than clinical, and much of the work can run on de-identified or minimized data. This is why administrative use cases are the recommended first production deployment: they exercise the full compliance architecture against a lower blast radius, so you build the muscle before you reach for clinical workloads.
Before shipping any path, ask: "Can I name every place PHI travels in this request, and confirm each one is inside my AWS account, under my BAA, encrypted, and logged?" If you can answer that cleanly for every code path — including the error paths and the logging paths — you have the spine of a compliant system. If there is a single hop you cannot account for, that hop is your risk.
The most reliable way to reduce PHI risk is to expose less PHI. HIPAA's minimum-necessary principle and its de-identification provisions are not just compliance language — they are a practical design tool. Every identifier you can remove before the model call is one less identifier that can leak, be logged, or be misused. De-identification done right can even move parts of a workload outside HIPAA's scope entirely.
HIPAA recognizes two formal routes to de-identified data. Safe Harbor removes eighteen specific categories of identifiers (names, geographic subdivisions smaller than a state, dates more specific than year tied to an individual, contact details, record numbers, biometric identifiers, full-face images, and so on), after which the data is no longer PHI. Expert Determination uses a qualified statistician to certify that re-identification risk is very small. Properly de-identified data falls outside HIPAA — but the bar is exacting, re-identification risk is real (especially with rich free-text notes), and getting it wrong is worse than not trying, so treat formal de-identification as a deliberate, validated process, not a regex pass.
Short of that, minimization is the everyday lever and applies to essentially every workload. Send the model the fewest records and fields that answer the question, and mask or tokenize identifiers it does not need to reason about — a summarization or coding task rarely needs the patient's name, address, or full MRN; it needs the clinical substance. Bedrock Guardrails' sensitive-information redaction can perform this stripping inline, and you can tokenize identifiers in your own pre-processing so even the prompt the model sees is reduced. Less PHI in the prompt means less in any log, less in any output, and a smaller surface for everything downstream.
The honest tradeoff: aggressive redaction can strip context the model needs, degrading answers, and over-tokenization can confuse a model left reasoning about opaque placeholders. The right calibration is use-case specific and is exactly what you validate against an evaluation set (Section VII) — confirm the de-identified or minimized inputs still produce acceptable outputs before you lock the policy in. Done well, minimization is a rare win-win: lower risk and a smaller, cheaper prompt.
The failure modes for HIPAA generative AI are well known and almost entirely avoidable. Nearly every one is a variation on a single theme: PHI ending up somewhere it should not, or an unverified output being trusted as fact. Knowing the list turns most of them into a pre-flight checklist.
These are the patterns that turn a promising pilot into an incident report. Read them as hard "do not," not as "be careful."
Encryption and a BAA make a system lawful to operate; governance makes it trustworthy and defensible over time. Three practices separate a healthcare GenAI system you can stand behind from one that merely demos well: an auditable record of what happened, an evaluation harness that proves quality and safety did not regress, and a human accountable for high-stakes output. None is optional in a regulated setting.
These are the operational disciplines a regulator, an auditor, or your own risk committee will ask about — and the ones that let you change the system confidently without wondering whether you just introduced a safety problem.
The HIPAA Security Rule requires audit controls: the ability to record and examine activity in systems that contain PHI. On AWS that means CloudTrail for API and configuration changes and Bedrock model-invocation logging for inference activity, written to an encrypted, access-controlled destination with retention that matches your policy. Wire this from day one — reconstructing who accessed what after an incident is impossible if the logs were never captured. Be deliberate about log contents: enough to audit and investigate, without persisting raw PHI where it does not belong.
Build a representative evaluation set of real (appropriately de-identified) inputs with known-good outputs or clear acceptance criteria, and run it automatically. Bedrock Model Evaluation and RAG evaluation let you score accuracy, faithfulness/grounding, and safety so that when you change a prompt, a model, a redaction policy, or a chunking strategy, you can prove you did not regress — including that you did not start leaking identifiers or producing unsupported clinical claims. In a regulated setting the eval harness is also evidence: it demonstrates that you validate the system rather than trusting it. This is the highest-ROI investment in the whole build.
Match oversight to risk. For clinical and other high-stakes outputs, a qualified human must review and own the result before it informs a decision — the model drafts, the professional decides and is accountable. For lower-stakes administrative tasks, oversight can be lighter (sampling, exception review, confidence thresholds that route uncertain cases to a person). Design the human checkpoint into the workflow rather than bolting it on; "a clinician could review it if they wanted" is not a control, whereas "the draft cannot be finalized without clinician sign-off" is. Contextual-grounding checks and citations from Guardrails and RAG make that human review faster and more reliable by showing the sources behind each claim.
Most healthcare organizations have the clinical and domain expertise but not a team that has shipped a PHI-bearing generative-AI system on AWS before. That gap — not the technology — is the usual reason these projects stall. The reliable path is the same crawl-walk-run staging used for any production GenAI, with HIPAA controls baked into every stage, executed by people who have done it under a BAA before.
A capable AWS partner with healthcare experience does a few things that are hard to get right the first time. They scope the first use case for low blast radius — typically administrative, not diagnostic — so the organization validates the full compliance architecture against manageable risk. They stand up the boundary correctly from the start: BAA confirmed, KMS encryption, PrivateLink, in-region eligible model, Guardrails for PHI redaction, least-privilege IAM, and CloudTrail plus model-invocation logging. They build the eval harness and the human-in-the-loop workflow as first-class deliverables. And they produce the documentation — architecture, data-flow diagrams, control mappings — that your compliance team needs to sign off and your auditors will later ask for.
The funding angle matters in healthcare specifically, where budgets are tight and procurement is slow. Generative-AI build work on AWS is frequently delivered as an AWS-funded proof of concept, and surrounding migration or modernization work can draw on AWS funding programs too. In those structures AWS underwrites the engagement, so the build can be substantially or fully credit-covered — a production-grade, compliance-reviewed system without an open-ended consulting bill. It is the same mechanic that funds GenAI POCs across industries, applied to a regulated workload where the documentation and control rigor are higher.
The crawl-walk-run staging looks like this in a HIPAA context. Crawl (prove value, ~2 weeks): one narrow administrative use case, the full boundary in place even for the pilot, a small de-identified eval set, and an "is this useful and safe?" gate with real users. Walk (harden, ~4–8 weeks): the complete control set, the automated eval harness, the human-review workflow, audit logging verified, and a compliance review against your risk analysis. Run (scale, ongoing): expand to more administrative use cases and, only with controls and oversight proven, carefully toward clinical assistance — re-evaluating models and policies on a regular cadence. The partner accelerates each stage; the staging itself is what keeps a regulated project from collapsing under its own risk.
The difference between a defensible HIPAA generative-AI system and a breach waiting to happen is rarely subtle. This table puts the compliant pattern beside the two shortcuts teams reach for under deadline pressure — a consumer LLM, or Bedrock used without the safeguards. Read it as a gut check before you ship.
| Control | Compliant Bedrock pattern | Consumer / uncontracted LLM | Bedrock without safeguards |
|---|---|---|---|
| BAA in place | Yes — AWS BAA accepted, covers Bedrock | No — unlawful basis for PHI | Maybe — but undermined by the gaps below |
| Training on your data | No — prompts/completions not used to train FMs | Often yes — inputs may train the provider's models | No (Bedrock guarantee) — but other gaps remain |
| Network path for PHI | Private — VPC endpoint (PrivateLink), off public internet | Public internet to a third-party API | Often public Bedrock endpoint, not PrivateLink |
| Encryption (rest + transit) | KMS at rest + TLS in transit, everywhere | Outside your control | Partial — stores or logs left unencrypted |
| PHI/PII redaction | Guardrails redact identifiers in + out | None enforced | None — raw PHI in every prompt and log |
| Audit trail | CloudTrail + model-invocation logging, encrypted | None you control | Missing or logging raw PHI insecurely |
| Human-in-the-loop | Required for clinical / high-stakes output | Undefined | Often output trusted directly |
| Net posture | Defensible, documented, auditable | Reportable breach | Eligible service, non-compliant system |
Situation: A weekend prototype that drafted prior-auth letters from clinical notes impressed leadership, but it was built against a consumer LLM API with PHI flowing straight to a third party — no BAA, no redaction, raw prompts logged to an external tool. Compliance halted any rollout. The team had strong product and clinical knowledge but had never shipped a PHI-bearing system on AWS, and they could not afford an open-ended consulting engagement to figure it out.
What CloudRoute did: Routed within a day to a vetted AWS partner with healthcare and HIPAA delivery experience. The partner confirmed the AWS BAA, rebuilt the workflow on Amazon Bedrock with a HIPAA-eligible model in-region, moved all inference behind a VPC endpoint (PrivateLink), added KMS encryption across the document store, vector store, and logs, and wired Bedrock Guardrails to redact identifiers the model did not need on both input and output. They scoped retrieval to the minimum-necessary records, stood up CloudTrail and model-invocation logging into an encrypted store, built a 75-example de-identified eval set in Bedrock Model Evaluation, and designed a human-in-the-loop step so no prior-auth draft could be finalized without staff sign-off. The work was filed as an AWS-funded GenAI POC, so the build was credit-covered.
Outcome: Compliance signed off after one architecture review on the BAA + no-training + PrivateLink + KMS + Guardrails + audit-logging posture, backed by the data-flow documentation the partner produced. The assistant reached production in 11 weeks as an administrative (non-clinical) drafting tool with mandatory human review. Redaction plus minimum-necessary retrieval kept PHI exposure tight, and the eval harness gave the team confidence to iterate. CloudRoute's commission was paid by the partner from AWS engagement funding — the customer paid $0.
POC → production: 11 weeks · compliance review: 1 meeting · PHI on public internet: none · cost to customer: $0
CloudRoute routes you to a vetted AWS partner with healthcare experience who stands up the compliant architecture (BAA, KMS, PrivateLink, Guardrails, audit logging, human-in-the-loop), produces the documentation your compliance team needs, and ships it — often as an AWS-funded GenAI POC, so you pay $0. No procurement. No open-ended consulting bill.