sentiment analysis on aws · the 2026 build guide

How to do sentiment analysis on AWS (2026).

Turning a flood of customer feedback — reviews, support tickets, survey responses, social posts, call transcripts — into sentiment, intent, and themes you can act on is one of the highest-value, lowest-risk things to build on AWS. This is the full how-to: the central decision between Amazon Comprehend (cheap, fast, structured, fixed labels) and an LLM on Amazon Bedrock (nuance, aspect-based sentiment, custom categories, intent, structured JSON), when each wins, the end-to-end architecture, how to run bulk analysis cheaply with batch inference, how to get aspect-based sentiment and clean JSON out of an LLM, how to push the results to a dashboard, how to measure accuracy, and what production really costs.

core engines
2
Comprehend sentiments
4
bulk cost lever
batch (~50% off)
credits to fund it
up to $100K
TL;DR
  • There are two ways to do sentiment analysis on AWS, and the first decision is which one. Amazon Comprehend is a managed NLP API that returns sentiment (Positive / Negative / Neutral / Mixed), targeted (aspect-based) sentiment, entities, key phrases, and language in one call — cheap, fast, no prompt engineering, fixed output. An LLM on Amazon Bedrock (Claude, Amazon Nova, Llama, Mistral) reads the same text and returns whatever you ask for — nuanced sentiment with a 1–5 score, aspect-based sentiment on your own aspects, intent, custom categories, reason codes, and structured JSON — at higher cost and with prompt and evaluation work.
  • Pick by the job. Use Comprehend when you want cheap, fast, structured sentiment at scale on straightforward text and the standard labels are enough. Use a Bedrock LLM when you need nuance (sarcasm, mixed feelings, domain language), aspect-based sentiment on categories you define, intent or topic alongside sentiment, or a custom JSON schema. Many production systems use both: Comprehend as a cheap first pass, an LLM for the hard or high-value slice.
  • Almost all feedback analysis is bulk and not real-time, so the dominant cost lever is batch: Comprehend has async batch jobs and Bedrock has batch inference (~50% off the on-demand token rate). GenAI and NLP bills add up fast once you process millions of items; CloudRoute routes you to AWS credits (Activate Portfolio up to $100K, Bedrock/GenAI POC $10K–$50K, GenAI Accelerator up to $1M) and a vetted ML partner who builds the pipeline — you pay $0.
the shape of the problem

IWhat sentiment analysis on AWS actually involves

Sentiment analysis sounds like one thing — "is this text positive or negative?" — but in production it is three or four questions stacked on top of each other, and the architecture you build depends on which of them you actually need to answer.

The naïve version is a single label per piece of text: this review is Positive, this ticket is Negative. That is genuinely useful and AWS makes it a single API call. But the moment a team starts acting on the results, the questions multiply. What is the sentiment about? A hotel review can be glowing about the location and scathing about the staff in the same paragraph — one overall label hides that. Why is the customer unhappy? Sentiment without a theme ("billing," "shipping delay," "a specific feature") tells you the temperature but not the cause. What does the customer want? Sentiment is not intent — an angry message can be a cancellation, a refund request, or a bug report, and you route each differently.

So "sentiment analysis" in practice is usually a small family of related tasks: overall sentiment (one label or score per item), aspect-based / targeted sentiment (sentiment per topic or entity within the text), intent or category classification (what the message is about or asking for), and theme extraction (the recurring topics across a whole corpus of feedback). The more of these you need — and the more nuance the text carries — the more the right tool shifts from a fixed-label NLP API toward a language model you can instruct.

On AWS that maps cleanly onto two engines. Amazon Comprehend is the managed NLP service: it returns sentiment, targeted sentiment, entities, key phrases, PII, and language detection through a simple API, with no prompt engineering and a low per-character price. Amazon Bedrock gives you foundation models (Claude, Amazon Nova, Llama, Mistral, and more) that read the text and return whatever you define — a nuanced score, sentiment per custom aspect, intent, reason codes, all as structured JSON. The first decision in any sentiment project on AWS is which of these does the work — and the honest answer is often "both, for different slices."

One framing worth keeping throughout: sentiment analysis is a high-value, low-risk GenAI use case. The input is text you already have, the output is a label or a small JSON object you can check against the source, errors are bounded (a mislabelled review is recoverable, not catastrophic), and you can measure accuracy directly against human labels. That is exactly why it is often the first analytics or GenAI workload a team ships — and why it is a natural fit for a funded proof-of-concept.

the one-sentence version

Sentiment analysis on AWS = pick the engine (Amazon Comprehend for cheap, fast, fixed-label sentiment / targeted sentiment / entities at scale; an LLM on Amazon Bedrock for nuance, aspect-based sentiment on your own categories, intent, and custom JSON) → run it (real-time API for live text, batch for the bulk backlog) → store the structured results → put them on a dashboard → measure accuracy against human labels. The engine choice drives both quality and cost.

the central decision

IIAmazon Comprehend vs an LLM on Bedrock — the core choice

The decision that shapes the whole system is which engine reads the text. Comprehend is a fixed-output NLP API; a Bedrock LLM is an instruction-following model. They overlap in the easy cases and diverge sharply as the task gets nuanced or custom — and the cost and effort profiles are very different.

The honest framing: if the standard labels are enough and the text is straightforward, use Comprehend; when you need nuance, your own categories, intent, or a custom schema, use an LLM. Comprehend is cheaper per item, faster to ship (no prompt to write, no JSON to coax), and entirely deterministic in shape — you always get the same fields back. A Bedrock LLM is more expensive and needs prompt and evaluation work, but it reads context, handles sarcasm and mixed feelings, scores sentiment on a scale, classifies against categories you define in the prompt, extracts intent, and returns exactly the JSON you ask for. The two are not rivals so much as different tools — and a large fraction of production systems use both.

Amazon Comprehend — managed NLP, fixed output, low cost

You call an API with text and get back structured results: sentiment as one of four labels (Positive, Negative, Neutral, Mixed) plus confidence scores; targeted (aspect-based) sentiment that finds entities in the text and assigns sentiment to each; entities, key phrases, PII, and the dominant language. It runs synchronously for single documents (DetectSentiment) and asynchronously for bulk (StartSentimentDetectionJob over a corpus in S3). You can also train custom classification models on your own labelled data when you need categories beyond the built-ins. Pros: very cheap per unit, fast, no prompt engineering, deterministic output shape, managed and scalable, strong multilingual coverage. Cons: the overall sentiment is four fixed buckets (no 1–5 nuance out of the box); it can miss sarcasm, irony, and domain-specific tone; you cannot freely define arbitrary categories or ask for reason codes without training a custom model. Choose it when you want cheap, fast, structured sentiment at scale, the four labels (or a custom classifier) are enough, and the text is fairly direct — product reviews, survey responses, support-ticket triage, social monitoring at volume.

An LLM on Bedrock — nuance, custom categories, intent, JSON

You send the text to a foundation model on Amazon Bedrock with a prompt that says exactly what to extract, and the model returns it as structured JSON. Because it follows instructions, one model can return overall sentiment and a 1–5 intensity score and aspect-based sentiment on your aspects ("battery life," "checkout flow," "onboarding") and the customer's intent ("cancel," "upsell opportunity," "bug report") and a one-line reason — all in a single call, in a schema you define. It reads context, so it handles sarcasm ("oh great, another outage"), mixed sentiment, negation, and domain language far better than a fixed-label model. Pros: nuance and reasoning; aspect-based sentiment on arbitrary, prompt-defined aspects; intent and custom categories with no training data; multiple signals per call; output in any JSON shape; trivially adjustable by editing the prompt. Cons: higher cost per item; needs prompt engineering and structured-output handling; non-determinism (set low temperature and validate the JSON); needs evaluation to trust it. Choose it when you need nuance, aspect-based sentiment on your own categories, intent alongside sentiment, custom labels, reason codes, or a specific JSON schema — or when the text is messy, sarcastic, or domain-heavy.

The pragmatic answer — often both (cascade)

The two engines compose well, and the most cost-effective production systems usually run a cascade: Comprehend (cheap, fast) does the first pass over everything for overall sentiment, language, entities, and PII redaction; an LLM on Bedrock then handles the slice that needs more — the items Comprehend flags Mixed or low-confidence, the high-value accounts, or the cases where you need aspect-based sentiment and intent. You pay the cheap NLP price on the bulk and the LLM price only where it earns its keep. The same pattern works in reverse for triage: Comprehend's sentiment routes a ticket, and only negative-and-urgent tickets get the LLM's richer intent/aspect extraction. Section III shows where each lands in the architecture.

the pragmatic rule

Default to Amazon Comprehend when the four standard labels (or a custom classifier) cover the job and the text is direct — it is cheaper, faster, and needs no prompting. Switch to a Bedrock LLM when you need nuance (sarcasm, mixed feelings, domain tone), aspect-based sentiment on categories you define, intent alongside sentiment, custom labels, reason codes, or a specific JSON schema. For most real systems the answer is a cascade: Comprehend over everything, the LLM on the hard or high-value slice.

end to end

IIIThe reference architecture on AWS, stage by stage

Whether you analyse one message live or a hundred million overnight, the system runs the same logical stages. Knowing each one is what lets you debug a pipeline that returns wrong labels or vague themes, because nearly every quality and cost problem traces back to a specific stage.

It helps to see the whole shape first. Feedback arrives from many sources, lands somewhere durable, gets cleaned and (where needed) PII-redacted, is analysed by Comprehend and/or a Bedrock LLM, and the structured results are stored and visualised. The split between a real-time path (a user is waiting — a live chat sentiment, an incoming ticket to route) and a batch path (a backlog with a deadline, not a person, waiting) is the most important architectural decision after the engine choice, because it determines both latency and cost.

1. Ingest — collect feedback from every source

Feedback is scattered: app-store and product reviews, support tickets (Zendesk, Salesforce), survey tools, social and community posts, NPS responses, and call/chat transcripts (often via Amazon Transcribe or Amazon Connect Contact Lens, which does its own real-time sentiment). The job here is to land all of it in one durable place — typically Amazon S3 for the bulk corpus and a stream (Kinesis or EventBridge) for live events — with a stable record id per item so results can be reconciled back later. Get the id and the source metadata right here; everything downstream keys off them.

2. Prepare — clean, normalise, detect language, redact PII

Raw feedback is noisy: HTML, emoji, signatures, boilerplate, duplicate quoted threads. Light cleaning improves accuracy for both engines. Two prepare steps are worth calling out on AWS. Language detection (Comprehend DetectDominantLanguage) lets you route non-English text appropriately — Comprehend supports many languages directly, and LLMs are strongly multilingual, but you want to know what you are dealing with. PII redaction matters when feedback contains names, emails, or account numbers: Comprehend can detect and redact PII before the text is stored or sent to a model, which is often a compliance requirement for customer data. This stage is cheap and high-leverage; skipping it is a common cause of noisy results.

3. Analyse — Comprehend and/or a Bedrock LLM

This is the engine stage from section II. For the bulk path, that is a Comprehend async sentiment job over the S3 corpus, or a Bedrock batch inference job (JSONL in, JSONL out, ~50% off), or both in a cascade. For the real-time path, it is a synchronous Comprehend DetectSentiment call or a Bedrock Converse call behind an API. The output of this stage is the structured signal per item — sentiment, score, aspects, intent, reason — that the rest of the system consumes. Section IV covers how to get clean, aspect-based JSON out of an LLM, and section VI the batch mechanics.

4. Store — land structured results somewhere queryable

The per-item results — record id, source, overall sentiment, score, per-aspect sentiment, intent, theme, timestamp — need to live somewhere you can aggregate and query. Common choices: Amazon S3 + Amazon Athena (cheap, serverless SQL over the result files, great for analytics), Amazon DynamoDB (low-latency lookups for a live app), or a warehouse like Amazon Redshift for heavy BI. Storing the results as flat, typed columns (one row per item, aspects exploded or kept as a nested field) is what makes the dashboard in the next stage trivial.

5. Visualise — dashboards and alerts

The point of all this is a view a human acts on: sentiment trend over time, breakdown by product/aspect/region, the themes driving negative sentiment this week, and alerts when a metric moves. Amazon QuickSight is the native BI choice (it reads Athena, Redshift, and S3 directly and can even surface NLQ/forecasting), but the results are plain data, so any BI tool works. Add alerting (an EventBridge rule or a QuickSight threshold) so a spike in negative sentiment about a specific aspect pages someone rather than waiting to be noticed in a weekly review. A pipeline whose output nobody looks at delivers no value — the dashboard is the deliverable.

the sentiment pipeline stages mapped to AWS services · representative as of 2026
StageWhat it doesReal-time pathBatch / bulk path
1. IngestCollect feedback from every sourceKinesis / EventBridge / APILand the corpus in Amazon S3
2. PrepareClean, detect language, redact PIILambda + Comprehend (language/PII)Glue / Lambda + Comprehend
3. AnalyseSentiment / aspect / intentComprehend DetectSentiment · Bedrock ConverseComprehend async job · Bedrock batch (~50% off)
4. StoreLand structured resultsDynamoDB (low-latency)S3 + Athena / Redshift
5. VisualiseDashboards + alertsLive widget + EventBridge alertAmazon QuickSight (trends, breakdowns)
The real-time path serves a user who is waiting (route this ticket, score this live chat); the batch path serves a deadline (analyse the whole review backlog overnight). Most systems run both: real-time for incoming items, a nightly batch over the accumulated corpus. The engine in stage 3 can be Comprehend, a Bedrock LLM, or a cascade of the two.
how to ask

IVGetting aspect-based sentiment and clean JSON out of an LLM

The whole reason to reach for a Bedrock LLM over Comprehend is the richer, custom output — nuanced scores, sentiment per aspect you define, intent, reason codes. Getting that reliably is a prompt-and-validation problem, and a handful of patterns do almost all of the work.

The through-line of every good sentiment prompt is define the exact schema, constrain the labels, and force the model to ground its answer in the text. Unlike Comprehend, the LLM will happily return prose, vary its format, or invent a category if you let it — so the prompt's job is to pin the output shape, enumerate the allowed values, and stop embellishment. Set a low temperature for consistency and always validate the returned JSON.

  • Fix the output schema explicitly — Ask for JSON only and describe every field and its type: e.g. { "overall": "positive|negative|neutral|mixed", "score": 1-5, "aspects": [{ "aspect": "...", "sentiment": "...", "evidence": "..." }], "intent": "...", "reason": "..." }. A fixed schema makes outputs comparable, machine-parseable, and easy to store as columns. Use the model's tool-use / structured-output mode where available to enforce it.
  • Enumerate the allowed labels — Do not let the model freelance categories. List the exact sentiment values and, for intent or theme, the exact closed set of allowed categories ("billing, shipping, product_quality, account, other"). An open-ended "classify the intent" produces inconsistent, unaggregatable labels; a closed list produces clean facets you can chart.
  • Define the aspects (for aspect-based sentiment) — For aspect-based sentiment, tell the model which aspects to score — either a fixed list you care about ("battery, screen, price, support") or "extract the aspects mentioned and score each." A fixed list gives consistent columns across items; open extraction discovers aspects you did not anticipate. Many systems do a discovery pass to find the aspects, then lock the list for production.
  • Ground each judgement in evidence — Ask the model to attach the supporting phrase or sentence to each aspect or to the overall sentiment ("evidence": the exact span). This both improves faithfulness and lets a human verify why an item was labelled the way it was — invaluable when a stakeholder disputes a result.
  • Ask for intent and reason alongside sentiment — The LLM's edge is doing several jobs in one call. In the same schema, capture the customer's intent (what they want) and a one-line reason for the sentiment. This is what turns "47% negative this week" into "47% negative, driven by checkout failures and slow support replies" — the actionable version.
  • Handle edge cases honestly — Tell the model what to do with empty, off-topic, spam, or non-text inputs — return "neutral" with intent "none" or a flag, not a fabricated sentiment. This matters most in batch, where a bad input should produce an honest null rather than noise that pollutes the aggregate.
  • Few-shot the tricky cases — A handful of labelled examples in the prompt — especially sarcasm, mixed sentiment, and your domain's edge cases — sharply raises consistency on exactly the items where the LLM beats Comprehend. Draw them from your evaluation set (section V) so they reflect real, hard inputs.
the highest-leverage move

Pin the JSON schema and enumerate the labels. "Return only JSON matching this schema; sentiment must be one of [positive, negative, neutral, mixed]; intent must be one of [billing, shipping, product, account, other]; score each of these aspects: [...]; attach the supporting phrase as evidence; if the text is empty or off-topic, return neutral with intent none." Pair it with low temperature, the model's structured-output mode, and JSON validation, and most format and consistency problems disappear before you ever change models.

measuring it

VEvaluating accuracy — because "it looks right" is not a metric

Sentiment is a classification task, which is good news: accuracy is directly measurable. Build a labelled test set once and you can compare Comprehend against an LLM, compare models and prompts, and catch regressions — instead of trusting a vibe.

Assemble a golden set first: a few hundred to a few thousand real items, each hand-labelled by humans with the ground-truth overall sentiment and (if you do aspect-based) the per-aspect labels and intent. Label a few hundred per important segment (product line, language, channel) so you can see where each engine is weak. Then score every candidate — Comprehend, each LLM, each prompt — on the same set and read standard classification metrics, not a single accuracy number that hides the failure modes.

  • Per-class precision, recall, F1 — Overall accuracy lies when classes are imbalanced (most reviews are positive). Read precision and recall per class — especially for Negative, the class you most need to catch. A model that is 92% accurate but misses half your negative feedback is failing at the actual job. The confusion matrix shows exactly which classes get confused (Neutral vs Mixed is the classic).
  • Agreement with human labels — Treat the human labels as ground truth and measure how often each engine agrees. Sentiment is partly subjective, so also sanity-check human–human agreement on a sample — if two annotators disagree 15% of the time, no model will exceed that ceiling, and you should not chase it.
  • Aspect and intent accuracy separately — When you do aspect-based sentiment and intent, score them as their own tasks: did the model find the right aspects, assign the right sentiment to each, and pick the right intent from the allowed set? An item can have correct overall sentiment but wrong aspects, and you want to know.
  • Segment the metrics — Break accuracy down by language, channel, length, and product area. Sentiment models are reliably weaker on short text (a three-word review), on non-English, and on sarcasm-heavy channels (social). Segmenting tells you exactly where to send the cascade's LLM tier and where Comprehend is fine.

How to run it on AWS

For the LLM side, Amazon Bedrock model evaluation lets you score and compare models on your dataset (programmatic metrics for classification, or an LLM-as-a-judge for harder qualitative checks), so you can pick the cheapest model that clears your bar objectively. For Comprehend custom classifiers, the training job emits its own precision/recall/F1 on a held-out split. Either way the discipline is identical: a fixed golden set, automated scoring on every change (new model, new prompt, new prep step), and a number that moves when you turn a knob.

Two non-negotiables for production. Log every analysis — record id, input (redacted), engine/model, prompt version, and full output — so any label can be reproduced and audited, and so you can re-score historical data when you change models. And keep a human-review loop: route a sample (and all low-confidence or high-stakes items) to humans, both to measure live accuracy and to feed corrected labels back into the golden set and the few-shot examples. Models drift as your product and your customers' language change; the review loop is how you notice.

doing it at scale

VIAnalysing millions of items — batch, and the bulk pattern

Scoring one live message is an API call. Scoring a backlog of ten million reviews, or re-scoring a corpus when you switch models, is a data-engineering job — and the right tools for it on AWS halve the bill for work nobody is waiting on in real time.

A huge share of sentiment work is not interactive: digesting an archive of reviews, scoring a quarter of survey responses, condensing a year of support tickets, or back-filling sentiment across an entire feedback history. Nobody is staring at a spinner — you just need the whole job done by a deadline. Both engines have a batch mode for exactly this. Amazon Comprehend async jobs (StartSentimentDetectionJob, StartTargetedSentimentDetectionJob) read a corpus from S3 and write results back to S3 in one managed job. Amazon Bedrock batch inference does the same for the LLM path: write your requests as JSONL to S3, submit one asynchronous job (CreateModelInvocationJob), and Bedrock processes them in the background and writes one structured result per input back to S3 — at roughly 50% of the on-demand token rate. For corpus-scale sentiment, batch is the single easiest cost win.

The bulk pattern, end to end: land the corpus in S3, run a cheap Comprehend async pass over everything for overall sentiment, language, and PII; select the slice that needs more (Mixed/low-confidence, high-value accounts, anything needing aspect-based sentiment or intent) and write those as JSONL requests; run a Bedrock batch job on a right-sized small model for that slice; reconcile both result sets back to your items by record id; and load the combined output into S3/Athena (or your warehouse) for the dashboard. Because each item is independent, both batch modes parallelise perfectly. Keep the real-time path (a user pastes text and waits, a ticket needs instant routing) on the synchronous APIs — often with prompt caching on the LLM if a long instruction or shared rubric repeats across calls — and send the bulk backlog to batch. See amazon-bedrock-batch-inference for the full mechanics and the cost math.

The two cost levers multiply here, which is the whole point. Use the cheap engine where it suffices (Comprehend, or a small Bedrock model) and run it on batch (~50% off). For a large corpus the combined effect over a frontier-model-on-demand baseline is routinely an order of magnitude or more. The cascade is what makes this affordable at scale: the expensive LLM only ever touches the fraction of items that genuinely need it, and even that fraction runs on batch on a small model.

the bulk rule of thumb

Real-time sentiment (a human or a router is waiting) → synchronous Comprehend / Bedrock Converse, smallest adequate engine, prompt caching if the rubric repeats. Bulk or corpus sentiment (a deadline, not a person, is waiting) → Comprehend async jobs and/or Bedrock batch inference (~50% off), in a cascade. The two cost levers — cheap engine × batch — multiply, and at corpus scale that is the difference between a hobby-budget job and an enterprise bill.

what it costs

VIIThe sentiment cost stack on AWS — where the money goes

A sentiment bill has a few line items, and which one dominates depends entirely on the engine. Comprehend is priced per unit of text; the LLM path is priced per token. Here is the full stack, the lever on each, and a worked example so you can reproduce the math for your own job.

The figures below are representative as of 2026 to show the shape of the bill, not a quote — always check the AWS pricing page for current rates. The headline: Comprehend bills per 100-character "unit" (a typical short review is one or two units), with a small per-unit price that drops for async/bulk; the LLM path bills per input + output token, so cost scales with how long each item is and how much JSON you ask back. For straightforward sentiment at scale, Comprehend is usually the cheaper engine by a wide margin; the LLM earns its higher price only on the items where its nuance and custom output matter — which is exactly what the cascade exploits.

A worked example (bulk feedback analysis)

The job: analyse 5,000,000 reviews/month, each averaging ~300 characters (≈ 3 Comprehend units, ≈ 80 input tokens) and, for the LLM path, producing a ~120-token JSON result.

Comprehend, everything. At a representative ~$0.0001 per unit for async sentiment, 5,000,000 × 3 units × $0.0001 = ≈ $1,500/month for overall sentiment across the entire corpus — cheap, fixed labels, no prompting.

LLM on everything (small model), on batch. 5,000,000 items × 80 input + 120 output tokens = 400M input + 600M output tokens. At a small Nova Lite-class rate (~$0.06 / 1M input, ~$0.24 / 1M output), on-demand ≈ (400 × $0.06) + (600 × $0.24) = $24 + $144 = ≈ $168/month; on batch (~50% off) ≈ $84/month — and you get nuance, aspects, and intent, not just a label. On a frontier Sonnet-class model (~$3 / $15 per 1M) the same job is (400 × $3) + (600 × $15) = $1,200 + $9,000 = ~$10,200/month on-demand — ~60× the small-model cost for a task that rarely needs it.

The cascade — the realistic shape. Run Comprehend over all 5M (~$1,500) for overall sentiment, then send only the ~15% that is Mixed/low-confidence or high-value (~750K items) to a small LLM on batch (~$13). Total ≈ $1,513/month for cheap labels on everything plus rich aspect/intent analysis where it counts — versus $10,200 to push everything through a frontier model. The arithmetic teaches the lesson twice: match the engine to the item, then halve it with batch.

sentiment cost stack on aws · representative shape as of 2026 — check the AWS pricing page for current rates
Cost lineWhen you payDriverMain lever to control it
Comprehend (built-in APIs)Per 100-char unit analysedVolume of text × per-unit rateAsync/bulk tier; redact and dedupe first; don't re-score unchanged text
Comprehend custom classifierTraining + inferenceTraining jobs + endpoint/throughputTrain once; use async over endpoints for bulk; right-size throughput
LLM — input tokensPer item (LLM path)Item length × model input rateCheapest adequate model; batch (~50% off); cascade only the hard slice
LLM — output tokensPer item (LLM path)JSON length × model output rateKeep the schema tight; ask only for fields you use
Storage / query / BIOngoingResult volume, Athena scans, QuickSight usersColumnar/partitioned results; per-session BI where it fits
Comprehend is priced per text unit; the LLM path per token. For plain sentiment at scale Comprehend is usually far cheaper; the LLM earns its higher price on nuance, aspects, and intent. The cascade — Comprehend on everything, a small LLM on the hard slice, both on batch — is what keeps corpus-scale analysis affordable. Confirm current rates on the AWS Comprehend and Bedrock pricing pages.
how it becomes $0

VIIIHow AWS credits make the whole build $0

Everything above shrinks a sentiment bill you pay AWS directly. For most startups and many companies the more relevant move is to not pay it at all during the build — because AWS will frequently fund this kind of workload with credits, and your Comprehend and Bedrock spend draws those credits down before it touches your card.

AWS runs several credit programs aimed at putting AI and GenAI workloads on AWS, and a sentiment pipeline is squarely credit-eligible: Comprehend (real-time and async), Bedrock inference (on-demand and batch), Transcribe for call/chat audio, and the supporting services (S3, Athena, QuickSight, the orchestration) all draw down credits. The relevant pools: AWS Activate (general startup credits, commonly up to $100K for institutionally-funded startups); a dedicated Bedrock / Generative-AI POC pool ($10K–$50K) aimed at proving out a GenAI use case — a one-time backfill of an entire feedback corpus is exactly the kind of bounded, high-volume job it is meant to absorb; and the competitive Generative AI Accelerator (credit awards up to $1M for a small cohort of AI-first startups). Credits apply automatically against your AWS bill until exhausted.

The practical mechanic is that most of these pools are partner-filed — requested through the AWS Partner Network (the ACE program), not a public self-serve form. That is why teams route through an AWS partner rather than applying alone, and it is the gap CloudRoute fills. CloudRoute matches you to the right credit pool for your stage and to a vetted AWS DevOps/ML partner who both files the credit application and helps build the pipeline itself — the ingestion and S3 layout, the prepare/PII-redaction step, the Comprehend-vs-LLM engine split, the aspect-based prompts and structured-output handling, the batch jobs and reconciliation, the evaluation harness against a human-labelled golden set, and the QuickSight dashboard the business actually reads. The customer pays $0: AWS funds the credit pool, AWS pays the partner through engagement-funding programs, and the partner pays CloudRoute a routing commission. You never see an invoice.

There is a clean synergy worth naming. Sentiment and feedback analysis is one of the most common first analytics workloads a team ships — high-value, low-risk, easy to scope — and a one-time corpus backfill (score the whole review and ticket history, stand up the dashboard) is precisely the bounded, high-volume job a Bedrock POC credit pool is designed to fund: prove the use case, analyse the corpus, run the accuracy evals, all funded. A team that combines the cascade (Comprehend + a small LLM) with batch and a credit pool can analyse an enormous backlog and stand up the production pipeline while paying nothing out of pocket. Related: see the cross-cluster pages on AWS credits for generative-AI startups and Bedrock POC funding for the full credit mechanics.

the central decision, side by side

Amazon Comprehend vs an LLM on Bedrock — which engine to build on

This is the comparison that decides your architecture. Read it as "default to Comprehend when standard labels at low cost are enough; move to a Bedrock LLM for nuance, aspect-based sentiment on your own categories, intent, and custom JSON; cascade the two for the best cost-per-quality." Figures and limits are representative 2026 illustrations, not quotes.

DimensionAmazon ComprehendLLM on Amazon BedrockCascade (both)
What it returnsFixed: sentiment (4 labels), targeted sentiment, entities, key phrases, PII, languageAnything you define: nuanced score, aspects, intent, categories, reason — as JSONCheap labels on all, rich JSON on the hard slice
Sentiment granularity4 buckets (Pos/Neg/Neutral/Mixed) + confidence1–5 score, custom scales, per-aspect sentimentComprehend bucket + LLM score where needed
Aspect-based sentimentTargeted sentiment on detected entitiesOn any aspects you define in the promptLLM tier, on your aspects
Custom categories / intentTrain a custom classifier on labelled dataDefine in the prompt, no training dataLLM tier for intent + custom labels
Nuance (sarcasm, mixed, domain)Limited — can miss itStrong — reads contextRoute nuanced items to the LLM
Effort to shipLowest — call the APIPrompt + structured-output + eval workModerate — two engines wired together
Relative cost / unitLowest (per text unit)Higher (per token; model-dependent)Low overall — LLM only on the slice
Best forCheap, fast, structured sentiment at scaleNuance, aspects, intent, custom JSONMost production systems at scale
Both run on AWS as managed services and both have a batch mode (Comprehend async jobs; Bedrock batch inference, ~50% off). A common production shape is the cascade: Comprehend over the whole corpus for overall sentiment, language, and PII, with a small Bedrock LLM on the Mixed/low-confidence or high-value slice for aspect-based sentiment and intent. Confirm current rates on the AWS Comprehend and Bedrock pricing pages.
before you analyse a single feedback corpus
Get AWS credits that cover Comprehend and Bedrock — and a partner to build the pipeline (you pay $0)
Get matched in 24h →
a recent match

A 9M-review + ticket sentiment backfill — run on $0 — anonymized

inquiry · series-a consumer marketplace, voice-of-customer, Berlin
Series-A consumer marketplace, 30 people, ~9M historical reviews + support tickets across 6 languages to analyse for a voice-of-customer program, EU data-residency requirement

Situation: The team wanted a living voice-of-customer dashboard: overall sentiment trend, sentiment per product aspect (delivery, pricing, app, support), and the intents driving negative feedback — across ~9M historical reviews and tickets in six languages, then continuously on new feedback. A first in-house attempt looped on-demand calls on a frontier model over every item: it was slow, it cost into the high four figures per month, it returned inconsistent free-text labels nobody could aggregate, and it had no accuracy measurement, so leadership did not trust the numbers. The two data engineers who could fix it were committed to core product, and there was no runway for a one-time backfill.

What CloudRoute did: CloudRoute matched them in under 24 hours to an EU-region AWS partner with a document-AI and Bedrock track record. The partner built the pipeline in eu-central-1 as a <strong>cascade</strong>: feedback landed in <strong>Amazon S3</strong> with stable record ids; a prepare step ran <strong>Amazon Comprehend</strong> for language detection and <strong>PII redaction</strong>; a <strong>Comprehend async sentiment job</strong> scored overall sentiment across all ~9M items cheaply; only the Mixed/low-confidence and high-value slice was sent to a <strong>right-sized small Bedrock model (Nova Lite-class)</strong> with an <strong>aspect-based, enumerated-label JSON prompt</strong> (overall + 1–5 score + per-aspect sentiment on the four business aspects + intent + evidence span), the whole thing run on <strong>Bedrock batch inference</strong> (~50% off) and reconciled by record id; results landed in <strong>S3 + Athena</strong> and surfaced in an <strong>Amazon QuickSight</strong> dashboard with a negative-sentiment-spike alert. A 1,500-item human-labelled golden set scored per-class precision/recall via <strong>Bedrock model evaluation</strong>, with a human-review loop on low-confidence items. The partner filed a Bedrock POC credit application plus an Activate application to fund the backfill and early usage.

Outcome: Consistent, structured sentiment + aspect + intent for the full ~9M-item corpus across all six languages, produced via the cascade on right-sized models and batch for a fraction of the original projection — and the entire cost absorbed by the approved credits, so the team paid $0 to stand up voice-of-customer and ship the dashboard. Negative-class recall cleared the team's bar on the golden set, so leadership trusted the trend lines; the "inconsistent labels" problem was gone because intents came from a closed set. The same pipeline now scores new feedback continuously and pages the team on aspect-level spikes. CloudRoute's commission was paid by the partner from AWS engagement funding, not by the customer.

corpus: ~9M reviews + tickets, 6 languages · stack: Comprehend (sentiment/PII) + small-LLM cascade + aspect-based JSON + batch (~50% off) + Bedrock eval + QuickSight · credits secured: POC + Activate · out-of-pocket: $0

faq

Common questions

How do you do sentiment analysis on AWS?
Pick the engine, then build a pipeline around it. The two engines are Amazon Comprehend (a managed NLP API that returns sentiment as Positive/Negative/Neutral/Mixed, plus targeted/aspect-based sentiment, entities, key phrases, PII, and language — cheap, fast, fixed output) and a foundation model on Amazon Bedrock (Claude, Amazon Nova, Llama, Mistral — which returns a nuanced score, sentiment per aspect you define, intent, and custom categories as structured JSON). Around the chosen engine you build: ingest feedback to Amazon S3, prepare it (clean, detect language, redact PII with Comprehend), analyse (Comprehend async job and/or Bedrock batch inference for the bulk; synchronous APIs for real-time), store the structured results (S3 + Athena, DynamoDB, or Redshift), and visualise them in Amazon QuickSight. Most production systems cascade the two: Comprehend over everything, the LLM on the hard or high-value slice.
Should I use Amazon Comprehend or a Bedrock LLM for sentiment analysis?
Use Amazon Comprehend when you want cheap, fast, structured sentiment at scale, the four standard labels (or a custom classifier you train) are enough, and the text is fairly direct — product reviews, survey responses, ticket triage, social monitoring. Use a Bedrock LLM when you need nuance (sarcasm, mixed feelings, domain-specific tone), aspect-based sentiment on categories you define yourself, intent alongside sentiment, custom labels or reason codes, or a specific JSON schema — or when the text is messy. Comprehend is cheaper per item and needs no prompting; the LLM costs more and needs prompt and evaluation work but is far more flexible and nuanced. In practice many systems use both in a cascade: Comprehend does a cheap first pass over everything, and the LLM handles the items Comprehend flags Mixed or low-confidence, plus high-value accounts and anything needing aspects or intent.
How do I do aspect-based sentiment analysis on AWS?
Two ways. Amazon Comprehend has targeted sentiment, which detects entities in the text and assigns sentiment to each automatically. For aspect-based sentiment on aspects you define ("battery life," "checkout flow," "onboarding," "delivery"), use a Bedrock LLM with a prompt that lists the aspects and asks for sentiment per aspect plus a supporting evidence span, returned as JSON. The LLM approach lets you fix the exact aspects you care about (consistent columns across items) or ask it to extract the aspects mentioned (discovery); many teams run a discovery pass to find the common aspects, then lock the list for production. Enumerate the allowed sentiment values, set a low temperature, use the model's structured-output mode, and validate the JSON so the per-aspect results aggregate cleanly on a dashboard.
How do I analyse millions of reviews or tickets cheaply on AWS?
Use batch, because almost no feedback analysis is real-time. Amazon Comprehend has async jobs (StartSentimentDetectionJob / StartTargetedSentimentDetectionJob) that read a corpus from S3 and write results back to S3 in one managed job. Amazon Bedrock has batch inference: write your requests as JSONL to S3, submit one asynchronous job, and Bedrock processes them in the background at roughly 50% of the on-demand token rate. The cheapest pattern is a cascade: run Comprehend over the whole corpus for overall sentiment (very cheap per text unit), then send only the slice that needs more — Mixed/low-confidence, high-value, or anything needing aspect-based sentiment and intent — to a small Bedrock model on batch. Reconcile everything by record id. Parse/prepare once and never re-score unchanged items. See the Bedrock batch inference page for the mechanics.
Can I get sentiment as a 1–5 score or custom labels instead of just positive/negative?
Amazon Comprehend's built-in sentiment is four fixed labels (Positive, Negative, Neutral, Mixed) with confidence scores, not a numeric scale; to get custom categories from Comprehend you train a custom classifier on labelled data. A Bedrock LLM gives you a numeric score (1–5, 0–10, or any scale), custom labels, and multiple signals in one call with no training data — you just describe the schema and the allowed values in the prompt and return JSON. So if you specifically need a graded intensity score, your own category taxonomy, or sentiment combined with intent and reason codes, the LLM path is the straightforward choice; if four buckets are enough, Comprehend is cheaper and simpler.
How do I get clean, structured JSON out of an LLM for sentiment?
Pin the schema and constrain the model. Ask for JSON only and describe every field and type (overall sentiment, score, aspects array with aspect/sentiment/evidence, intent, reason); enumerate the exact allowed values for sentiment and for any closed category set so the model cannot freelance labels; use the model's tool-use / structured-output mode on Bedrock to enforce the shape; set a low temperature for consistency; and validate the returned JSON, retrying or flagging on a parse failure. Add a few labelled examples (especially sarcasm and mixed-sentiment cases) and tell the model what to do with empty or off-topic input (return neutral / a flag, not a fabrication). With a fixed schema, enumerated labels, low temperature, and validation, the output becomes reliable enough to store as typed columns and chart directly.
How accurate is sentiment analysis on AWS, and how do I measure it?
Accuracy depends on the engine, the text, and the language, so measure it on your own data rather than trusting a headline number. Build a golden set of a few hundred to a few thousand real items hand-labelled by humans, then score each candidate (Comprehend, each LLM, each prompt) on per-class precision, recall, and F1 — not overall accuracy, which hides failures on the minority Negative class you most need to catch — plus a confusion matrix, and segment the metrics by language, channel, and length where models are reliably weaker. Amazon Bedrock model evaluation can score and compare LLMs on your dataset; Comprehend custom classifiers emit precision/recall/F1 on a held-out split. Log every analysis for audit and re-scoring, and keep a human-review loop on low-confidence and high-stakes items to track live accuracy and feed corrections back into the golden set.
What does sentiment analysis on AWS actually cost?
It depends on the engine. Amazon Comprehend bills per 100-character unit (a short review is one or two units) at a small per-unit rate that drops for async/bulk — so plain sentiment over a large corpus is cheap. A Bedrock LLM bills per input and output token, so cost scales with item length and how much JSON you request; a small model is inexpensive and a frontier model can be ~60× more for the same job. As a representative 2026 illustration, scoring 5M short reviews/month is roughly $1,500 on Comprehend for overall sentiment, ~$84/month on a small LLM on batch for full aspect/intent JSON, or ~$10,200/month on a frontier model on-demand — which is why the cascade (Comprehend on everything, a small LLM on the ~15% hard slice, both on batch) lands around $1,500 total while still giving rich analysis where it matters. Figures are representative as of 2026 — check the AWS Comprehend and Bedrock pricing pages for current rates.
Can AWS credits cover the cost of building a sentiment analysis pipeline?
Yes — a sentiment pipeline is squarely credit-eligible: Amazon Comprehend, Bedrock inference (on-demand and batch), Transcribe for call/chat audio, and supporting services (S3, Athena, QuickSight, orchestration) all draw down credits, which apply automatically against your AWS bill until exhausted. The relevant pools are AWS Activate (up to $100K for institutionally-funded startups), a dedicated Bedrock/GenAI POC pool ($10K–$50K) — well suited to absorbing a one-time backfill of an entire feedback corpus — and the GenAI Accelerator (up to $1M for selected startups). These are largely partner-filed via the AWS Partner Network, which is why teams route through a partner. CloudRoute matches you to the right pool and a vetted AWS partner who files the application and builds the pipeline — customer pays $0, AWS funds it.

Build sentiment analysis on AWS — funded by AWS credits

CloudRoute routes you to a vetted AWS GenAI/ML partner who designs and ships the pipeline — ingestion, PII redaction, the Comprehend-vs-LLM engine split (or a cascade), aspect-based sentiment and intent with clean JSON, batch for bulk corpora, accuracy evaluation against a golden set, and the QuickSight dashboard. AWS credits fund the build and the inference. You pay $0.

matched within< 24h
credits to fund itup to $100K
cost to you$0
Sentiment Analysis on AWS: Comprehend vs Bedrock (2026) · CloudRoute