Should You Put AI in Your Product?
“To AI, or not to AI?” is the product manager’s version of Hamlet. (Except no one dies, worst case you just ship a feature that auto‑composes emails like a Victorian novelist.) Jokes aside, deciding when to use AI—and when not—is now a core product strategy question.
This post gives you a clear, evidence‑based framework: where AI shines, where it stumbles, pros and cons you can show your execs, and a simple checklist you can run before you add “AI‑powered” to your roadmap.
First, let’s align on what “using AI” actually means
“Using AI” spans a spectrum:
Rules & heuristics (if‑this‑then‑that logic, regex, scoring formulas).
Traditional ML (ranking, forecasting, anomaly detection).
Generative AI (LLMs, image/audio models) for summarization, code & content drafting, chat interfaces.
Adoption stats can look wildly different depending on which of these you’re measuring. For example, McKinsey’s global survey reports that by May 2024, 65% of organizations regularly used generative AI; by March 2025 they reported 71% using genAI in at least one function. (McKinsey & Company)
But U.S. government data that asks a narrower question—“Are you using AI to produce goods or services?”—shows a much lower level, roughly ~9–10% overall as of late summer 2025, with big firms wobbling a bit in recent months. The U.S. Fed notes surveys vary: firm adoption ranges from ~5% to ~40%, while worker usage is often 20–40%, depending on the occupation. Translation: lots of experimentation and assistant‑style use, less full‑blown production integration (so far). (Census.gov)
Takeaway: Adoption stats depend on definitions. Use the right denominator when making your business case.
When AI belongs in your product
1) Your problem is fuzzy, high‑variance, and data‑rich
If the task needs judgment over clean rules—classifying messy text, prioritizing tickets, summarizing threads, recognizing patterns—AI is built for that. LLMs in particular excel when there’s a lot of unstructured data and “good enough” is good enough (more on risk tolerance below).
2) You’re augmenting humans, not replacing them
The strongest evidence of ROI so far: AI assisting people in real workflows. A large‑scale, real‑world study of 5,000+ support agents found a 14% productivity boost overall from a gen‑AI assistant—and ~34% for less‑experienced workers. That’s a big deal for onboarding and consistency. (Oxford Academic)
For developers, a controlled experiment found engineers using GitHub Copilot completed a task ~56% faster. That’s not “robots wrote the code,” it’s “humans finished faster with AI help.” (arXiv)
3) Personalization and ranking move your core metrics
Recommendation, retrieval‑augmented search, next‑best‑action—these are classic ML strengths. If your product’s value depends on ordering, matching, or tailoring, AI tends to pay off.
4) Latency and accuracy requirements are tolerant
AI is ideal when: occasional errors are acceptable, corrections are cheap, and time‑to‑answer beats near‑perfect precision. Think: drafting sales outreach, summarizing meetings, triaging support, or helping users navigate complex forms.
5) You can measure value and bound harm
If you can instrument outcomes (CSAT, time‑to‑resolution, conversion) and cap damage (human review, rollback, rate limits), you’re set up for learning loops. NIST’s AI Risk Management Framework says the work of trustworthy AI breaks down into “GOVERN, MAP, MEASURE, [and] MANAGE.” Build those muscles early. (NIST Publications)
When AI probably doesn’t belong (yet)
1) Clear, stable rules already win
If 5 lines of code or a database constraint does the job, do that. Shipping a stochastic model where a deterministic rule works is like hiring a poet to format CSVs.
2) Stakes are high and errors are expensive
Credit decisions, safety systems, medical triage, hiring/HR, law enforcement—these use cases are regulated and often classified as “high‑risk” in the EU AI Act. In the EU, some practices are outright banned (e.g., untargeted scraping for facial recognition databases, emotion recognition at work or in schools, social scoring). If your feature touches these areas, pause and talk to counsel. (European Commission)
3) You can’t stomach ML “technical debt”
Classic wisdom from Google’s ML ops paper: “it is dangerous to think of these quick wins as coming for free.” AI features create ongoing costs—monitoring drift, retraining, labeling, safety tooling, incident response—that often dwarf the model itself. Don’t adopt AI unless you’re prepared to own the lifecycle. (NeurIPS Papers)
4) Your UX needs reliability over creativity
Nielsen Norman Group warns that users often over-trust confident AI outputs and struggle to error‑check them—exactly the opposite of what you want in critical flows. Use AI where you can display provenance (citations, evidence) or put a human in the loop. (Nielsen Norman Group)
5) Latency, privacy, or cost constraints are tight
Large models are powerful, but slower and pricier. Providers commonly price per token and often charge more for bigger models. Cloudflare, for instance, published tiered pricing by model size; smaller models are dramatically cheaper per million tokens than very large ones. Your cost per interaction scales with tokens in/out. Plan accordingly. (The Cloudflare Blog)
(If you need lower latency or stricter data control, consider small/on‑device models and edge inference; research suggests these can reduce cost and improve privacy while keeping quality acceptable for many tasks.) (arXiv)
Pros and cons you can share upstairs
The Pros
Measured productivity gains in the wild (14% for support agents; ~56% faster for devs on a coding task). That’s not hype; those are controlled and large‑scale studies. (Oxford Academic)
24/7 coverage for triage and first drafts (with human review where it matters).
Better personalization & discovery in content‑heavy products.
Speed to experiment: it’s often easier to prototype complex behavior with a prompt + policy than with months of rules development.
Cost trajectories are improving. Industry analysis and vendors show inference pricing falling and serving efficiency improving (e.g., speculative decoding, batching). Your 2025 unit economics may look markedly better than 2024’s. (The Cloudflare Blog)
The Cons
Non‑determinism & hallucinations. LLMs can fabricate plausible nonsense. If the UX doesn’t help users verify outputs, trust erodes. (And if you can’t verify—don’t deploy.) (Nielsen Norman Group)
Regulatory overhead in certain domains (EU AI Act risk classes, transparency, documentation, conformity assessments). Roadmaps must include compliance work. (European Commission)
MLOps & maintenance. Monitoring, evaluation, prompt/content safety, and drift handling are real ongoing costs—the “interest” on ML technical debt. (NeurIPS Papers)
Latency & cost trade‑offs. Bigger models → better quality (sometimes), but slower/more expensive. Your volume × tokens × price curve matters more than your demo wow‑factor. (The Cloudflare Blog)
User expectations. A bright, talkative chatbot that’s wrong 5% of the time can be worse than a boring form that’s right 100% of the time.
A quick decision checklist
Score one point for each “Yes.”
Value: Is there a clear, measurable outcome (e.g., reduced handle time, higher conversion) you can A/B test in ≤ 6 weeks?
Data: Do you have enough representative data (or can you use retrieval from trusted content) to support the task?
Tolerance: Can you safely tolerate occasional mistakes (with humans, guardrails, or rollbacks)?
Regulation: Are you outside high‑risk/banned categories—or have you scoped compliance work with counsel? (European Commission)
Lifecycle: Do you have owners for GOVERN / MAP / MEASURE / MANAGE (NIST), including monitoring and incident response? (NIST Publications)
Economics: Do tokens × volume × price fit your margins (now and at 10× volume)? (Most providers price per million tokens; costs rise with model size.) (The Cloudflare Blog)
0–2: Skip AI. A simpler solution likely wins.
3–4: Pilot with human‑in‑the‑loop and strict guardrails.
5–6: Build, measure, and iterate; consider expanding model scope once ROI is proven.
Build vs. Buy vs. Blend
Buy (API to a hosted model) when speed‑to‑value matters, your use case isn’t core IP, and data constraints allow it. You’ll move fast and can switch models as pricing/quality shifts. (Token‑based pricing and model size tiers make budgeting straightforward.) (The Cloudflare Blog)
Build (fine‑tune/self‑host) when AI is the product, you have a proprietary data advantage, or you must keep data on‑prem/edge for privacy or latency. Edge/SLM research suggests growing viability here. (arXiv)
Blend with retrieval‑augmented generation (RAG): keep authoritative content in your control, ground the model’s answers, and show sources to users. (Bonus: this reduces hallucinations and helps with compliance evidence.)
How to de‑risk what you do ship
Start narrow. Launch one tightly scoped task (e.g., summarize tickets into tags with a human confirmer).
Ground the model. Use RAG with vetted content, structured prompts, and schema‑constrained outputs where possible.
Expose provenance. Show citations, confidence, or “What changed?” diffs so users can verify. (Nielsen Norman Group’s usability advice here is clear: help people check the AI.) (Nielsen Norman Group)
Instrument everything. Track inputs, outputs, user corrections, and business outcomes by cohort.
Create a safety playbook. Abuse handling, jailbreak resistance, PII redaction, rate limits, prompt/version pinning, and a rollback button.
Budget for evaluation. Offline evals are good; live evals with shadow mode and golden datasets are better.
Compliance is a feature. If you operate in or sell to the EU, plan for the AI Act (risk classification, documentation, transparency). Build the evidence trail while you build the feature. (European Commission)
A note on cost math (that CFO‑friendly slide)
Vendors typically price per million tokens (inputs + outputs). Your monthly inference cost ≈
(avg tokens per interaction) × (monthly interactions) × (price per million) / 1,000,000.
Prices vary by model size and provider; published schedules (e.g., Cloudflare) make it explicit that larger models cost more per million tokens. Pair that with NVIDIA’s guidance on benchmarking inference performance and you have a straightforward way to model unit economics per feature. (The Cloudflare Blog)
Pro tip: Before you reach for the biggest model, try a small model with retrieval, schema‑constrained outputs, and a review step. You’ll often hit “good enough” quality with much better latency and cost. (arXiv)
Reality check: The hype is real and uneven
Yes, many teams report genuine gains and broader usage year over year. Yes, some series show recent softness in enterprise‑level adoption metrics (depending on how you define “use”). Both can be true: experimentation is high; production‑grade, end‑to‑end AI (with governance and profit) is harder. Manage expectations with that nuance. (McKinsey & Company)
Final word
Use AI where it naturally fits your product’s jobs‑to‑be‑done, where you can measure value, and where your org can carry the operational load. Avoid it where rules are enough, errors are costly, or compliance would own your roadmap.
Or, as the ML ops classic cautions: “It is dangerous to think of these quick wins as coming for free.” Build the muscles to pay the interest—or don’t take on the debt. (NeurIPS Papers)
And if anyone asks why your AI‑powered feature is rolling out in carefully staged phases, you can smile and say: “Because we’re smart, and we like sleeping at night.”