Milly Software
InsightsEngagementMeasuring AI Chat Widget ROI on Shopify: How to Do It Honestly
Engagement··8 min read

Measuring AI Chat Widget ROI on Shopify: How to Do It Honestly

Three visitor cohorts, conversion lift with significance testing, and strict-attribution chat-assisted revenue — how to measure what an AI chat widget is actually worth on a Shopify store.

V
Viet Le
co-founder · Milly Software

AI chat widgets are easy to make look good. Open rate goes up. Messages per session goes up. "Engaged visitors" goes up. These are vanity metrics — they describe widget activity, not merchant outcomes. The honest question is whether the widget is actually causing more revenue, and answering it requires a few more pieces of math than most chat tools surface.

This post walks through what we measure on the Impact dashboard: three visitor cohorts, conversion lift with significance testing, and a strict-attribution revenue line that ignores everything soft. None of it is novel — it's how a respectable A/B test is graded — but it's rarely how chat widgets are evaluated.

The vanity-metrics trap

The trap looks like this: a chat widget reports "42% widget open rate" and "4.3 messages per session," and the merchant has no way to compare those numbers to anything. Open rate compared to what? Messages per session compared to what? The widget's existence is being graded against itself.

Two questions never get asked: did the visitors who saw the widget convert at a higher rate than the visitors who didn't?And is that difference large enough to be real, or could it be noise? Both are answerable on a moderately-trafficked Shopify store with a few weeks of data — but both require cohort-level visibility, not session-level activity counts.

Three cohorts: shown, hidden, interactor

Every visitor lands in one of three buckets, tracked via the widget_shown and widget_opened analytics events:

  • Shown — the widget rendered in their viewport. Conditional rules may have overridden visibility, the device may have been excluded, the page may have been on the no-show list. If the widget showed at all, this cohort.
  • Hidden — the widget did not render. Either rules suppressed it, the page was excluded, or rollout percentage put them in the holdout. This is the control group for the lift calculation.
  • Interactor — a subset of Shown who actually opened the widget. The strongest-intent population: they didn't just see chat available, they engaged with it.

Each cohort gets its own conversion rate (orders ÷ unique visitors) and its own revenue total. The dashboard shows all three side by side, not just the flattering one.

Conversion lift, with statistical significance

The first comparison is shown vs hidden. The conversion lift is the percentage improvement of the shown rate relative to the hidden rate:

conversionLift = ((shownConvRate - hiddenConvRate) / hiddenConvRate) * 100

// Example:
//   shown:  4.0% conversion (1,200 orders / 30,000 visitors)
//   hidden: 3.0% conversion (60 orders / 2,000 visitors)
//   lift:   +33.3%

That's the headline number. But a +33% lift on a small hidden cohort can easily be noise. The dashboard runs a two-proportion z-test on the shown vs hidden proportions and reports both the p-value and a isSignificant flag (true when p < 0.05). The statistical machinery is right there in the rendered card — not hidden behind a "learn more" link.

The flag matters more than it sounds. A merchant looking at a "+50% lift" with a p-value of 0.4 is being told an exciting number that has the same epistemic weight as a coin flip. The same merchant looking at a "+8% lift" with p < 0.001 has a small effect they can take to the bank. The honest dashboard helps tell which is which.

Strict-attribution chat-assisted revenue

Lift on conversion rate is one signal. The other is a stricter question: did revenue land specifically because of the widget? That's a different join.

The chat-assisted revenue line counts orders where:

  • The visitor opened the widget
  • The visitor added a product via in-chat ATC (the product_added_to_cart analytics event)
  • That same product appeared in the visitor's eventual order

The third condition is the strict one. The query joins order_items.product_id against analytics_events.event_data->>'productId' — if the chat-recommended product isn't actually in the order, it doesn't count, even if the visitor opened the widget on their way to checkout. This is conservative on purpose. Loose attribution turns every widget interaction into revenue credit; strict attribution gives a number the merchant can defend to their CFO.

Attribution windows

Different stores have different consideration cycles. An impulse accessory might convert in the same session; a $2,000 e-bike might take three weeks of research. A single attribution window applied uniformly under-counts long-cycle stores and over-counts impulse stores.

Each store sets its own window — 7, 14, 30, or 60 days (default 7). The interactor and post-purchase counts use that window when joining widget events to orders. The active value is surfaced on the Impact card so the merchant always knows which window the displayed numbers represent.

Long-cycle merchants (e-bikes, B2B, custom builds) typically run 30+ days; UAG runs 14 days because flash-sale primer chats commonly happen 8-14 days before the actual purchase.

What the Impact dashboard actually surfaces

The card stack on the Impact page condenses to:

  • Conversion rates, three cards side by side: shown, hidden, interactor. Each shows the rate, the orders, and the unique visitor count behind it.
  • Conversion lift with the p-value and significance flag visible on the same card.
  • Revenue by cohort: revenue from shown, hidden, and interactor visitors.
  • Chat-assisted revenue: strict-attribution number and the order-level breakdown (variant name, order number, price, order total, date).
  • Active attribution window: the 7 / 14 / 30 / 60-day setting in plain language so the rest of the numbers have context.

For client pitch decks and internal ROI conversations, the chat-assisted revenue line is usually what matters. It's the line a merchant can read out loud without hedging — "the chat-recommended product was in the order; we know that for certain" — and that's rare in attribution.

Vanity metrics are still on other pages (Speed for latency, Queries for top searches, Gaps for unanswered questions). They have their own jobs to do. But they don't live where the ROI question is being answered, because they don't answer it.

Frequently Asked Questions

How do you measure the ROI of an AI chat widget honestly?

By comparing visitor cohorts rather than counting widget activity. Milly Chat's Impact dashboard splits visitors into three cohorts — Shown (the widget rendered), Hidden (it didn't; the control group), and Interactor (a subset of Shown who opened it) — and reports a conversion rate and revenue total for each. Open rate and messages-per-session are vanity metrics because they grade the widget against itself.

Is the conversion lift statistically significant?

It's tested. The dashboard runs a two-proportion z-test on the Shown vs Hidden conversion proportions and reports both the p-value and a 95% confidence interval, so a big-looking lift with a weak p-value is flagged rather than celebrated.

What counts as chat-assisted revenue?

A deliberately strict definition: an order only counts when a product the widget recommended actually appears in that order, joined within the store's attribution window. Loose attribution credits every interaction; strict attribution gives a number a merchant can state without hedging.

What attribution window does Milly Chat use?

Each store sets its own — 7, 14, 30, or 60 days, defaulting to 7. The interactor and post-purchase counts use that window when joining widget activity to orders, and the active window is shown in plain language on the dashboard so every number has context.

Try Milly Chat

Want to see how this fits your store?
We'll set up a working session.

Get it on ShopifyTalk to sales →