Milly Software
InsightsKnowledge basePDF Upload for Your Shopify AI Chat Knowledge Base
Knowledge base··5 min read

PDF Upload for Your Shopify AI Chat Knowledge Base

Upload spec sheets, manuals, and catalogs as PDFs into your Shopify AI chat knowledge base — automatic extraction, embeddings, and a draft-first workflow.

V
Viet Le
co-founder · Milly Software

A lot of valuable product depth lives in PDFs that merchants would never paste into a chat-widget admin manually. Spec sheets for technical products. Sizing charts with diagrams. Application guides for industrial coatings. Warranty terms. Catalog booklets the brand prints once a season. The information is real and useful; the format is the friction.

PDF upload removes the format friction. Drop a file into the Knowledge Base dashboard, the system extracts the text, cleans it up, runs it through embeddings, and the AI chat widget can answer questions from it. Same retrieval pipeline as manual entries — just a different on-ramp.

Drag-drop, extract, embed

From the merchant's side: open the Knowledge Base PDF upload modal, drag one or more PDFs in, click upload. The endpoint receives the file, runs it through pdf-parse, normalizes the extracted text (collapses excessive whitespace, trims), passes it into the same KB pipeline that handles manual entries — embedding generation, KB entry creation, indexing.

The whole loop runs synchronously per file. A 50-page product catalog parses in a few seconds; a 5-page spec sheet in well under a second. Multi-file uploads run sequentially with per-file results so the merchant can see which uploaded cleanly and which had issues.

Title detection

Each KB entry needs a title for the dashboard list and as a retrieval signal. PDF upload derives one in priority order:

  1. PDF metadata Title field — the Title property baked into the PDF's metadata. When set (which it usually is on PDFs exported from Word, Pages, InDesign), this is the most descriptive.
  2. Filename fallback — strip the .pdf extension, replace dashes and underscores with spaces. uag-warranty-terms.pdf → "uag warranty terms." Imperfect but always available.

The merchant can edit the title after upload. The auto- derived value is a starting point, not a lock.

Limits and edge cases

Three guards keep the upload flow predictable:

  • 4MB per file maximum. Most spec sheets and policy docs come in well under this; large image-heavy catalogs may not. The merchant gets a clear error when a file exceeds the limit, with the actual size shown so they know how much to trim.
  • Scanned / image-only PDFs are detected. PDFs that scan as raster images (with no embedded text layer) extract to almost nothing — the parser returns the fallback "Untitled" with a tiny word count. The endpoint flags any extraction below 10 words as "scanned document — no extractable text" rather than creating a near-empty KB entry. The merchant gets steered toward OCR before reupload.
  • Password-protected PDFs are rejected with a specific error. The parser throws on encrypted PDFs; the endpoint catches the password-related error and returns a clean "Password-protected PDF" message instead of a generic parse failure.

Draft-first workflow

Uploaded PDF entries land as drafts (is_active: false). They're visible in the dashboard but don't serve to the widget yet. The merchant reviews — checks that the extracted text is clean, edits the title if the auto-derived version is awkward, possibly trims out a section that wasn't worth indexing — and publishes when ready.

This is the same draft-first pattern URL Sync uses for bulk sitemap imports. The reasoning is the same: importing content faster than the merchant can review introduces noise into the AI's retrieval. Drafts let the merchant be the editor.

Why PDFs specifically

Most chat-widget tools refuse to ingest PDFs at all, or require manual export-as-text-then-paste workflows that most merchants never finish. The result is that the depth merchants have available — the spec sheets they spent real money producing, the warranty terms they had a lawyer write, the application guides they iterate on each season — stays out of the chat widget's reach.

Closing that gap is high-leverage for a small set of merchants and irrelevant for the rest. Apparel stores usually don't need PDF support. Industrial-coatings merchants and electric-bike companies do — their visitors ask product-spec questions that the merchant's sales team would answer by emailing a PDF. The widget should be able to answer those questions from the same source the sales team uses.

Try Milly Chat

Want to see how this fits your store?
We'll set up a working session.

Get it on ShopifyTalk to sales →