PGH Web

PGH Web

Web Solutions from Pittsburgh

Why Puppeteer PDFs Are So Large (and How to Compress Them)

Jeff Straney·

We needed a PDF pipeline. Not a complicated one. HTML and CSS source files, rendered to a professional PDF for sale as a digital download. We had the content. We had the design. We had Puppeteer, which is what you reach for when you need Chrome to do something without a human sitting in front of it.

The first build produced a 53MB PDF. Ten pages, four images, no particular reason for the file to be that large. If you've landed here because your Puppeteer PDF is enormous and you don't know why, the answer is a few sections down. The wrong turns along the way are worth reading too.

This is where a reasonable person stops and asks a question before doing anything else. I didn't do that. Instead, I asked Claude.

The Diagnosis

Claude looked at the situation and offered a confident explanation: Puppeteer was somehow inflating the images. The low-resolution webp files going in were, according to this theory, coming out much larger on the other side. The exact mechanism was never entirely clear, because the mechanism doesn't exist. You can't make a compressed image contain more information by embedding it in a PDF. That's not how compression works. That's not how information works.

To Claude's credit, it eventually walked this back. To Claude's discredit, not before it proposed a fix.

The Subset Font Approach

The fix was font subsetting. Specifically, it wanted to bring in Harfbuzz to analyze the document, extract every character in use, and generate minimal font files containing only those glyphs. This would involve a new build step, a new npm dependency with WebAssembly bindings, and, as it turned out, direct downloads from the Google Fonts CDN that returned 404 on every attempt because the URLs were guessed rather than known.

I don't doubt that Harfbuzz is a good tool. I have every doubt that it was the right tool here. We weren't shipping a multilingual document with twelve typefaces. We were shipping a career guide with two fonts in it.

The subsetting step was abandoned after the third broken URL attempt. We moved on.

The Box Shadows

At some point the conversation arrived at our CSS. Claude identified a box-shadow on the role comparison cards and suggested removing it.

A box shadow is a rendering instruction, not stored data. Removing it from a PDF doesn't meaningfully reduce file size. We removed it anyway, rebuilt, and confirmed that 53MB minus one box-shadow: 1px 4px 6px equals 53MB. The shadows went back in.

Why Puppeteer PDFs Are So Large

I ran pdfinfo on the file. One line in the output said everything:

Optimized: no

Puppeteer generates PDFs through Chrome's Skia rendering engine. It embeds decompressed image data, uncompressed content streams, and runs no optimization pass on the output. The PDF is large because Puppeteer doesn't compress its output. That's the whole answer. Everything else was noise.

Ghostscript is a PostScript and PDF processor that's been around since 1988. It knows how to compress PDFs. One command:

gs -sDEVICE=pdfwrite -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH \
  -sOutputFile=output.pdf input.pdf

53MB became 334KB. The document looks identical. All 18 pages are intact. The images are fine. The box shadows are fine.

Worth noting: Ghostscript was Claude's suggestion. It was buried behind the image inflation theory, the Harfbuzz detour, and the shadow removal, but it was there. The instinct was right. The path to it was not.

Compressing Puppeteer PDFs with Ghostscript

The build script renders HTML to PDF with Puppeteer, then immediately passes the output to Ghostscript. If Ghostscript isn't installed, the build fails loudly rather than producing a 53MB file someone might accidentally ship. The whole thing is about 50 lines of TypeScript.

The source files are HTML and CSS. The output is a compressed, professional PDF. There are no dependencies beyond Puppeteer and a system Ghostscript install. No font subsetting. No WebAssembly bindings. No CDN requests.

This is the pipeline I'd have built on day one if I'd run pdfinfo on day one. Before you theorize about where file size comes from, look at the file.