hack

Progressive PDF Loading

Roberto Guido

01 Mar 2020 • 2 min read

If you have large PDF documents to serve into a website, probably you don't want those are fetched entirely before showing to the user. And if you host them in AWS S3, you may also be concerned about bandwidth usage and costs. Mozilla's PDF.js handles progressive loading and handling, but a few things need to be tuned for optimal results and not all of the steps are clearly documented (as the usual response for questions submitted through GitHub issues is "It works, issue closed").

First of all: you need to "linearize" your documents. That is: build an index at the beginning of the file, and permit puntual fetching of data from byte X to byte Y for a desidered page. This can be easily done using QPDF, an open source utility widely used for this specific task. Just run qpdf --linearize document.pdf final_document.pdf and upload final_document.pdf on your S3 bucket.

Second: you have to enable S3 to handle Range Request headers. Range requests are HTTP headers used to fetch just a specific portion of a file (from byte X to byte Y, as above), and are perfectly handled by PDF.js as long as the web server inform it that this feature is available. On the CORS configuration of your S3 bucket you have to explicitly expose those headers, such as

<?xml version="1.0" encoding="UTF-8"?>
<CORSConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<CORSRule>
    <AllowedOrigin>*.your.domain.com</AllowedOrigin>
    <AllowedMethod>GET</AllowedMethod>
    <AllowedHeader>*</AllowedHeader>
    <ExposeHeader>Accept-Ranges</ExposeHeader>
    <ExposeHeader>Content-Range</ExposeHeader>
    <ExposeHeader>Content-Encoding</ExposeHeader>
    <ExposeHeader>Content-Length</ExposeHeader>
</CORSRule>
</CORSConfiguration>

This is enough to enable progressive fetching and fast rendering, but still not enough to fetch the data only when required. Until here, PDF.js will anyway download the whole document, just in different portions and in background. To enforce a more oculate bandwidth usage, you have also to enable the disableAutoFetch option in PDF.js, more precisely using it as a parameter for getDocument() function:

pdfjsLib.getDocument({
    url: 'https://bucket.s3.eu-central-1.amazonaws.com/final_document.pdf',
    disableStream: true, // This is required to actually trigger disableAutoFetch behaviour
    disableAutoFetch: true, // This!!!
}).then(function(pdfDocument) {
    pdfViewer.setDocument(pdfDocument);
});

Disabling auto fetching may result in a slower rendering of the document, as downloads are posticipated until actually necessary and the user have to wait for it, but this is a compromise to eventually save some dollar from your S3 monthly bill.