9 Oct 2011

Unix: Getting the page count of a linearized PDF

We were doing some work last week to rasterize a PDF document into a sequence of images and wanted to get a rough idea of how many pages we’d be dealing with if we created an image per page.

The PDFs we’re dealing with are linearized since they’re available for viewing on the web:

A LINEARIZED PDF FILE is one that has been organized in a special way to enable efﬁcient incremental access in a network environment. The ﬁle is valid PDF in all respects, and is compatible with all existing viewers and other PDF applications. Enhanced viewer applications can recognize that a PDF ﬁle has been linearized and can take advantage of that organization (as well as added “hint” information) to enhance viewing performance.

The neat thing about this is it means that the document has meta data detailing the number of pages it contains:

Part 2: Linearization parameter dictionary
43 0 obj
<< /Linearized 1.0 % Version
/L 54567 % File length
/H [475 598] % Primary hint stream offset and length (part 5)
/O 45 % Object number of ﬁrst page’s page object (part 6)
/E 5437 % Offset of end of ﬁrst page
/N 11 % Number of pages in document
/T 52786 % Offset of ﬁrst entry in main cross-reference table (part 11)
>>
endobj

By making use of the http://en.wikipedia.org/wiki/Strings_(Unix) command Duncan and I hacked together a little script that lets us grab the number of pages in The Games of Strategy PDF or any other linearized PDF:

strings RAND_CB149-1.pdf |
awk '/Linearized/ { inmeta = 1; } match($0, /\/N [0-9]+/) { if(inmeta) print substr( $0, RSTART, RLENGTH ); exit;}' |
cut -d" " -f2

It seems much more difficult to find the count if the document hasn’t been linearized but we didn’t need to solve that problem for the moment!

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.