· unix awk

Unix: Getting the page count of a linearized PDF

We were doing some work last week to rasterize a PDF document into a sequence of images and wanted to get a rough idea of how many pages we’d be dealing with if we created an image per page.

The PDFs we’re dealing with are linearized since they’re available for viewing on the web:

A LINEARIZED PDF FILE is one that has been organized in a special way to enable efficient incremental access in a network environment. The file is valid PDF in all respects, and is compatible with all existing viewers and other PDF applications. Enhanced viewer applications can recognize that a PDF file has been linearized and can take advantage of that organization (as well as added “hint” information) to enhance viewing performance.

The neat thing about this is it means that the document has meta data detailing the number of pages it contains:

Part 2: Linearization parameter dictionary
43 0 obj
<< /Linearized 1.0 % Version
/L 54567 % File length
/H [475 598] % Primary hint stream offset and length (part 5)
/O 45 % Object number of first page’s page object (part 6)
/E 5437 % Offset of end of first page
/N 11 % Number of pages in document
/T 52786 % Offset of first entry in main cross-reference table (part 11)
>>
endobj

By making use of the http://en.wikipedia.org/wiki/Strings_(Unix) command Duncan and I hacked together a little script that lets us grab the number of pages in The Games of Strategy PDF or any other linearized PDF:

strings RAND_CB149-1.pdf |
awk '/Linearized/ { inmeta = 1; } match($0, /\/N [0-9]+/) { if(inmeta) print substr( $0, RSTART, RLENGTH ); exit;}' |
cut -d" " -f2

It seems much more difficult to find the count if the document hasn’t been linearized but we didn’t need to solve that problem for the moment!

  • LinkedIn
  • Tumblr
  • Reddit
  • Google+
  • Pinterest
  • Pocket