Inside the PDF File Format
PDF files are all over the internet — publishers use them almost exclusively, and if you try to download any academic papers, the links usually come with a "PDF warning", just in case you don't feel like downloading a few megabytes of document and potentially opening up a separate window just to read the content. A lot of applications don't even have a "print" option; they just export a PDF view which you can then print from Acrobat. So what are these PDFs? Why PDF rather than HTML?
The truth is that PDF, or
Portable Document Format, gets sort of a
bad rap from users who inevitably compare it to HTML, but this isn't entirely
fair, since PDF is optimized as a format for printing and concise document
specification. By design, an HTML document is supposed to render in whatever
format looks best for the user agent; PDF, on the other hand, is supposed to
look exactly the same whether it's viewed on screen, on paper,
on a mobile device, etc. How faithfully it does so is, of course, subject to
the limitations of the target device (printers have a much higher
resolution than any computer screen), but Adobe puts a lot of effort into
preserving fidelity across targets.
PDF has been around since the early 90's, having
evolved from an earlier format called
PostScript. Both were
conceived and controlled by Adobe, a company that was founded by two of the
engineers from Xerox who worked on the original desktop computer design.
PostScript is actually a fully-featured programming language. You can define procedures, conditional operators, variables, etc. PostScript is "Turing complete". However, PostScript is a programming language meant for printers to interpret, and PostScript "programs" ordinarily describe what a page or set of pages should look like. The PostScript commands are transmitted, in source code form, to the printer, which interprets/compiles the commands, updates the global state, and executes the commands which generally involve making physical marks on paper.
If you have access to a laser printer, it probably supports PostScript directly (I've had good luck with HP support for PostScript). Figure 1 is a complete PostScript program; you can send this directly as text, without any preprocessing, to a PostScript capable printer. For example, if you save figure 1 as "hello.ps" and your printer is at IP address 192.168.1.2, you could do this:
telnet 192.168.1.2 9100 < hello.psand the output should look like Figure 2.
/Times-Roman findfont 12 scalefont setfont newpath 50 700 moveto (Hello, World) show
The point being that PostScript is a text format for printers to interpret directly. Now, programming in PostScript is sort of like programming in assembler — you have infinite flexibility, but infinite tedium as well. PostScript doesn't even figure out where the line breaks should go on the paper; you're responsible for determining when you've reached the end of the line/page and move to a new one. (The technical term for this process is typesetting). Even the most die-hard of command-line fanatics don't program directly in PostScript but instead use a preprocessor like troff to deal with the typesetting. Troff input looks like Figure 3 and can be typeset and fed to a networked (PostScript capable) printer via a command like:
troff -T ps < source.tr | telnet 192.168.1.2 9100
.ll 6i .ps 12 .vs 16 The area is \(*p\fIr\fR\|\s8\u2\d\s0
With Troff, document authors could take advantage of some features that HTML authors or MS-Word users of today take for granted such as automatic computation of line breaks or justified alignment. Still, you can't say that Figure 3 is particularly readable — before proofreding [*] it, you'd need to convert it to PostScript and print it out(!). To save a few trees, on-screen PostScript readers like GhostScript were created.
Still, as a shared document format, PostScript had some problems. In order to view a PostScript file onscreen, the entire embedded program had to be interpreted and run. There was no possibility of random access, since the program itself maintained a global state — in order to show the user page 700, for example, the viewer program had to parse and intepret the first 699 pages so that the application would be in the correct state. In the early 1990's, Adobe started work on what they called the Portable Document Format which aimed to unify printer-friendly and screen-friendly formatting. Like PostScript, PDF is a text format which describes what a printer ought to do in order to display it; however, general programming constructs like loops and variables were removed to make random-access feasible.
Figure 4 is pretty much the smallest parseable PDF file you could put together.
%PDF-1.6 1 0 obj << /Type /Catalog /Pages 2 0 R >> endobj 2 0 obj << /Type /Pages /Count 1 /Kids [3 0 R] >> endobj 3 0 obj << /Type /Page /Parent 1 0 R /MediaBox [0 0 614 794] /Contents 4 0 R /Resources 5 0 R >> endobj 4 0 obj << /Length 58 >> stream BT /F0 1 Tf 12 0 0 12 10 750 Tm (Hello, World) Tj ET endstream endobj 5 0 obj << /ProcSet [/PDF] /Font << /F0 6 0 R >> >> endobj 6 0 obj << /Type /Font /Subtype /Type1 /BaseFont /Helvetica >> endobj xref 0 6 0000000000 65535 f 0000000009 00000 n 0000000062 00000 n 0000000125 00000 n 0000000239 00000 n 0000000343 00000 n 0000000412 00000 n trailer < <01234567890ABCDEF>] /Size 6 >> startxref 488 %%EOF
If you download this file and open it in Acrobat, you'll see a simple output similar to the one in figure 5; however, if you open the same file in a text editor like Notepad or vi, you'll see that it's identical to figure 4.
Download it, don't copy-paste it, because line-ending conventions matter here — I'll get to why below.
Figure 4 might seem a little opaque at first, but if you start to look at it, you
can begin to see some regularity here. First, you see that there are regular
endobj. There are 6 of these, and each is given a
sequential number. The
obj entries are followed by an
trailer entry and a
startxref entry. PDF's are
actually designed to be read "backwards" starting at the end. The very last
entry before the
%%EOF delimiter is the
startxref 488This is a pointer to the cross-reference file. In this example, the xref file starts at byte 488 of the file.
This is why line-ending conventions matter; on a Windows machine, a text editor would save CRLF pairs for line endings, which would change the locations of the object entries within the file.
If you follow this backwards, you'll see that the
points to the line that reads
xref. This section, which starts from
xref token and runs to the
trailer section, is a list of
pointers to the other objects in the file:
xref 0 6 0000000000 65535 f 0000000009 00000 n 0000000062 00000 n 0000000125 00000 n 0000000239 00000 n 0000000343 00000 nThe first line declares the range of pointers listed in the cross-reference section; in this case, the objects numbered 0 through 6 (the top line) are declared here. The remaining lines identify, one per line, the location of an object in the file. The first object is object #0, which, you'll notice, doesn't appear anywhere in the file. PDF requires that object #0 is declared as "free"; this is the meaning of the
fat the end of the line. What about the 65535 in between the 0000000000 and the
65535 is the generation number. PDF allows documents to be revised and rolled back, with their revision history stored within the document itself rather than in an external revision control system. For this reason, every object in the file includes a generation number which starts at 0 when the document is authored for the first time and increments by one each time a revision to the object is made. You probably won't come across PDF files with non-zero generation numbers "in the wild" unless you deal with professional publishing software. Here, object #0 is at generation #65536 (the max), but all the others are at generation 0 — brand new.
So, this cross-reference table identifies 6 objects. The number of each object is given by its position in the list, so line 1 locates the first object, line 2 the second, and so forth. It's worth noting also that PDF doesn't require the numbers to appear sequentially in the file, and in general, PDF document creator software outputs them in a fairly random order; hence the need for the cross-reference table at the end. For this simple example, I put them in order because I'm not dealing with too many objects. However, this is a toy example — even the one-page newsletter that my kids' elementary school sends out each week declares a few hundred objects.
So, I've been talking a lot about "objects". What's a PDF object?
You can see from figure 4 that an object is delimited by
endobj tags. In this example, each of the objects is additionally
>> pairs, but this
isn't strictly a requirement; PDF allows all sorts of types to occur as
objects, but most of the time, you'll see that they're
PDF defines six type of objects: boolean (true/false), numeric, string, name, array
and dictionary. Numerics and booleans can be recognized by their contents,
but the other four have special delimiters that identify them to the parser.
strings are delmited by parentheses (), names by slashes /, arrays by brackets
 and dictionaries by what are technically referred to as "guillemets" which
is what the French use for quotation marks but what you and I (unless you're
French) would probably just call "double angle brackets"
(Note that this is specifically two "less-than" signs and two "greater-than"
signs; not the more correct « » characters «»)
In this example, all of the top-level objects are
which are name-value pairs where the names are
name types and
the values are any other type of object, including another dictionary. Also
notice that, at the very end, after the
xref section, there's
a trailer which contains a dictionary object. The most important element
of this dictionary is the
Root entry which is a pointer to (surprise) the
"root" of the document.
As you can observe from figure 4, top-level objects are numbered; each one
must be given a unique number, and each should appear as an entry in the
cross-reference table. Once this is done, pointers (or references) can
be used in place of actual objects anywhere in the file. You see this
throughout the file — references are specified as
object_number generation_number R. So,
1 0 R is a pointer to object 1. Wherever a reference is encountered, the PDF
parser replaces the reference with the object being referenced. In fact, you can
think of figure 4 as being expanded as shown in figure 6:
%PDF-1.6 trailer <</Root << /Type /Catalog /Pages << /Type /Pages /Count 1 /Kids [ << /Type /Page /Parent XXXX /MediaBox [0 0 614 794] /Contents << /Length 58 >> stream BT /F0 1 Tf 12 0 0 12 10 750 Tm (Hello, World) Tj ET endstream /Resources << /ProcSet [/PDF] /Font << /F0 << /Type /Font /Subtype /Type1 /BaseFont /Helvetica >> >> >> >> ] >> >> /ID [<01234567890ABCDEF> <01234567890ABCDEF>] /Size 6 >> 488 %%EOF
This isn't a valid PDF — content streams can't be embedded inside
other objects this way, and the
Parent element of the
dictionary must be a reference to a referencable object which it can't be
in this case — but this demonstrates logically what the PDF parser
does at display time.
So, turning back to the
trailer <</Root 1 0 R /ID [<01234567890ABCDEF> <01234567890ABCDEF>] /Size 6 >>
The most important entry in the
trailer dictionary is the
/Root declaration — this is a reference to the
1 0 obj << /Type /Catalog /Pages 2 0 R >> endobjin turn points to the pages object:
2 0 obj << /Type /Pages /Count 1 /Kids [3 0 R] >> endobjwhich describes, at a high level, the structure of the document. In particular, this document has one page (the
/Countentry) whose description can be found in
/Kids. Notice also that the
/Kidsentry is a -delimited
array, not just a bare reference like the others. The single page is described in object 3:
3 0 obj << /Type /Page /Parent 1 0 R /MediaBox [0 0 614 794] /Contents 4 0 R /Resources 5 0 R >> endobjHere, finally, we're starting to get to the meat of the document. First of all, the
/MediaBoxentry describes the actual size of the page in 1/72s of an inch — 614x794 ≈ 8.5"x11", the standard U.S. page size.
Additionally, there's a reference to a
5 0 obj << /ProcSet [/PDF] /Font << /F0 6 0 R >> >> endobjThe most important part of this is the
/Fontdeclaration which is a list of fonts. In this case, the document has only only font, so there's a single reference to a font object:
6 0 obj << /Type /Font /Subtype /Type1 /BaseFont /Helvetica >> endobjThis is the smallest font description you can legally put together in a PDF document; since PDF is specifically a printing language, you can imagine that it has a lot of support for font descriptions. In fact, you can completely specify the geometry of a font within a PDF document so that the printer can reproduce the font exactly as it was originally described. I'll leave that to PDF software and fontophiles, though, and turn finally to the actual content of the single page of this document:
4 0 obj << /Length 58 >> stream BT /F0 1 Tf 12 0 0 12 10 750 Tm (Hello, World) Tj ET endstream endobjHere, the object starts out as a dictionary, but is followed by a
streamdeclaration. This stream contains a series of commands that the printer should execute to display the page. This format is somewhat reminiscent of PostScript, but deliberately scaled back so that the page is a standalone element. This is executed as:
BT <-- Begin text /F0 1 Tf <-- Select Font 0; named in the resources entry 12 0 0 12 10 750 Tm <-- Set the text translation matrix (Hello, World) Tj <-- output the string "Hello, World" ET <-- End text
Besides the text translation matrix on the third line, you should find this pretty self explanatory.
One thing to note is that each line ends with a "command" which is preceded
by its arguments. So what is the
Tm command all about? Well, in
its most primal form, a document is just a collection of polygons
— arbitrary shapes made up of straight and curved lines, optionally
connected to one another and filled in. Letters on the page — glyphs, to be technical — are
just shapes as far as the printer is concerned. Fairly intricate ones, no
doubt, but still just shapes. So, from the printer's perspective, the whole
document can be specified as a series of points on a Cartesian coordinate
system which need to be connected together and filled in. This whole
coordinate system is subject to a transformation at any time; this transformation
is compactly specified as a matrix which will be applied to each point. I
won't go into the vagaries of matrix operations and affine transformations
here (I talked a bit about it last month),
but the net result of the matrix specified in this example is to scale
(enlarge) the shapes to 12 points and move them to position (10,750), measured
from the lower-left corner of the page.
A full introduction to the PDF formatting language would take a book; PDF also has commands for drawing arbitrary lines in any color, specifying user interaction, downloading additional content from the internet, etc. However, all PDF documents follow this same format — page objects specify commands to be executed by the printer which should translate to a printed page.
Figure 4 is longer than it strictly needs to be; I've added a lot of formatting for readability. Since PDF isn't actually designed to be human- readable, documents are usually compressed by the removal of superfluous whitespace as illustrated in Figure 7.
%PDF-1.6 1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj 2 0 obj<</Type/Pages/Count 1/Kids[3 0 R]>>endobj 3 0 obj<</Type/Page/Parent 1 0 R/MediaBox[0 0 614 794]/Contents 4 0 R/Resources 5 0 R>>endobj 4 0 obj<</Length 49>>stream BT /F0 1 Tf 12 0 0 12 10 750 Tm (Hello, World) Tj ET endstream endobj 5 0 obj<</ProcSet[/PDF]/Font<</F0 6 0 R>>>>endobj 6 0 obj<</Type/Font/Subtype/Type1/BaseFont/Helvetica>>endobj xref 0 0 0000000000 65535 f trailer <</Root 1 0 R/ID[<01234567890ABCDEF><01234567890ABCDEF>]/Size 0>> startxref 450 %%EOF
The only required whitespace is after the endobj tokens and before numbers. Notice in particular the lack of whitespace before the name tokens and their value tokens as in:
/Type/Font/Subtype/Type1/BaseFont/HelveticaThis is three individual dictionary entries, all run together on a single line with no intervening whitespace. Since "/" is not a valid name token character, the parser will know upon encountering it that the name part of the token is complete and the value part begins.
Since content streams are usually pretty long, PDF additionally allows (and virtually all applications take advantage of) the content stream for each page to be compressed within the document itself. PDF supports both Flate compression and LZW compression of content streams. Of course, images can also be embedded and can be compressed as well, including as JPEG streams. The tiny content stream in figure 5 is hardly worth compressing, but most PDFs include hundreds or thousands of typesetting commands, which are repetitive and lend themselves well to Lempel-Ziv style compression. In reality, the object declarations are pretty repetitive as well; however, if those were compressed, the whole document would need to undergo a decompression stage before the viewer could start rendering it, so PDFs are virtually always uncompressed except for their content streams and embedded graphics.
If you open pretty much any other PDF file in a text editor, you'll notice that the top two lines probably look like this:
%PDF-1.6 %äãÏÒThe meaning of the "%PDF-1.6" part if obvious enough; this tells the opening application that this is a PDF file conformant to revision 6 of the specification, but what about the garble that follows it? This is a comment, so it's ignored by the PDF reader, but it serves as a warning that this document contains non-ASCII characters (which will invariably be the case if the document compresses its page content streams, which documents invariably do).
Putting the cross-reference table at the end of the document simplifies the job for the document creator, but it creates a poorer user experience for the consumer of the document, since the whole document must be scanned before the rendering software can do anything with it. If a PDF is created once and printed in total on paper, this makes sense — since the only human interaction with the file is that of the author generating it, the process ought to be optimized for him. However, modern PDF usage has a PDF file being viewed far more often onscreen than in printed form, so Adobe came up with the "linearized" form to streamline the generation of a viewable PDF. One of the main difference between linearized and non-linearized is that the cross-reference table comes at the front. This creates more work for the document authoring application, since the first byte of the file can't be output until the offsets of each object are known. However, this means that the user can jump to a page from the table of contents as soon as the first few kilobytes of the file have been processed; for a document containing many hundreds of pages, this can be a significant advantage.
*: Yes, that was a joke