Thanks to R&OS PDF and HTMLSax (and Adobe), phpPatterns articles are now available in PDF format, rendered with 100% PHP.
Widescreen PDF
At the bottom of all articles you’ll now find a link to a PDF version of the content. In the name of being cautious about bandwidth, you will need to be logged in to get them (if you’ve forgotten your password, note that you can get it emailed to you (*deprecated link*).
They take a little time to render (I’m not caching them at the moment) and I’ve still got some work to do to include the inline images but hopefully you’ll agree the overall appearance is good.
The workhorses behind this are two class libraries;
R&OS PDF, an awesome piece of work which is, IMO, a credit to open source and deserves recogition from the php|arch Grant Program.
HTMLSax - a cunning set of classes which make it possible to parse HTML in a SAX-like manner, while allowing for badly formed documents.
But Why?
PDF isn’t everyone’s favorite, especially if you browse with Lynx, but the point here is really to demonstrate the concept of software layering and N-Tier.
eZ publish 2.2.x, with which the site is built, makes the division between application logic and presentation logic correctly. What that means is it’s easy to add different media to the presentation logic tier while still using the same application logic.
In practical terms, the gain here is it took about 4 hours to add the PDF version and apart from minor additions to the URL handling logic (controllers), no code had to be re-written to make this possible. With the classes written, the PDF “view” that displays the document is only about 30 lines of code.
I can easily add further alternative content types, perhaps XML (SOAP, XML-RPC, XUL whatever) or even Flash with help from Ming without needing to re-write all the data fetching / manipulation code.
How it works
Avoiding an in depth tutorial (R&OS PDF has excellent documention and HTMLSax has an example which is all you need, assuming you’ve done something like Parse an RSS Feed with SAX), the basic problem to overcome was that the content in the database contains HTML-like markup (eZ article tags in fact which are a hybrid HTML but (making things a little easier) well formed XML.
Using HTMLSax, it’s easy to strip out the HTML markup while still being able to use the “knowledge” of how the content should be formatted.
Having built a class to build the desired Pdf layout (call it PdfArticle), an instance of the class is passed to my own extension of HTMLSax which builds the PDF document using methods available in PdfArticle.
What’s nice about HTMLSax is any tags it encounters which I haven’t specifically told it to “listen” for are simply discarded. It would also make a handy tool for converting HTML to XHTML (or XML-FO for that matter), given some time.
R&OS PDF, btw, is even capable of encrypting PDF documents, requiring a password to open them. If you’ve check out php architects delivery system, R&OS would allow you to build the same.
About Internet Explorer
Having thought it was ready to roll, I tested IE (6) and guess what? It displays raw PDF rather than passing the file to Acrobat.
The first problem seems to be IE doesn’t like inline PDF files which don’t, in the URL, have the extension .pdf (it’s generated by PHP on the fly). In R&OS PDF the workaround I found works to fix this is in the stream() method to class.pdf.php, changing
header ('Content-Disposition: inline, filename='.$fileName);
to
header ('Content-Disposition: attachment, filename='.$fileName);
This means IE will launch Acrobat as a seperate application rather than trying to display an inline version.
The second problem is IE does something wierd in that it makes two requests when fetching the file, the first to download and cache it then a second to trigger the display. If there are any headers being sent which disable caching, such as;
header( "Cache-Control: no-cache, must-revalidate" ); header( "Pragma: no-cache" );
This causes a problem. Sending the header;
header( "Cache-Control: private" );
...fixes the problem.
Internet Explorer still has one remaining problem in that it gets the filename wrong (this may be because it’s too long) but at that point I cease to care and recommend looking at a better browser.