http://qs321.pair.com?node_id=1067942

andreas1234567 has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks

I find myself in charge of a large number of PDF documents produced by a document production system. There is currently no automated testing. For every new release of the (buggy) document production system itself, or its' (even more buggy) document templates, we face an time-consuming, error-prone, much hated and pretty much useless please see through this pile of documents and report errors if any nightmare of a manual testing process.

Preferably, I would like to add automated tests of both the PDFs' contents and visuals, and humbly ask for the Monks' advise.

PDF content testing

The strategy is to use Xpdf (pdftotext.exe) to convert PDF into text, and then use Test::File::Contents to check the output. This works reasonable good. But alternate solutions or suggestions are welcome.

PDF visuals testing

The strategy is ... non-existant. Any assistance or guidance is highly appreciated.

Happy holidays.

PS. OS is MSWin32 but alternatives are welcome. DS.
C:\test>perl -wle "print $^O" MSWin32

--
No matter how great and destructive your problems may seem now, remember, you've probably only seen the tip of them. [1]

Replies are listed 'Best First'.
Re: PDF content and visuals testing best practices
by Corion (Patriarch) on Dec 20, 2013 at 12:01 UTC

    If you can find templates that "are not supposed to change", like the page for a book cover or something like that, maybe you can set up a special single-page document and render that to a bitmap using (yuck) Image::Magick (or maybe better direct Ghostscript). Then you can try to use Image::Compare or Image::SubImageFind to find the "not changing" parts again.

    Of course, maintaining such a library of image-based tests gets really ugly. Maybe you can use wraith by the BBC to manage and compare the "screenshots" whenever a change is detected.

Re: PDF content and visuals testing best practices
by ateague (Monk) on Dec 20, 2013 at 17:48 UTC
    I feel your pain. I have the (mis)fortune to have to deal with this on a daily basis as $WORK.
    The strategy is to use pdftotext.exe to convert PDF into text

    *yuck*

    If that works, more power to you. I have always ended up with inconsistently spaced blobs of text when I first tried that route. My personal preference is to use pdftohtml.exe. I use the one included in Calibre Portable since it is actively updated.

    I use the following command line: pdftohtml.exe -xml -zoom 1.4 [PDF FILE]

    This will rip out all the text elements into an XML file with attributes for the font, x/y position on the page and text length. (-zoom 1.4 makes the positioning units 100 dpi).

    Here is an example I am currently working with:
    <?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd"> <pdf2xml> <page number="1" position="absolute" top="0" left="0" height="1100" wi +dth="850"> <fontspec id="0" size="17" family="Times" color="#000000"/> <text top="103" left="115" width="602" height="18" font="0">XXXXXXXXXX +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</text> <text top="120" left="115" width="602" height="18" font="0">XXXXXXXXXX +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</text> <text top="186" left="115" width="103" height="18" font="0">ROUTE TO:< +/text> <text top="186" left="265" width="107" height="17" font="0">Audit Bill +ing</text> <text top="220" left="115" width="128" height="18" font="0">SORT GROUP +:</text> <text top="220" left="265" width="152" height="18" font="0">Invoice So +rt Group</text> <text top="286" left="115" width="260" height="18" font="0">OH_GOD_IT_ +BURNS 2013-12-20</text> <text top="286" left="415" width="71" height="18" font="0">23:53:04</t +ext> <text top="286" left="545" width="108" height="18" font="0">FOOBAR</te +xt> <text top="320" left="115" width="602" height="18" font="0">XXXXXXXXXX +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</text> <text top="336" left="115" width="602" height="18" font="0">XXXXXXXXXX +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</text> </page> /<pdf2html>

    I can then use XML::Simple to slurp each <page> element into a hash and then use Test::More's eq_hash to compare my extracted data with my reference XML hash.

      Just reposting a PM from andreas1234567 for reference:
      I have trouble running pdftohtml.exe. It complains "freetype.dll" is missing (even though it *is* present in DLL dir)

      Depending which version of pdftohtml.exe (Dynamic vs Static) you run, you may need the following dlls:

      • freetype.dll
      • jpeg.dll
      • libpng12.dll
      • zlib1.dll

      These DLLs are found in the DLLs/ directory under Calibre Portable/Calibre/. You can do 1 of two things:

      1. Copy those DLLs into the same directory as pdftohtml.exe
      2. (Temporarily) add the path to the DLL directory to $ENV{PATH} in your script:
        { local $ENV{PATH} = $ENV{PATH}.";<PATH TO DLLs>;" system "pdftohtml.exe", "-xml", "<PDF FILE>"; }

        This did the trick for me:
        set EXEPATH=C:\Users\%USERNAME%\Calibre Portable\Calibre set PATH=%PATH%;%EXEPATH%\DLLS

        Thanks!

        --
        No matter how great and destructive your problems may seem now, remember, you've probably only seen the tip of them. [1]
Re: PDF content and visuals testing best practices
by sundialsvc4 (Abbot) on Dec 20, 2013 at 14:31 UTC

    Yes, you can use packages such as PDF::API3 to “dumpster-dive” quite a ways into the “guts” of a PDF file, but if you can identify defects from the text content of the file, your ugly approach might be the most cost-effective.   The content of a PDF can be very beastly unpredictable, making it difficult to write reliable logic to track down problems.

    And in appropriate status meetings, keep oh-so politely mentioning the ¢o$t of the fact that this system is still not working as the business should have reason to expect.   Every hour spent ... opportunity costs ...

Re: PDF content and visuals testing best practices
by kcott (Archbishop) on Dec 21, 2013 at 09:45 UTC

    G'day andreas1234567,

    Just a thought that may provide a partial solution. I haven't tried this myself and I don't know applicable it might be to your situation.

    Several assumptions are implied, including: existence of reference documents, lack of variable content (e.g. document-specific reference/ID numbers, date/time fields, etc.), and so on.

    If you took a Digest::* of reference documents, you could compare with digests of test-generated documents.

    While this won't identify specific issues, it might reduce the "pile of documents" to a "handful of documents" that require closer, subsequent scrutiny (whether by a manual or automated process).

    -- Ken