http://qs321.pair.com?node_id=11113858

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi

I used to use the following to get the xml structure of a PDF file

my $xml = qx/mudraw -ttt $file/; my $tree = XMLin($xml, ForceArray => [qw/page block line span char/]); #going through the structure for my $page (@{$tree->{page}}) { for my $block (@{$page->{block}}) { for my $line (@{$block->{line}}) { for my $span (@{$line->{span}}) { my $string = join '', map {$_->{c}} @{$span->{char}}; #... do my stuff } } } $page_index ++ }

I need to drop mudrow now! Is there an alternative to get $xml? At best a pure Perl solution? I could find anything searching the web :(

Replies are listed 'Best First'.
Re: PDF alternative to mudrow to get XML structure
by marto (Cardinal) on Mar 05, 2020 at 18:31 UTC

      Thank you. Which pdftohtml do you mean? The only one I know is pdftohtml (https://www.xpdfreader.com/pdftohtml-man.html) from XpdfReader, but no -xml option there.

        That's another pdftohtml. Poppler was fork of xpdf, now there are popular pdf-related utilities with identical names but different abilities.

        The pdftohtml doesn't report per-character coordinates though -- only per-span (per line, usually) spacial extent. I may be mistaken, but presumably they (per-char coords) are what you were after (that "my stuff" over there above), or you wouldn't go such long way to just extract text. BTW, nowadays (for a few years) the "mudraw" was (thankfully) renamed and is invoked as

        >mutool.exe draw -F stext document.pdf 2>nul <?xml version="1.0"?> <document name="(null)"> <page width="841.836" height="595.238"> <block bbox="50.172 35.47 805.83609 48.81"> <line bbox="50.172 35.47 129.62201 48.81" wmode="0" dir="1 0"> <font name="Times-Roman" size="10"> <char quad="50.172 35.47 55.172 35.47 50.172 48.81 55.172 48.81" x="50 +.172" y="46" c="2"/> <char quad="55.172 35.47 57.952 35.47 55.172 48.81 57.952 48.81" x="55 +.172" y="46" c="/"/> <char quad="57.952 35.47 62.952 35.47 57.952 48.81 62.952 48.81" x="57 +.952" y="46" c="2"/> <char quad="62.952 35.47 67.951999 35.47 62.952 48.81 67.951999 48.81" + x="62.952" y="46" c="7"/> <char quad="67.951999 35.47 70.731998 35.47 67.951999 48.81 70.731998 +48.81" x="67.951999" y="46" c="/"/> <char quad="70.731998 35.47 75.731998 35.47 70.731998 48.81 75.731998 +48.81" x="70.731998" y="46" c="2"/> <char quad="75.731998 35.47 80.731998 35.47 75.731998 48.81 80.731998 +48.81" x="75.731998" y="46" c="0"/> <char quad="80.731998 35.47 85.731998 35.47 80.731998 48.81 85.731998 +48.81" x="80.731998" y="46" c="2"/> <char quad="85.731998 35.47 90.731998 35.47 85.731998 48.81 90.731998 +48.81" x="85.731998" y="46" c="0"/> <char quad="90.731998 35.47 93.232 35.47 90.731998 48.81 93.232 48.81" + x="90.731998" y="46" c=" "/> <char quad="93.232 35.47 98.232 35.47 93.232 48.81 98.232 48.81" x="93 +.232" y="46" c="3"/> <char quad="98.232 35.47 101.012 35.47 98.232 48.81 101.012 48.81" x=" +98.232" y="46" c=":"/> <char quad="101.012 35.47 106.012 35.47 101.012 48.81 106.012 48.81" x +="101.012" y="46" c="0"/> <char quad="106.012 35.47 111.012 35.47 106.012 48.81 111.012 48.81" x +="106.012" y="46" c="7"/> <char quad="111.012 35.47 113.512 35.47 111.012 48.81 113.512 48.81" x +="111.012" y="46" c=" "/> <char quad="113.512 35.47 120.732 35.47 113.512 48.81 120.732 48.81" x +="113.512" y="46" c="A"/> <char quad="120.732 35.47 129.62201 35.47 120.732 48.81 129.62201 48.8 +1" x="120.732" y="46" c="M"/> </font> </line> ...

        (see stderr output is supressed, or xml will be interspersed with "doc this, page that" messages)

        The alternative is Ghostscript, of course:

        >gswin64c -q -sDEVICE=txtwrite -dTextFormat=1 -o - document.pdf <page> <block> <line> <span bbox="50 46 130 46" font="Times-Roman" size="10.0000"> <char bbox="50 46 55 46" c="2"/> <char bbox="55 46 58 46" c="/"/> <char bbox="58 46 63 46" c="2"/> <char bbox="63 46 68 46" c="7"/> <char bbox="68 46 71 46" c="/"/> <char bbox="71 46 76 46" c="2"/> <char bbox="76 46 81 46" c="0"/> <char bbox="81 46 86 46" c="2"/> <char bbox="86 46 91 46" c="0"/> <char bbox="91 46 93 46" c=" "/> <char bbox="93 46 98 46" c="3"/> <char bbox="98 46 101 46" c=":"/> <char bbox="101 46 106 46" c="0"/> <char bbox="106 46 111 46" c="7"/> <char bbox="111 46 114 46" c=" "/> <char bbox="114 46 121 46" c="A"/> <char bbox="121 46 130 46" c="M"/> </span> </line> </block> ...

        (see bbox is not really a box, take "size" into account to get height).

        ###################

        At best a pure Perl solution?

        Oh yes it's possible, see CAM::PDF. Chris laid beautiful foundation, huge amount of work. Some aspects are not really finished, though nothing is impossible with due diligence. Let's take a file from recent PDF question, then:

        use strict; use warnings; use CAM::PDF; my $d = CAM::PDF-> new( 'document.pdf' ); my $t = $d-> getPageContentTree( 1 ); $t-> render( 'CAM::PDF::Renderer::Dump' ); __END__ ( 50.17, 549.24) ( 50.17, 549.24) 2/27/2020 ( 93.23, 549.24) ( 93.23, 549.24) 3:07 ( 113.51, 549.24) ( 113.51, 549.24) AM ( 677.77, 549.24) ( 677.77, 549.24) Quotations ( 724.17, 549.24) ( 724.17, 549.24) Due ( 743.33, 549.24) ( 743.33, 549.24) By: ( 760.28, 549.24) ( 760.28, 549.24) 01/22/2020 ( 288.40, 533.24) ( 288.40, 533.24) ABSTRA ( 344.41, 533.24) ( 344.41, 533.24) CT ( 367.36, 533.24) ( 367.36, 533.24) OF ( 390.30, 533.24) ( 390.30, 533.24) UNSTRAPPED ( 487.91, 533.24) ( 487.91, 533.24) (A ....

        Something close to what you wanted? This "content tree" can be enormous structure, and easily eat 100++ MB for complex page, it follows drawing instructions as they flow during content interpretation, each node has "graphics state" attached and updated as it all proceeds. See source for an approximate idea, of course "The PDF Reference" is ultimate authority, can't avoid if you are serious about PDF.

        CAM::PDF can take different "plugins" (renderers) to traverse (render) this tree. The CAM::PDF::Renderer::Dump is primitive example. Now somewhat closer to "per-character coordinates" goal:

        MyTestRenderer.pm:

        package MyTestRenderer; use strict; use warnings; use base 'CAM::PDF::GS'; sub new { my ( $class, @args ) = @_; my $self = $class-> SUPER::new( @args ); $self-> { mode } = 'c'; # split into characters return $self } sub renderText { my ( $self, $string, $width ) = @_; my $fontsize = $self-> { Tfs }; my ( $xu, $yu ) = $self-> textToUser( 0, 0 ); my ( $xd, $yd ) = $self-> userToDevice( $xu, $yu ); printf "(x = %5.1f, y = %5.1f) (w = %.3f, h = %3.1f) %s\n", $xd, $yd, $width, $fontsize, $string; return; } 1;

        use strict; use warnings; use CAM::PDF; use lib '.'; my $d = CAM::PDF-> new( 'document.pdf' ); my $t = $d-> getPageContentTree( 1 ); $t-> render( 'MyTestRenderer' ); __END__ (x = 50.2, y = 549.2) (w = 0.500, h = 10.0) 2 (x = 55.2, y = 549.2) (w = 0.278, h = 10.0) / (x = 58.0, y = 549.2) (w = 0.500, h = 10.0) 2 (x = 63.0, y = 549.2) (w = 0.500, h = 10.0) 7 (x = 68.0, y = 549.2) (w = 0.278, h = 10.0) / (x = 70.7, y = 549.2) (w = 0.500, h = 10.0) 2 (x = 75.7, y = 549.2) (w = 0.500, h = 10.0) 0 (x = 80.7, y = 549.2) (w = 0.500, h = 10.0) 2 (x = 85.7, y = 549.2) (w = 0.500, h = 10.0) 0 (x = 93.2, y = 549.2) (w = 0.500, h = 10.0) 3 (x = 98.2, y = 549.2) (w = 0.278, h = 10.0) : (x = 101.0, y = 549.2) (w = 0.500, h = 10.0) 0 (x = 106.0, y = 549.2) (w = 0.500, h = 10.0) 7 (x = 113.5, y = 549.2) (w = 0.722, h = 10.0) A (x = 120.7, y = 549.2) (w = 0.889, h = 10.0) M (x = 677.8, y = 549.2) (w = 0.722, h = 10.0) Q (x = 685.0, y = 549.2) (w = 0.500, h = 10.0) u (x = 690.0, y = 549.2) (w = 0.500, h = 10.0) o (x = 695.0, y = 549.2) (w = 0.278, h = 10.0) t ....

        Problem solved? Maybe. Depends on your PDF files input. If they are as primitive and consistent as sample, and for years to follow, then yes. Otherwise, much further work is required, like I said.

        Different Y-coordinates in listings above are irrelevant, depend on obvious Y-axis interpretation. GS (and CAM::PDF) report baseline position, mutool does true per-glyph bbox -- I don't thinks such precision is necessary. Just step 1-2 units down from baseline, add 1-2 units to text height. Good enough, and constant per span (line). (Not that we can't do true glyph bbox in Perl. See Font::TTF, Font::FreeType). "w" is width in "unscaled text space", -- multiply by text size. Both "w" and "h" are further to be adjusted if general transformation matrix (cm) or text matrix (tm) specify scaling different from 100% or horizontal scaling (Tz) is not 1.

        Much nastier issues are there in case texts are not "single byte ascii, US-centric" encoded. See this patch to get string widths of double-byte encoded fonts. This patch may be of interest, too. As to actual text content extraction with non-ascii and/or double-byte encodings, this patch does that but was applied into different place for different (current at the time) purpose. CAM::PDF::PageText is only interested in text, it's independent from (orthogonal to) concept of "tree rendering", though it uses such a tree. The patch can be examined and snapped into appropriate place in our renderer, if you really want it done in "pure Perl".

Re: PDF alternative to mudrow to get XML structure
by jcb (Parson) on Mar 06, 2020 at 00:22 UTC

    You are confused. PDF does not have an XML structure, aside from some metadata blocks in some PDF files. The PDF structure itself is not XML because (among other reasons) PDF is an older format than XML.

    I am unfamiliar with mudraw; perhaps it translates PDF structure into XML? Try searching CPAN for "PDF" and see what you find.

      "You are confused..."

      Looks like you are the one who is confused here. OP specifically shows what they are doing, tells us how they are generating XML from PDF.

      "I am unfamiliar with mudraw; perhaps it translates PDF structure into XML? Try searching CPAN for "PDF" and see what you find.”

      It'd have take seconds to confirm what mudraw does.

        A PDF file does not have an XML structure. Our questioner is using a tool that produces XML output describing PDF structure and now needs to replace that tool. There is no standard translation from PDF to XML. There is no easy replacement for mudraw because the XML our questioner is using is a mudraw-specific format because there is no standard XML mapping for PDF. The best solution is to process the PDF directly.