http://qs321.pair.com?node_id=11113907


in reply to Re^2: PDF alternative to mudrow to get XML structure
in thread PDF alternative to mudrow to get XML structure

That's another pdftohtml. Poppler was fork of xpdf, now there are popular pdf-related utilities with identical names but different abilities.

The pdftohtml doesn't report per-character coordinates though -- only per-span (per line, usually) spacial extent. I may be mistaken, but presumably they (per-char coords) are what you were after (that "my stuff" over there above), or you wouldn't go such long way to just extract text. BTW, nowadays (for a few years) the "mudraw" was (thankfully) renamed and is invoked as

>mutool.exe draw -F stext document.pdf 2>nul <?xml version="1.0"?> <document name="(null)"> <page width="841.836" height="595.238"> <block bbox="50.172 35.47 805.83609 48.81"> <line bbox="50.172 35.47 129.62201 48.81" wmode="0" dir="1 0"> <font name="Times-Roman" size="10"> <char quad="50.172 35.47 55.172 35.47 50.172 48.81 55.172 48.81" x="50 +.172" y="46" c="2"/> <char quad="55.172 35.47 57.952 35.47 55.172 48.81 57.952 48.81" x="55 +.172" y="46" c="/"/> <char quad="57.952 35.47 62.952 35.47 57.952 48.81 62.952 48.81" x="57 +.952" y="46" c="2"/> <char quad="62.952 35.47 67.951999 35.47 62.952 48.81 67.951999 48.81" + x="62.952" y="46" c="7"/> <char quad="67.951999 35.47 70.731998 35.47 67.951999 48.81 70.731998 +48.81" x="67.951999" y="46" c="/"/> <char quad="70.731998 35.47 75.731998 35.47 70.731998 48.81 75.731998 +48.81" x="70.731998" y="46" c="2"/> <char quad="75.731998 35.47 80.731998 35.47 75.731998 48.81 80.731998 +48.81" x="75.731998" y="46" c="0"/> <char quad="80.731998 35.47 85.731998 35.47 80.731998 48.81 85.731998 +48.81" x="80.731998" y="46" c="2"/> <char quad="85.731998 35.47 90.731998 35.47 85.731998 48.81 90.731998 +48.81" x="85.731998" y="46" c="0"/> <char quad="90.731998 35.47 93.232 35.47 90.731998 48.81 93.232 48.81" + x="90.731998" y="46" c=" "/> <char quad="93.232 35.47 98.232 35.47 93.232 48.81 98.232 48.81" x="93 +.232" y="46" c="3"/> <char quad="98.232 35.47 101.012 35.47 98.232 48.81 101.012 48.81" x=" +98.232" y="46" c=":"/> <char quad="101.012 35.47 106.012 35.47 101.012 48.81 106.012 48.81" x +="101.012" y="46" c="0"/> <char quad="106.012 35.47 111.012 35.47 106.012 48.81 111.012 48.81" x +="106.012" y="46" c="7"/> <char quad="111.012 35.47 113.512 35.47 111.012 48.81 113.512 48.81" x +="111.012" y="46" c=" "/> <char quad="113.512 35.47 120.732 35.47 113.512 48.81 120.732 48.81" x +="113.512" y="46" c="A"/> <char quad="120.732 35.47 129.62201 35.47 120.732 48.81 129.62201 48.8 +1" x="120.732" y="46" c="M"/> </font> </line> ...

(see stderr output is supressed, or xml will be interspersed with "doc this, page that" messages)

The alternative is Ghostscript, of course:

>gswin64c -q -sDEVICE=txtwrite -dTextFormat=1 -o - document.pdf <page> <block> <line> <span bbox="50 46 130 46" font="Times-Roman" size="10.0000"> <char bbox="50 46 55 46" c="2"/> <char bbox="55 46 58 46" c="/"/> <char bbox="58 46 63 46" c="2"/> <char bbox="63 46 68 46" c="7"/> <char bbox="68 46 71 46" c="/"/> <char bbox="71 46 76 46" c="2"/> <char bbox="76 46 81 46" c="0"/> <char bbox="81 46 86 46" c="2"/> <char bbox="86 46 91 46" c="0"/> <char bbox="91 46 93 46" c=" "/> <char bbox="93 46 98 46" c="3"/> <char bbox="98 46 101 46" c=":"/> <char bbox="101 46 106 46" c="0"/> <char bbox="106 46 111 46" c="7"/> <char bbox="111 46 114 46" c=" "/> <char bbox="114 46 121 46" c="A"/> <char bbox="121 46 130 46" c="M"/> </span> </line> </block> ...

(see bbox is not really a box, take "size" into account to get height).

###################

At best a pure Perl solution?

Oh yes it's possible, see CAM::PDF. Chris laid beautiful foundation, huge amount of work. Some aspects are not really finished, though nothing is impossible with due diligence. Let's take a file from recent PDF question, then:

use strict; use warnings; use CAM::PDF; my $d = CAM::PDF-> new( 'document.pdf' ); my $t = $d-> getPageContentTree( 1 ); $t-> render( 'CAM::PDF::Renderer::Dump' ); __END__ ( 50.17, 549.24) ( 50.17, 549.24) 2/27/2020 ( 93.23, 549.24) ( 93.23, 549.24) 3:07 ( 113.51, 549.24) ( 113.51, 549.24) AM ( 677.77, 549.24) ( 677.77, 549.24) Quotations ( 724.17, 549.24) ( 724.17, 549.24) Due ( 743.33, 549.24) ( 743.33, 549.24) By: ( 760.28, 549.24) ( 760.28, 549.24) 01/22/2020 ( 288.40, 533.24) ( 288.40, 533.24) ABSTRA ( 344.41, 533.24) ( 344.41, 533.24) CT ( 367.36, 533.24) ( 367.36, 533.24) OF ( 390.30, 533.24) ( 390.30, 533.24) UNSTRAPPED ( 487.91, 533.24) ( 487.91, 533.24) (A ....

Something close to what you wanted? This "content tree" can be enormous structure, and easily eat 100++ MB for complex page, it follows drawing instructions as they flow during content interpretation, each node has "graphics state" attached and updated as it all proceeds. See source for an approximate idea, of course "The PDF Reference" is ultimate authority, can't avoid if you are serious about PDF.

CAM::PDF can take different "plugins" (renderers) to traverse (render) this tree. The CAM::PDF::Renderer::Dump is primitive example. Now somewhat closer to "per-character coordinates" goal:

MyTestRenderer.pm:

package MyTestRenderer; use strict; use warnings; use base 'CAM::PDF::GS'; sub new { my ( $class, @args ) = @_; my $self = $class-> SUPER::new( @args ); $self-> { mode } = 'c'; # split into characters return $self } sub renderText { my ( $self, $string, $width ) = @_; my $fontsize = $self-> { Tfs }; my ( $xu, $yu ) = $self-> textToUser( 0, 0 ); my ( $xd, $yd ) = $self-> userToDevice( $xu, $yu ); printf "(x = %5.1f, y = %5.1f) (w = %.3f, h = %3.1f) %s\n", $xd, $yd, $width, $fontsize, $string; return; } 1;

use strict; use warnings; use CAM::PDF; use lib '.'; my $d = CAM::PDF-> new( 'document.pdf' ); my $t = $d-> getPageContentTree( 1 ); $t-> render( 'MyTestRenderer' ); __END__ (x = 50.2, y = 549.2) (w = 0.500, h = 10.0) 2 (x = 55.2, y = 549.2) (w = 0.278, h = 10.0) / (x = 58.0, y = 549.2) (w = 0.500, h = 10.0) 2 (x = 63.0, y = 549.2) (w = 0.500, h = 10.0) 7 (x = 68.0, y = 549.2) (w = 0.278, h = 10.0) / (x = 70.7, y = 549.2) (w = 0.500, h = 10.0) 2 (x = 75.7, y = 549.2) (w = 0.500, h = 10.0) 0 (x = 80.7, y = 549.2) (w = 0.500, h = 10.0) 2 (x = 85.7, y = 549.2) (w = 0.500, h = 10.0) 0 (x = 93.2, y = 549.2) (w = 0.500, h = 10.0) 3 (x = 98.2, y = 549.2) (w = 0.278, h = 10.0) : (x = 101.0, y = 549.2) (w = 0.500, h = 10.0) 0 (x = 106.0, y = 549.2) (w = 0.500, h = 10.0) 7 (x = 113.5, y = 549.2) (w = 0.722, h = 10.0) A (x = 120.7, y = 549.2) (w = 0.889, h = 10.0) M (x = 677.8, y = 549.2) (w = 0.722, h = 10.0) Q (x = 685.0, y = 549.2) (w = 0.500, h = 10.0) u (x = 690.0, y = 549.2) (w = 0.500, h = 10.0) o (x = 695.0, y = 549.2) (w = 0.278, h = 10.0) t ....

Problem solved? Maybe. Depends on your PDF files input. If they are as primitive and consistent as sample, and for years to follow, then yes. Otherwise, much further work is required, like I said.

Different Y-coordinates in listings above are irrelevant, depend on obvious Y-axis interpretation. GS (and CAM::PDF) report baseline position, mutool does true per-glyph bbox -- I don't thinks such precision is necessary. Just step 1-2 units down from baseline, add 1-2 units to text height. Good enough, and constant per span (line). (Not that we can't do true glyph bbox in Perl. See Font::TTF, Font::FreeType). "w" is width in "unscaled text space", -- multiply by text size. Both "w" and "h" are further to be adjusted if general transformation matrix (cm) or text matrix (tm) specify scaling different from 100% or horizontal scaling (Tz) is not 1.

Much nastier issues are there in case texts are not "single byte ascii, US-centric" encoded. See this patch to get string widths of double-byte encoded fonts. This patch may be of interest, too. As to actual text content extraction with non-ascii and/or double-byte encodings, this patch does that but was applied into different place for different (current at the time) purpose. CAM::PDF::PageText is only interested in text, it's independent from (orthogonal to) concept of "tree rendering", though it uses such a tree. The patch can be examined and snapped into appropriate place in our renderer, if you really want it done in "pure Perl".