CAM::PDF extract text and their coordinates from pdf..

umesh_epub has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: CAM::PDF extract text and their coordinates from pdf..
by snoopy (Curate) on Jan 09, 2013 at 22:08 UTC

package PDF::ToText;

use 5.006;
use warnings;
use strict;
use CAM::PDF;
use CAM::PDF::GS;
use base qw(CAM::PDF::GS);

=head1 NAME

PDF::ToText - CAM::PDF renderer to extract PDF Text and position infor
+mation

=head1 VERSION

Version 0.01

=cut

our $VERSION = '0.01';

=head1 SYNOPSIS

    use CAM::PDF;
    use PDF::ToText;
    my $pdf = CAM::PDF->new($filename);
    my $contentTree = $pdf->getPageContentTree(1);
    $contentTree->render("PDF::ToText");

=head1 SUBROUTINES/METHODS

=head2 renderText

=cut

sub _textToDevice {
    my $self = shift;

    my @t2u = $self->textToUser( @_ );
    my @t2d = $self->userToDevice( @t2u);

    return @t2d;
}

sub renderText {
   my $self = shift;
   my $string = shift;
   my $width = shift;

   # collect vertices of this text segment.

   my @bottom_left = $self->_textToDevice(0, 0);
   my @bottom_right = $self->_textToDevice($width, 0);
   my @top_left = $self->_textToDevice(0, $self->{Tfs});
   my @top_right = $self->_textToDevice($width, $self->{Tfs});

   printf "%7.2f %7.2f %7.2f %7.2f %s\n", @bottom_left, @top_right, $s
+tring; 

   return;
}
[download]

In it's current state, it dumps text coordinates to STDOUT; but it can be easily amended to collect them in a global variable or whatever (CAM::PDF doesn't currently support the passing of handles).

[reply]
[d/l]

Re^2: CAM::PDF extract text and their coordinates from pdf..

by umesh_epub (Novice) on Jan 10, 2013 at 05:39 UTC

[reply]

Re^3: CAM::PDF extract text and their coordinates from pdf..

by snoopy (Curate) on Jan 10, 2013 at 05:58 UTC

Yes, that's the same point that I got to.

In practice, you end up with a lot of text fragments that need to be reassembled into words and lines. Putting these back together into words and lines is a fair bit of work and can involve some heuristics.

Rather than continuing to develop the above, I personally went with pstotext from the Ghostscript suite; it has a `-bboxes` option to output text positions and does attempt to assemble words and lines. Despite it's name it will work on pdf files.

Another program I looked at was pdfminer.

One of these, or something similar, might work. It's just a matter of how good a job they do.

- David

[reply]

Re^4: CAM::PDF extract text and their coordinates from pdf..

by umesh_epub (Novice) on Jan 10, 2013 at 13:04 UTC

Re^5: CAM::PDF extract text and their coordinates from pdf..

by snoopy (Curate) on Jan 10, 2013 at 23:19 UTC

Re: CAM::PDF extract text and their coordinates from pdf..
by LanX (Saint) on Jan 10, 2013 at 06:34 UTC

pdftohtml -xml

for older discussions see search result:

2011-02-09 LanX Re: Need Help for Convert PDF to HTML Re:SoPW
2010-12-22 LanX Re^2: PDF File Merging Data Re:SoPW
2010-12-22 LanX Re: Convert PDF file into HTML file Re:SoPW
2010-03-28 LanX Re: How to invoke pdftotext and extract first line of text from PDF file? Re:SoPW
2010-03-26 LanX Parsing PDFs by text position? SoPW
2009-09-12 LanX Re: Convert PDF to HTML (or JPEG) (How?) Re:SoPW

Cheers Rolf

[reply]
[d/l]


XP is just a number
	PerlMonks