Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

CAM::PDF extract text and their coordinates from pdf..

by umesh_epub (Novice)
on Jan 09, 2013 at 07:46 UTC ( [id://1012402]=perlquestion: print w/replies, xml ) Need Help??

umesh_epub has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I want to extract pagecontent and their coordinates from pdf any other ways to get the output. Please let me know.
#! perl use strict; use warnings; use CAM::PDF; use CAM::PDF::PageText; my $filename = 'test.pdf'; my $pdf = CAM::PDF->new($filename); my $pageone_tree = $pdf->getPageContentTree(1); print CAM::PDF::PageText->render($pageone_tree);
Thanks

Replies are listed 'Best First'.
Re: CAM::PDF extract text and their coordinates from pdf..
by snoopy (Curate) on Jan 09, 2013 at 22:08 UTC
    I've previously written a rendering class that does does just that:
    package PDF::ToText; use 5.006; use warnings; use strict; use CAM::PDF; use CAM::PDF::GS; use base qw(CAM::PDF::GS); =head1 NAME PDF::ToText - CAM::PDF renderer to extract PDF Text and position infor +mation =head1 VERSION Version 0.01 =cut our $VERSION = '0.01'; =head1 SYNOPSIS use CAM::PDF; use PDF::ToText; my $pdf = CAM::PDF->new($filename); my $contentTree = $pdf->getPageContentTree(1); $contentTree->render("PDF::ToText"); =head1 SUBROUTINES/METHODS =head2 renderText =cut sub _textToDevice { my $self = shift; my @t2u = $self->textToUser( @_ ); my @t2d = $self->userToDevice( @t2u); return @t2d; } sub renderText { my $self = shift; my $string = shift; my $width = shift; # collect vertices of this text segment. my @bottom_left = $self->_textToDevice(0, 0); my @bottom_right = $self->_textToDevice($width, 0); my @top_left = $self->_textToDevice(0, $self->{Tfs}); my @top_right = $self->_textToDevice($width, $self->{Tfs}); printf "%7.2f %7.2f %7.2f %7.2f %s\n", @bottom_left, @top_right, $s +tring; return; }
    It's a drop in replacement for CAM::PDF::PageText.

    In it's current state, it dumps text coordinates to STDOUT; but it can be easily amended to collect them in a global variable or whatever (CAM::PDF doesn't currently support the passing of handles).

      Hi Snoopy,
      Thanks for your kind replay. How to know line start and line end.
      Which material we have to study for doing pdf operations.
      Thanks,
      Umesh
        Hi Umesh,

        Yes, that's the same point that I got to.

        In practice, you end up with a lot of text fragments that need to be reassembled into words and lines. Putting these back together into words and lines is a fair bit of work and can involve some heuristics.

        Rather than continuing to develop the above, I personally went with pstotext from the Ghostscript suite; it has a `-bboxes` option to output text positions and does attempt to assemble words and lines. Despite it's name it will work on pdf files.

        Another program I looked at was pdfminer.

        One of these, or something similar, might work. It's just a matter of how good a job they do.

        - David

Re: CAM::PDF extract text and their coordinates from pdf..
by LanX (Saint) on Jan 10, 2013 at 06:34 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1012402]
Approved by marto
Front-paged by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (3)
As of 2024-04-19 19:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found