Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: PDF GetInfo(

by axelrose (Scribe)
on Apr 29, 2002 at 20:42 UTC ( [id://162920]=note: print w/replies, xml ) Need Help??


in reply to PDF GetInfo(

With the help of Martin Hosken, author of the Text::PDF modules I attacked the task like this:
#!perl -w use strict; use Text::PDF::File; if (@ARGV) { for my $file (@ARGV) { if ( -r $file ) { print gettitle($file), "\n" } } } elsif ( $^O =~ /Mac/ ) { chomp( my $pwd = `pwd` ); my $file = MacPerl::Ask( "Input file:", $pwd ); if ( -r $file ) { print gettitle($file), "\n" } } else { die "no input, no output\n"; } sub gettitle { my $pdffile = shift; my $pdf = Text::PDF::File->open($pdffile) || die; my $info = $pdf->{'Info'}->val; my $title = $info->{'Title'}->val; }
I will check if manually going through all lines of the PDF file will give a speed boost.

Replies are listed 'Best First'.
Re: Re: PDF GetInfo(
by axelrose (Scribe) on May 13, 2002 at 11:13 UTC
    With the help of Alan Fry I could manage to get a fast solution like this
    sub gettitle { use Fcntl; my $file = shift; local *IN; sysopen( IN, $file, O_RDONLY, 0 ) or die "while reading: '$file'\n"; read IN, my ($str), -s $file; close IN; my ($info_block) = ( $str =~ /\/Info\s(\d+)\s0\sR/ ) or die "cannot get /Info paragraph\n"; my $searchpos = -1; my $info_start; while (1) { $info_start = index( $str, "$info_block 0 obj", $searchpos + 1 ); die "cannot get position of '$info_block 0 obj'\n" if $info_start < $searchpos + 1; last if ( substr( $str, $info_start - 1, 1 ) =~ /\015|\012/ ); $searchpos = $info_start; } my $info_obj = substr( $str, $info_start, index( $str, ">>", $info_start ) - $info_start + 2 ); my ($title) = ( $info_obj =~ /\/Title\s*\( ([^\015\012|\015|\012]*) \) /x ) or return 'undefined'; return $title; }

    I furthermore compared the performance of the above solution with Text::PDF and PDF-111 from CPAN. The test set consisted of 36 PDF files summing up to 3.8 MB.

    runtime ratios of
    index-solution-from-above : Text::PDF methods : PDF-111
    were:
    1 : 6 : 12

    PDF-111 from CPAN has other flaws too. The author didn't respond to my questions. IMHO it should be dumped. It has a far to promiment place in the module hierarchy.

      I'll grant you PDF has some definite flaws but the above solution does as well unfortunately.

      It has trouble with titles that were truncated due to length – it returns them as undefined. There also seems to be some problems with asian languages.

      I like the speed and mem use compared to some of the others (about 4× faster than PDF->GetInfo). I need to muck about in the info section for some other stuff hopefully I'll figure out the format for long titles. Thanks.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://162920]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (5)
As of 2024-04-19 06:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found