Re: PDF GetInfo(

With the help of Martin Hosken, author of the Text::PDF modules I attacked the task like this:

#!perl -w
use strict;
use Text::PDF::File;

if (@ARGV) {
    for my $file (@ARGV) {
        if ( -r $file ) { print gettitle($file), "\n" }
    }
}
elsif ( $^O =~ /Mac/ ) {
    chomp( my $pwd = `pwd` );
    my $file = MacPerl::Ask( "Input file:", $pwd );
    if ( -r $file ) { print gettitle($file), "\n" }
}
else {
    die "no input, no output\n";
}

sub gettitle {
    my $pdffile = shift;
    my $pdf     = Text::PDF::File->open($pdffile) || die;
    my $info    = $pdf->{'Info'}->val;
    my $title   = $info->{'Title'}->val;
}
[download]

I will check if manually going through all lines of the PDF file will give a speed boost.

Comment on Re: PDF GetInfo( Download Code

Replies are listed 'Best First'.
Re: Re: PDF GetInfo( by axelrose (Scribe) on May 13, 2002 at 11:13 UTC
With the help of Alan Fry I could manage to get a fast solution like this sub gettitle { use Fcntl; my $file = shift; local IN; sysopen( IN, $file, O_RDONLY, 0 ) or die "while reading: '$file'\n"; read IN, my ($str), -s $file; close IN; my ($info_block) = ( $str =~ /\/Info\s(\d+)\s0\sR/ ) or die "cannot get /Info paragraph\n"; my $searchpos = -1; my $info_start; while (1) { $info_start = index( $str, "$info_block 0 obj", $searchpos + 1 ); die "cannot get position of '$info_block 0 obj'\n" if $info_start < $searchpos + 1; last if ( substr( $str, $info_start - 1, 1 ) =~ /\015\|\012/ ); $searchpos = $info_start; } my $info_obj = substr( $str, $info_start, index( $str, ">>", $info_start ) - $info_start + 2 ); my ($title) = ( $info_obj =~ /\/Title\s$ ([^\015\012\|\015\|\012]*) $ /x ) or return 'undefined'; return $title; } [download] I furthermore compared the performance of the above solution with Text::PDF and PDF-111 from CPAN. The test set consisted of 36 PDF files summing up to 3.8 MB. runtime ratios of index-solution-from-above : Text::PDF methods : PDF-111 were: 1 : 6 : 12 PDF-111 from CPAN has other flaws too. The author didn't respond to my questions. IMHO it should be dumped. It has a far to promiment place in the module hierarchy.	[reply] [d/l]
Re(3): PDF GetInfo() by Arguile (Hermit) on Aug 16, 2002 at 18:59 UTC
I'll grant you PDF has some definite flaws but the above solution does as well unfortunately. It has trouble with titles that were truncated due to length – it returns them as undefined. There also seems to be some problems with asian languages. I like the speed and mem use compared to some of the others (about 4× faster than PDF->GetInfo). I need to muck about in the info section for some other stuff hopefully I'll figure out the format for long titles. Thanks.	[reply]


Think about Loose Coupling
	PerlMonks