comment on

I did try both of those .. without success.

I got a pdf I've created with openoffice and pdftotext is able to extract text from it, whereas CAM::PDF (or File::Extract::PDF) gives me messy characters.

[jerome@saab pdf]$ getpdftext.pl -v ~/faxTaxHabitation2005.pdf
       




 
  
                 
                      
    ! " #  $ 

%  # & '     (
"  ) *

        + + +
...
[download]

And pdftotext:

[jerome@saab pdf]$ pdftotext ~/faxTaxHabitation2005.pdf txt
[jerome@saab pdf]$ tail txt

Merci de bien vouloir me confirmer ces informations par retour de fax 
+afin que je puisse proceder au paiment le plus rapidement possible au
+ numero suivant : *************

Cordiales salutations.
...
[download]

The ideal would be a perl module linked to the xpdf C code .. :)

-- Nice photos of naked perl sources here !

In reply to Re^2: extract text from pdf by jeteve
in thread extract text from pdf by jeteve

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Clear questions and runnable code get the best and fastest answer
	PerlMonks