Convert PDF to HTML (or JPEG)

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Convert PDF to HTML (or JPEG) by almut (Canon) on Sep 12, 2009 at 12:31 UTC
For PDF to JPG (or any other raster image format like PNG or TIFF), you could use GhostScript to do the conversion: `$ gs -q -dBATCH -dNOPAUSE -sDEVICE=jpeg -dJPEGQ88 -r150 -sOutputFile=i +mg%d.jpg input.pdf` [download] This would create as many images (`img1.jpg` to `imgN.jpg`) as there are pages in the PDF file. `-r` is the resolution in dpi (150dpi would create an image size of 1240x1754 for A4 paper size), and `-dJPEGQ` is the quality factor (up to 100). Unfortunately, this doesn't do any anti-aliasing, so the fonts typically look rather ragged... You can work around that problem by doing the anti-aliasing yourself; which means, you'd have to oversample while rendering from PDF to raster (e.g. by a factor of 4, i.e. 600dpi) and then downsample with an appropriate filter. ImageMagick's `convert` can be used for the latter. The complete sequence of steps would be: `$ gs -q -dBATCH -dNOPAUSE -sDEVICE=jpeg -dJPEGQ88 -r600 -sOutputFile=i +mg%d.jpg input.pdf $ for img in img.jpg ; do convert $img -filter Lanczos -resize 25% -q +uality 90 out_$img ; done` [download] The resulting anti-aliased images `out_img.jpg` would then have 150dpi resolution. In case you have the non-`/usr/bin`-namespace-polluting sister GraphicsMagick installed (instead of ImageMagick), the command would be `gm convert ...` (Those who hold a degree in Signal Processing - or have come in contact with filter design in some other context - might want to take a look at the list of filters to choose from — in case of doubt, stick with Lanczos or Kaiser for somewhat sharper, or Gaussian or Cubic for somewhat softer results.) Also, there's documentation - well hidden from daylight - under `/usr/share/doc/ghostscript/Devices.htm`, which explains what options are available with the individual Ghostscript output devices (you usually need to have another package installed (e.g. `ghostscript-doc` on Debian/Ubuntu) to have that file).	[reply] [d/l] [select]
Re^2: Convert PDF to HTML (or JPEG) by LanX (Saint) on Sep 12, 2009 at 14:13 UTC
Almut, IIRC `convert` has a switch for antialiasing, I never had problems converting PDF to bitmaps (well ... years ago) So no need for oversampling. Cheers Rolf	[reply] [d/l]
Re^3: Convert PDF to HTML (or JPEG) by almut (Canon) on Sep 12, 2009 at 17:33 UTC
Yes, `convert` has an `-antialias` switch, but not GhostScript — at least not the jpeg driver (there's an `x11alpha` screen driver, but I think that's the only one which does anti-aliasing by itself). And ImageMagick (i.e. `convert`) cannot render PDF/PS itself; it uses GhostScript for that under the hood, anyway... Personally, I prefer to use both tools separately, because then I have fine control over the parameters used during conversion, and so far, I've always achieved better results (in less time) than when trying to convince `convert` alone to do what I want. For example, the naive approach (which I figure should be comparable to the conversions I posted above) when using `convert` directly would be something like this: `$ convert input.pdf -density 150 -geometry 1240x1754 -antialias -quali +ty 90 img%d.jpg` [download] But the results are much worse than when doing the steps separately... (example: test1.jpg, test2.jpg — where test1.jpg has been produced by using `gs` and `convert` separately, and test2.jpg when calling `gs` indirectly via `convert` (the command right above)). As I read the docs, `-density` is supposed to set the resolution ("set resolution of an image for rendering to devices"), however, for some reason this doesn't seem to be passed on to Ghostscript (as can be revealed using `strace`)... In case you have the patience to figure out the correct incantation of options for `convert` that achieves the quality of test1.jpg, please let me know (input PDF here) — IMHO, there's too much Magick going on :)	[reply] [d/l] [select]
Re^4: Convert PDF to HTML (or JPEG) by LanX (Saint) on Sep 14, 2009 at 20:56 UTC
Re^5: Convert PDF to HTML (or JPEG) by almut (Canon) on Sep 14, 2009 at 23:29 UTC
Some notes below your chosen depth have not been shown here
Re: Convert PDF to HTML (or JPEG) (How?) by LanX (Saint) on Sep 12, 2009 at 10:25 UTC
What kind of conversion do you expect? PDF is a printformat with fixed geometry and linebreaks. Each character is positioned individually, the bigger context is (per default) lost. (Normal) HTML defines texts (lines and paragraphs) which are flexibly drawn and broken dependent on the users display. Cheers Rolf UPDATE: you might want to look at solutions using xpdf-tools like pdf2html which produces HTML-files (+ massive CSS) with fixed positioned text... that's what you want?	[reply]
Re^2: Convert PDF to HTML (or JPEG) (How?) by Sewi (Friar) on Sep 12, 2009 at 12:20 UTC
Oh, sorry, I didn't see the -c - switch which does exactly what I need. Thanks!	[reply]
Re: Convert PDF to HTML (or JPEG) by ww (Archbishop) on Sep 12, 2009 at 09:15 UTC
I don't know if this will help, but have you evaluated SWISH::Filters::Pdf2HTML? from CPAN: - Perl extension for filtering PDF documents with Swish-e This is a plug-in module that uses the xpdf package to convert PDF documents to html for indexing by Swish-e. Any info tags found in the PDF document are created as meta tags. This filter plug-in requires the xpdf package	[reply]
Re^2: Convert PDF to HTML (or JPEG) by Sewi (Friar) on Sep 12, 2009 at 09:25 UTC
I tried xpdf some time ago when looking for the same problem and it seems that xpdf ignores pictures at all when converting :-(	[reply]
Re^3: Convert PDF to HTML (or JPEG) by marto (Cardinal) on Sep 12, 2009 at 11:54 UTC
I'm not quite sure what you were expecting, README: Xpdf is an open source viewer for Portable Document Format (PDF) files. (These are also sometimes also called 'Acrobat' files, from the name of Adobe's PDF software.) The Xpdf project also includes a PDF text extractor, PDF-to-PostScript converter, and various other utilities. man pdftotext: Pdftotext converts Portable Document Format (PDF) files to plain text. Pdftotext reads the PDF file, PDF-file, and writes a text file, text- file. If text-file is not specified, pdftotext converts file.pdf to file.txt. If text-file is ´-, the text is sent to stdout. man pdfimages: Pdfimages saves images from a Portable Document Format (PDF) file as Portable Pixmap (PPM), Portable Bitmap (PBM), or JPEG files. Pdfimages reads the PDF file PDF-file, scans one or more pages, and writes one PPM, PBM, or JPEG file for each image, image-root-nnn.xxx, where nnn is the image number and xxx is the image type (.ppm, .pbm, .jpg). These utilities are not designed to output html with embeded images. Martin	[reply]


more useful options
	PerlMonks