Convert PDF file into HTML file

DEIVEEGARAJA has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Convert PDF file into HTML file by chrestomanci (Priest) on Dec 22, 2010 at 11:53 UTC
It will never be easy to convert PDF to HTML, because PDF can contain a lot more than HTML can, while at the same time PDF has a lot less structure. HTML files usually have a linear structure that can easily be parsed. There are lots of tools to rendering them on screen or a paper printout. Converting HTML to PDF is easy, you just 'print' them to a PDF file. There are plenty of tools to do that. PDF files are not designed to have structure, they are more like a printout in electronic form. You can think of them more as postscript that is designed to be viewable on screen as well as on paper. PDF does not contain blocks of text in order with formatting, just lines of text in particular fonts. It is up to the human who reads those lines to decide what is a heading, collum or foot note. Any tool to convert PDF to html, (word, plan text, etc) has to use heuristics to guess structure from this unstructured text on a page. Those tools tend to be expensive, proprietary, and inexact, especially when faced with unusual layout such as multiple column or embedded images. OCR tools face similar problems for the same reasons. Having said that, if your input PDF files are simple, you could consider converting them to SVG (A form of XML), using pdf2svg, (part of the inkscape toolset), and then converting that XML to HTML using standard CPAN modules, and your own heuristics.	[reply]
Re^2: Convert PDF file into HTML file by elef (Friar) on Dec 22, 2010 at 12:23 UTC
Well said. This probably won't be any use, but here it goes anyway: pdftotext (part of the xpdf pdf viewer) can programmatically convert pdf to "formatted" txt. All it takes is `system (\"pdftotext -layout -enc UTF-8 \"$infile\" \"$outfile\"")` It approximates the original layout by inserting spaces in the txt. As you need HTML, you're probably better off with pdf2svg, this is just a note in case pdf2svg fails or whatever.	[reply] [d/l]
Re^2: Convert PDF file into HTML file by ajguitarmaniac (Sexton) on Dec 22, 2010 at 12:44 UTC
Hi chrestomanci, I do not have a solution to the topic under discussion but have another question for you since you seem to possess sound knowledge on the intricate structure of the PDF file. Anyways, the moment I saw this question, call it reflex, I googled the same, found a bunch of search results, websites that claim to convert PDF files to any desired format (including HTML). But websites claim that they can convert 'online PDFs" to HTML. Now is there a difference between the regular PDF file and these 'online PDFs'? Pardon me if my question is extremely silly but I really wanted to know this because there are a number of sites that I bumped into that claim can do the coversion under this discussion. Thanks.	[reply]
Re^3: Convert PDF file into HTML file by chrestomanci (Priest) on Dec 22, 2010 at 13:14 UTC
I did not think I was much of an expert on the internals of PDF. I had the insight to think of PDF as similar to postscript, and from that explained why perfect conversion is not possible. Online PDF will not be any different to normal PDF, those websites are simply referring to PDF files that are already downloadable on the web, which makes their conversion tools simpler. I had a look at a few online converters, and they mostly appear to be demos for paid apps that convert to other formats. You can't download a free executable to do the convertion on your own computer, you have to use the online tool, and see their ads. I also suspect that if you tried writing a script to use those online tools for bulk conversion, you would quickly find something preventing you such as a CAPTCHA, or a robots exclusion policy. In any case as I said before, the conversion will never be perfect. For an example of how far from perfect a PDF to HTML conversion can be, just click on "view as html" when google finds PDF files in a web search.	[reply]
Re^4: Convert PDF file into HTML file by Anonymous Monk on Feb 09, 2011 at 09:37 UTC
Re^2: Convert PDF file into HTML file by bart (Canon) on Feb 08, 2011 at 12:10 UTC
Oh, yeah, part of the fun of working with text from PDF is that, in order to nicely position the text on the page as for kerning (putting letters closer together to fill visual gaps between them) or justification (making spaces wider so the right side lines up to the margin), the PDF writer software may have cut up the text in small substrings and placed each on the page individually. It's up to you to puzzle the pieces back together again. Very rarely the text in PDF comes as one chunk.	[reply]
Re: Convert PDF file into HTML file by ww (Archbishop) on Dec 22, 2010 at 13:28 UTC
There's another possible complication beyond those enumerated in the excellent Re: Convert PDF file into HTML file. Some .pdf are created by scanning text_on_paper ¹. The intermediate is an image, not unlike a .png, .jpg or .bmp. The resultant .pdf contains a picture of the text, not the ASCII or UTF or Kanji characters, per se. And that, TTBOMK, leaves only the OCR option for retrieving the text as text. Update: Addition below, for clarity: ^1. This is typical, for example, of low-cost home "MFC" and "all-in-one" printer-scanner-copiers and of offices with limited, low-level IT knowledge and support and is effected by use of the multi-function copiers now commonly replacing single-purpose Xerograpic copiers.	[reply]
Re: Convert PDF file into HTML file by oko1 (Deacon) on Dec 22, 2010 at 15:06 UTC
As has been pointed out in a number of the excellent replies here, there's no reliable automatic way to do it because the information structures of PDF and HTML are incompatible. However, with a little human interaction and intelligence plugged into the system, it can be made to work (although it's not scalable.) 'pdftotext -layout' will extract the text, and 'pdfimages' will get the images. Once you have those, structuring either (or both) into a reasonable HTML approximation is relatively simple - but does require some thought and a little artistic judgement. In the (narrow, specialized) case where you know that your PDFs are going to be nothing more than plain text, the process could be automated with "pdftotext -layout -htmlmeta file.pdf". This will produce an HTML file with a reasonable header and the content surrounded by 'pre' tags. -- "Language shapes the way we think, and determines what we can think about." -- B. L. Whorf	[reply]
Re: Convert PDF file into HTML file by LanX (Saint) on Dec 22, 2010 at 13:47 UTC
The answer highly depends on the nature of your PDFs and the result you want! There is no simple answer for this general question, because a pure print format and a flowing format are different by nature and (as already mentioned) need heuristics. This post lists some possibilities (especially pdftohtml -xml) and other corresponding discussions: Parsing PDFs by text position? Cheers Rolf	[reply]