Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

M$ Word -->HTML-->PS--> PDF

by g00n (Hermit)
HTML-->PS--> PDF NODE.owner = 244776 N.title = monktitlebar sitedoclet N.owner = 17342 -->
on Jan 23, 2004 at 11:00 UTC ( [id://323508]=note: print w/replies, xml ) Need Help??


in reply to Re: Converting M$ Word --> PDF
in thread Converting M$ Word --> PDF

You can get a similiar result exporting ms-word docs to html. Then using html2ps to convert the file to postscript. Convert the postscript to PDF using ps2pdf. something like ...
  • export file to html using MSWord/OO to say file.doc->file.html
  • using cygwin on windows (or copy file to *nix sys)
  • perl /usr/bin/html2ps file.html > file.ps
  • ps2pdf file.ps

If you have cygwin on a MS system this works OK (especially if you dont have access to a *nix). The above suggestion works a treat if you have OO/*nix combo.

It works for text. But I have not tried text/graphics or plain graphics. Anyone had experience with graphics using this approach?

Replies are listed 'Best First'.
Re: M$ Word -->HTML-->PS--> PDF
by peterr (Scribe) on Jan 24, 2004 at 00:09 UTC
    Hi g00n,

    Thanks for your tips on how to go

    M$ Word -->HTML-->PS--> PDF

    If you have cygwin on a MS system this works OK (especially if you dont have access to a *nix). The above suggestion works a treat if you have OO/*nix combo.

    I do have cygwin installed on the Win box, but I do have access to the Linux box at the website (shell) also. The less steps and less 'box changes' the better. My reply to "neuroball", the 3 steps is the ideal situation, but the current Word doc (the catalogue) has tables, graphics and was 'built' with Word templates, so I have no idea how ell it would all convert.

    Peter

      the problem

        but the current Word doc (the catalogue) has tables, graphics and was 'built' with Word templates, so I have no idea how ell it would all convert.

      the site that got me interested in pdf was Stas Beckmans site, www.stason.org. He gave a talk to the melbourne pm last year. Through the course of his talk on mod_perl 2 he showed the notes from his site in html with pdf downloads of the site.

      So I tried to re-create this html->ps->pdf so that I too could have a printable version of a project I'm working on called Ratpile (make a directory that has *stuff* stored in it searchable by stuffing information about it into a relational database - data mining some may call it.) using perl+DBI+TT2. The template I created is a *bare bones* html page sans images. This is the technique Stas is using with his docset.

      the point I guess I'm trying to make is I've used text only and not images. I've done a bit of research and this is what I've come up with...

      • graphics are supported in postscript (3?)
      • others better (ybiC) than I, have hacked together html->PS->PDF code and appears to handle images via html2ps but not html tables (Create PostScript and PDF versions of all HTML files in given directory )
      • one approach could be to use Matt Sergeants, PDFLib (load_image method) a oo wrapper around pdflib by www.pdflib.com. but I seem to remember has restrictions for use under OSI (has to be opensource, private use or researcher).
      • or use Alfred Reibenschuhs - Text::PDF::API where I found via an old page PDF-API2-0 which has some image (jpg,png,handleing capabilities
      • logreport has an interesting set of observations about html->PDF generation. Namely problems with html formatting and tables
      building html->PDF with images and troublesome html tables

      now given what we have found above I would suggest the following (unless anyone has a better idea) of using:

      • extract word document to html
      • extract table data (word document via OLE) or (via html via Html-TableExtract - like latter better.)
      • remove html tables in html documents
      • reinsert data into a simple table using <pre> tags for layout and html tags for bolding, emphasis. Or find some other method by experimentation in html for representing tables (text)
      • PDF-API2 as the PDF renderer. This can all be done in code.

      the real problem maybe rendering the tables generated from word. complicated layout in word (re-rendered to html) will have to be modified to the postscript syntax then rendered to PDF. The problem is defined by converting the html tables to pdf.

      it is not rocket science to create a bit of code to extract the data from the table, re-create a table using PDF-API (and its child modules).

      update: Perl Graphics Programming has 3 chapters devoted to PDF and perl, 1 specifically on PDF-API2.

      but is there a shorcut?

      of course you could forget all the above and take your chances with Michael Frankl's HTML-HTMLDOC and convert you html files directly to PDF :)

      credits

      damn I love cpan.

        Hi g00n

        update: Perl Graphics Programming has 3 chapters devoted to PDF and perl, 1 specifically on PDF-API2.

        I see there are some examples from this book here (I guess you live in Melb, so do I)

        Thanks everyone for your replies, I'm still trying to digest it all, it may take me a few days though, being an "old cogger". :)

        Peter

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://323508]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (2)
As of 2024-04-26 07:20 GMT
Sections?
HTML-->PS--> PDF NODE.owner = 244776 N.title = Sections sitedoclet N.owner = 17342 -->
Information?
HTML-->PS--> PDF NODE.owner = 244776 N.title = Information sitedoclet N.owner = 17342 -->
Find Nodes?
HTML-->PS--> PDF NODE.owner = 244776 N.title = Find Nodes sitedoclet N.owner = 17342 -->
Leftovers?
    HTML-->PS--> PDF NODE.owner = 244776 N.title = Leftovers sitedoclet N.owner = 17342 -->
    Voting Booth?

    No recent polls found