Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Converting HTML to DOC

by gokuraku (Monk)
on Apr 08, 2010 at 16:14 UTC ( [id://833548]=perlquestion: print w/replies, xml ) Need Help??

gokuraku has asked for the wisdom of the Perl Monks concerning the following question:

I have a feeling I already know the question, but thought I would check anyway, has anyone been able to figure out a way to convert HTML source to a Doc using perl? This would mean setting up the page in the Doc like it is in HTML, with images (if any)...I don't want to use render here since its only for later review and not printing. I usually think of printing when I think render.
I've seen a few programs that do this, but I thought there might be a Perl Module that can do a conversion, maybe with Mechanize or something like that. If anyone has any good leads on this I'd appreciate it.

Replies are listed 'Best First'.
Re: Converting HTML to DOC
by bobr (Monk) on Apr 09, 2010 at 08:27 UTC
    If you have Word installed, you can utilize Win32::OLE to make Word convert it for you.
    use strict; use Win32::OLE; use Win32::OLE::Const 'Microsoft Word'; use File::Spec; my ($input_html,$output_doc) = @ARGV; my $word = Win32::OLE->CreateObject('Word.Application'); $word->{'Visible'} = 0; my $file = $word->Documents->Open({ FileName => File::Spec->rel2abs($input_html), Format => wdOpenFormatWebPages, ConfirmConversions => 0, AddToRecentFiles => 0, Revert => 0, ReadOnly => 1, OpenAndRepair => 0, }) or die dump $word; # Save As $word->ActiveDocument->SaveAs({ FileName => File::Spec->rel2abs($output_doc), FileFormat => wdFormatDocument, }); # Quit/Close $file->Close({SaveChanges => wdDoNotSaveChanges}); $word->Quit( {SaveChanges => wdDoNotSaveChanges});
    Of course Word's sense of HTML is somewhat limited, but mostly that works quite well.

    -- Roman

      Roman, this is great! It's what I was looking for, since my experience with most of the tools I found with Google was lacking and I wanted a nice scriptable option that I could feed files to from a PowerShell script.
      Thanks alot!
Re: Converting HTML to DOC
by MidLifeXis (Monsignor) on Apr 08, 2010 at 16:42 UTC

    Load the HTML file directly into the word editor, and then save as .doc?

    My wife is walking for a cure for MS. Please consider supporting her.

Re: Converting HTML to DOC
by wazoox (Prior) on Apr 09, 2010 at 10:51 UTC
    Alternatively you could convert it to RTF. Word opens RTF with a .DOC extension just fine. There is a module that does exactly this: HTML::FormatRTF, lucky boy :)
      RTF doesn't save the page images though, which is what I need to do.
Re: Converting HTML to DOC
by LanX (Saint) on Apr 08, 2010 at 16:39 UTC
    > convert HTML source to a Doc using perl?

    Do you mean MS-Word .doc?

    Cheers Rolf

Re: Converting HTML to DOC
by gokuraku (Monk) on Apr 08, 2010 at 17:23 UTC
    Yes, to MSWord Doc format. I'm trying to figure out a way to script it since we are talking hundreds of files across a web site.
      Did you try to google, to see what fits your needs "html to word" perl?

      IMHO there is no trivial answer to such a general question!

      Word-doc is effectively a print format like PDF, HTML is a multi device format, that means you have to decide what the print-version has to look like, where the page breaks are, font-sizes ... and so on.

      Furthermore word-doc is (was???) a closed proprietary format, converting to RTF is much easier and better supported.

      If you just want the default formating MS-Word produces you should simply use it's API and script the load and save-as.

      Cheers Rolf

      UPDATE: IIRC word-doc can embed HTML-Objects. And I wouldn't be surprised if the IE has a feature to export word-doc, so maybe another approach to script it...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://833548]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (6)
As of 2024-03-28 21:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found