Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Rendering HTML / capturing pixels

by SpaceAce (Beadle)
on Feb 27, 2003 at 08:39 UTC ( [id://239028]=perlquestion: print w/replies, xml ) Need Help??

SpaceAce has asked for the wisdom of the Perl Monks concerning the following question:

This is a slightly oddball question. I am not familiar with any PERL code to render HTML pages (render as in "draw", not as in "create") but it seems with so comprehensive a language, the code probably already exists.

I am familiar with certain PERL libraries for the manipulation of image data (mostly ImageMagick/PerlMagick).

Now, the idea I have involves turning HTML into images. I am not dealing with highly fancy pages, here, just basics like tables and images. I would like to process an HTML page and basically grab a "screenshot" of it. Is this ridiculously farfetched? I can already see hundreds of complications, but if anyone knows where I might even get a start at this, I'd appreciate it.

SpaceAce
s>>sp>;s>..|>\u$&ace>g;print;

Replies are listed 'Best First'.
Re: Rendering HTML / capturing pixels
by Corion (Patriarch) on Feb 27, 2003 at 08:54 UTC

    Rendering HTML is far from "easy", especially with the "simple" things like tables and images. You might find some inspiration in the converters that convert HTML to Postscript and/or (La)TEX. For the actual rendering, you will also have to consider CSS and the like.

    Under Win32, there are two relatively easy ways to capture the image of a webpage, either you automate Internet Explorer to display the HTML, and then take a screenshot, or you automate Internet Explorer to print the page into a file, and then postprocess that file.

    Under Unix, I see only the way of printing to a file, but there is no such nice way of automating a browser as there is under Win32. You might be able to write some XS-glue to automate one of the rendering engines (KHTML, Gecko), but that's not "easy" per se (IMO).

    perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web
      Some (all? most?) versions or *nix Netscape allow remote control. You start netscape with the "-remote" option. You could probably generate Postscript as Corion suggests with the commands openURL() and saveAs(). I have not tried that particular combination. See this for more information.

      Another option would be to get the Mozilla source and modify it directly or see if something in the source allows what you want.

      Finally, building on what PodMaster said, there is a tkHTML widget here, but I do not know if there is a perl binding, yet. I have not played with it at all.

      HTH, --traveler

        Thanks for the information and links. I will look into the Netscape angle.

        SpaceAce

         If it's just plain text formatting then 'links --dump' might be a way to go.

         I guess it depends upon what the motivation for this is, if it supposed to be used as a CGI script, for example, there might not be an X session running for the graphical browser to use..

        Steve
        ---
        steve.org.uk
      I am not overly concerned with the task being "easy" After all, the easy ones are usually the least interesting :)

      I had already considered browser automation, but I would prefer to make the program as standalone as possible. If I have to depend on a browser to do it, I will probably try to work with a *nix version of Netscape as opposed to going for a Win32 solution.

      SpaceAce

Re: Rendering HTML / capturing pixels
by PodMaster (Abbot) on Feb 27, 2003 at 09:05 UTC
    Basics like tables and images? That's complex enough ;) You can do it (for the most part) using Wx and/or Tk. You'd be better off using OLE Automation if you can (if you're on win32).

    WxBrowser - a wxPerl HTML Browser
    Re: capture what's on the screen
    http://search.cpan.org/author/NI-S/Tk-HTML-3.002/

    Another idea that might work is to embed perl into mozilla (there was a recent node about it, something about XUL), and let mozilla render it, and then take a screenshot. ( probably won't work, at least not using XUL )


    MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
    I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests.
    ** The Third rule of perl club is a statement of fact: pod is sexy.

      There is a renderer that I have used...subset of gtk

      use Gtk::XmHTML;

      Should work...(tested under Debian)


      When cryptography is outlawed...I will still be using it.
        Thank you, I will definitely have a peek at that.

        SpaceAce

      By "basics" I just meant that I won't be dealing with Javascript, CSS or any other extensions/additions/dynamic situations. I realize tables and images are not really "simple" :)

      Win32 is not out of the question but I prefer Linux for any kind of development, and especially for PERL. Unfortunately, for the last several versions of wxWindows and wxPerl I have not been able to successfully complete an installation. Even if I jostle things around and get the everything installed and operational, any wxPerl program I write tends to crash with a segfault, even "Hello, world". Perhaps it is time to try again.

      Thanks for the link and the ideas.

      SpaceAce

Re: Rendering HTML / capturing pixels
by hiseldl (Priest) on Feb 27, 2003 at 16:47 UTC

    You could convert your HTML to postscript via HTMLDOC (GPL) and then use Ghostscript.pm (Perl API for Ghostscript) to convert to a ppm. Then convert your ppm to a GIF, which can then be loaded into Image::Magick.

    Here is a shell script showing how ghostscript converts a postscript file to a ppm on the command line, you could probably simulate these actions using Ghostscript.pm:

    #! /bin/sh # pstogif # # Call it by putting the .ps file name as first argument # but without the ".ps" extension. # Ex: for "Intro_Tbl.ps" use "pstogif Intro_Tbl" # gs -r72x72 -sDEVICE=ppmraw -sOutputFile=$1.ppm << endinput ($1.ps) run endinput pnmcrop < $1.ppm | ppmtogif > $1.gif
    ...This requires both GhostScript and pbmplus to work.

    HTH. :-)

    --
    hiseldl
    What time is it? It's Camel Time!

      Thank you :) I like the sound of this and I'm going to check it out. This might be the right solution for what I have in mind.

      A general thanks to everybody for the helpful suggestions.

      SpaceAce

        If you are using KDE, you can use 'kwebdesktop' to capture an image of a website. For example:

        % kwebdesktop 800 600 perlmonks.png http://www.perlmonks.org/

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://239028]
Approved by Corion
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (8)
As of 2024-04-19 09:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found