Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Looking for a XPATH-like tool for HTML documents

by xern (Beadle)
on Aug 22, 2005 at 10:52 UTC ( [id://485655]=perlquestion: print w/replies, xml ) Need Help??

xern has asked for the wisdom of the Perl Monks concerning the following question:

I wonder if there's any xpath-like tool which can help locate and extract data from any HTML document. The tool is expected to act like xpath, say 'hpath'. The interface would be almost the same as that of xpath, except for node information.

hpath some.html /html/body/p[1]

Then, it returns all of the data under node /html/body/p[1]

I guess there's already something there, but I just missed it.

Thanks

Replies are listed 'Best First'.
Re: Looking for a XPATH-like tool for HTML documents
by davorg (Chancellor) on Aug 22, 2005 at 11:59 UTC

    Use htmltidy to turn your HTML into XHTML. Then just use XPath.

    --
    <http://dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

      That was the thing i missed to say in my node :)

      I just scribbled up a command piping tidy and xpath into 'hpath'.

      #!/bin/sh # hpath foo.html /path tidy -asxhtml -numeric $1 2>/dev/null | xpath $2

      Hope it helps. Thanks

Re: Looking for a XPATH-like tool for HTML documents
by inman (Curate) on Aug 22, 2005 at 11:17 UTC
    Look at HTML::Tree. I don't think this gives you an XPATH approach out of the box but it's not far off in it's representation of an HTML page as a tree. You should be able to scan the documents pretty easily.
Re: Looking for a XPATH-like tool for HTML documents
by tomhukins (Curate) on Aug 22, 2005 at 12:31 UTC
    Use XML::LibXML in recover mode as described in XML::LibXML::Parser's documentation. This can deal with anything from horribly malformed psuedo-HTML to valid HTML with a DTD.
Re: Looking for a XPATH-like tool for HTML documents
by neniro (Priest) on Aug 22, 2005 at 11:20 UTC
Re: Looking for a XPATH-like tool for HTML documents
by merlyn (Sage) on Aug 22, 2005 at 15:50 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://485655]
Approved by inman
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (4)
As of 2024-03-28 23:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found