Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

HTML::LinkExtor idiosyncracy

by RandomWalk (Beadle)
on Apr 22, 2005 at 23:37 UTC ( [id://450610]=perlquestion: print w/replies, xml ) Need Help??

RandomWalk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I am feeding a Yahoo directory page to HTML::LinkExtor. This program extracts

<a href="http://rds.yahoo.com/S=10341:D2/CS=10341/SS=53744154/*http:// +www.beltbuckleshop.com/">
as expected. OTOH, it truncates <a href=http://rds.yahoo.com/S=10341:D1/CS=10341/SS=53744154/SIG=112eblhep/*http%3A//www.beltbuckleshop.com/> to "http://www.beltbuckleshop.com", even substituting ":" for "%3A"!

I'm following "Google Hack #44" slavishly.

I *think* this is the nub of the problem, so I've not included more info. Anyone have experience with this module? Could it be the lack of quotation marks about the "href" value?

Thanks.

Replies are listed 'Best First'.
Re: HTML::LinkExtor idiosyncracy
by ikegami (Patriarch) on Apr 22, 2005 at 23:46 UTC

    Substituting ":" for "%3A" is perfectly acceptable according to RFC1738. Not only that, but typing the URL in the browser demonstrates that Yahoo! can handle the escaped ":".

    The second snippet is not valid HTML (although Firefox and Internet Explorer DWIM). HTML does allow you to omit the quotes under some circumstances. However, this isn't one of those circumstances. The HTML4 specification states:

    By default, SGML requires that all attribute values be delimited using either double quotation marks (ASCII decimal 34) or single quotation marks (ASCII decimal 39). Single quote marks can be included within the attribute value when the value is delimited by double quote marks, and vice versa. Authors may also use numeric character references to represent double quotes (&#34;) and single quotes (&#39;). For double quotes authors can also use the character entity reference &quot;.

    In certain cases, authors may specify the value of an attribute without any quotation marks. The attribute value may only contain letters (a-z and A-Z), digits (0-9), hyphens (ASCII decimal 45), periods (ASCII decimal 46), underscores (ASCII decimal 95), and colons (ASCII decimal 58). We recommend using quotation marks even when it is possible to eliminate them.

    Furthermore, XML documents (including XHTML documents) require quotes around the values of every attribute, no exception. Maybe the problem is that you have an XHTML document, and the "/" is being interpreted as part of tag closer "/>"?

    I can't reproduce the second problem with HTML::LinkExtor 1.31 (the one that came with ActivePerl 5.6.1):

Re: HTML::LinkExtor idiosyncracy
by tlm (Prior) on Apr 22, 2005 at 23:56 UTC

    Seems to work fine for me:

    DB<1> use HTML::LinkExtor DB<2> p $HTML::LinkExtor::VERSION 1.33 DB<3> $s = '<a href=http://rds.yahoo.com/S=10341:D1/CS=10341/SS=5374 +4154/SIG=112eblhep/*http%3A//www.beltbuckleshop.com/>' DB<4> $e = HTML::LinkExtor->new DB<5> $e->parse($s) DB<6> x $e->links 0 ARRAY(0x84a3f78) 0 'a' 1 'href' 2 'http://rds.yahoo.com/S=10341:D1/CS=10341/SS=53744154/SIG=112ebl +hep/*http%3A//www.beltbuckleshop.com/'
    ...as expected.

    the lowliest monk

Re: HTML::LinkExtor idiosyncracy
by eibwen (Friar) on Apr 23, 2005 at 13:12 UTC
    http://rds.yahoo.com/S=10341:D1/CS=10341/SS=53744154/SIG=112eblhep/* as you've no doubt discovered is a redirection page. It is conciveable that HTML::LinkExtor may recognize this fact and return the referent, therefore transmuting the url:

    http://rds.yahoo.com/S=10341:D1/CS=10341/SS=53744154/SIG=112eblhep/*ht +tp%3A//www.beltbuckleshop.com/ # URL http://rds.yahoo.com/S=10341:D1/CS=10341/SS=53744154/SIG=112eblhep/* + # Referer http://www.beltbuckleshop.com/ + # Referent

    Additionally, note the referent does not contain %3A, as the : following the protocol appears mandatory (at least in firefox).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://450610]
Approved by BazB
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (3)
As of 2024-04-24 01:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found