Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re: HTML::LinkExtor idiosyncracy

by ikegami (Patriarch)
on Apr 22, 2005 at 23:46 UTC ( [id://450611]=note: print w/replies, xml ) Need Help??


in reply to HTML::LinkExtor idiosyncracy

Substituting ":" for "%3A" is perfectly acceptable according to RFC1738. Not only that, but typing the URL in the browser demonstrates that Yahoo! can handle the escaped ":".

The second snippet is not valid HTML (although Firefox and Internet Explorer DWIM). HTML does allow you to omit the quotes under some circumstances. However, this isn't one of those circumstances. The HTML4 specification states:

By default, SGML requires that all attribute values be delimited using either double quotation marks (ASCII decimal 34) or single quotation marks (ASCII decimal 39). Single quote marks can be included within the attribute value when the value is delimited by double quote marks, and vice versa. Authors may also use numeric character references to represent double quotes (") and single quotes ('). For double quotes authors can also use the character entity reference ".

In certain cases, authors may specify the value of an attribute without any quotation marks. The attribute value may only contain letters (a-z and A-Z), digits (0-9), hyphens (ASCII decimal 45), periods (ASCII decimal 46), underscores (ASCII decimal 95), and colons (ASCII decimal 58). We recommend using quotation marks even when it is possible to eliminate them.

Furthermore, XML documents (including XHTML documents) require quotes around the values of every attribute, no exception. Maybe the problem is that you have an XHTML document, and the "/" is being interpreted as part of tag closer "/>"?

I can't reproduce the second problem with HTML::LinkExtor 1.31 (the one that came with ActivePerl 5.6.1):

use HTML::LinkExtor (); { my $p = HTML::LinkExtor->new(); $p->parse(<<'__EOI__'); <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>Virtual Library</title> </head> <body> <a href=http://rds.yahoo.com/S=10341:D1/CS=10341/SS=53744154/SIG=1 +12eblhep/*http%3A//www.beltbuckleshop.com/> </body> </html> __EOI__ my @links = $p->links(); print($links[0][2], $/); } { my $p = HTML::LinkExtor->new(); $p->parse(<<'__EOI__'); <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <html lang="en"> <head> <title>Virtual Library</title> </head> <body> <a href=http://rds.yahoo.com/S=10341:D1/CS=10341/SS=53744154/SIG=1 +12eblhep/*http%3A//www.beltbuckleshop.com/> </body> </html> __EOI__ my @links = $p->links(); print($links[0][2], $/); } __END__ output ====== http://rds.yahoo.com/S=10341:D1/CS=10341/SS=53744154/SIG=112eblhep/*ht +tp%3A//www.beltbuckleshop.com/ http://rds.yahoo.com/S=10341:D1/CS=10341/SS=53744154/SIG=112eblhep/*ht +tp%3A//www.beltbuckleshop.com/

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://450611]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (2)
As of 2024-04-24 23:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found