Substituting ":" for "%3A" is perfectly acceptable according to RFC1738. Not only that, but typing the URL in the browser demonstrates that Yahoo! can handle the escaped ":".

The second snippet is not valid HTML (although Firefox and Internet Explorer DWIM). HTML does allow you to omit the quotes under some circumstances. However, this isn't one of those circumstances. The HTML4 specification states:

By default, SGML requires that all attribute values be delimited using either double quotation marks (ASCII decimal 34) or single quotation marks (ASCII decimal 39). Single quote marks can be included within the attribute value when the value is delimited by double quote marks, and vice versa. Authors may also use numeric character references to represent double quotes (") and single quotes ('). For double quotes authors can also use the character entity reference ".

In certain cases, authors may specify the value of an attribute without any quotation marks. The attribute value may only contain letters (a-z and A-Z), digits (0-9), hyphens (ASCII decimal 45), periods (ASCII decimal 46), underscores (ASCII decimal 95), and colons (ASCII decimal 58). We recommend using quotation marks even when it is possible to eliminate them.

Furthermore, XML documents (including XHTML documents) require quotes around the values of every attribute, no exception. Maybe the problem is that you have an XHTML document, and the "/" is being interpreted as part of tag closer "/>"?

I can't reproduce the second problem with HTML::LinkExtor 1.31 (the one that came with ActivePerl 5.6.1):

use HTML::LinkExtor (); { my $p = HTML::LinkExtor->new(); $p->parse(<<'__EOI__'); <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>Virtual Library</title> </head> <body> <a href=http://rds.yahoo.com/S=10341:D1/CS=10341/SS=53744154/SIG=1 +12eblhep/*http%3A//www.beltbuckleshop.com/> </body> </html> __EOI__ my @links = $p->links(); print($links[0][2], $/); } { my $p = HTML::LinkExtor->new(); $p->parse(<<'__EOI__'); <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <html lang="en"> <head> <title>Virtual Library</title> </head> <body> <a href=http://rds.yahoo.com/S=10341:D1/CS=10341/SS=53744154/SIG=1 +12eblhep/*http%3A//www.beltbuckleshop.com/> </body> </html> __EOI__ my @links = $p->links(); print($links[0][2], $/); } __END__ output ====== http://rds.yahoo.com/S=10341:D1/CS=10341/SS=53744154/SIG=112eblhep/*ht +tp%3A//www.beltbuckleshop.com/ http://rds.yahoo.com/S=10341:D1/CS=10341/SS=53744154/SIG=112eblhep/*ht +tp%3A//www.beltbuckleshop.com/

In reply to Re: HTML::LinkExtor idiosyncracy by ikegami
in thread HTML::LinkExtor idiosyncracy by RandomWalk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.