Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

I've been struggling with HTML::LinkExtor for a while now, thanks to everyone who pointed me in its direction.

You can see the full documentation at http://search.cpan.org/author/GAAS/HTML-Parser-3.26/lib/HTML/LinkExtor.pm if you want to, but I'm just wondering what prompted the author to return the results as an array with a hash as the second item?

Here's the relevant section:

$p->links Returns a list of all links found in the document. The returned values will be anonymous arrays with the follwing [sic] elements: [$tag, $attr => $url1, $attr2 => $url2,...]

It's kind of confusing me. For a start, if it's a hash, shouldn't that be

[$tag, {$attr => $url1, $attr2 => $url2,...}]
instead?

And more to the point, I'm racking my not-inconsiderable knowledge of HTML to try and find a situation where a single tag could have two or more attributes which were links.

Apart from anything else, this structure leads to scary dereferencing being needed like this:

$p = HTML::LinkExtor->new(\&cb, "http://www.perl.org/"); sub cb { my($tag, %links) = @_; print "$tag @{[%links]}\n"; }

Maybe

@{[%links]}
isn't scary to you but it is to me...
--
($_='jjjuuusssttt annootthheer pppeeerrrlll haaaccckkeer')=~y/a-z//s;print;

Replies are listed 'Best First'.
Re: Why Would HTML::LinkExtor return a hash of attributes?
by PodMaster (Abbot) on Aug 19, 2002 at 05:24 UTC
    "what prompted the author to return the results as an array with a hash as the second item? "

    Where do you get that from?

    "It's kind of confusing me. For a start, if it's a hash, shouldn't that be ..."

    It's not a hash, where do you get hash from?

    And why is print "$tag @{[%links]}\n"; scary to you?

    More than anything it's kind of silly to me, cause all that sub needs to be is  print "@_\n";

    update:
    "And more to the point, I'm racking my not-inconsiderable knowledge of HTML to try and find a situation where a single tag could have two or more attributes which were links.".

    AFAIK, no attributes are ever "links". Duplicate SRC attributes wouldn't be valid HTML, and one of the 2 would be ignored. It's like this, if anyone writing HTML wants anybody to somewhat accurately interpret it, well, he's gotta write valid HTML, right? (right)

    ____________________________________________________
    ** The Third rule of perl club is a statement of fact: pod is sexy.

      Isn't it a hash?

      Why does it have "key => val, key => val" if it isn't?

      Plus, if it isn't, why does the sub grab it as:

      my($tag, %links) = @_;
      if it isn't a hash?

       all that sub needs to be is print "@_\n" Try it. You get both the "HREF" and the thing it's a link to. I don't want to extract "HREF" 5,000 times do I?
      --

      ($_='jjjuuusssttt annootthheer pppeeerrrlll haaaccckkeer')=~y/a-z//s;print;
         => Is a fancy comma (,). It allows you to say
        print I => AM => NOT => QUOTING => WORDS => TIMES => 5;
        which would print IAMNOTQUOTINGWORDSTIMES5

        Using this fancy comma doesn't make a hash. A hash is a "data structure", and I => AM => NOT => QUOTING => WORDS => TIMES => 5 is a list.

        Now you can do many things with lists. You can create arrays ( also data structures )

        my @ARRAY = ( I => AM => NOT => QUOTING => WORDS => TIMES => 5 );
        and you can create hashes
        my %HASH = ( I => AM => NOT => QUOTING => WORDS => TIMES => 5 );
        . Do you follow now?

        ____________________________________________________
        ** The Third rule of perl club is a statement of fact: pod is sexy.

Re: Why Would HTML::LinkExtor return a hash of attributes?
by Arien (Pilgrim) on Aug 19, 2002 at 05:50 UTC
    I'm racking my not-inconsiderable knowledge of HTML to try and find a situation where a single tag could have two or more attributes which were links.

    Well, <object> has the attributes classid, codebase, data, and archive (a space seperated lis of URIs). And even <img> could have multiple attributes that link: src, longdesc, and usemap.

    — Arien

      <img> could have multiple attributes that link: src, longdesc, and usemap.
      Aha!

      Now that makes sense. I bet I could have figured it out if I'd thought a bit longer. I'm so lazy.

      Thanks for your help, podmaster, but surely the guy is intending it to be used as a hash?

      One useful attribute of it being a hash would be to clobber incorrect HTML where a link had two HREFs.
      --

      ($_='jjjuuusssttt annootthheer pppeeerrrlll haaaccckkeer')=~y/a-z//s;print;
        "but surely the guy is intending it to be used as a hash?"

        I'm not a mindreader. HTML::LinkExtor is a pretty mature module, and I doubt the interface should/will change.

        You can certainly try to persuade the guy (perlsonally i'd rather just write HTML::LinkExtractor which would do all the things you say here, but would also extract the link text (stuff in between <a ..> </a> tags).

        " One useful attribute of it being a hash would be to clobber incorrect HTML where a link had two HREFs. "

        You don't have to worry about that (when in doubt, test).

        use HTML::LinkExtor; my $p = new HTML::LinkExtor( sub { print "@_\n" }, ); $p->parse( q{ <a href="BUTTER" href="SCOTCH"> <img src="AND" src="PEANUTS"> }); __END__ a href SCOTCH img src PEANUTS

        ____________________________________________________
        ** The Third rule of perl club is a statement of fact: pod is sexy.