hmerrill has asked for the wisdom of the Perl Monks concerning the following question:

I have a file containing a bunch of links like:
<A HREF=?ad=049>One</A> <A HREF=?ad=050>Two</A> etc.
and I'm trying to figure out how to get the "link" part (?ad=049) *AND* the content part (One). I found Recipe 20.3 in the Perl Cookbook for "Extracting URLs", and I've been able to use HTML::LinkExtor to get the *link*, but not the content. Can someone please clue me in ;-/

TIA.

Replies are listed 'Best First'.
Re: extracting link *and* tag content from "a href"
by davido (Cardinal) on Jul 19, 2004 at 19:08 UTC

    This example from the documentation for HTML::TokeParser:

    use HTML::TokeParser; $p = HTML::TokeParser->new(shift||"index.html"); while (my $token = $p->get_tag("a")) { my $url = $token->[1]{href} || "-"; my $text = $p->get_trimmed_text("/a"); print "$url\t$text\n"; }

    And what it does is, "...extracts all links from a document. It will print one line for each link, containing the URL and the textual description between the <A>...</A> tags..."


    Dave

      Thanks - that's exactly what I need.
Re: extracting link *and* tag content from "a href"
by Fletch (Bishop) on Jul 19, 2004 at 19:02 UTC
      Thank you very much!
Re: extracting link *and* tag content from "a href"
by iburrell (Chaplain) on Jul 19, 2004 at 20:39 UTC
    How about using something that is HTML? The snippet is not HTML. The href attribute must have quotes around it if it contains characters other than letter and numbers. How is the parser supposed to tell if "ad=" is the start of new attribute or what?
Re: extracting link *and* tag content from "a href"
by bageler (Hermit) on Jul 19, 2004 at 19:17 UTC
    Here's regexp that will do it, in case CPAN is not an option. If CPAN isn't an option then you have bigger problems ;) edit: generalized it a bit.
    #!/usr/bin/perl while(<DATA>) { m#<A.*?HREF=(?:'|")?(.[^\'\"]+)(?:'|")?(?:\s.+)?>(.*?)</A>#ig; print "URL: $1\nName: $2\n\n"; } __END__ <A HREF=?ad=049>One</A> <A HREF=?ad=050>Two</A> <a target='_new' href='foo'>foo --> bar</a> <a target='_new' href='boo'>blah</a> <a target='_new' href=bar>troz</a> <a target='_new' href=bar2 onclick='somefunc'>troz</a> <a target='_new' href='bar3' onclick='somefunc'>troz</a>
      I downvoted this because it doesn't actually work on html. It's a good try, but there are several cases it just misses, for example:
      <a href='this>breaks>"'>maybe</a> <a href=#>test</a> <a href="/path/to/don't/use/this">omg</a>
      (The second two are credited two perlygatekeeper in #perl on Freenode)

      Your code produces:
      URL: this>breaks Name: "'>maybe URL: #> Name: URL: "/ Name:
      I'm sure you could manage to fix these specific cases, but I seriously doubt you'll ever actually get to the point where it parses every type of valid html. And even if you do, whats the point? You just wasted X hours to do something that existing modules already do extremely well. This makes a decent learning exercise but please to not suggest "home grown" regexen for such complicated tasks.
        well it worked on his examples :) What's the point? the point is to try and reinvent the wheel. Why would I want to reinvent the wheel? why not, if I'm getting paid :) then I learn things too, such as the mistakes you pointed out.

        Of course, I was working under the assumption that the links are valid html, of which none of the examples you nor the thread author provided are. Anything not matching [a-zA-Z0-9], such as quotes, anglebrackets,etc. should be urlencoded if put in a url.

        in any case, you're right it's still broken for some cases. downvote away :)
Re: extracting link *and* tag content from "a href"
by gellyfish (Monsignor) on Jul 20, 2004 at 08:09 UTC

    I have a simple example for the v1 HTML::Parser here

    /J\

•Re: extracting link *and* tag content from "a href"
by merlyn (Sage) on Jul 20, 2004 at 13:24 UTC
    <A HREF=?ad=049>One</A> <A HREF=?ad=050>Two</A>
    This isn't HTML, so you might have problems with standard HTML parsers. In fact, I'd hope to never run across a page that looks like that.

    Standard acceptable HTML will need to quote those attribute values.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.