in reply to extracting link *and* tag content from "a href"

Here's regexp that will do it, in case CPAN is not an option. If CPAN isn't an option then you have bigger problems ;) edit: generalized it a bit.
#!/usr/bin/perl while(<DATA>) { m#<A.*?HREF=(?:'|")?(.[^\'\"]+)(?:'|")?(?:\s.+)?>(.*?)</A>#ig; print "URL: $1\nName: $2\n\n"; } __END__ <A HREF=?ad=049>One</A> <A HREF=?ad=050>Two</A> <a target='_new' href='foo'>foo --> bar</a> <a target='_new' href='boo'>blah</a> <a target='_new' href=bar>troz</a> <a target='_new' href=bar2 onclick='somefunc'>troz</a> <a target='_new' href='bar3' onclick='somefunc'>troz</a>

Replies are listed 'Best First'.
Re^2: extracting link *and* tag content from "a href"
by BUU (Prior) on Jul 19, 2004 at 20:24 UTC
    I downvoted this because it doesn't actually work on html. It's a good try, but there are several cases it just misses, for example:
    <a href='this>breaks>"'>maybe</a> <a href=#>test</a> <a href="/path/to/don't/use/this">omg</a>
    (The second two are credited two perlygatekeeper in #perl on Freenode)

    Your code produces:
    URL: this>breaks Name: "'>maybe URL: #> Name: URL: "/ Name:
    I'm sure you could manage to fix these specific cases, but I seriously doubt you'll ever actually get to the point where it parses every type of valid html. And even if you do, whats the point? You just wasted X hours to do something that existing modules already do extremely well. This makes a decent learning exercise but please to not suggest "home grown" regexen for such complicated tasks.
      well it worked on his examples :) What's the point? the point is to try and reinvent the wheel. Why would I want to reinvent the wheel? why not, if I'm getting paid :) then I learn things too, such as the mistakes you pointed out.

      Of course, I was working under the assumption that the links are valid html, of which none of the examples you nor the thread author provided are. Anything not matching [a-zA-Z0-9], such as quotes, anglebrackets,etc. should be urlencoded if put in a url.

      in any case, you're right it's still broken for some cases. downvote away :)
        That'll teach me to take the easy way out! Anyways, I'm glad we've agreed that it's broken =]. The second example *is* valid though, as far as I know. In the future if you'd just said "This is a learning exercise, please use one of the modules" I wouldn't have had any problems.