in reply to Re: extracting link *and* tag content from "a href"
in thread extracting link *and* tag content from "a href"

I downvoted this because it doesn't actually work on html. It's a good try, but there are several cases it just misses, for example:
<a href='this>breaks>"'>maybe</a> <a href=#>test</a> <a href="/path/to/don't/use/this">omg</a>
(The second two are credited two perlygatekeeper in #perl on Freenode)

Your code produces:
URL: this>breaks Name: "'>maybe URL: #> Name: URL: "/ Name:
I'm sure you could manage to fix these specific cases, but I seriously doubt you'll ever actually get to the point where it parses every type of valid html. And even if you do, whats the point? You just wasted X hours to do something that existing modules already do extremely well. This makes a decent learning exercise but please to not suggest "home grown" regexen for such complicated tasks.

Replies are listed 'Best First'.
Re^3: extracting link *and* tag content from "a href"
by bageler (Hermit) on Jul 19, 2004 at 20:33 UTC
    well it worked on his examples :) What's the point? the point is to try and reinvent the wheel. Why would I want to reinvent the wheel? why not, if I'm getting paid :) then I learn things too, such as the mistakes you pointed out.

    Of course, I was working under the assumption that the links are valid html, of which none of the examples you nor the thread author provided are. Anything not matching [a-zA-Z0-9], such as quotes, anglebrackets,etc. should be urlencoded if put in a url.

    in any case, you're right it's still broken for some cases. downvote away :)
      That'll teach me to take the easy way out! Anyways, I'm glad we've agreed that it's broken =]. The second example *is* valid though, as far as I know. In the future if you'd just said "This is a learning exercise, please use one of the modules" I wouldn't have had any problems.
        The second example is not valid HTML. The quotes are optional when the value contains only letters, numbers, period, and hyphens, </code>0-9A-Za-z.-</code> basically. Browsers and some parsers will work around broken markup, but many won't.