extracting link *and* tag content from "a href"

hmerrill has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: extracting link and tag content from "a href" by davido (Cardinal) on Jul 19, 2004 at 19:08 UTC
This example from the documentation for HTML::TokeParser: `use HTML::TokeParser; $p = HTML::TokeParser->new(shift\|\|"index.html"); while (my $token = $p->get_tag("a")) { my $url = $token->[1]{href} \|\| "-"; my $text = $p->get_trimmed_text("/a"); print "$url\t$text\n"; }` [download] And what it does is, "...extracts all links from a document. It will print one line for each link, containing the URL and the textual description between the `<A>`...`</A>` tags..." Dave	[reply] [d/l] [select]
Re^2: extracting link and tag content from "a href" by hmerrill (Friar) on Jul 19, 2004 at 19:09 UTC
Thanks - that's exactly what I need.	[reply]
Re: extracting link and tag content from "a href" by Fletch (Bishop) on Jul 19, 2004 at 19:02 UTC
HTML::TokeParser (or ::Simple) and/or HTML::TreeBuilder can do this easily.	[reply]
Re^2: extracting link and tag content from "a href" by hmerrill (Friar) on Jul 19, 2004 at 19:03 UTC
Thank you very much!	[reply]
Re: extracting link and tag content from "a href" by iburrell (Chaplain) on Jul 19, 2004 at 20:39 UTC
How about using something that is HTML? The snippet is not HTML. The href attribute must have quotes around it if it contains characters other than letter and numbers. How is the parser supposed to tell if "ad=" is the start of new attribute or what?	[reply]
Re: extracting link and tag content from "a href" by bageler (Hermit) on Jul 19, 2004 at 19:17 UTC
Here's regexp that will do it, in case CPAN is not an option. If CPAN isn't an option then you have bigger problems ;) edit: generalized it a bit. `#!/usr/bin/perl while(<DATA>) { m#<A.?HREF=(?:'\|")?(.[^\'\"]+)(?:'\|")?(?:\s.+)?>(.?)</A>#ig; print "URL: $1\nName: $2\n\n"; } __END__ <A HREF=?ad=049>One</A> <A HREF=?ad=050>Two</A> <a target='_new' href='foo'>foo --> bar</a> <a target='_new' href='boo'>blah</a> <a target='_new' href=bar>troz</a> <a target='_new' href=bar2 onclick='somefunc'>troz</a> <a target='_new' href='bar3' onclick='somefunc'>troz</a>` [download]	[reply] [d/l]
Re^2: extracting link and tag content from "a href" by BUU (Prior) on Jul 19, 2004 at 20:24 UTC
I downvoted this because it doesn't actually work on html. It's a good try, but there are several cases it just misses, for example: `<a href='this>breaks>"'>maybe</a> <a href=#>test</a> <a href="/path/to/don't/use/this">omg</a>` [download] (The second two are credited two perlygatekeeper in #perl on Freenode) Your code produces: `URL: this>breaks Name: "'>maybe URL: #> Name: URL: "/ Name:` [download] I'm sure you could manage to fix these specific cases, but I seriously doubt you'll ever actually get to the point where it parses every type of valid html. And even if you do, whats the point? You just wasted X hours to do something that existing modules already do extremely well. This makes a decent learning exercise but please to not suggest "home grown" regexen for such complicated tasks.	[reply] [d/l] [select]
Re^3: extracting link and tag content from "a href" by bageler (Hermit) on Jul 19, 2004 at 20:33 UTC
well it worked on his examples :) What's the point? the point is to try and reinvent the wheel. Why would I want to reinvent the wheel? why not, if I'm getting paid :) then I learn things too, such as the mistakes you pointed out. Of course, I was working under the assumption that the links are valid html, of which none of the examples you nor the thread author provided are. Anything not matching `[a-zA-Z0-9]`, such as quotes, anglebrackets,etc. should be urlencoded if put in a url. in any case, you're right it's still broken for some cases. downvote away :)	[reply] [d/l]
Re^4: extracting link and tag content from "a href" by BUU (Prior) on Jul 19, 2004 at 20:41 UTC
Re^5: extracting link and tag content from "a href" by iburrell (Chaplain) on Jul 20, 2004 at 16:49 UTC
Re: extracting link and tag content from "a href" by gellyfish (Monsignor) on Jul 20, 2004 at 08:09 UTC
I have a simple example for the v1 HTML::Parser here /J\	[reply]
•Re: extracting link and tag content from "a href" by merlyn (Sage) on Jul 20, 2004 at 13:24 UTC
`<A HREF=?ad=049>One</A> <A HREF=?ad=050>Two</A>` [download] This isn't HTML, so you might have problems with standard HTML parsers. In fact, I'd hope to never run across a page that looks like that. Standard acceptable HTML will need to quote those attribute values. -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply] [d/l]