Re: Using HTTP::LinkExtor to get URL and description info
by crazyinsomniac (Prior) on Aug 08, 2002 at 04:39 UTC
|
You have to know your tools. HTML::LinkExtor was designed to only extract the links, not the text in between (whatever you call it, cdata or whatever).
Demo
use strict;
use Data::Dumper;
use HTML::LinkExtor;
my $base = 'http://perlmonks.org/';
my $stringy = q{
<tr><td><a HREF="/index.pl?node_id=188511">How does this code work (w
+arnings.pm)?</a></td> <td>by <a HREF="/index.pl?node_id=80322">John
+M. Dlugosz</a></td></tr>
<tr><td><a HREF="/index.pl?node_id=188509">Tk and X events</a></td> <
+td>by <a HREF="/index.pl?node_id=961">Anonymous Monk</a></td></tr>
<tr><td><a HREF="/index.pl?node_id=188507">warnings::warnif etc. wise
+ usage?</a></td> <td>by <a HREF="/index.pl?node_id=80322">John M. Dl
+ugosz</a></td></tr>
<tr><td><a HREF="/index.pl?node_id=188505">52-bit numbers as floating
+ point</a></td> <td>by <a HREF="/index.pl?node_id=80322">John M. Dlu
+gosz</a></td></tr>
};
my $p = new HTML::LinkExtor(undef, $base);
$p->parse($stringy);
print Dumper $p->links;
$p = new HTML::LinkExtor( sub { print Dumper($_) for @_; } , $base);
$p->parse($stringy);
And now for the nudge, HTML::TokeParser tutorial
update: suprise, suprise, I've solved this one before
(crazyinsomniac) Re: Getting the Linking Text from a page
| [reply] [d/l] |
|
Thanks for that!
I'll be looking at that tomorrow for certain, but I do have one question. My program is taking headlines off of newspaper sites, but at the moment I'm using LWP::Simple with get(URL), dumping it in to an array, then reading through to a certain pre-determined point, and then using a regex to get the info I want.
Is HTML::TokeParser going to allow me to do that type of thing or will I have to write new "rules" to determine what is a headline and what is just a link on the page?
Thanks again!
Some people fall from grace. I prefer a running start...
| [reply] |
|
I would suggest the CPAN module HTML::Parser. It's pretty
straightforward:
use HTML::Parser;
$p = new HTML::Parser(start_h => [\&start, "tagname"],
end_h => [\&end, "tagname"],
default_h => [\&default, "text"]);
$p->parse($some_html);
$p->parsefile(\*SOME_FH);
sub start {
my ($tagname) = @_;
$in_a = 1 if $tagname eq 'a';
}
sub end {
my ($tagname) = @_;
$in_a = 0 if $tagname eq 'a';
}
sub default {
my ($text) = @_;
# do something with text if $in_a
}
HTH. Off the top of my head. Check the HTML::Parser PoD for
absolute correctness. | [reply] [d/l] |
OT: FYI, SGML :)
by BorgCopyeditor (Friar) on Aug 08, 2002 at 05:05 UTC
|
I don't know exactly what you call the descriptive tag between the <a href> and the closing </a> unless it's descriptive tag : ).
Not that this matters, but it's the content of the anchor element. Here's more info about SGML, which is where terms like "tag" and "element" come from.
BCE --Your punctuation skills are insufficient!
| [reply] [d/l] [select] |
Re: Using HTTP::LinkExtor to get URL and description info
by hacker (Priest) on Aug 08, 2002 at 19:48 UTC
|
| [reply] [d/l] |
|
I know that I got the module name wrong, but for what I want to do, HTML::LinkExtor doesn't go far enough.
My app filters out headlines but I want the active link and the text description. As far as I can tell from playing with HTML::LinkExtor, all I can get is the link, not the text and the link. I couldn't work out how you would pull the link text as it has no anchor tag.
HTML::TokeParser seems to be the best way for me to go for my project. However if I've missed something with HTML::LinkExtor, let me know and I'll take another look at that.
Some people fall from grace. I prefer a running start...
| [reply] |
Re: Using HTTP::LinkExtor to get URL and description info
by OracleJedi (Initiate) on Aug 09, 2002 at 18:25 UTC
|
HTML::TreeBuilder might be overkill for what you need, but it's simple:
use HTML::TreeBuilder;
use strict; # examples aren't exempt!!!
my $parser = new HTML::TreeBuilder;
$parser->parse($html_code_from_elsewhere);
my @links = $parser->look_down('_tag' => 'a');
foreach my $link (@links) {
my $href = $link->attr('href');
my $descr = $link->content->[0]; # Assumes only simple text conten
+ts
}
$parser->delete();
| [reply] [d/l] |
|
I've been doing some web automation and I'm using HTML::TreeBuilder everywhere.
I was also afraid of overkill, but when you don't need the power, you don't have to use it, and it has made a few things really easy compared to what I could do with other tools.
Btw, I think your code could be improved in this way:
use HTML::TreeBuilder;
use strict; # examples aren't exempt!!!
my $parser = new HTML::TreeBuilder;
$parser->parse($html_code_from_elsewhere);
my @links = $parser->look_down('_tag' => 'a');
foreach my $link (@links) {
my $href = $link->attr('href');
my $descr = $link->as_text();
}
$parser->delete();
This removes the assumption about only simple text contents and only gets text from the anchor element. Your code would have gotten markup elements embedded in the anchor element, like:
<a href="..."><p class="big-and-bold">Winners!</p> for today</a>
Fetching $link->content[0] on the above would get you an HTML::Element.
I know you pointed out this limitation, but I think the original Seeker might like to have the as_text() method pointed out as extracting text from HTML appears to be the thing of interest. | [reply] [d/l] [select] |
Re: Using HTTP::LinkExtor to get URL and description info
by jdavidboyd (Friar) on Aug 14, 2002 at 13:30 UTC
|
Hey, HTML::TokeParser comes with a sample script that works great. (In the EXAMPLES section)
I cut it out, and have been able to use it, (with a slight amount of rework) to recreate all my bookmarks into a clean sorted file.
Dave | [reply] |
|
| [reply] |