Getting link attributes from WWW::Mechanize?

artche has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

i'm creating a web crawler and using www::mechanize::link to extract links from content.

Everything almost works fine, but i can't get attributes of link.

Fragment of code is

use WWW::Mechanize;
use Data::Dumper;
use DBI;
...
my $mech = WWW::Mechanize->new(
        stack_depth => 0,
        autocheck   => 0,
        onerror => undef,
        );

$mech->timeout(30);
$mech->agent_alias( 'Windows IE 6' ); 
$mech->max_redirect( 0 );
        $mech->get("http://".$placement_page_url);

...

my @links = $mech->links(); 
foreach $link (@links)
{
...                            
print $link->attrs." attrs\n";
%dump = Dumper $link->attrs;
print %dump;
...
}
[download]

I can parse almost every infromation about links. I get url_abs, base, text, tag etc. But for attributes there is a hash ref. I can display it trough Data::Dumper but how to extract values from it? I need attributes like "rel" and "title". How to extract it? Simply hash ref, key->value doesn't work.

Monks, i'm hoping, that someone knows how read from attributes and can help me.

artche

Comment on Getting link attributes from WWW::Mechanize? Download Code

Replies are listed 'Best First'.
Re: Getting link attributes from WWW::Mechanize? by Your Mother (Archbishop) on Dec 16, 2010 at 23:42 UTC
It's just a hash ref so dereference it with the key you want. E.g., use warnings; use strict; use WWW::Mechanize; my $mech = WWW::Mechanize->new( stack_depth => 0, autocheck => 0, onerror => undef, ); $mech->agent_alias("Windows IE 6"); $mech->get("http://cnn.com"); for my $link ( $mech->links ) { print " URI: ", $link->url_abs, $/; print "Title: ", $link->attrs->{title} \|\| "[n/a]", $/, $/; } __END__ URI: http://www.cnn.com/ Title: [n/a] URI: http://edition.cnn.com/ Title: CNN INTERNATIONAL URI: http://www.cnnmexico.com/ Title: CNN M?XICO URI: javascript:cnn_initeditionhtml(3); Title: [n/a] [download] As you can see from México, there may be encoding issues you'll need to address.	[reply] [d/l]

Replies are listed 'Best First'.

Re: Getting link attributes from WWW::Mechanize?
by Your Mother (Archbishop) on Dec 16, 2010 at 23:42 UTC

It's just a hash ref so dereference it with the key you want. E.g.,

use warnings;
use strict;
use WWW::Mechanize;

my $mech = WWW::Mechanize->new(
                               stack_depth => 0,
                               autocheck   => 0,
                               onerror => undef,
                              );
$mech->agent_alias("Windows IE 6");
$mech->get("http://cnn.com");

for my $link ( $mech->links )
{
    print "  URI: ", $link->url_abs, $/;
    print "Title: ", $link->attrs->{title} || "[n/a]", $/, $/;
}

__END__
  URI: http://www.cnn.com/
Title: [n/a]

  URI: http://edition.cnn.com/
Title: CNN INTERNATIONAL

  URI: http://www.cnnmexico.com/
Title: CNN M?XICO

  URI: javascript:cnn_initeditionhtml(3);
Title: [n/a]
[download]

As you can see from México, there may be encoding issues you'll need to address.

[reply]
[d/l]