costas has asked for the wisdom of the Perl Monks concerning the following question:

I am using HTML::LinkExtor to extract image links from a set of url's captured using the google api. I am able to succesfully parse links from the first url passed intot he following subroutine but then i get nothing from the remaining urls i pass through. Does it have anyting to do with creating anew object each time i loop through?
sub parsedocument { my ($url) = @_; my $ua = LWP::UserAgent->new; # Set up a callback that collect image links my @imgs = (); sub callback { my($tag, %attr) = @_; return if $tag ne 'img'; # we only look closer at <img ...> push(@imgs, values %attr); } my $p = HTML::LinkExtor->new(\&callback); # Request document and parse it as it arrives my $res = $ua->request(HTTP::Request->new(GET => $url), sub {$p->parse($_[0])}); # Expand all image URLs to absolute ones my $base = $res->base; @imgs = map { $_ = url($_, $base)->abs; } @imgs; # Print them out print join("<br>", @imgs), "<br>"; }
Ive tried amny things but cannot seem to get it working. Does anybody see anything that might be causiung me a problem?

Thanks in advance

costas

Replies are listed 'Best First'.
Re: problems using HTML::LinkExtor? anyone have an idea?
by alien_life_form (Pilgrim) on May 20, 2002 at 14:27 UTC
    Greetings,

    Adding the infinitely advisable:

    use strict; use warnings;
    To your sample yields:
    Variable "@imgs" will not stay shared at par.pl line 17 (#1)
    Drilling further down:
    C:\TEMP>perl -Mdiagnostics par.pl perl -Mdiagnostics par.pl Variable "@imgs" will not stay shared at par.pl line 17 (#1) (W closure) An inner (nested) named subroutine is referencing a lexical variable defined in an outer subroutine. When the inner subroutine is called, it will probably see the valu +e of the outer subroutine's variable as it was before and during the *f +irst* call to the outer subroutine; in this case, after the first call t +o the outer subroutine is complete, the inner and outer subroutines will + no longer share a common value for the variable. In other words, the variable will no longer be shared. Furthermore, if the outer subroutine is anonymous and references a lexical variable outside itself, then the outer and inner subrouti +nes will never share the given variable. This problem can usually be solved by making the inner subroutine anonymous, using the sub {} syntax. When inner anonymous subs tha +t reference variables in outer subroutines are called or referenced, + they are automatically rebound to the current values of such variables.
    I could not have said it better myself. :)
    Hence the working version:
    use strict; use warnings; use HTML::LinkExtor; use LWP::UserAgent; use URI::URL; sub parsedocument { my ($url) = @_; my $ua = LWP::UserAgent->new; $ua->env_proxy(); # Set up a callback that collect image links my @imgs = (); my $callback = sub { my($tag, %attr) = @_; return if $tag ne 'img'; # we only look closer at <img ...> push(@imgs, values %attr); }; my $p = HTML::LinkExtor->new($callback); # Request document and parse it as it arrives my $res = $ua->request(HTTP::Request->new(GET => $url), sub {$p->parse($_[0])}); # Expand all image URLs to absolute ones my $base = $res->base; @imgs = map { $_ = url($_, $base)->abs; } @imgs; # Print them out print join("<br>", @imgs), "<br>"; } map {parsedocument($_) } @ARGV;
    Note that your sample's original (from the documentation of HTML::Linkextor) works exactly because it occurs in the program's main. When you wrap it in a sub, you get the problem you described, which is wellknown - for instance - to people trying to use mod_perl and Apache::Registry.
    Cheers,
    alf
    You can't have everything: where would you put it?