Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Why is this link parser only working the first time?

by dave_aiello (Pilgrim)
on Jan 19, 2001 at 06:03 UTC ( [id://52910]=perlquestion: print w/replies, xml ) Need Help??

dave_aiello has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to write a link parser that takes a URL and returns an array of links on the page that are within in the /attachments sub tree of a Web server's document tree. The subroutine parse_page works fine if it is only called one time. Subsequent invocations of the subroutine within the same program return nothing in the array, although I am sure that there are links that meet the criteria.

What I want to do is set up a main subroutine which grabs URLs from a database table, passes them one at a time to parse_page, and receives back a corresponding array of links. And to be perfectly clear, I want to do this serially.

I'm sorry if I have done something stupid. I have been sick all week, and I am working on less than my normal powers of abstract reasoning and concentration. I shamelessly leveraged the example code in the perldoc for HTML::LinkExtor and squinted at the screen for a while to get this far.

Dave Aiello
Chatham Township Data Corporation

#!/usr/local/bin/perl use LWP::UserAgent; use HTML::LinkExtor; use URI::URL; sub parse_page { my($url) = @_; my $ua = LWP::UserAgent->new; my @links = (); sub attachment_link_extractor { my ($tag, %attr) = @_; push(@links, values %attr) if (($tag eq 'a') && ($attr{href} =~ m/attachments/)); } my $p = HTML::LinkExtor->new(\&attachment_link_extractor); $res = $ua->request(HTTP::Request->new(GET => $url), sub {$p->parse($_[0])}); my $base = $res->base; @links = map { if ($_ =~ /^http/) { $_ = "/". url($_, $base)->rel; } else { $_ = $_; } } @links; $p->links; return(@links); }

Replies are listed 'Best First'.
Re: Why is this link parser only working the first time?
by dkubb (Deacon) on Jan 19, 2001 at 10:34 UTC
    Here's a possible solution:
    #!/usr/bin/perl -w use strict; use LWP::UserAgent; use HTML::LinkExtor; use URI::URL; use HTTP::Request::Common qw(GET); my $links = parse_page('http://www.perl.com/'); sub parse_page { my $url = shift; #Get the base URL my $base = url($url)->abs->base; my @links; #Push any matching links onto the @link array #Only put a relative link, so we don't store #excessive amounts of data my $callback = sub { my $tag = shift; my %attr = @_; if($tag eq 'a' && $attr{href}[0] =~ $base) { push @links, url($attr{href}[0], $base)->abs->rel; } }; #Prepare the Link parser my $p = HTML::LinkExtor->new($callback, $base); #Fetch and Parse the web page my $ua = LWP::UserAgent->new; my $response = $ua->request(GET($url), sub {$p->parse($_[0])}); return \@links; }

    I'll explaination of what I did:

    I tried to limit the amount of data that is stored into the @links array. So instead of holding the entire url in the array, then shortening it later, I just made it relative inside the callback subroutine.

    Passing in the second argument to HTML::LinkExtor said to it "Add in this base url to any relative links automatically". This simplified the callback code, because every href link passed in was absolute, I didn't have to do any error checking similar to the map you used at the end of the original subroutine.

    Notice how I am accessing the first element inside $attr{href}? It's an arrayref. I think this is where you may have had problems, I know I did too, at first. But liberal use of Data::Dumper showed me the error in my ways.

    In order to simplify the LWP client block, I removed HTTP::Request. Instead I used a module called HTTP::Request::Common, which generally does the Right Thing, saving a few keystokes along the way.

    And finally, I returned an array reference instead of a real array. When you return an array from a subroutine, you are literally copying over each element to a new place in memory. Not to mention, that when @links goes out of scope, perl's garbage collector will be invoked, and could cause a performance penalty. By passing an array reference, I have just passed the location of the @links array back to the caller, not the actual information. It's much lighter than the entire @array, and quite speedy to pass around. Your milage may vary, but I believe it's a good habit to get into.

Re (tilly) 1: Why is this link parser only working the first time?
by tilly (Archbishop) on Jan 19, 2001 at 06:29 UTC
    If you had turned warnings on you would have receieved a message about "cannot stay shared". I tried to explain the error in RE (3): BrainPain-Help, I never got any feedback on it so I have no idea whether it is or isn't understandable.

    Anyways to fix your problem is simplicity, instead of writing:

    sub attachment_link_extractor { # ...
    write that as
    my $link_extractor = sub { # ...
    And then just pass the variable in to the constructor to LinkExtor.
      tilly:

      Thanks very much. This is an excellent solution because the code change is limited to a single line.

      What used to be:

      sub attachment_link_extractor { my ($tag, %attr) = @_; push(@links, values %attr) if (($tag eq 'a') && ($attr{href} =~ m/attachments/)); } my $p = HTML::LinkExtor->new(\&attachment_link_extractor);
      ... now becomes ...
      $attachment_link_extractor = sub { my ($tag, %attr) = @_; push(@links, values %attr) if (($tag eq 'a') && ($attr{href} =~ m/attachments/)); }; my $p = HTML::LinkExtor->new($attachment_link_extractor);
      The only part of the your suggested change that confused me was the need to put a semicolon after the bracket that closes the subroutine. I don't remember ever writing Perl that way before, but, I keep forgetting that the subroutine itself is part of an assignment statement.

      Dave Aiello
      Chatham Township Data Corporation

        Sorry, I should have remembered the semi-colon. That is the kind of detail I stopped thinking about, they just automatically go in when I write, and if I edit then run I expect to have Perl catch.

        You are right about why you need it. In the language used in perlsyn, the subroutine is now within a "simple statement" and you need to terminate the statement.

        As for writing Perl like that, I do it all of the time and recommend it whenever I get the chance. :-)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://52910]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (4)
As of 2024-03-29 09:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found