turbolofi has asked for the wisdom of the Perl Monks concerning the following question:

I am building a very simple html-file parser, where I want to retrieve the contents of a single, well-formatted file. It's not dynamic yet, as I'm still struggling with a, perhaps very simple, problem.
Specifically, I can't end a loop.. Some code is provided below. The problematic part is the "while ($html)"-part.
Parsing static files is not problematic (with "while (<>)"), but as soon as I'm looping through this retrieved html-file, it's causing an endless loop.

Any help would be much appreciated.
Also: this is my first post to perlmonks, so be gentle, dear monks!
#!/usr/bin/perl -w # Get urls from result page use warnings; use strict; use LWP::Simple; my ($html, $url); my $count = 0; $html = get("http://localhost:8080/html.htm") or die "Couldn't fetch p +age."; while($html) # <- Problematic part.. { $html =~ m{<(a class=\"smallV110\" href=\"/)(.*?)\">} || die "couldn't + match"; #match regexp and capture backreference to $2, or die with e +rror $url = $2; print "$url\n"; $count++; print "$count\n"; }

Replies are listed 'Best First'.
Re: Ending a loop of content of LWP's get-function
by ikegami (Patriarch) on Mar 27, 2009 at 17:02 UTC

    You want to loop over the URLs, with the fetch inside the loop.

    my @urls = ( "http://localhost:8080/html.htm", ); for my $url (@urls) { my $html = get($url) or die "Couldn't fetch page."; $html =~ ... ... }

    Or if you plan on adding to @urls,

    my @urls = ( "http://localhost:8080/html.htm", ); while (@urls) { my $url = shift(@urls); my $html = get($url) or die "Couldn't fetch page."; $html =~ ... ... push @urls, $new_url; # or @new_urls ... }

    Using push results in a breadth-first search.
    Using unshift results in a width-first search instead.
    The former is almost surely most desirable here.

      Thankyou for your quick reply, and for the pointers to push and unshift.
      I'm still struggling with getting it work correctly, though. I've tried both of your suggestions, with two different results:
      #!/usr/bin/perl -w use warnings; use strict; use LWP::Simple; my ($html, $url); my $count = 0; my @urls = ( "http://localhost:8080/html.htm", ); for my $url (@urls) { my $html = get($url) or die "Couldn't fetch page."; $html =~ m{<(a class=\"smallV110\" href=\"/)(.*?)\">} || die "couldn't + match"; #match regexp and capture backreference to $2, or die with e +rror $url = $2; print "$url\n"; $count++; print "$count\n"; }
      this gives only one line of content from the retrieved file. It loops till it has found one occurence of the matched pattern, then quits the loop. I'd like it to continue until the whole file has been matched. Is it possible to use "length" to achieve this?
      the other example gives a more grave error:
      #!/usr/bin/perl -w use warnings; use strict; use LWP::Simple; my ($html, $url); my $count = 0; my $new_url; my @urls = ( "http://localhost:8080/html.htm", ); while (@urls) { my $url = shift(@urls); my $html = get($url) or die "Couldn't fetch page."; $html =~ m{<(a class=\"smallV110\" href=\"/)(.*?)\">} || die "couldn't + match"; #match regexp and capture backreference to $2, or die with e +rror $url = $2; print "$url\n"; push @urls, $new_url; # or @new_urls }
      This code gives, as in the case above, one matched result from the retrieved file, then quits with the error:
      Use of uninitialized value $url in pattern match (m//) at C:/Perl/lib/LWP/Simple.pm line 131. Couldn't fetch page. at retrieve.pl line 13.
      I should note that I use ActivePerl, though I doubt very much that this is the cause of the latter problem. Again, I appreciate any help!
        $url = $2; <-- called $url here print "$url\n"; push @urls, $new_url; # or @new_urls <-- called $new_url here.

        Just rename one.

        Also, it seems you want to search for the pattern multiple times. You'll need the "g" modifier for that.

        while ($html =~ m{...}g) { my $new_url = $2; print "$new_url\n"; push @urls, $new_url; }
Re: Ending a loop of content of LWP's get-function
by zentara (Cardinal) on Mar 27, 2009 at 17:32 UTC
Re: Ending a loop of content of LWP's get-function
by toolic (Bishop) on Mar 27, 2009 at 17:09 UTC
    Welcome to the Monastery!

    The get function returns a scalar, and according to the documentation for LWP::Simple, you can check for success with defined instead of "while". Something like this (untested):

    use warnings; use strict; use LWP::Simple; my ($html, $url); my $count = 0; $html = get("http://localhost:8080/html.htm"); die "Couldn't fetch page." unless defined $html; if ($html =~ m{<(a class=\"smallV110\" href=\"/)(.*?)\">} ) { $url = $2; print "$url\n"; $count++; print "$count\n"; } else { die "couldn't match"; }
Re: Ending a loop of content of LWP's get-function
by turbolofi (Acolyte) on Mar 27, 2009 at 18:07 UTC
    We got it work - thanks everyone, for the pointers to the documentation (RTFM, I know), and for the reminder of how regexp behaves! Here's the code, just for future reference.
    #!/usr/bin/perl -w # Get urls from result page use warnings; use strict; use LWP::Simple; my ($html, $url, @urls); my $count = 0; $html = get("http://localhost:8080/html.htm") or die "Couldn't fetch p +age."; while($html =~ m{<(a class=\"smallV110\" href=\"/)(.*?)\">}g) { my $new_url = $2; print "$new_url\n"; $count++; print "$count\n"; push @urls, $new_url; }