Ending a loop of content of LWP's get-function

turbolofi has asked for the wisdom of the Perl Monks concerning the following question:

I am building a very simple html-file parser, where I want to retrieve the contents of a single, well-formatted file. It's not dynamic yet, as I'm still struggling with a, perhaps very simple, problem.
Specifically, I can't end a loop.. Some code is provided below. The problematic part is the "while ($html)"-part.
Parsing static files is not problematic (with "while (<>)"), but as soon as I'm looping through this retrieved html-file, it's causing an endless loop.

Any help would be much appreciated.
Also: this is my first post to perlmonks, so be gentle, dear monks!

#!/usr/bin/perl -w
# Get urls from result page
use warnings;
use strict;
use LWP::Simple;
my ($html, $url);
my $count = 0;
$html = get("http://localhost:8080/html.htm") or die "Couldn't fetch p
+age.";
while($html) # <- Problematic part..
{
$html =~ m{<(a class=\"smallV110\" href=\"/)(.*?)\">} || die "couldn't
+ match"; #match regexp and capture backreference to $2, or die with e
+rror
$url = $2;
print "$url\n";
$count++;
print "$count\n";
}
[download]

Comment on Ending a loop of content of LWP's get-function Download Code

Replies are listed 'Best First'.
Re: Ending a loop of content of LWP's get-function by ikegami (Patriarch) on Mar 27, 2009 at 17:02 UTC
You want to loop over the URLs, with the fetch inside the loop. `my @urls = ( "http://localhost:8080/html.htm", ); for my $url (@urls) { my $html = get($url) or die "Couldn't fetch page."; $html =~ ... ... }` [download] Or if you plan on adding to `@urls`, `my @urls = ( "http://localhost:8080/html.htm", ); while (@urls) { my $url = shift(@urls); my $html = get($url) or die "Couldn't fetch page."; $html =~ ... ... push @urls, $new_url; # or @new_urls ... }` [download] Using `push` results in a breadth-first search. Using `unshift` results in a width-first search instead. The former is almost surely most desirable here.	[reply] [d/l] [select]
Re^2: Ending a loop of content of LWP's get-function by turbolofi (Acolyte) on Mar 27, 2009 at 17:32 UTC
Thankyou for your quick reply, and for the pointers to push and unshift. I'm still struggling with getting it work correctly, though. I've tried both of your suggestions, with two different results: `#!/usr/bin/perl -w use warnings; use strict; use LWP::Simple; my ($html, $url); my $count = 0; my @urls = ( "http://localhost:8080/html.htm", ); for my $url (@urls) { my $html = get($url) or die "Couldn't fetch page."; $html =~ m{<(a class=\"smallV110\" href=\"/)(.?)\">} \|\| die "couldn't + match"; #match regexp and capture backreference to $2, or die with e +rror $url = $2; print "$url\n"; $count++; print "$count\n"; }` [download] this gives only one line of content from the retrieved file. It loops till it has found one occurence of the matched pattern, then quits the loop. I'd like it to continue until the whole file has been matched. Is it possible to use "length" to achieve this? the other example gives a more grave error: `#!/usr/bin/perl -w use warnings; use strict; use LWP::Simple; my ($html, $url); my $count = 0; my $new_url; my @urls = ( "http://localhost:8080/html.htm", ); while (@urls) { my $url = shift(@urls); my $html = get($url) or die "Couldn't fetch page."; $html =~ m{<(a class=\"smallV110\" href=\"/)(.?)\">} \|\| die "couldn't + match"; #match regexp and capture backreference to $2, or die with e +rror $url = $2; print "$url\n"; push @urls, $new_url; # or @new_urls }` [download] This code gives, as in the case above, one matched result from the retrieved file, then quits with the error: Use of uninitialized value $url in pattern match (m//) at C:/Perl/lib/LWP/Simple.pm line 131. Couldn't fetch page. at retrieve.pl line 13. I should note that I use ActivePerl, though I doubt very much that this is the cause of the latter problem. Again, I appreciate any help!	[reply] [d/l] [select]
Re^3: Ending a loop of content of LWP's get-function by ikegami (Patriarch) on Mar 27, 2009 at 17:45 UTC
`$url = $2; <-- called $url here print "$url\n"; push @urls, $new_url; # or @new_urls <-- called $new_url here.` [download] Just rename one. Also, it seems you want to search for the pattern multiple times. You'll need the "g" modifier for that. `while ($html =~ m{...}g) { my $new_url = $2; print "$new_url\n"; push @urls, $new_url; }` [download]	[reply] [d/l] [select]
Re: Ending a loop of content of LWP's get-function by zentara (Cardinal) on Mar 27, 2009 at 17:32 UTC
If you google for "LWP download progress", you will find a LWP callback that you can use to cancel the download at any point. I'm not really a human, but I play one on earth My Petition to the Great Cosmic Conciousness	[reply]
Re: Ending a loop of content of LWP's get-function by toolic (Bishop) on Mar 27, 2009 at 17:09 UTC
Welcome to the Monastery! The `get` function returns a scalar, and according to the documentation for LWP::Simple, you can check for success with defined instead of "while". Something like this (untested): `use warnings; use strict; use LWP::Simple; my ($html, $url); my $count = 0; $html = get("http://localhost:8080/html.htm"); die "Couldn't fetch page." unless defined $html; if ($html =~ m{<(a class=\"smallV110\" href=\"/)(.*?)\">} ) { $url = $2; print "$url\n"; $count++; print "$count\n"; } else { die "couldn't match"; }` [download]	[reply] [d/l] [select]
Re: Ending a loop of content of LWP's get-function by turbolofi (Acolyte) on Mar 27, 2009 at 18:07 UTC
We got it work - thanks everyone, for the pointers to the documentation (RTFM, I know), and for the reminder of how regexp behaves! Here's the code, just for future reference. `#!/usr/bin/perl -w # Get urls from result page use warnings; use strict; use LWP::Simple; my ($html, $url, @urls); my $count = 0; $html = get("http://localhost:8080/html.htm") or die "Couldn't fetch p +age."; while($html =~ m{<(a class=\"smallV110\" href=\"/)(.*?)\">}g) { my $new_url = $2; print "$new_url\n"; $count++; print "$count\n"; push @urls, $new_url; }` [download]	[reply] [d/l]