itterator variable in a LWP-UA-code snippet

Perlbeginner1 has asked for the wisdom of the Perl Monks concerning the following question:

good morning dear monks!

I'm new to programming and i am trying to learn the basics of the PERL.At the moment i digg into the Perl LWP::UserAgent.
note; This first mentioned code below runs and give me back the content of the parsed site: what i want is to enter a loop in the argument that fetches the url. In other words - i want to itterate over some hundreds of targets...


#!/usr/bin/perl

use strict;            # 
use warnings;         # 
use diagnostics;       # 
use LWP::UserAgent; 
  $ua = LWP::UserAgent->new; 
  $ua->agent("$0/0.1 " . $ua->agent); 
  # $ua->agent("Mozilla/8.0") # pretend we are very capable browser 

 
 $req = HTTP::Request->new(GET => 'http://dms-schule.bildung.hessen.de
+/suchen/suche_schul_db.html?show_school=5503'); 
  $req->header('Accept' => 'text/html'); 

  # send request 
  $res = $ua->request($req); 

  # check the outcome 
  if ($res->is_success) { 
     print $res->content; 
  } else { 
     print "Error: " . $res->status_line . "\n"; 
  }
[download]

as mentioned above: the code runs well and nicely: i want to build in a loop to fetch more pages. Well i want to fetch pages

from http://dms-schule.bildung.hessen.de/suchen/suche_schul_db.html?show_school=01

to

http://dms-schule.bildung.hessen.de/suchen/suche_schul_db.html?show_school=10000

- the one that have no results i want to drop (but that has to be done later with some additional code. For the proof of concept i want to get all the urls - let us say printed out that the LWP::userAgent fetches...

the quesions are:

1. how to enter the loop in correct way.
2. how to make the prorgamme to print out all the URLs that are fetched. (later on i want to parse the sites with content) but thats a part that i have do design and code later on.

Here the code that has a build in loop - to make USER-Agent to itterate over a bunch of targets.

# first get a list of all schools

    my $ua = LWP::UserAgent->new;
    
    $ua->agent("Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.7) Ge
+cko/20070914 Firefox/2.0.0.7"); 
    
    #pretending to be firefox on linux.
    
    for my $i (0..10000) {
      my $request = HTTP::Request->new(GET => sprintf("http://dms-schu
+le.bildung.hessen.de/suchen/suche_schul_db.html?show_school=5503,%d",
+ $i));
      $request->header('Accept' => 'text/html');
      my $response = $ua->request($request);
      if ($response->is_success) {
        $pagecontent = $response -> content;
      }
    # now we can do whatever with the $pagecontent
    
    }
    my $request = POST $url, 

# check the outcome 
  if ($res->is_success) { 
     print $res->content; # please print out all the URLS that were fe
+tched! Thx my dear! 
  } else { 
     print "Error: " . $res->status_line . "\n"; 
  }
[download]

do you have any idea how to insert the loop correctly - and how to get the programme to print out all the urls (not the content)!!!
Please let me know if i have do be more descriptive!

many thanks! Perlbeginner1

Comment on itterator variable in a LWP-UA-code snippet Select or Download Code

Replies are listed 'Best First'.
Re: iitterator variable in a LWP-UA-code snippet by zentara (Cardinal) on Oct 31, 2010 at 12:14 UTC
If I understand your question correctly, the easiest way to do it is to make the $url a separate variable, instead as part of the GET. Also, I don't know what you are try to accomplish if you don't want the content. Are you just tring to detect whether the server is up? There are easier ways to do that. for my $i (0..10000) { my $url = sprintf("http://dms-schule.bildung.hessen.de/suchen/suche_ +schul_db.html?show_school=5503,%d", $i); my $request = HTTP::Request->new(GET => $url ); $request->header('Accept' => 'text/html'); my $response = $ua->request($request); if ($response->is_success) { $pagecontent = $response -> content; } my $request = POST $url, # check the outcome if ($res->is_success) { print "Success $url \n"; # print out all the URLS that were +fetched! } else { print "Error: $url " . $res->status_line . "\n"; } } # end of for $i loop [download] I'm not really a human, but I play one on earth. Old Perl Programmer Haiku ................... flash japh	[reply] [d/l]
Re: iitterator variable in a LWP-UA-code snippet by kcott (Archbishop) on Oct 31, 2010 at 12:04 UTC
You've shown start and end range as `...school=01` and `...school=10000` so you'll want to loop through `(1 .. 10000)`, not `(0 .. 10000)`. I'm not sure what's happening with your sprintf - you're going to end up with the range: `...school=5503,0` to `...school=5503,10000`. You'll want the remove the `5503,` part. Then you'll need `%02d` while $i < 10 - after that I'm not sure: is 10-99 represented like that or do they have leading zeros? Anyway, you can probably work it out from there. -- Ken	[reply] [d/l] [select]
Re^2:itterator variable in a LWP-UA-code snippet by Perlbeginner1 (Scribe) on Oct 31, 2010 at 12:21 UTC
hello Ken; hello Anonymous Monk -thx for the reply! @ken: guess that you are right want to loop through (1 .. 10000), not (0 .. 10000). i am going to try it out with %02d while $i < 10 i come back and report all my findings regards beginnner update String concatenation: `#!/usr/bin/env perl use 5.12.0; use warnings; use LWP::Simple; use Data::Dumper; say Dumper({ map { $_ => eval { get 'http://example.com/'.$_ }, } 0..1 +000 });` [download] i try this out! thx. i come back and report	[reply] [d/l]
Re: iitterator variable in a LWP-UA-code snippet by Anonymous Monk on Oct 31, 2010 at 12:02 UTC
String concatenation: `#!/usr/bin/env perl use 5.12.0; use warnings; use LWP::Simple; use Data::Dumper; say Dumper({ map { $_ => eval { get 'http://example.com/'.$_ }, } 0..1000 });` [download]	[reply] [d/l]
Duplicate: please delete Re: iitterator variable in a LWP-UA-code snippet by zentara (Cardinal) on Oct 31, 2010 at 12:14 UTC
It was cold and I double clicked the create button, and it made 2 ! :-) If I understand your question correctly, the easiest way to do it is to make the $url a separate variable, instead as part of the GET. Also, I don't know what you are try to accomplish if you don't want the content. Are you just tring to detect whether the server is up? There are easier ways to do that. for my $i (0..10000) { my $url = sprintf("http://dms-schule.bildung.hessen.de/suchen/suche_ +schul_db.html?show_school=5503,%d", $i); my $request = HTTP::Request->new(GET => $url ); $request->header('Accept' => 'text/html'); my $response = $ua->request($request); if ($response->is_success) { $pagecontent = $response -> content; } my $request = POST $url, # check the outcome if ($res->is_success) { print "Success $url \n"; # print out all the URLS that were +fetched! } else { print "Error: $url " . $res->status_line . "\n"; } } # end of for $i loop [download] I'm not really a human, but I play one on earth. Old Perl Programmer Haiku ................... flash japh	[reply] [d/l]
Re: Duplicate: please delete Re: iitterator variable in a LWP-UA-code snippet by Gavin (Archbishop) on Oct 31, 2010 at 16:27 UTC
Hee He "It was cold and I double clicked the create button, and it made 2 ! :-)" Only you could get away with that and not loose XP/Face!	[reply]
Re^2: Duplicate: please delete Re: iitterator variable in a LWP-UA-code snippet by zentara (Cardinal) on Oct 31, 2010 at 21:54 UTC
I think I found the holy grail of how to double my xp per post, just by double clicking. :-) I'm not really a human, but I play one on earth. Old Perl Programmer Haiku ................... flash japh	[reply]
Re: Duplicate: please delete Re: iitterator variable in a LWP-UA-code snippet by Perlbeginner1 (Scribe) on Oct 31, 2010 at 12:38 UTC
hello zentara hello all!- many thanks for the quick reply. your answers are very very helpful and inspring! really! of course i want to have the content. But i have to get prepared for this "job". I want to parse the content of all the pages note - there are some with empty results!! since we itterate over many pages. note i want to run over a bunch of sites .... some are empty some not.. see the loop over Hessen: http://dms-schule.bildung.hessen.de/suchen/suche_schul_db.html?show_school=5503 http://dms-schule.bildung.hessen.de/suchen/suche_schul_db.html?show_school=5504 http://dms-schule.bildung.hessen.de/suchen/suche_schul_db.html?show_school=5505 http://dms-schule.bildung.hessen.de/suchen/suche_schul_db.html?show_school=5514 etc i look for the data that is in the with that information i want to force the parcer - probably i do it with HTML::TreeBuilder::XPath - to get the data out of the sites. And finally i want to store it into a database. but - i muse obut the idea of also using HTTP::Request::Common; what do you think. It can make all things easier - doesn� it!? look forward to hear from you!!	[reply]
Re^2: Duplicate: please delete Re: iitterator variable in a LWP-UA-code snippet by marto (Cardinal) on Oct 31, 2010 at 13:26 UTC
Again you've chosen to ignore the fomratting advice given when posting, I've mentioned this to you a couple of times. Honestly, it wont take long to read and learn this. You've also been asking questions similar to this for quite some time, and have been provided several solutions and code to get you going. I understand you are trying to get a working solution for this task. Which parts exactly are you having problems with? Looping? If so see Recursion: The Towers of Hanoi problem from the Subroutines sub section of tutorials. You mention you want to run this for several sites which I presume have different markup. Why not just call a different parsing subroutine for each site? I'm sure you've mentioned at least a couple of different sites you wish to parse in your previous posts.	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: itterator variable in a LWP-UA-code snippet by aquarium (Curate) on Oct 31, 2010 at 22:56 UTC
i'm fairly certain that the website uses ajax/json behind the scenes to get the data into the html in the first place. therefore it might be possible (if you ask nicely) to get the urls from the website admin, that give you the raw data...rather than constructing school info urls (some of which don't exist) and scraping the html. the hardest line to type correctly is: stty erase ^H	[reply]