Perlbeginner1 has asked for the wisdom of the Perl Monks concerning the following question:

good morning dear monks!

I'm new to programming and i am trying to learn the basics of the PERL.At the moment i digg into the Perl LWP::UserAgent.
note; This first mentioned code below runs and give me back the content of the parsed site: what i want is to enter a loop in the argument that fetches the url. In other words - i want to itterate over some hundreds of targets...

#!/usr/bin/perl use strict; # use warnings; # use diagnostics; # use LWP::UserAgent; $ua = LWP::UserAgent->new; $ua->agent("$0/0.1 " . $ua->agent); # $ua->agent("Mozilla/8.0") # pretend we are very capable browser $req = HTTP::Request->new(GET => 'http://dms-schule.bildung.hessen.de +/suchen/suche_schul_db.html?show_school=5503'); $req->header('Accept' => 'text/html'); # send request $res = $ua->request($req); # check the outcome if ($res->is_success) { print $res->content; } else { print "Error: " . $res->status_line . "\n"; }


as mentioned above: the code runs well and nicely: i want to build in a loop to fetch more pages. Well i want to fetch pages

from http://dms-schule.bildung.hessen.de/suchen/suche_schul_db.html?show_school=01

to

http://dms-schule.bildung.hessen.de/suchen/suche_schul_db.html?show_school=10000

- the one that have no results i want to drop (but that has to be done later with some additional code. For the proof of concept i want to get all the urls - let us say printed out that the LWP::userAgent fetches...

the quesions are:

1. how to enter the loop in correct way.
2. how to make the prorgamme to print out all the URLs that are fetched. (later on i want to parse the sites with content) but thats a part that i have do design and code later on.

Here the code that has a build in loop - to make USER-Agent to itterate over a bunch of targets.

# first get a list of all schools
my $ua = LWP::UserAgent->new; $ua->agent("Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.7) Ge +cko/20070914 Firefox/2.0.0.7"); #pretending to be firefox on linux. for my $i (0..10000) { my $request = HTTP::Request->new(GET => sprintf("http://dms-schu +le.bildung.hessen.de/suchen/suche_schul_db.html?show_school=5503,%d", + $i)); $request->header('Accept' => 'text/html'); my $response = $ua->request($request); if ($response->is_success) { $pagecontent = $response -> content; } # now we can do whatever with the $pagecontent } my $request = POST $url, # check the outcome if ($res->is_success) { print $res->content; # please print out all the URLS that were fe +tched! Thx my dear! } else { print "Error: " . $res->status_line . "\n"; }




do you have any idea how to insert the loop correctly - and how to get the programme to print out all the urls (not the content)!!!
Please let me know if i have do be more descriptive!

many thanks! Perlbeginner1

Replies are listed 'Best First'.
Re: iitterator variable in a LWP-UA-code snippet
by zentara (Cardinal) on Oct 31, 2010 at 12:14 UTC
    If I understand your question correctly, the easiest way to do it is to make the $url a separate variable, instead as part of the GET. Also, I don't know what you are try to accomplish if you don't want the content. Are you just tring to detect whether the server is up? There are easier ways to do that.
    for my $i (0..10000) { my $url = sprintf("http://dms-schule.bildung.hessen.de/suchen/suche_ +schul_db.html?show_school=5503,%d", $i); my $request = HTTP::Request->new(GET => $url ); $request->header('Accept' => 'text/html'); my $response = $ua->request($request); if ($response->is_success) { $pagecontent = $response -> content; } my $request = POST $url, # check the outcome if ($res->is_success) { print "Success $url \n"; # print out all the URLS that were +fetched! } else { print "Error: $url " . $res->status_line . "\n"; } } # end of for $i loop

    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh
Re: iitterator variable in a LWP-UA-code snippet
by kcott (Archbishop) on Oct 31, 2010 at 12:04 UTC

    You've shown start and end range as ...school=01 and ...school=10000 so you'll want to loop through (1 .. 10000), not (0 .. 10000).

    I'm not sure what's happening with your sprintf - you're going to end up with the range: ...school=5503,0 to ...school=5503,10000.

    You'll want the remove the 5503, part. Then you'll need %02d while $i < 10 - after that I'm not sure: is 10-99 represented like that or do they have leading zeros? Anyway, you can probably work it out from there.

    -- Ken

      hello Ken; hello Anonymous Monk -thx for the reply!

      @ken: guess that you are right want to loop through (1 .. 10000), not (0 .. 10000).


      i am going to try it out with

      %02d while $i < 10

      i come back and report all my findings

      regards beginnner

      update

      String concatenation:

      #!/usr/bin/env perl use 5.12.0; use warnings; use LWP::Simple; use Data::Dumper; say Dumper({ map { $_ => eval { get 'http://example.com/'.$_ }, } 0..1 +000 });


      i try this out! thx.
      i come back and report
Re: iitterator variable in a LWP-UA-code snippet
by Anonymous Monk on Oct 31, 2010 at 12:02 UTC
    String concatenation:
    #!/usr/bin/env perl use 5.12.0; use warnings; use LWP::Simple; use Data::Dumper; say Dumper({ map { $_ => eval { get 'http://example.com/'.$_ }, } 0..1000 });
Duplicate: please delete Re: iitterator variable in a LWP-UA-code snippet
by zentara (Cardinal) on Oct 31, 2010 at 12:14 UTC
    It was cold and I double clicked the create button, and it made 2 ! :-)

    If I understand your question correctly, the easiest way to do it is to make the $url a separate variable, instead as part of the GET. Also, I don't know what you are try to accomplish if you don't want the content. Are you just tring to detect whether the server is up? There are easier ways to do that.

    for my $i (0..10000) { my $url = sprintf("http://dms-schule.bildung.hessen.de/suchen/suche_ +schul_db.html?show_school=5503,%d", $i); my $request = HTTP::Request->new(GET => $url ); $request->header('Accept' => 'text/html'); my $response = $ua->request($request); if ($response->is_success) { $pagecontent = $response -> content; } my $request = POST $url, # check the outcome if ($res->is_success) { print "Success $url \n"; # print out all the URLS that were +fetched! } else { print "Error: $url " . $res->status_line . "\n"; } } # end of for $i loop

    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh

      Hee He

      "It was cold and I double clicked the create button, and it made 2 ! :-)"

      Only you could get away with that and not loose XP/Face!

        I think I found the holy grail of how to double my xp per post, just by double clicking. :-)

        I'm not really a human, but I play one on earth.
        Old Perl Programmer Haiku ................... flash japh
      hello zentara hello all!- many thanks for the quick reply.

      your answers are very very helpful and inspring! really!


      of course i want to have the content. But i have to get prepared for this "job". I want to parse the content of all the pages note - there are some with empty results!! since we itterate over many pages.

      note i want to run over a bunch of sites .... some are empty some not..

      see the loop over Hessen:
      http://dms-schule.bildung.hessen.de/suchen/suche_schul_db.html?show_school=5503
      http://dms-schule.bildung.hessen.de/suchen/suche_schul_db.html?show_school=5504
      http://dms-schule.bildung.hessen.de/suchen/suche_schul_db.html?show_school=5505
      http://dms-schule.bildung.hessen.de/suchen/suche_schul_db.html?show_school=5514
      etc

      i look for the data that is in the


      with that information i want to force the parcer - probably i do it with HTML::TreeBuilder::XPath - to get the data out of the sites.

      And finally i want to store it into a database.

      but - i muse obut the idea of also using HTTP::Request::Common; what do you think. It can make all things easier - doesnŽ it!?

      look forward to hear from you!!

        Again you've chosen to ignore the fomratting advice given when posting, I've mentioned this to you a couple of times. Honestly, it wont take long to read and learn this.

        You've also been asking questions similar to this for quite some time, and have been provided several solutions and code to get you going. I understand you are trying to get a working solution for this task. Which parts exactly are you having problems with? Looping? If so see Recursion: The Towers of Hanoi problem from the Subroutines sub section of tutorials.

        You mention you want to run this for several sites which I presume have different markup. Why not just call a different parsing subroutine for each site? I'm sure you've mentioned at least a couple of different sites you wish to parse in your previous posts.

        A reply falls below the community's threshold of quality. You may see it by logging in.
Re: itterator variable in a LWP-UA-code snippet
by aquarium (Curate) on Oct 31, 2010 at 22:56 UTC
    i'm fairly certain that the website uses ajax/json behind the scenes to get the data into the html in the first place. therefore it might be possible (if you ask nicely) to get the urls from the website admin, that give you the raw data...rather than constructing school info urls (some of which don't exist) and scraping the html.
    the hardest line to type correctly is: stty erase ^H