WWW::Mechanize::Firefox runs well: some attempts to make the script a bit more robust

Perlbeginner1 has asked for the wisdom of the Perl Monks concerning the following question:

hello dear all

have a nice script that works as a image-scraper: for the first trials and tests all goes well.
here a list or urls that i use in urls.txt - that i am running against with the script. Note this is only a short list. i need to run against 2500 Urls - so it would be great if the sript is a bit more robust and would continue to run - if some urls are not available or take too much time to get. i thint that the script is running into some problems if some Urls are not available or take too much time or do block mozrepl and www:Mechanize::FireFox too much time.

Well - do you think that my ideas and suggestions are probably the cause of the issue or not. If so - how can we improve the script and make it stronger and more powerful - and robust so that it does not stop tooo soon.

love to hear from you

greetiings
see the code the list of urls - note this is only a very very short list...


http://www.bez-zofingen.ch
http://www.schulesins.ch
http://www.schulen-turgi.ch/pages/bezirksschule/startseite.php
http://www.schinznach-dorf.ch
http://www.schule-seengen.ch
http://www.gilgenberg.ch/schule/bez/2005-06/
http://www.rheinfelden-schulen.ch/bezirksschule/
http://www.bezmuri.ch
http://www.moehlin.ch/schulen/
http://www.schule-mewo.ch
http://www.bez-frick.ch
http://www.bezendingen.ch
http://www.bezbrugg.ch
http://www.schule-bremgarten.ch/content/view/20/37/
http://www.bez-balsthal.ch
http://www.schule-baden.ch
http://bezaarau.educanet2.ch/info/.ws_gen/index.htm
http://www.benedict-basel.ch
http://www.institut-beatenberg.ch/
http://www.schulewilchingen.ch
http://www.ksuo.ch
http://www.international-school.ch
http://www.vsgtaegerwilen.ch/
http://www.vgk.ch/
http://www.vstb.ch
[download]

well but i guess that i would be very happy if it is more robust than now

well sure thing it is driving a real browser as with WWW::Mechanize::Firefox

so somewhere it might be somewhat instable - perhaps some bit more than any other screen-scraping solution. I am getting sometimes some errors like the following... (see below) note i also had a closer look at the debugging pages http://search.cpan.org/~corion/WWW-Mechanize-Firefox-0.64/lib/WWW/Mechanize/Firefox/Troubleshooting.pod with its hints and tricks and workarounds regarding various bugs, troubles and things like that.

see the code:


  #!/usr/bin/perl
  use strict;
  use warnings;

  use WWW::Mechanize::Firefox;

  my $mech = new WWW::Mechanize::Firefox();

  open my $urls, '<', 'urls.txt' or die $!;

  while (<$urls>) {
    chomp;
    next unless /^http/i;
    print "$_\n";
    $mech->get($_);
    my $png = $mech->content_as_png;
    my $name = $_;
    $name =~ s#^http://##i;
    $name =~ s#/##g;
    $name =~ s/\s+\z//;
    $name =~ s/\A\s+//;
    $name =~ s/^www\.//;
    $name .= ".png";
open(my $out, '>', "/home/martin/images/$name") or die $!;
  binmode $out;
    print $out $png;
    close $out;
    sleep 5;
  }
[download]

see the results and yes, also the errors where it stops.

martin@linux-wyee:~/perl> perl test_10.pl
http://www.bez-zofingen.ch
Datei oder Verzeichnis nicht gefunden at test_10.pl line 24, <$urls> l
+ine 3.
martin@linux-wyee:~/perl> perl test_10.pl
http://www.bez-zofingen.ch
http://www.schulesins.ch
http://www.schulen-turgi.ch/pages/bezirksschule/startseite.php
http://www.schinznach-dorf.ch
http://www.schule-seengen.ch
http://www.gilgenberg.ch/schule/bez/2005-06/
http://www.rheinfelden-schulen.ch/bezirksschule/
Not Found at test_10.pl line 15
martin@linux-wyee:~/perl>
[download]

what do you suggest - how can we make the script a bit more robust - how to get it so that it does not stop so early!?

greetings

Comment on WWW::Mechanize::Firefox runs well: some attempts to make the script a bit more robust Select or Download Code

Replies are listed 'Best First'.
Re: WWW::Mechanize::Firefox runs well: some attempts to make the script a bit more robust by Marshall (Canon) on Apr 02, 2012 at 01:50 UTC
I don't have the right combination of stuff to run your code right now, but yes, `http://www.rheinfelden-schulen.ch/bezirksschule/`will not be found. You aren't checking for success or failure of the page fetch. I'm guessing that the "not found" error happens when you try to use the $mech object for the page that didn't "work". So I would suggest checking to see if the "get" worked before trying to do anything more with it. `$mech->get($_); if (!$mech->success()) { ... failed somehow.. do something print "get of $_ failed!\n"; next; }` [download] PS: I find `$name =~ s#^http://##i;` a bit "hard on the eyes". I sometimes use the \| character, `s\|^http://\|\|;` but some folks object to that as the \| normally means "or" in a regex. I think `s[^http://][];` will work also? I don't know if these "not found" errors are transient or not. You can make use of the redo function to go back to the while() without re-evaluation (ie getting the next url) - this is like "next;" except that the while conditional is not re-evaluated. Of course you will need to structure the redo; within code some appropriate counter for max_retries so that you don't wind up in an infinite loop. But the first step would be to see if just skipping that URL like above will allow the code to complete. Then we can talk about "how to give it another chance". BTW: It's been some months since we talked about this project. What lead you to go down the road of using Mechanize::Firefox? This adds an additional layer of complication to the whole thing - I'm for example having some version issue with Firefox and Mozrepl - so there are some "landmines" along this path. Update: If you add: $\|=1; at the top of the code, this will un-buffer writes to STDOUT and make it easier to follow what the code is doing while it executes. If you don't do that, there is a long lag between the program printing and that output appearing on the screen because the typical buffer is ~4KB - many lines are "printed" by the program before they are "flushed" to the output. "flushing every print" has a performance impact, but in this case, it will make no difference at all.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re: WWW::Mechanize::Firefox runs well: some attempts to make the script a bit more robust
by Marshall (Canon) on Apr 02, 2012 at 01:50 UTC

http://www.rheinfelden-schulen.ch/bezirksschule/

$mech->get($_);
if (!$mech->success())
{
    ... failed somehow.. do something
    print "get of $_ failed!\n";
    next;
}
[download]

$name =~ s#^http://##i;

s|^http://||;

s[^http://][];

I don't know if these "not found" errors are transient or not. You can make use of the redo function to go back to the while() without re-evaluation (ie getting the next url) - this is like "next;" except that the while conditional is not re-evaluated. Of course you will need to structure the redo; within code some appropriate counter for max_retries so that you don't wind up in an infinite loop. But the first step would be to see if just skipping that URL like above will allow the code to complete. Then we can talk about "how to give it another chance".

BTW: It's been some months since we talked about this project. What lead you to go down the road of using Mechanize::Firefox? This adds an additional layer of complication to the whole thing - I'm for example having some version issue with Firefox and Mozrepl - so there are some "landmines" along this path.

Update:
If you add: $|=1; at the top of the code, this will un-buffer writes to STDOUT and make it easier to follow what the code is doing while it executes. If you don't do that, there is a long lag between the program printing and that output appearing on the screen because the typical buffer is ~4KB - many lines are "printed" by the program before they are "flushed" to the output. "flushing every print" has a performance impact, but in this case, it will make no difference at all.

[reply]
[d/l]
[select]