rizzy has asked for the wisdom of the Perl Monks concerning the following question:
This is probably a very simple thing to do, but I can't seem to find an answer. Here's the problem: I am downloading and parsing hundreds of thousands of html files (from a list)and for whatever reason, every once in a while, the perl script is not able to access one of the file (even though It is there and it usually "sees" it). WHen this happens, the code stops running.
What I would like to do is record the filename that couldn't load and continue on looping through the rest of the list. That way, I don't have to babysit the thing and can come back and try with those that didn't work later. Here's the basic structure of my code:#!/usr/bin/perl -w use strict; use LWP::Simple; open ("output","> /output/results.txt") || die ("Could not open output + file $!"); open ("input", "< /input/urllist.txt") || die ("Could not open input f +ile $!"); $/=undef; my $urllist=<input>; while($urllist =~ m{(http://.+\.html)}g){ my $url=$1; my $html=''; $html = get("$url") or print "Couldn't fetch $url."; while($html=~ m{(find whatever I want)}gi){ $mysearch=$1; print output "$url|$mysearch\n";} } } close ("output"); close ("input");
Basically, I have a file stored locally that has a bunch of urls. I open this and for every url, I try to access it (using the get command) and then search for various things and save the results. So, it calls the "get" command for hundreds of thousands of urls. Just because of the nature of the web, some of these will not work when it tries, even though they are there. When it calls "get" and fails to find the file, how do I tell it to either keep going (by maybe replacing $html with a whitespace or something) or to move on to the next matched $url from the urllist? Thanks in advance.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: How to allow loop to continue to run after a problem opening a file
by halfcountplus (Hermit) on Oct 20, 2010 at 03:08 UTC | |
by rizzy (Sexton) on Oct 20, 2010 at 03:23 UTC |