Mosley has asked for the wisdom of the Perl Monks concerning the following question:

On Win32 I use LWP::Simple to fetch files from the web, but sometimes the file is to big and causes my script run to forever, if I cancel the script out before before it's finished, the temp file I am writing gets write protected. If I run the script again, on a smaller file, it uses the data from the last session plus the new pages data. Does any have a solution for checking the file size first before grabbing it and putting it into a variable? Or if the script is taking to long to process the data, to quit and unlink the temp file.I AM NEW TO PERL, so don't everybody laugh at one time. Here is my code
use LWP::Simple; if ($ENV{'QUERY_STRING'}) { $buffer = $ENV{'QUERY_STRING'}; } else { read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'}); } @pairs = split(/&/, $buffer); foreach $pair (@pairs) { ($name, $value) = split(/=/, $pair); $value =~ tr/+/ /; $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg; $value =~ s/\n/ /g; $FORM{$name} = $value; } print "Content-type: text/html\n\n"; $URL = $FORM{'url'}; $page = get($URL); $page =~ s/\s+/ /g; # I do more with the $page varaible later but I break it off here. # Strip most of the HTML, <script>, <style> and punctuation. # I think it's greedy. Any help? I perfer without perl module? $break = $page; $break =~ tr/A-Z/a-z/; $break =~ s/&nbsp\;/ /g; $break =~ s/<s.*?<\/s.*?>//igs; $break =~ s/&lt\;//igs; $break =~ s/&gt\;//igs; if(!($break)) { &error; &print_footer; exit; } &print_main_header; ($text = $break) =~ s/<(\/|!)?[-.a-zA-Z0-9]*.*?>//g; $text =~ s/[,.?':!"@#\$\%&*()_|\/\-=+\^~`\{\}\[\]\\]//g; $text =~ s/\s+/ /g; @text = split(/\s/, $text);

Replies are listed 'Best First'.
Re: Check remote file for size?
by pjf (Curate) on Oct 13, 2001 at 12:22 UTC
    G'day Mosley,

    Firstly, I'd strongly suggest you consider using the CGI perl module, which will do QUERY_STRING decoding for you. It comes standard with perl, provides many very useful features, and will do a better job than your home-grown encoding (particularly for things like parameters with multiple values).

    To answer your question about checking a file's size, you can use the -s operator, like this:

    $file = "/path/to/file"; if (-s $file) { # file has non-zero size. } else { # File is empty or does not exist. }
    -s returns the file size in bytes if you need that much information.

    In your particular example, I would suggest opening your file for writing using open(FILE,">$filename"). The ">" will clobber the contents of the file if it exists, as opposed to ">>" which will append to the file. Clobbering the contents means you shouldn't need to care about the file size at all.

    Also, there's nothing wrong with using perl modules. In fact, it's very highly recommended. More often than not, re-inventing the wheel results in something which isn't quite as round.

    If you're looking at stripping out HTML tags, I'd suggest looking at HTML::Parser, which will do most of the work for you.

    If you want your script to die automatically after a certain period of time, you can set an alarm. Unfortunately I don't know how portable this is across non-UNIX operating systems.

    $SIG{ALRM} = sub { die "Script took too long.\n" }; alarm(60); # Create alarm signal in 60 seconds. # ... do long-running stuff. alarm(0); # Cancel the alarm
    If you want your code to always clean up after itself, you can do so with an END block.
    END { unlink($tmpfile); }
    An END block runs just before the perl interpretor exits.

    Hope that you find all of the above useful.

    Cheers,
    Paul

Re: Check remote file for size?
by thinker (Parson) on Oct 13, 2001 at 13:27 UTC
    Hi Mosley,

    You could try something like this to check for the size of the file before beginning download.
    #!/usr/bin/perl -w use strict; use LWP::Simple; my $url="http://localhost/index.html"; my $MAX_SIZE=50_000; my ($content_type, $document_length, $modified_time, $expires, $server +)= head($url); if ($document_length>$MAX_SIZE){ print "TooBig, Ignoring\n" } else { my $dl=get($url); print $dl; };

    To Parse the html, I would suggest using HTML::Parser, but you don't wan't to use a module. Why constrain yourself so?
    Anyway, Hope this helps,

    thinker
      Thanks Paul and Thinker. I will use the modules. I was using ">" to write with, but some reason it wasn't clobbering it, after thinking about the temp file being write protected, I think Win32 still thought the script was running. Your help was just what I needed! This place is about 500% better than Usenet!