in reply to Re: Correct code for deleting old file?
in thread Correct code for deleting old file?

Thanks for your help. I know there is a bug in the overall code as the taxonomy file never gets replaced (that's acknowledged by the software company) but as iis an older version they don't fix the bug. The getFile code is
sub getFile { my $url = shift; my $type = shift; my $download_directory = shift; my($remoteName, $remotePath, $sysCall, $target, $rootName, $curren +t_directory, $matchLineSave, $thisFileName, @listing, $template, $matchName, $matchLine, @hi +story, $i, $key, $value); &checkPath($download_directory); # split download URL into directory and name if ($url =~ /(.*\/)(.+?)$/) { $remotePath = $1; $remoteName = $2; } else { &lastWords("Cannot parse filename from $url"); } if ($url =~ /^http:/i) { if ($remoteName =~ /[\*\?]/) { &lastWords("Wild card filenames not allowed for http downloads +"); } $matchName = $remoteName; } else { # get a directory listing from FTP sites unlink("$download_directory/.listing"); unlink("$download_directory/index.html"); $sysCall = $wget; if ($ENV{'WINDIR'}){ $sysCall =~ s#/#\\#g; } while (($key, $value) = each(%wget_options)) { if (length($value)) { $sysCall .= " $key=$value"; } else { $sysCall .= " $key"; } } $sysCall .= " --dont-remove-listing"; $sysCall .= " --directory-prefix=$download_directory"; $sysCall .= " \"$remotePath\""; if (system($sysCall)) { &lastWords("Could not retrieve directory listing with $sysCall +"); } unless (-e "$download_directory/.listing" && -s "$download_direc +tory/.listing") { &lastWords("Failed to retrieve directory listing with $sysCall +"); } # got a directory listing from remote site, slurp local download . +history into array if (open(HISTORY, "<$download_directory/.history")) { @history = <HISTORY>; close HISTORY; } # make filename into regular expression $template = $remoteName; $template =~ s/(\W)/\\$1/g; $template =~ s/\\\?/\./g; $template =~ s/\\\*/\.\*/g; # work through directory .listing # exit with first match that isn't already in .history $matchName = "no match"; if (open(LISTING, "$download_directory/.listing")) { @listing = <LISTING>; close LISTING; foreach (@listing) { if (/\s+($template)\s*$/ || /\s+($template)\s+->/) { # got a match from .listing $matchName = $1; $matchLine = $_; for ($i = 0; $i <= $#history; $i++) { if ($history[$i] =~ /\s+($matchName)\s*$/ || $history[$i +] =~ /\s+($matchName)\s+->/) { if ($history[$i] eq $matchLine) { # this file already downloaded $matchName = ""; } else { # new version of file available $history[$i] = $matchLine; $matchLineSave = $matchLine; $matchLine = ""; } last; } } if ($matchName && $matchName ne "no match") { last; } } } } # if $matchName is empty, there is no new update # if $matchName is "no match", $remoteName was not found in the di +rectory listing # if $matchLine is empty, no need to append to .history if ($matchName eq "no match") { &lastWords("No match to $url on remote server"); } if ($matchName && $matchLine) { push @history, $matchLine; $matchLineSave = $matchLine; } if ($matchName) { # replace (potentially) wildcard $url with actual download path $url = "$remotePath$matchName"; } else { # nothing to retrieve &message("No update available for $url\n"); return 0; } } # download the file $sysCall = $wget; if ($ENV{'WINDIR'}){ $sysCall =~ s#/#\\#g; } while (($key, $value) = each(%wget_options)) { if (length($value)) { $sysCall .= " $key=$value"; } else { $sysCall .= " $key"; } } # seems like wget cannot cope with --retr-symlinks and --timestampin +g if ($matchLineSave =~ /\s+($matchName)\s*$/) { $sysCall .= " --timestamping"; } else { $sysCall .= " --retr-symlinks"; } $sysCall .= " --directory-prefix=$download_directory"; # Only use --ignore-length option if desperate # if ($url =~ /^http:/i) { # $sysCall .= " --ignore-length"; # } $sysCall .= " \"$url\""; # Try each download twice. # The second attempt will normally not result in a download because +timestamping is on. # The exception is when a file is updated on the remote server durin +g downloading: # wget resumes the download, and ends up with a mixed file of the co +rrect size, # but the file date is that of the original file. # Don't try to do this if the download file is a link if ($matchLineSave =~ /\s+($matchName)\s*$/) { if (system($sysCall)){ &lastWords("Error return from: $sysCall"); } } if (system($sysCall)){ &lastWords("Error return from: $sysCall"); } else { $logEntry .= "Downloaded $url to $download_directory\n"; print "Resting for 10 seconds\n"; sleep 10; } $thisFileName = "$download_directory/$matchName"; # ensure file is writeable chmod 0644, $thisFileName; if ($thisFileName =~ /(.*)\.gz$/i || $thisFileName =~ /(.*)\.Z$/i) + { $rootName = $1; # decompress with forced overwrite $sysCall = "$gzip -df $thisFileName"; if ($ENV{'WINDIR'}){ $sysCall =~ s#/#\\#g; } # try up to 3 times before giving up if (system($sysCall) && system($sysCall) && system($sysCall)){ &lastWords("Error return from: $sysCall"); } else { &message("Expanded $thisFileName\n"); $thisFileName = $rootName; } } if ($type eq "taxonomy") { # taxonomy file may be a tar archive if ($thisFileName =~ /\.tar$/i) { # unpack if tar archive # have to cd to $local_taxonomy_directory because -C option of t +ar unreliable in DOS if ($ENV{'WINDIR'}){ $current_directory = `cd`; $current_directory =~ s#\\#/#g; } else { $current_directory = `pwd`; } chomp($current_directory); if ($local_taxonomy_directory =~ /^(\w:)(.*)/) { chdir($1); chdir($2); } else { chdir($local_taxonomy_directory); } $thisFileName =~ /.*\/(.+?)$/; $sysCall = "$tar -xf $1"; if ($ENV{'WINDIR'}){ $sysCall =~ s#/#\\#g; } # try up to 3 times before giving up if (system($sysCall) && system($sysCall) && system($sysCall)){ &lastWords("Error return from: $sysCall"); } else { &message("Unpacked $thisFileName\n"); } unlink($thisFileName); # cd back to original directory (just in case) if ($current_directory =~ /^(\w:)(.*)/) { chdir($1); chdir($2); } else { chdir($current_directory); } } } elsif ($type eq "unigene") { # nothing more to do for UniGene } else { # sequence database files need to be renamed if ($type eq "fasta") { $target = "$local_incoming_directory/$db_name\_xyzzy.fasta"; } elsif ($type eq "name") { $target = "$local_incoming_directory/$db_name\_xyzzy.nam"; } elsif ($type eq "reference") { # use the existing extension, if any, otherwise use .ref # lose ".complete" from MSDB $thisFileName =~ s/msdb\.ref\.complete\./msdb\.ref\./; if ($thisFileName =~ /.*\.(.+?)$/) { $target = "$local_incoming_directory/$db_name\_xyzzy.".lc($1 +); } else { $target = "$local_incoming_directory/$db_name\_xyzzy.ref"; } } if (move($thisFileName, $target)) { &message("Renamed $thisFileName to $target\n"); } else { &lastWords("Error return from renaming $thisFileName to $targe +t"); } } # if we get here, download must have succeeded, so can now write out # the updated .history, to avoid getting the same file again if (@history) { open(HISTORY, ">$download_directory/.history"); foreach (@history) { print HISTORY $_; } close HISTORY; } # all done return 1; }

Replies are listed 'Best First'.
Re^3: Correct code for deleting old file?
by afoken (Chancellor) on Jun 15, 2012 at 20:18 UTC

    Your code contains lot of bad practices:

    • &function(...). Correct in Perl 4, still working but with really unwanted side effects in Perl 5. Get rid of all ampersands in front of function names, except when you want to make a reference to a function.
    • Two-argument open. Use three-argument open.
    • Lack of error checks, especially after open. Unless you use autodie, add or die "Could not open $filename: $!" after open.
    • Single-argument system() with hand-crafted quoting. Always begs for trouble, because shells tend to have very different quoting rules. At least pass arguments for the invoked program as a list of arguments. See also Re^3: Perl Rename for a little background, http://www.in-ulm.de/~mascheck/various/ for more.

    Also, you seem to go through great pains to handle URLs and to issue HTTP and FTP requests using wget. Perl has LWP, so you do not have to mess with the shell and external programs at all. Look first at LWP::Simple, that should nearly be sufficient to replace wget. If you need more control, look at LWP::UserAgent. As you seem to write a spider, also have a look at LWP::RobotUA.

    Perl also can decompress gzip files and unpack tar files, if you like, even both in one package, Archive::Tar, and of course without having to mess with the shell.

    For URL handling, look at URI and its helpers.

    And for my personal taste, the getFile() function is about 200 lines longer than it should be. Too much code for one function, too deeply nested, too many variables. As a rule of thumb, a function should not exceed one screen, because otherwise, you will get lost in the code. At my university, a screen was defined as a VT420, i.e. 25 lines of no more than 80 characters, but in the real world, my editor shows about 40 lines of about 110 characters. Consider splitting the function into smaller parts.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

      Thanks a million. However that sounds like a complete rewrite would be ideal.

      However I should point out that the scripts aren't mine (my programming expertise is with REALbasic), they are commercial (which might explain why they are so poorly documented) and I have basically no knowledge of Perl, I'm just the poor sucker who got stuck with setting up our new Mascot PC (the guy who did it the last time has long since left).

      Our Mascot version is a few years old, so the database update script doesn't work right with the new databases (there have been some changes), so the script needed updating.

      I managed to do most of the required changes, but there is an acknowledged bug in there in that a certain file (speclist.txt) needs to be replaced, however the download does not replace it, it just adds it to the folder (as speclist1, speclist2, etc). So their solution is that we should delete the file manually before running the script - yeah, right, who is going to remember that as (a) the script is run every month only, and (b) is being called automatically.

      So I try to add a file deletion routine to the script:

      # # delete old speclist # if ( $matchName eq "speclist.txt" ) { if (-e "$local_taxonomy_directory/speclist.txt") { # does the s +peclist file exist unlink("$local_taxonomy_directory/speclist.txt") #delete + it } else { &lastWords("Failed to delete original speclist"); } } #

      but that doesn't seem to work and I don't know why ... yet

      Thanks

      Markus

Re^3: Correct code for deleting old file?
by flexvault (Monsignor) on Jun 15, 2012 at 13:52 UTC

    markuswinter,

    I took a quick look at the code, and I think your original solution to delete the file may be the better choice. There is still a lot of code that isn't defined and probably compiled in by a 'use' statement.

    Can you test the script in a non-destructive way?

    Usually on PM we try to give hints rather than working code. In your case, it sounds like production code that you want to modify. I strongly suggest you set-up a 'sand box' (testing environment) before you modify production code. If I'm misunderstanding your problem, please correct me.

    Thank you

    "Well done is better than well said." - Benjamin Franklin