learningperl01 has asked for the wisdom of the Perl Monks concerning the following question:

Hello i am wondering if someone could please help me out I am having the following problem with the code below. First here is what the script does. I have about 200,000 plus files for which I need to go in a delete/remove the first four lines from. These are logs from a server, long story short for what we are doing wee need to delete the first four lines from the code. The problem that everytime I run the script below I get a OUT OF MEMORY error after the fourth or fifth log file(for the most part the files are under 10mb, but some are as large as 700mb...but I am only trying to remove the first four line from each log file). I am pretty new to perl and programming in general so I don't know if I am using perl to its best capabilities or if there is a better way to do what I am doing. Anyway any help would be appreciated. thanks for the help in advance.
#!/usr/bin/perl use File::Find; use Tie::File; my $DIRECTORY = $ARGV[0]; find(\&edits, $DIRECTORY); sub edits() { if ( -f and /^33dc01\..*outer.log$/ ) { foreach ( $File::Find::name ) { print "$File::Find::name\n"; tie my @file, 'Tie::File', $File::Find::name or die "Can't + tie $File::Find::name $!"; splice @file, 0, 4; untie @file; } } }

Replies are listed 'Best First'.
Re: Removing lines from files
by markkawika (Monk) on May 09, 2008 at 22:33 UTC
    Perl is a poor choice for this operation. I would use a bourne shell script with standard unix utilities (assuming you're doing this on a unix box):
    #!/bin/sh DIR=$1 cd ${DIR} find . -name '33dc01.*outer?log' -print | \ while read fn do tail +5 ${fn} > ${fn}.deleting mv ${fn}.deleting ${fn} done
      Using the code below how would I modify it so that only one file gets processed at a time. Meaning make the script wait until file1 is done with the removal/backup/deletion before it starts with file2? thanks again for the help
      #!/usr/bin/perl use warnings; use strict; use File::Find; @ARGV == 1 or die "usage: $0 directory_name\n"; my @files; find sub { if ( -f and /^33dc01\..*outer\.log$/ ) { push @files, $File::Find::name; print "$File::Find::name\n"; } }, $ARGV[ 0 ]; ( $^I, @ARGV ) = ( '', @files ); while ( <> ) { next if $. <= 4; print; close ARGV if eof; }
Re: Removing lines from files
by GrandFather (Saint) on May 09, 2008 at 22:13 UTC

    Aside from your immediate problem there are a couple of style issues to consider that may help you in the future. First, always use strictures (use strict; use warnings;). Strictures catch a lot of silly typos and similar errors before they get a chance to waste a few hours trying to find subtle errors.

    Don't use prototypes for subroutines. While there are a small number of situations where they are useful, generally they don't do what you think and often do what you don't expect. In the case of your sample code the prototype is actually ignored in any case because it hasn't been seen before it is used!

    In the interests of showing you a little more Perl power consider:

    sub edits { return unless -f and /^33dc01\..*outer.log$/; # Set up for in place edit local @ARGV = ($_); local $^I = '.bak'; print "$File::Find::name\n"; while (<>) { print if $. > 4; } }

    which uses Perl's in place edit facility to rewrite the file having skipped the first four lines. Note that this will create backups of the original files with .bak appended to the file name.

    The special variable $. provides the current line number for the file handle most recently accessed. See $^I, @ARGV, $. and $ARGV (which I didn't use, but may be of interest).


    Perl is environmentally friendly - it saves trees
Re: Removing lines from files
by pc88mxer (Vicar) on May 09, 2008 at 19:19 UTC
    Just replace this:
    tie my @file, 'Tie::File', $File::Find::name or die "Can't + tie $File::Find::name $!"; splice @file, 0, 4; untie @file;
    with:
    open(F, '<', $File::Find::name) or die "..."; my $tmp = $File::Find::name . " - new"; # see comment below open(G, '>', $tmp) or die "..."; for (1..4) { <F> }; # don't copy the first four lines while (<F>) { print G } close(G); close(F); rename($tmp, $File::Find::name) or warn "unable to replace $File::Find::name: $!\n";
    The only caveat is that you have to ensure that $tmp can never be the name of an existing log file.

    The tie method is inefficient because it reads the entire file into memory.

    One advantage that this approach has over an in-place re-write is that you won't have to worry about leaving yourself with a corrupted log file if the copy is interrupted.

      In a PDF found here, "Lightweight Database Techniques" tutorial materials freed, Dominus says Tie::File is for convenience, not performance, but he also says its reasonably fast. The job you're doing here is removing the first 4 lines of every (200,000) files. Tie::File will have to rewrite every large file from the point beyond the change to the end. He says since the module must perform reasonable well for many different types of applications, its slower than code custom written for a single application.
Re: Removing lines from files
by NetWallah (Canon) on May 09, 2008 at 19:27 UTC
    Have you tried doing
    shift @file for 1..4;
    instead of the "splice" ? Not sure if that would help, but it might. Update: It does not. See pc88mxer's reply below.

    Why do you use a "foreach" loop over a scalar (foreach ( $File::Find::name )) ? It seems that $_ is not even referenced. in that loop.

    Although "Tie::File" claims to be efficient, you are not using it for "random, array-style access", so the overhead may be too high for your case. Benchmark can help find more optimal mechanisms.

    ++ on using modules to reduce the amount you are coding ! (Although this may appear contrary to my previous sentence).

    Update 1: pc88mxer : The Tie::File docs claim that it does NOT read the file into memory. Also, Your method (re-writing the relevant part of the file) is supposed to be LESS efficient than claimed by Tie::File. I will attempt to benchmark & post here.

         "How many times do I have to tell you again and again .. not to be repetitive?"

      A check of the source code reveals that the SHIFT method for Tie::File is implemented in terms of the SPLICE method:
      sub SHIFT { my $self = shift; scalar $self->SPLICE(0, 1); }
      Besides, the file is updated after every shift operation which means you'd be re-writing the file four times (!)

      I don't think Tie::File is the right approach to manipulate 700 MB log files.

      Update: I have a feeling the OP is running into out of memory problems because Tie::File is keeping track of the start of each line even though it doesn't need to. For a multi-megabyte file this clearly would be a problem. However, this is currently just a conjecture.

Re: Removing lines from files
by jwkrahn (Abbot) on May 09, 2008 at 21:50 UTC
    #!/usr/bin/perl use warnings; use strict; use File::Find; @ARGV == 1 or die "usage: $0 directory_name\n"; my @files; find sub { if ( -f and /^33dc01\..*outer\.log$/ ) { push @files, $File::Find::name; print "$File::Find::name\n"; } }, $ARGV[ 0 ]; ( $^I, @ARGV ) = ( '', @files ); while ( <> ) { next if $. <= 4; print; close ARGV if eof; }
      Using the code below how would I modify it so that only one file gets processed at a time. Meaning make the script wait until file1 is done with the removal/backup/deletion before it starts with file2? thanks again for the help
      #!/usr/bin/perl use warnings; use strict; use File::Find; @ARGV == 1 or die "usage: $0 directory_name\n"; my @files; find sub { if ( -f and /^33dc01\..*outer\.log$/ ) { push @files, $File::Find::name; print "$File::Find::name\n"; } }, $ARGV[ 0 ]; ( $^I, @ARGV ) = ( '', @files ); while ( <> ) { next if $. <= 4; print; close ARGV if eof;
      }

        The way the code works only one line at a time will ever be in memory and the files are processed in the order they appear in the @ARGV array.