JPaul has asked for the wisdom of the Perl Monks concerning the following question:

Greetings all,

I am currently doing some contract programming for a printing company who receive large (anywhere from 15 Meg to 150 Meg) text files containing thousands of text statements. My scripts perform various operations on these statement files, the simplest pulling names/addresses (at specific line/positions per page) and outputting them into a different file.
The scripts are very straightforward (read until find page marker, take name/addr from line# X, dump to other file, carry on)... I installed a simple percentage counter to see the current progress (bytes read/total bytes in file).

My boggle is this:
As the programme progresses through the file, it slows down, until as it reaches the very end it appears (from when it started the file) to really be crawling. The script does complete its task correctly, nothing is freezing, but I'm still interested into why it slows down.
I use a read() to pull byte by byte (looking for top-of-page hex marker, since pages are all different lengths).
Nothing is stored in memory but the temporary name/addr that it pulls from the master file, and then outputs.

Any ideas? -- Alexander Widdlemouse undid his bellybutton and his bum dropped off --

  • Comment on Perl scripts slowing down towards end of file processing

Replies are listed 'Best First'.
Re: Perl scripts slowing down towards end of file processing
by tachyon (Chancellor) on Jul 14, 2001 at 02:03 UTC

    Mavericks suggestion to set the input record separator to 0x0c will allow you to read in one record at a time and avoid all that mucking around with single characters. It will no doubt speed your code considerably as well as simplifying it quite a lot. Let Perl work for you.

    Here is a small optimisation of your last two subs that will save a bit of time too. By using the 'x' operator rather than a while loop we add our whitespace pad in one operation. By using a regex to stip leading and trailing whitespace we save a lot of mucking about and let the regex engine do the work. This too will be much faster.

    # Pad out with trailing whitespace as necessary sub pad { my ($data, $len) = @_; my $pad = $len - (length $data); $data .= " " x $pad unless $pad < 0; return $data; } # Get rid of leading/trailing whitespace sub normalise { my $value = shift; $value = s/^\s+|\s+$//gm; return $value; }

    Hope this helps

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: Perl scripts slowing down towards end of file processing
by HyperZonk (Friar) on Jul 13, 2001 at 23:58 UTC
    Well, if you are running on a Windoze box, you could be slamming the memory ... we'd have to see the code to say for sure, though. If that is the case, you may be forcing 'doze to use virtual memory (you know, hard disk) for data structure storage. On a fragmented drive, that incurs an even greater performance penalty of course.

    Can you update us with details about the platform, data structures, etc.?

    Update from HZ: It has been noted to me (thank you tye) that other systems can also penalize you for slamming memory. In any case, we need more details to help you. Please, sir, can we see your code?
      Script is running on a linux machine running perl 5.6.0
      There are no complex 'data structures' to speak of, except a bunch of vars containing integers for counting, and a few with the text removed from the statement file using substr().
      None of the text removed is over 30 bytes in length.
      Each page is separated into an array (@page = split(/\n/, $page_contents)), the @page array contains 65 elements of less than 85 bytes (changes per page), and is overwritten each time we split() a new page.

      Did I miss anything?
      JP -- Alexander Widdlemouse undid his bellybutton and his bum dropped off --

        That behavior does sound like what happens when you start using too much memory (non-linear reduction in performance). So you should check the memory consumption of the process during different points in the run (for example, via the "ps" command).

        Some file systems can also get much slower when they get nearly full so that is another possibility to check.

        Another possiblity is that your total (virtual) memory consumption hasn't grown too high but that you are jumping around too much in memory such that your "working set" is too large to fit in the available RAM.

        But, yes, we are all just guessing.

        Nothing in your description rings alarms bells to me about where the memory usage would be growing.

                - tye (but my friends call me "Tye")
Re: Perl scripts slowing down towards end of file processing
by JPaul (Hermit) on Jul 14, 2001 at 00:31 UTC
    #!/usr/bin/perl -w use strict; my $records_out = 0; # Did they not tell us what to use? if (@ARGV != 2) { die("$0 - No file to process or output.\n"); } # Get total size, in bytes, of file for percentage counter my @stat = stat($ARGV[0]); my $total_bytes = $stat[7]; $| = 1; # Autoflush output open(IN, $ARGV[0]) || die("$0 - Cannot open $ARGV[0].\n"); open(OUT, ">$ARGV[1]"); print "Dumping Account holder name and Account number from $ARGV[0]... +\n"; print "Outputting to $ARGV[1]\n"; my $read_bytes = 0; my $page_contents = ''; my $read_pages = 1; for(;;) { my $EOF = read(IN, my $char, 1); last if ($EOF == 0); if (substr($char, 0, 1) =~ /\x0c/) { my @temp = split(/\n/, $page_contents); # Split lines +up my $line_count = @temp; next if $line_count < 2; # Get name, address, city, state & zipcode from page my $name = substr($temp[58], 5, 30); my $addr = substr($temp[59], 5, 30); my $addr2 = substr($temp[60], 5, 30); my ($city, $state) = split(/\,/, substr($temp[61], 5, +20)); my $zip = substr($temp[61], 26, 5); $addr = &normalise($addr); $addr2 =&normalise($addr2); $city = &normalise($city); $state = &normalise($state); $zip = &normalise($zip); &format($name, $addr, $addr2, $city, $state, $zip); $read_pages++; $page_contents = ''; } $page_contents .= $char; # Push char on +to page # Do percentage display $read_bytes++; my $percent = $read_bytes / $total_bytes * 100; # Work out per +centage $percent = sprintf("%.2f", $percent); # Set to 2 dp print "$percent%\r"; } $read_pages++; close(IN); close(OUT); print "\n-> Read $read_pages pages, and dumped $records_out records ou +t.\n"; exit(0); # Format data into ANCOA import standard sub format { my ($name, $addr, $addr2, $city, $state, $zip) = @_; my $record = "1"; # Record Type $record .= &pad($name, 35); $record .= &pad($addr, 30); $record .= &pad($addr2, 30); $record .= &pad($city, 30); $record .= &pad($state, 2); $record .= &pad($zip, 5); $record .= &pad(" ", 342); # Empty Optional Customer Info my $len = length($record); print OUT "$record"; $records_out++; } # Pad out with trailing whitespace as necessary sub pad { my ($data, $len) = @_; while(length($data) < $len) { $data .= " "; } return $data; } # Get rid of leading/trailing whitespace sub normalise { my ($value) = @_; if (defined $value) { for(my $i = 0; $i <= length($value); $i++) { if (substr($value, $i, 1) ne " ") { $value = substr($value, $i, length($value)); last; } } for(my $i = length($value); $i >= 0; $i--) { next if substr($value, $i, 1) eq ""; if (substr($value, $i, 1) ne " ") { $value = substr($value, 0, $i + 1); last; } } } return $value; }

    JP

      You might want to considered changing your input loop to look something like this.
      # tell perl to make the input record separator the character with hex +value '0c' $/ = pack('c',0x0c); while(<IN>) { my @temp = split(/\n/,$_); # rest of processing }
      This would save you the byte by byte read and the string concatination for every record. Should be quite a bit faster.

      /\/\averick

        Well hell,
        What took 11 minutes to process (byte-by-byte) now takes 16 seconds. I crap you not.

        I'd better remember that, thanks Mav.

        JP
        -- Alexander Widdlemouse undid his bellybutton and his bum dropped off --

Re: Perl scripts slowing down towards end of file processing
by cyocum (Curate) on Jul 14, 2001 at 00:06 UTC
    Are you using my variables that get garbage collected regularly? If you are putting stuff in like a local variable then it gets stored when you are in a different scope then it comes back when you get out of that scope. The upshot is that you could be storing things that you don't mean to be. Without seeing the code it is all idle speculation.
      Would it matter?
      I'm using strict, there are a few small routines that would have the current vars stored, but anything stored is small and the subs are very small and pointless too (padding fields out with spaces, removing lead/trailing spaces).

      Even so, why would having things stored out of scope make much of a difference?

      JP -- Alexander Widdlemouse undid his bellybutton and his bum dropped off --