comment on

Hi, I have written a bit of perl, which is performing very slowly, so I'm hoping to get some advice here

The script takes in any number of files, where all files have the format that each line starts with a 10 hexit hex count, followed by anything. The count on each line is always greater than the count value on the previous line. The task is to merge all the input files in to one file, in order. The input files can be quite large, 3GB or so. After a bit of googling I decided to put all the input files in an array, and put the result in a new array and finally write out the new array to a file. Mainly because I have access to machines with lots of RAM, so I thought if it's all chucked in to memory it'll be faster, and then I just dump the end result in to a file.

It hasn't really worked out as I expected. The script got to the point where the final array is complete and it's starting to write out to the file after about an hour or so. However, just the writing to a file is taking many hours!

Any suggestions as to how to improve my script? Thanks!

#!/bin/env perl
use strict;
use warnings;
use List::Util qw(min max);
use Math::BigInt;

my @filenames = @ARGV;

#Define empty hash. This will be a hash of all the filenames. Within t
+he hash each filename points to an array containing the entire conten
+ts of the file, and an array of timestamps.
my %all_files=();

#>32 hex to dec function
sub hex2dec {
  my $hex = shift;
  return Math::BigInt->from_hex("0x$hex");
}

#For each file on the command line, create a new hash entry indexed by
+ the filename. Each entry is an array containing the contents of the 
+file.
foreach my $filename (@filenames) {
  open(my $handle, "<", "$filename") or die "Failed to open file $file
+name: $!\n";
  while(<$handle>) {
    chomp;
    my $fullline = $_;
    if($fullline =~ m/(\w+).*/) {
        #Store contents of line
        my $timestamp = $1;
      push @{$all_files{$filename}}, $fullline;
      push @{$all_files{"${filename}.timestamp"}}, $timestamp;
    } else {
        print "Unexpected line format: $fullline in $filename\n";
        exit;
    }
  }
  close $handle;
  $all_files{"${filename}.neof"} = 1;
}

my $neofs = 1;
my @minarray = ();
my $min = 0;
my $storeline = "";
my @mergedlogs = ();
my $matchmin=0;
my $line=0;
while ($neofs == 1) {
  print "$line\n";
  $line++;
  $neofs = 0;
  #First find the lowest count
  foreach my $filename (@filenames) {
    print "@{$all_files{\"${filename}.timestamp\"}}[0]\n";
    my $tmpdec=hex2dec(@{$all_files{"${filename}.timestamp"}}[0]);
    print "$tmpdec\n";
    push @minarray, hex2dec(@{$all_files{"${filename}.timestamp"}}[0])
+;
  }
  $min = min @minarray;
  @minarray = ();
  #For each file matching the lowest count, shift out the current line
  foreach my $filename (@filenames) {
    print "$filename $min";
    $matchmin=0;
    if(hex2dec(@{$all_files{"${filename}.timestamp"}}[0]) == $min && $
+all_files{"${filename}.neof"} == 1) {
      $matchmin=1;
      $storeline = shift @{$all_files{$filename}};
      shift @{$all_files{"${filename}.timestamp"}};
      #Check if array is empty (i.e. file completed)
      if ( ! @{$all_files{$filename}}) {
        #If so, set not end of file to 0
        $all_files{"${filename}.neof"} = 0;
        #Force count value to max so that it loses all future min batt
+les
        push @{$all_files{"${filename}.timestamp"}}, "10000000000";
      }
      #Push the line to the merged file.
      push @mergedlogs, "$storeline $filename";
    }
    $neofs = $neofs || $all_files{"${filename}.neof"};
  }
}

unlink "mergedlogs.txt";
foreach (@mergedlogs)
{
  open FH, ">>mergedlogs.txt" or die "can't open mergedlogs.txt: $!";
  print FH "$_\n";
  close FH
}
[download]

In reply to Write large array to file, very slow by junebob

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.