Hi, I have written a bit of perl, which is performing very slowly, so I'm hoping to get some advice here
The script takes in any number of files, where all files have the format that each line starts with a 10 hexit hex count, followed by anything. The count on each line is always greater than the count value on the previous line. The task is to merge all the input files in to one file, in order. The input files can be quite large, 3GB or so. After a bit of googling I decided to put all the input files in an array, and put the result in a new array and finally write out the new array to a file. Mainly because I have access to machines with lots of RAM, so I thought if it's all chucked in to memory it'll be faster, and then I just dump the end result in to a file.
It hasn't really worked out as I expected. The script got to the point where the final array is complete and it's starting to write out to the file after about an hour or so. However, just the writing to a file is taking many hours!
Any suggestions as to how to improve my script? Thanks!
#!/bin/env perl use strict; use warnings; use List::Util qw(min max); use Math::BigInt; my @filenames = @ARGV; #Define empty hash. This will be a hash of all the filenames. Within t +he hash each filename points to an array containing the entire conten +ts of the file, and an array of timestamps. my %all_files=(); #>32 hex to dec function sub hex2dec { my $hex = shift; return Math::BigInt->from_hex("0x$hex"); } #For each file on the command line, create a new hash entry indexed by + the filename. Each entry is an array containing the contents of the +file. foreach my $filename (@filenames) { open(my $handle, "<", "$filename") or die "Failed to open file $file +name: $!\n"; while(<$handle>) { chomp; my $fullline = $_; if($fullline =~ m/(\w+).*/) { #Store contents of line my $timestamp = $1; push @{$all_files{$filename}}, $fullline; push @{$all_files{"${filename}.timestamp"}}, $timestamp; } else { print "Unexpected line format: $fullline in $filename\n"; exit; } } close $handle; $all_files{"${filename}.neof"} = 1; } my $neofs = 1; my @minarray = (); my $min = 0; my $storeline = ""; my @mergedlogs = (); my $matchmin=0; my $line=0; while ($neofs == 1) { print "$line\n"; $line++; $neofs = 0; #First find the lowest count foreach my $filename (@filenames) { print "@{$all_files{\"${filename}.timestamp\"}}[0]\n"; my $tmpdec=hex2dec(@{$all_files{"${filename}.timestamp"}}[0]); print "$tmpdec\n"; push @minarray, hex2dec(@{$all_files{"${filename}.timestamp"}}[0]) +; } $min = min @minarray; @minarray = (); #For each file matching the lowest count, shift out the current line foreach my $filename (@filenames) { print "$filename $min"; $matchmin=0; if(hex2dec(@{$all_files{"${filename}.timestamp"}}[0]) == $min && $ +all_files{"${filename}.neof"} == 1) { $matchmin=1; $storeline = shift @{$all_files{$filename}}; shift @{$all_files{"${filename}.timestamp"}}; #Check if array is empty (i.e. file completed) if ( ! @{$all_files{$filename}}) { #If so, set not end of file to 0 $all_files{"${filename}.neof"} = 0; #Force count value to max so that it loses all future min batt +les push @{$all_files{"${filename}.timestamp"}}, "10000000000"; } #Push the line to the merged file. push @mergedlogs, "$storeline $filename"; } $neofs = $neofs || $all_files{"${filename}.neof"}; } } unlink "mergedlogs.txt"; foreach (@mergedlogs) { open FH, ">>mergedlogs.txt" or die "can't open mergedlogs.txt: $!"; print FH "$_\n"; close FH }
In reply to Write large array to file, very slow by junebob
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |