hperange has asked for the wisdom of the Perl Monks concerning the following question:
I have a program which runs slightly slower than I would like it to run. I profiled it, and it spends almost 100% of the time in 1 subroutine which is not surprising, as this is the only subroutine which is reading text input. Below is the subroutine, and a demonstration of the size of the input.
# returns the $limit largest files from the flist file, # full path to file in $name sub process_flist { my ($name, $limit) = @_; my ($nlines, $total, @lines, @size); open(my $fh, '<', $name) or die("Error opening file `$name': $!\n"); while (<$fh>) { my @f = split / /; # skip files that have a space or other whitespace # characters in their name next if @f > 10; # store file size in bytes and the full path to the file push @lines, $f[4] . '/' . $f[1]; } $nlines = scalar @lines; { # disable warnings because the array to be sorted has # the following format "12345/path/to/file" # Perl would complain this is not a number # but the <=> comparison operator will handle such # input properly # this is needed so the files can be sorted # with a single pass through # the flist file no warnings 'numeric'; $total = sum(@lines); $limit = min($limit, $nlines); @lines = (sort {$b <=> $a} @lines)[0 .. ($limit - 1)]; } # returns the number of files, their cumulative size, # and the $limit largest files return ($nlines, $total, @lines); }
The format of a single input record is:find /tgt -type f -name input | xargs wc -l 197898 .../input 213267 .../input 240331 .../input 194063 .../input 191862 .../input 179495 .../input 218041 .../input 1434957 total
Where the 2nd field is the name of a file, and the 5-th is the file size in bytes51 opt/src/t.tar 100444 1247464676 290119680 283320 NA 1 0xbe2d 0x4000 +0006
The program runs for around 40 seconds with this input - 7 input files, but with 100 input files of similar size it runs for around 25 minutes. I have 2 pictures of the profiler output - http://imgur.com/FUOqb,iWF5z#1
Can the runtime be reduced ? I am not proficient in interpreting the output of the profiler, so I can't really figure out whether it can be improved, or it is simply so I/O intensive that not much can be done. The input files represent a list of files backed up from a particular client, and they are not ordered in any meaningful way.
EDIT:Below code is apparently faster (at the cost of some readability) than split/if
... while (<$fh>) { next unless m/^[0-9]+ ([^ ]+) [0-9]+ [0-9]+ ([0-9]+)/; push @lines, $2 . '/' . $1; } ...
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Optimizing I/O intensive subroutine
by chromatic (Archbishop) on Oct 26, 2012 at 16:16 UTC | |
by Anonymous Monk on Oct 26, 2012 at 16:30 UTC | |
by chromatic (Archbishop) on Oct 26, 2012 at 17:15 UTC | |
|
Re: Optimizing I/O intensive subroutine
by BrowserUk (Patriarch) on Oct 26, 2012 at 16:43 UTC | |
by hperange (Beadle) on Oct 26, 2012 at 22:31 UTC | |
by BrowserUk (Patriarch) on Oct 27, 2012 at 00:07 UTC | |
|
Re: Optimizing I/O intensive subroutine
by tobyink (Canon) on Oct 26, 2012 at 14:49 UTC |