Optimizing I/O intensive subroutine

hperange has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I have a program which runs slightly slower than I would like it to run. I profiled it, and it spends almost 100% of the time in 1 subroutine which is not surprising, as this is the only subroutine which is reading text input. Below is the subroutine, and a demonstration of the size of the input.

# returns the $limit largest files from the flist file, 
# full path to file in $name
sub process_flist {
  my ($name, $limit) = @_;
  my ($nlines, $total, @lines, @size);

  open(my $fh, '<', $name) or die("Error opening file `$name': $!\n");

  while (<$fh>) {
    my @f = split / /;
    
    # skip files that have a space or other whitespace
    # characters in their name
    next if @f > 10;    

    # store file size in bytes and the full path to the file
    push @lines, $f[4] . '/' . $f[1];
  }

  $nlines = scalar @lines;
  {
    # disable warnings because the array to be sorted has 
    # the following format "12345/path/to/file"
    # Perl would complain this is not a number
    # but the <=> comparison operator will handle such 
    # input properly 
    # this is needed so the files can be sorted 
    # with a single pass through
    # the flist file
    no warnings 'numeric';

    $total = sum(@lines);
    $limit = min($limit, $nlines);
    @lines = (sort {$b <=> $a} @lines)[0 .. ($limit - 1)];
  }

  # returns the number of files, their cumulative size,
  # and the $limit largest files
  return ($nlines, $total, @lines);
}
[download]

The routine opens an input file, and returns the n largest files. Input size demonstration:

find /tgt -type f -name input | xargs wc -l
197898 .../input
213267 .../input
240331 .../input
194063 .../input
191862 .../input
179495 .../input
218041 .../input
1434957 total
[download]

The format of a single input record is:

51 opt/src/t.tar 100444 1247464676 290119680 283320 NA 1 0xbe2d 0x4000
+0006
[download]

Where the 2nd field is the name of a file, and the 5-th is the file size in bytes

The program runs for around 40 seconds with this input - 7 input files, but with 100 input files of similar size it runs for around 25 minutes. I have 2 pictures of the profiler output - http://imgur.com/FUOqb,iWF5z#1

Can the runtime be reduced ? I am not proficient in interpreting the output of the profiler, so I can't really figure out whether it can be improved, or it is simply so I/O intensive that not much can be done. The input files represent a list of files backed up from a particular client, and they are not ordered in any meaningful way.

EDIT:

Below code is apparently faster (at the cost of some readability) than split/if

...
while (<$fh>) {
  next unless m/^[0-9]+ ([^ ]+) [0-9]+ [0-9]+ ([0-9]+)/;
  push @lines, $2 . '/' . $1;
}
...
[download]

Comment on Optimizing I/O intensive subroutine Select or Download Code

Replies are listed 'Best First'.
Re: Optimizing I/O intensive subroutine by chromatic (Archbishop) on Oct 26, 2012 at 16:16 UTC
What happens if you return an array reference instead of an array at the end of this function? (Depending on how large `$limit` is, you might cause Perl to copy a lot of data.) What happens if, instead of reading the whole file into an array and then sorting it, you keep track of the largest n lines you've seen and read the file line by line? Improve your skills with Modern Perl: the free book.	[reply] [d/l]
Re^2: Optimizing I/O intensive subroutine by Anonymous Monk on Oct 26, 2012 at 16:30 UTC
The latter suggestion is more promising. Looking at the timings, the bulk of the time is in the `split` (20.4 s) and the `push @lines` (16.2s)	[reply] [d/l] [select]
Re^3: Optimizing I/O intensive subroutine by chromatic (Archbishop) on Oct 26, 2012 at 17:15 UTC
Ah, I didn't see there was a second image. Good catch.	[reply]
Re: Optimizing I/O intensive subroutine by BrowserUk (Patriarch) on Oct 26, 2012 at 16:43 UTC
Running your routine on 7 files of 200,000 lines apiece (with limit = 1000), takes just 10.5 seconds; and on 100x 200,000 lines takes 145 seconds on my machine. Showing (as expected) that the runtime is pretty linear with respect to the number of files. Which make your figures (of 40s for 7 and 1500s for 100) suggest that the majority of time is being spent outside of this routine doing something non linear. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP Neil Armstrong	[reply]
Re^2: Optimizing I/O intensive subroutine by hperange (Beadle) on Oct 26, 2012 at 22:31 UTC
Thanks for the hints, I will try the split suggestion. The default value of $limit is 20 I suppose you were testing on a desktop machine; pushing the boundaries of my knowledge, is it possible that I get a very different result because I am mainly testing on servers (HP-UX), where there is a possibility that the heavy use of I/O is slowing down the process ? I am sure there is nothing else causing the difference because apart from this routine the program only contains code to pretty print the results of this routine, and a few sanity checks ( 3x stat syscall per number of files to be read ) Unfortunately I cannot provide a trace from the machines where it takes longer to run, as they are "production" and I am not allowed to play with them :)	[reply]
Re^3: Optimizing I/O intensive subroutine by BrowserUk (Patriarch) on Oct 27, 2012 at 00:07 UTC
I suppose you were testing on a desktop machine; pushing the boundaries of my knowledge, is it possible that I get a very different result because I am mainly testing on servers (HP-UX), where there is a possibility that the heavy use of I/O is slowing down the process ? Looking at the second profiler image, it isn't IO that is constraining your process, but rather memory allocation. Which is rather difficult to understand given that processing 100x 200,000 line files only requires 36MB & 147 seconds on my machine. The only way I can see that happening is if the server in question is so overloaded memory wise that it is permanently swapping, so that every split to an array and push to an array requires the machine to wait for substantial IO. And given that the profiling shows that the actual file IO done by your program isn't taking long at all, then the server must be using a ludicrously heavily over-subscribed separate swap partition. Under those circumstances, there is nothing you can do to your script that will improve its performance. The only way will be to get the processing moved to a server with more (SOME!) free resource. Were I you, I would run the program on my workstation (or development server) and profile it there, and then take the two profiles to someone in authority and show them how badly the server resources are being managed. Demonstrate not only that your process is being dramatically choked by being run on a server that is totally inadequately provisioned for the tasks being given it; but that all the other processes running on that same server, will be being similarly choked and hampered by the same problems. Just for completeness, here is my test setup: <Reveal this spoiler or all in this thread> And a run on the 100 files with limit set to 20: `C:\test>junk47 -N=100 -L=20 147.615` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP Neil Armstrong	[reply] [d/l] [select]
Re: Optimizing I/O intensive subroutine by tobyink (Canon) on Oct 26, 2012 at 14:49 UTC
I'm clutching at straws here, but if a lot of the filenames contain multiple whitespace characters, then you might get a very slight improvement doing this: `my @f = split / /, $_, 11; # skip files that have a space or other whitespace # characters in their name next if @f > 10;` [download] The third parameter to split tells it not to keep looking for split points once it's found 10. `perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'`	[reply] [d/l]