machinecraig has asked for the wisdom of the Perl Monks concerning the following question:

I recently needed a small script to report on the number of open files that matched a particular name (in the example, file_a). The platform is AIX, and I decided to make use of the lsof command.

I thought of two easy ways to do this, I could make a single system call to do this: system(lsof -c file_a | wc -l), or, I could get the result of lsof and then use Perl's grep function to get the same sort of count.

I thought that using Perl's grep function would be quicker - but found that doing all of the work via external commands a single system call benchmarked much, much faster.

Can any elightened monks out there tell me why the processing in this example via the external command system call is faster than using the perl functions? I'm still learning Perl, and I've never used the Benchmark module before - so it's entirely possible I made some kind of boneheaded error... Thanks!
#!/usr/bin/perl -w use Benchmark qw(:all) ; use strict; my $lsofcmd1 = "/usr/local/bin/lsof -c file_a|wc -l"; my $lsofcmd2 = "/usr/local/bin/lsof|grep file_a|wc -l"; my $lsofcmd3 = "/usr/local/bin/lsof"; my $count = "500"; my $results = timethese($count, { 'lsof_c' => sub { system($lsofcmd1); }, 'lsof_grep' => sub { system($lsofcmd2); }, 'lsof_perl' => sub { my @ret = `$lsofcmd3`; print scalar(my @openfiles = grep(/fi +le_a/,@ret)); }, }, 'none' ); cmpthese( $results ) ; ##### Results #### # Rate lsof_perl lsof_grep lsof_c # lsof_perl 13.4/s -- -96% -96% # lsof_grep 321/s 2293% -- -5% # lsof_c 338/s 2422% 5% --
Update 1: removed redundant use statements... Update 2: replaced erroneous mentions of system calls - thanks to Errto's response.

Replies are listed 'Best First'.
Re: system calls vs perl functions: Which should be faster in this example?
by Celada (Monk) on Dec 05, 2005 at 20:07 UTC

    According to my lsof's manpage, that's not what the -c option does:

    This option selects the listing of files for processes executing the command that begins with the characters of c.

    That filters by the name of the command, not the name of the open file.

    But your lsof|grep example and the Perl one should have the same bahevious and should be comparable. I am not sure why they are so different. One important difference is that you are capturing output with backticks in one case and not in the other.

    You should also be aware lsof's reporting of filenames may not always be reliable. Once you open a file, the kernel forgets what name was used to open it and only remembers which device and inode was opened. If the file has multiple names (hard links) then it is in fact impossible in general to know which one was opened. lsof gets around this by searching for the name in the kernel's directory name lookup cache where the kernel remembers this information in case it needs it again. But it is a cache and it might be flushed. Does your application permit finding exactly which files you are interested in and searching for those files in the lsof output by device & node number?

      Argh. You're exactly right about the lsof -c option. My actual requirement is to search for files opened by a particular process - I somehow mungled this up while writing up my question. I suppose that's what I get for skipping my coffee this morning. :-) Good catch - and interesting info re: lsof's reporting of filenames. Also - that's a good point about the use of backticks. I'll switch to backticks for my other system calls and see how the numbers change.
Re: system calls vs perl functions: Which should be faster in this example?
by Roy Johnson (Monsignor) on Dec 05, 2005 at 19:56 UTC
    I would expect the I/O of special-purpose utilities to be faster than Perl's processing. However, you're slinging the data around more than you need to. First you read the whole mess into an array, and then you grep it into another array before simply counting the number of elements and throwing it all away.

    Try redefining lsof_perl like so:

    'lsof_perl' => sub { print scalar(grep(/file_a/, `$lsofcmd +3`)); },
    You should get somewhat better results.

    Caution: Contents may have been coded under pressure.
      What you say about specialized utilities makes sense - also, good point about needless moving around of the lsof output. Thanks!
Re: system calls vs perl functions: Which should be faster in this example?
by Errto (Vicar) on Dec 05, 2005 at 20:56 UTC

    <pedantry>

    A system call is a function within the operating system, typically implemented in the kernel, that userspace programs can invoke through a defined interface to perform low-level OS services. In Unix, system calls are documented in section 2 of the manual. Perl provides a number of builtin functions that correspond directly to Unix system calls such as fork, exec, sysopen, sysread, fcntl, etc.

    system is a Perl function that invokes an external program, possibly through a sub-shell, and (generally) waits for its result. Calling system is not the same thing as a system call :)

    </pedantry>

      Ok - now I feel like an idiot. Seriously - thanks for setting me straight.I've been thinking of system() and backticks as "system calls" for a long time now. Argh. :-)

      Funny how you can ask one question - and learn a lot more than just the answer.
        I have been working on something similar, so here are some tips that may improve your performance:

        1. Improve the command execution time by using "lsof -Pn"

        From the lsof manpage:
        -P: This option inhibits the conversion of port numbers to port names for network files. Inhibiting the conversion may make lsof run a little faster.
        -n: This option inhibits the conversion of network numbers to host names for network files. Inhibiting conversion may make lsof run faster. It is also useful when host name lookup is not working properly.

        2. Reduce the overall output size and improve parsing by using "lsof -Pn -F" (The -F option is specifically intended for post-processing scripts like perl)

        Although more complex, the same information is presented in a terse representation, for some perl examples have a look in the lsof "scripts" directory.

        Here is the list_fields.pl example, which illustrates the parsing technique:

        $fhdr = 0;# fd hdr. flag $fdst = 0;# fd state $access = $devch = $devn = $fd = $inode = $lock = $name = "";# | file +descr. $offset = $proto = $size = $state = $stream = $type = "";# | variables $pidst = 0;# process state $cmd = $login = $pgrp = $pid = $ppid = $uid = "";# process var. # Process the ``lsof -F'' output a line at a time, gathering # the variables for a process together before printing them; # then gathering the variables for each file descriptor # together before printing them. while (<>) { chop; if (/^p(.*)/) { # A process set begins with a PID field whose ID character is `p'. $tpid = $1; if ($pidst) { &list_proc } $pidst = 1; $pid = $tpid; if ($fdst) { &list_fd; $fdst = 0; } next; } # Save process-related values. if (/^g(.*)/) { $pgrp = $1; next; } if (/^c(.*)/) { $cmd = $1; next; } if (/^u(.*)/) { $uid = $1; next; } if (/^L(.*)/) { $login = $1; next; } if (/^R(.*)/) { $ppid = $1; next; } # A file descriptor set begins with a file descriptor field whose ID # character is `f'. if (/^f(.*)/) { $tfd = $1; if ($pidst) { &list_proc } if ($fdst) { &list_fd } $fd = $tfd; $fdst = 1; next; } # Save file set information. if (/^a(.*)/) { $access = $1; next; } if (/^C(.*)/) { next; } if (/^d(.*)/) { $devch = $1; next; } if (/^D(.*)/) { $devn = $1; next; } if (/^F(.*)/) { next; } if (/^G(.*)/) { next; } if (/^i(.*)/) { $inode = $1; next; } if (/^k(.*)/) { next; } if (/^l(.*)/) { $lock = $1; next; } if (/^N(.*)/) { next; } if (/^o(.*)/) { $offset = $1; next; } if (/^P(.*)/) { $proto = $1; next; } if (/^s(.*)/) { $size = $1; next; } if (/^S(.*)/) { $stream = $1; next; } if (/^t(.*)/) { $type = $1; next; } if (/^T(.*)/) { if ($state eq "") { $state = "(" . $1; } else { $state = $state . " " . $1; } next; } if (/^n(.*)/) { $name = $1; next; } print "ERROR: unrecognized: \"$_\"\n"; } # Flush any stored file or process output. if ($fdst) { &list_fd } if ($pidst) { &list_proc } exit(0); ## list_fd -- list file descriptor information # Values are stored inelegantly in global variables. sub list_fd { if ( ! $fhdr) { # Print header once. print " FD TYPE DEVICE SIZE/OFF INODE NAME\n"; $fhdr = 1; } printf " %4s%1.1s%1.1s %4.4s", $fd, $access, $lock, $type; $tmp = $devn; if ($devch ne "") { $tmp = $devch } printf " %10.10s", $tmp; $tmp = $size; if ($offset ne "") { $tmp = $offset } printf " %10.10s", $tmp; $tmp = $inode; if ($proto ne "") { $tmp = $proto } printf " %10.10s", $tmp; $tmp = $stream; if ($name ne "") { $tmp = $name } print " ", $tmp; if ($state ne "") { printf " %s)\n", $state; } else { print "\n"; +} # Clear variables. $access = $devch = $devn = $fd = $inode = $lock = $name = ""; $offset = $proto = $size = $state = $stream = $type = ""; } # list_proc -- list process information # Values are stored inelegantly in global variables. sub list_proc { print "COMMAND PID PGRP PPID USER\n"; $tmp = $uid; if ($login ne "") {$tmp = $login } printf "%-9.9s %6d %6d %6d %s\n", $cmd, $pid, $pgrp, $ppid, $t +mp; # Clear variables. $cmd = $login = $pgrp = $pid = $uid = ""; $fhdr = $pidst = 0; }

        (The code was written by the lsof author, Victor A. Abell).

        0xbeef