Hot Pastrami has asked for the wisdom of the Perl Monks concerning the following question:

Ah, hello everyone.

Hypothetical situation: I have a large number of smallish text files (~30-40k) to search for a string. Because of their small size, it should be OK to create a loop where a file is slurped up into a variable, searched, then move on to the next file. Easy. Here's the question: Let's say there are A LOT of small files, so read-in efficiency becomes important. Is there any significant performance difference between the following slurping methods?:
# METHOD A: read file into array:
@fileContents = <TEXTFILE>;
...and...
# METHOD B: read file into string
local $/ = undef;
$fileContents = <TEXTFILE>;
Consider that once the file is read, I've got to search for multiple strings either by looping through array (Method A) or search one very LONG string (Method B). As you would expect, any and all help will be appreciated.

Alan "Hot Pastrami" Bellows
-Sitting calmly with scissors-

Replies are listed 'Best First'.
Re: File slurping efficiency
by KM (Priest) on Aug 11, 2000 at 22:20 UTC
    I just did a quick test of opening a small file 500 times using method A, then grepping, and method B and using a match. Results of my benchmark:

    Method A:
    1 wallclock secs ( 0.16 usr + 0.02 sys = 0.18 CPU)

    Method B:
    0 wallclock secs ( 0.04 usr + 0.03 sys = 0.07 CPU)

    Results may vary, try it yourself to see what you get. Not very surprising since Method A will need to populate an array, the loop through it and match. Method B just does the match.

    UPDATE: Out of curiosity I did Method A withought building an array and simply grepping from the open filehandle. Small difference:
    0 wallclock secs ( 0.12 usr + 0.03 sys = 0.15 CPU)

    Cheers,
    KM

Re: File slurping efficiency
by chromatic (Archbishop) on Aug 11, 2000 at 22:23 UTC
    Arrays use more memory than scalars, and there will be some processing needed to split up a file into records (one record per slot in the array). However, it does take a bit of time to localize a magic variable and restore it. My suspicion is that method B is faster, but I don't have any benchmarks.

    In cases like this, I usually go with whichever method makes parsing easier. If it's line based, I loop over the array (or use while on the filehandle). If I'm dealing with something that can span lines, I'll try the second (or redefine $/).

    If you're only looking for the first instance of the string in the file, and you're only concerned with what else is on the line, I'd go with while. Without seeing your data or knowing more about it, it's hard to say more.

Re: File slurping efficiency
by Hot Pastrami (Monk) on Aug 12, 2000 at 05:20 UTC
    Ovid:

    It was the "Running with Scissors" topic that inspired me to add "Sitting calmly with scissors"... I was just being a smartass about my cautious approach to "things unknown" when it comes to Perl (can I say "smartass" here? If not, imagine I typed "wise-crackin' guy" instead of "smartass").

    At the advice of yourself and gryng, I'll skip the study() for now and just match the string as is... your reasoning sounds pretty logical. Perhaps in the future I'll add it and try the benchmark module ZZamboni suggested, and see if it does me any good.

    As far as the index file goes, I've used that in other scenarios, but it won't work here because these files aren't static... they're ever-changing. I think the overhead of constantly rebuilding the word indexes would outweigh the speedy search advantage. Thanks for tryin to help me out though, man... I appreciate it.

    And thanks to Ovid, gryng, and ZZamboni, you guys are the tops!!

    Alan "Hot Pastrami" Bellows
    -Sitting calmly with scissors-

    P.S. Hey Ovid, as a curiosity, the company I work for is called Ovid... I'm a Perl programmer there. Crazy stuff, eh?
Re: File slurping efficiency
by Hot Pastrami (Monk) on Aug 12, 2000 at 02:06 UTC
    Thanks for the tips, guys... this Perl efficiency stuff where it's at. I have a related question now... If I went with method B, and ran the ultra-long string through a pattern match, would it be to my advantage to study() the string first, or will the string's length make it take too blasted long? I have never actually used the study() function before, so I don't know much about its impact.

    If nobody replies to this, I'll take that to mean "get off your lazy butt, get a benchmark utility, and test it yourself, bozo!" That's perfectly reasonable, and very true, but I'll probably think the "bozo" was out of line.

    Alan "Hot Pastrami" Bellows
    -Sitting calmly with scissors-
      First, I want to say that I laughed my head off when I read your sig line (have you seen Running With Scissors?).

      You need to be very careful with study. To the best of my knowledge, it's always been very buggy. In fact, in later versions of Perl (not sure about 5.6), successful matches against $_ can fail if you're using study, even if the string your matching against isn't what you studied. Apparently, the only way to get around this is to explicitly undef the studied string as soon as you are done with it (see Mastering Regular Expressions, second edition, page 289).

      If you're willing to risk the problems with study, you should go ahead and benchmark it, but I wouldn't bother with it, personally.

      Cheers,
      Ovid

      Hi Hot Pastrami,

      study() (correct me if I'm wrong here guys) is only useful if you are going to search many keywords on the same string. Think of it as building an index of where all the a's and the b's, etc.. are located in the string so that if you need to see if 'airplane' is there, you can look really quickly (ok, so it doesn't do exactly that... but it's the same idea! :) ).

      So if you are only going to look at a few keywords then don't bother with study, but if you are going to look at a few hundred keywords, then it might help alot more! (The exact cut off value depends on the length of the string, the length of the keywords, and the number of the keywords -- yes the length of the keywords effects the speed of a lookup, both ways).

      Cheers,
      Gryn

      You maybe know this already, but in case you don't: if you want to do some benchmarking, make sure you check out the Benchmark module, which is part of the Perl standard distribution.

      --ZZamboni

RE: File slurping efficiency
by Anonymous Monk on Aug 12, 2000 at 04:15 UTC
    I had a similar problem with a smaller number (~7,000) of relatively static small text files, and adopted the strategy of building a flatfile index, then searching the index. Each file had one line in the index. The first token was the filename, followed by a sorted list of unique words (tokens, in this case) in the file. A snap to search, and quite simple to update whenever a file changed or a new file was added.

    dws (who misplaced his password)