perlcat has asked for the wisdom of the Perl Monks concerning the following question:

Dear Perlmonks,

I am writing to you to seek some advice on how to optimize a script I wrote. I will first describe the situation, and then present the possible solutions.

The size of the input file is about 50mb; each line contains a string which is built like this:

abcd\tabcd

where \t is a tab seperating the two blocks.

As for now, I read the whole file line by line and search $_ for a given string. If found, I push the second block into an array. To do so, I match every line against the regular expression ^(.*)\t(.*)$.

I am experiencing serious performance issues. It takes me between 5 and 10 seconds to have the final array.

I have read a bit and found out that there are plenty of ways to program this script differently. Which would be the best solution?

a) read the whole file into an array and then parse the array?

b) read the file line by line, as I do now?

c) a friend also told me the reg exp was inefficient. He said I would gain some speed by replacing the reg exp with the index and substring functions. True?

d) create an index of the file? I'm not really sure how to tackle this.

So dear Perlmonks, thanks in advance for your advice.

Larry

Replies are listed 'Best First'.
Re: performance issues
by zentara (Cardinal) on Jan 27, 2009 at 18:27 UTC
    Yeah, your regex is pretty bad, and in this simple case you don't even need a regex.
    my @array; my $match = 'some_criteria'; while (<FH>) { chomp; # thanks gwadej for reminding me my ($x,$y) = split /\t/, $_; if ( $x eq $match ){ push @array, $y; } } print "@array\n";

    I'm not really a human, but I play one on earth Remember How Lucky You Are

      You might want to add a chomp before the split.

      G. Wade
      thanks, that's indeed a neat way to do it!
Re: performance issues
by kyle (Abbot) on Jan 27, 2009 at 18:29 UTC

    If you want to know why it's so slow, profile it. Until you do that, you're not only wasting your time trying to speed up things that may not be slow, you're also wasting your time guessing. There's just no reason to sit and scratch your head when you can sit and watch the computer tell you the answer.

      thanks I will do that.
Re: performance issues
by gone2015 (Deacon) on Jan 27, 2009 at 18:19 UTC

    I was almost with this up to the regex /^(.*)\t(.*)$/. Can you post the working code, from which one can unambiguously reverse-engineer the requirement ?

    When you say you are searching for a "given string", what do you mean ? Are you looking for lines in the file where the "block" before the tab exactly matches the given string ? or the given string as a sub-string of one, or either or both blocks ? or what ?

      I will post some code asap. I'm indeed looking for a substring in the left block. I'm using m=~ to achieve this.

        OK. From what I've understood you to want... see if something along these lines works any better:

        use strict ; use warnings ; my $data ; { local $/ ; $data = <DATA> ; } ; my $s = 'bc' ; # Substring to search for in left hand side my @r = ($data =~ m/$s[^\s]*[ \t]+(.*)/g) ; print "@r\n" ; __DATA__ abcd s1 efgh s2 ijklm s3 nopq s4 bc s5 aqbc s6 rst s7 uvwx s8 yz s9
        noting that the [ \t]+ can be changed to simply \t for use with your file (I've used [ \t]+ here only because that works even if the stuff between the "blocks" has been converted to any mixture tabs and spaces). Noting also that the (.*) will stop at end of line.

        Note also that this is assuming that the right hand side will not contain any tabs (or spaces)

        The regex can match to the search substring in the right hand side, but since that ends in a newline, the [^\s]*[ \t]+ part will fail the match.

        The following:

        my $data = "\n" ; { local $/ ; $data .= <DATA> ; } ; my $s = 'bc' ; # Substring to search for in left hand side my @r = ($data =~ m/\n.*?$s[^\s]*[ \t]+(.*)/g) ;
        avoids matching in the right hand side, and the right hand side can contain tabs (and spaces). I don't have enough data to know if there's any performance advantage or disadvantage here.

        I'm using m=~ to achieve this.
        I guess that this is a typo—if you've got working code at all, then you must be using something like (if we must call it that) =~m//, not m=~.
Re: performance issues
by matija (Priest) on Jan 27, 2009 at 18:35 UTC

    Think some more about this file.

    Is the file only rarely changed? Or, is it changed much more rarely than it is searched? Or, at least, are the changes, when they come, relatively small?

    If so, you could benefit enormously by importing the file into a database once, and then letting the database do the heavy lifting of indexing and searching.

      Well the data is rarely changed - it is generated once every month. I will think about the database solution.
Re: performance issues
by moritz (Cardinal) on Jan 27, 2009 at 18:52 UTC
    You haven't told us the most important thing: what do you need the data for? Depending on the usage there can be many ways to speed up the program: if you do direct lookups, a hash might be a good ideas. There are trees for other forms of lookups, and a database with index can also help - if you don't change that file too often.

      I use the data to generate ngrams. Basically, whenever I find a given string in the first part of the line (before the \t), I put the second into an array. After having gone through all the lines, I generate ngrams for the array.

        Is this something where using DBM::Deep to store the resulting data structure for reader use would make sense?

        --MidLifeXis

Re: performance issues
by jh- (Scribe) on Jan 27, 2009 at 18:21 UTC

    I don't believe reading the file to an array would do much else than increase the memory usage of your script. Most time is probably spent reading from disk anyways.

    Don't quote me on this, but you could probably gain some speed boost by replacing the regexp with split(/\t/, $_);. Other than that, I'm out of ideas..

Re: performance issues
by MidLifeXis (Monsignor) on Jan 27, 2009 at 18:22 UTC

    It may not just be the regular expression that is causing your perceived performance issues.

    Can you show some code to illustrate what you have tried?

    --MidLifeXis