baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

Hi

quick question , is there a faster alternative for the push and shift function on the following code:

my @a =(); my @b =(); while(<IN>){ chomp; $cnt++; /\d+\t(\d+)/; if (@a<7){ push(@a,$1); } push(@b,$1) if $1 > 3; if (@b>7){ shift(@a); shift(@b); } print $a[0] - $b[-1] . "\n" }
the reason i'm asking is because if i just iterate through file IN like :
while(<IN>){ $cnt++ }
it takes me 10 min and if a complicate it with 2 push and shift functions, plus regex, the runtime increases to 150 min

cheers

baxy

Update:

ok so more details , though i think they are irrelevant:

file size: 70GB file format : tsv 1 34 2 6 3 78 simple while-loop time: real 10m51.378s user 9m22.610s sys 0m13.120s while loop with regex + push and shift + one subtraction(i forgot to m +ention that) real 157m13.245s user 152m8.520s sys 1m26.660s

Replies are listed 'Best First'.
Re: Faster push and shift
by moritz (Cardinal) on Feb 16, 2012 at 10:05 UTC

    I could imagine that the regex match is much slower than the push and shift. Use Devel::NYTProf to find out.

    Also it is wrong to use $1 without checking if the regex succeeded.

    (And no, I don't think there are much faster options. shift and push are appropriately optimized. But if you tell us how your input data looks like, and what output you want, maybe there is a faster option to achieve the same thing with a different algorithm).

Re: Faster push and shift
by BrowserUk (Patriarch) on Feb 16, 2012 at 12:43 UTC

    Try this. It should complete much more quickly unless all the required records are at the beginning of the file:

    #! perl -slw use strict; use Data::Dump qw[ pp ]; use Time::HiRes qw[ time ]; use File::ReadBackwards; tie *FH, 'File::ReadBackwards', $ARGV[0] or die $!; my $start = time; my( $last, @a, @b ); while( <FH> ) { /\d+\t(\d+)/; if( $1 > 3 ) { unshift @b, $1; unshift @a, $last; last if @b == 7; } $last = $1; } print time() - $start; print join ' ',times; pp \@a, \@b;

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      I don't think your solution is equivalent to the original program. Your idea is to stop reading the file as soon as @b "has enough elements", i.e. you are interested in the first set of occurances of "suitable elements" in the file.

      The OP, however, throws out elements from @b (he is basically treating @b as a queue, where he pushes from one end and shifts from the other), so he is interested in the last set of occurances. That's why he has to process the whole file always.

      -- 
      Ronald Fischer <ynnor@mm.st>
        That's why use File::ReadBackwards;

      What’s the difference here, BrowserUK?   (Really, that’s quite the sincere question.)   Kindly enlighten us all:   What is the key change from the OP that will make it faster, and how much faster do you predict it now will be?   ... And under what governing assumptions?

        how much faster do you predict it now will be?

        Doing it the OPs way on 1 million records takes 3.3 seconds:

        c:\test>junk91 junk.dat 3.25699996948242 ([1, 3, 4, 4, 1, 3, 2], [4, 4, 4, 4, 4, 4, 4])

        Same file doing it my way produces the same results in 1/2 a millisecond:

        c:\test>junk91-2 junk.dat 0.000479936599731445 ([1, 3, 4, 4, 1, 3, 2], [4, 4, 4, 4, 4, 4, 4])

        I make that 6,600 times faster. The OPs mileage my vary.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Re: Faster push and shift
by rovf (Priest) on Feb 16, 2012 at 11:24 UTC
    If reading the file WITHOUT doing significant processing takes already 10 minutes, this must be a tremendously huge file, and maybe you should think your algorithm all over. However, from your code it is not clear what you are wanting to do. You are populating @a and @b, but never use them anywhere.

    There is one more thing which puzzles me: Usually, input/output time dominates processing time, unless you do a lot of processing for your input records. You keep your arrays small (@a and @b never grow larger than 6 elements), so the operation on the arrays doesn't impact the run-time significantly. The regexp also is simple enough that I don't think that the processing time of this regexp could be 15 times the reading-time of a record. I would conclude, that the processing time is not lost by your coding, but to be sure, you could insert timer calls in your loop.

    -- 
    Ronald Fischer <ynnor@mm.st>

      Actually, each of those active lines costs substantially. This is just 1M records matching the OPs data as simply as possible:

      c:\test>junk91 junk.dat Bare loop ## baseline 0.322973012924194 0.358 0.078 0 0 c:\test>junk91 junk.dat ## +400% Add back: regex; 1.40799999237061 1.466 0.046 0 0 c:\test>junk91 junk.dat ## +80% Add back: regex; first push; 1.67599987983704 1.731 0.046 0 0 c:\test>junk91 junk.dat ## +400% Add back: regex; first push; second push; 2.94299983978271 2.932 0.124 0 0 c:\test>junk91 junk.dat ## +150% Add back: regex; first push; second push; shifts 3.2759997844696 3.369 0.015 0 0

      The explanation is that no matter how little time something takes, if you do it a million times, it adds up.

      In the OPs case, where he must be processing somewhere in the region of 2 or 3 billion records, it adds up big.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        Indeed ...

        And I find it interesting, that reading the data does not use that much time, compared to the other operations. I wouldn't have expected this, even if we take buffering into account.

        -- 
        Ronald Fischer <ynnor@mm.st>
Re: Faster push and shift
by RichardK (Parson) on Feb 16, 2012 at 13:05 UTC

    You could drop the chomp as you don't need it, your regex doesn't care about the line ending. As BrowserUk points out doing anything millions of times adds up, so it might reduce your runtime a little.

    Also, as you don't use the first value, you might be able to write your regex as just /\t(\d+)/ , assuming that your file is completely regular and doesn't have any odd lines or comments in it ;)