in reply to performance issues with grepping large sorted arrays

Is there anyway I can make this run faster?

Don't use a data structure that requires a linear search. Use a hash instead, or, if that would get too large to fit into memory, use an indexed database, e.g. Berkeley DB (you might want to take a look at this recent thread, where a similar problem had been discussed).

If you really need to use an array, search it in a short-circuit way, i.e. don't use grep, but rather a for loop in combination with last, to exit the loop as soon as you've found a match:

my $found = 0; for (@A) { if ($_ eq $key) { $found = 1; last } } if ($found) { ...

Replies are listed 'Best First'.
Re^2: performance issues with grepping large sorted arrays
by Trizor (Pilgrim) on Jul 23, 2007 at 23:34 UTC

    Arrays don't require a linear search, see my post about binary search. It requires a sorted array, but if you've already got that...

    In terms of algorithmic complexity short circuiting has no impact on efficiency because the worst case is still O(n) and the average case is O(n/2) which simplifies to O(n), which is still greater than O(lg n)

      I absolutely agree that a binary search would be faster if you have a sorted array (though sorting the array in the first place would also take quite some time, and I was under the impression that the OP would only be doing that in an attempt to speed up things).

      As to "the worst case is still O(n) and the average case is O(n/2) which simplifies to O(n)": I think in most real life situations, you're very unlikely to encounter the worst case scenario (in the OP's case, it would mean that every $key being looked up would match the last element of the array, or is not found at all). Sure, ultimately it's a statistical issue, but in the typical case there will be some improvement when short-circuiting the search. But that's a moot point anyway, as there are even better alternatives... (like hash tables, with constant-time lookup on average)

        The point is: grep is O(n) the short circuit for loop is O(n/2) and O(n) == O(n/2).
        -- gam3
        A picture is worth a thousand words, but takes 200K.