Re: performance issues with grepping large sorted arrays
by GrandFather (Saint) on Jul 23, 2007 at 23:07 UTC
|
my %lu = map {$_ => 1} @array;
foreach my $key (keys %H){
if ($lu{$key}) {
or turn the loop inside out:
foreach my $elt (@A){
if (exists $hash{$elt}) {
In the second case you may need to keep a record of hits so that you don't perform your hit process more than once (assuming that to be what you want).
Note that sorting the array will make no difference at all. grep always processes all elements. You could make a small efficiency gain by using a for loop instead of the grep and exiting as soon as a matching element is found.
DWIM is Perl's answer to Gödel
| [reply] [d/l] [select] |
Re: performance issues with grepping large sorted arrays
by mr_mischief (Monsignor) on Jul 24, 2007 at 01:16 UTC
|
You're using a data structure which takes a constant time to access a single elements, and you're searching through all of its elements then searching a second data structure that takes a time linear to the number of elements in it in order to see if the two have an element in common. As GrandFather suggests, this is backwards.
You can, as GrandFather noted, turn this inside out:
foreach ( @array ) {
if ( exists $hash{$_} ) {
# do something
} else {
# do something else
}
}
You could also, if you don't have to build both structures (which we cannot tell without more context), do something more like this:
my %hash = ( ... );
while ( $_ = get_item() ) {
if ( exists $hash{ $_ } ) { ### update: fixed syntax error typo
# do something
} else {
# do something else
}
}
... which would keep you from ever building the huge array (where get_item() returns one value that would have been assigned to an element of @array in the other version). As I said, though, we can't tell if you need the array without more information.
The former solution assumes you needed to have the array anyway, while the second hopes to avoid it if possible. If there's nothing you can do about having such a large array, you'll at least want to do the lookups using the hash.
In your original solution, you end up with the running time being the product of a number linear to the number of keys in the hash times a number linear to the number of elements in the array. Think f(n*o). You could scan linearly through the array, and do a constant time lookup on the hash each time f(n), or more specifically f(n+x*n), but that's trivially f(n), which is what GrandFather's suggestion gets you. The necessary time is cut roughly by a factor of the number of keys in the hash, which is a big amount when you're talking about 1000 keys. 1/1000 of the time is always a good speedup.
The solution that would let you not use the array (if possible) would likely be a bit faster still because it wouldn't require the memory for the array. An array with 800k elements is likely to get into a page file somewhere if the elements are of any size at all. So I guess the question here is, where is the array from, and how is it built up? | [reply] [d/l] [select] |
|
|
| [reply] |
Re: performance issues with grepping large sorted arrays
by wind (Priest) on Jul 23, 2007 at 23:00 UTC
|
| [reply] |
Re: performance issues with grepping large sorted arrays
by almut (Canon) on Jul 23, 2007 at 23:16 UTC
|
Is there anyway I can make this run faster?
Don't use a data structure that requires a linear search. Use a
hash instead, or, if that would get too large to fit into memory, use an
indexed database, e.g. Berkeley DB (you might want to take a look at
this recent thread, where a similar problem had been
discussed).
If you really need to use an array, search it in a
short-circuit way, i.e. don't use grep, but rather a for
loop in combination with last, to exit the loop as soon as
you've found a match:
my $found = 0;
for (@A) { if ($_ eq $key) { $found = 1; last } }
if ($found) {
...
| [reply] [d/l] [select] |
|
|
Arrays don't require a linear search, see my post about binary search. It requires a sorted array, but if you've already got that...
In terms of algorithmic complexity short circuiting has no impact on efficiency because the worst case is still O(n) and the average case is O(n/2) which simplifies to O(n), which is still greater than O(lg n)
| [reply] |
|
|
I absolutely agree that a binary search would be faster if
you have a sorted array (though sorting the array in the first place
would also take quite some time, and I was under the impression that
the OP would only be doing that in an attempt to speed up things).
As to "the worst case is still O(n) and the average case is
O(n/2) which simplifies to O(n)": I think in most real life situations,
you're very unlikely to encounter the worst case scenario (in the OP's
case, it would mean that every $key being looked up would match the
last element of the array, or is not found at all). Sure, ultimately it's a statistical issue,
but in the typical case there will be some improvement when
short-circuiting the search. But that's a moot point anyway, as there are
even better alternatives... (like hash tables, with constant-time
lookup on average)
| [reply] |
|
|
|
|
|
|
|
Re: performance issues with grepping large sorted arrays
by Trizor (Pilgrim) on Jul 23, 2007 at 23:17 UTC
|
If array A is sorted by using the string cmp operator you can employ the binary search algorithm to determine if $item is in the array.
The basic idea is to see if the item is greater than or less than the middle item in the array, if its greater, binary search the top half of the array. If its less, binary search the bottom half. You'll eventually narrow it down to a one element array or hit it along the way at a midpoint. The runtime of this algorithm is O(lg n), versus the O(n) of using grep
A recursive implementation follows:
# Takes an array ref and a target item
# Returns the index of the item in the array or
# -1 if its not there.
sub binsearch {
my $arref = shift;
my $target = shift;
my $len = scalar @$arref;
my $midpoint = floor( $len / 2); # Get the center, rounding down whe
+n we have 1 element left we don't get undef
my $cmp = $target cmp $arref->[$midpoint]; # make the comparison onc
+e
return $midpoint unless $cmp; # We've found it!
return -1 if $len == 1; # Must not be there
return binsearch ([$arref->[0..$midpoint],$target) if $cmp == -1; #
+Its in the lower half, check to the midpoint because its rounded down
return binsearch ([$arref->[$midpoint+1..$len-1],$target) if $cmp ==
+ 1; # Its in the upper half, check to the midpoint +1 for reason stat
+ed above
croak "Something really weird happened. Might want to run make test
+on perl itsself.";
}
This will still take time with 800k elements. If speed is really an issue consider PDL or doing it in C. | [reply] [d/l] [select] |
|
|
| [reply] [d/l] [select] |
|
|
The problem still is that the OP said they sorted the array just to try to get the grep to be faster.
Yes, binary search is faster than a straight linear search.
No, it's not going to be as fast as eliminating 1000 comparisons altogether.
for ( @array ) {
if ( exists $hash{$_} ) {
# do something
} else {
# do something else
}
}
Here, you have one hash lookup for each array item.
@array = sort @array;
foreach ( keys %hash ) {
if ( is_in_array( $_ ) ) {
# do something
} else {
# do something else
}
}
Here, you're doing a sort (quicksort, if using Perl's sort(), which is O(n log(n)) and worst case O(n**2) ), then using binary search which is O(log2 n) for each key of the hash.
So your solution ends up being O( (n log(n)) + (m * log n) ) with a worst case of O(n**2 + m * log(n)). Just going through the loop and doing a hash lookup is predictably O(n).
In simplest terms, with an 800k element array you can do 800k comparisons the way GrandFather and I have been recommending. In order to do a quicksort then a binary search for each key of the hash, you're possibly talking about 800k * 800k + 1000 * log2(800k). log2(800,000) is roughly 20. So 640,000,020,000 comparisons in the worst case, even using fairly good sort and search algorithms.
The loop is simply inside out. | [reply] [d/l] [select] |