Re: searching through data
by almut (Canon) on Apr 16, 2009 at 12:17 UTC
|
Use a hash, with your numbers as keys. That way, grepping through the entire array would become a simple hash lookup.
#!/usr/bin/perl
my @array = map {int rand 1e6} 1..400000;
# create file with numbers to look up
open my $fh, ">", "in.txt" or die "$!";
for (1..1000000) {
print $fh int rand 1e6, "\n";
}
close $fh;
my %lookup_table;
$lookup_table{$_}++ for @array;
open (in , "<", "in.txt") || die "$!";
while (<in>){
my ($num) = m/^(\d+)/;
print "$num, " if $lookup_table{$num};
}
close in;
__END__
$ time ./757954.pl >out
real 0m4.141s
user 0m4.004s
sys 0m0.132s
(Memory requirement approx. 100 M — or 80 M, if you get rid of the map for the @array initialisation)
Update: with 300_000_000 rows, it takes about 15 min., which
includes creating the 2 Gig random data file "in.txt" plus writing a 760 M
output file. (Memory requirement is the same.) | [reply] [d/l] [select] |
|
|
my @array = map {int rand 1e6} 1..400000;
...
$lookup_table{$_}++ for @array;
one should use something like
my %lookup_table = map {int rand 1e6 => 1} 1..400000;
in order to save memory requirements.
There is no need to have the list of numbers both in an array and in the hash.
| [reply] [d/l] [select] |
|
|
There is no need to have the list of numbers both in an array and in the hash.
Well, I just left in the @array to make it easier for the OP
to see what is what... And if the idea is to reduce memory usage,
you should definitely also get rid of the map, which would cut
it down to 45 MB, as opposed to 108 MB with the map. (map creates
all the elements on the stack before assigning them...)
my %lookup_table;
$lookup_table{int rand 1e6}++ for 1..400000; # 45 M
---
my %lookup_table = map {int rand 1e6 => 1} 1..400000; # 108 M
Also, using ++ instead of assigning 1 has the added benefit
of detecting duplicate numbers, should this ever be of interest...
| [reply] [d/l] [select] |
Re: searching through data
by citromatik (Curate) on Apr 16, 2009 at 12:43 UTC
|
What about getting the job done in less than a half of second? (no DBMs, no Hashes):
use strict;
use warnings;
my ($file1, $file2) = @ARGV;
my $ids = 1000000; # last id
my $bin=0;
substr($bin,$_,1,pack ("c",0)) for (0..$ids);
# Create the index
open my $fh1, "<", $file1 or die $!;
while (my $id = <$fh1>){
chomp $id;
substr($bin, $id,1,pack ("c",1));
}
close $fh1;
# Search $ids from file 2
open my $fh2, "<", $file2 or die $!;
while (my $id2 = <$fh2>){
chomp $id2;
print "$id2\n" if ((unpack "c",substr ($bin,$id2,1)) == 1);
}
close $fh2;
file1 contains 400000 numbers between 0 and 1000000. file2 contains 10000 numbers in the same range. Processing both takes less than half a second.
time perl sp.pl file1 file2 > /dev/null
real 0m0.448s
user 0m0.448s
sys 0m0.000s
A bit of explanation: The program constructs an scalar ($bin) containing 1000000 of 0s. While processing the index (the first list of ids), it substitutes every 0 at position $id with a 1.
Then, it reads the second file of ids and checks if at position $id2, the $bin scalar has a 0 (not seen previously) or a 1 ($id already seen in the first file).
Hope this helps
| [reply] [d/l] [select] |
|
|
getting the job done in less than a half of second
While this is certainly a neat approach, and much better
memory-wise (as long as you can keep the maximum possible number within
limits...), it isn't actually faster than using a hash. The
slightly modified code (to make it comparable with the hash version
I suggested) takes about the same running time on my
system. For example, with 1000_000 values to look up:
$ time ./757954_bytevector.pl >out
real 0m4.421s
user 0m4.368s
sys 0m0.048s
----------
#!/usr/bin/perl
# create file with numbers to look up
open (my $fh, ">", "in.txt") || die "$!";
for (1..1000000) {
print $fh int rand 1e6, "\n";
}
close $fh;
my $ids = 1000000; # last id
my $bin=0;
substr($bin,$_,1,pack ("c",0)) for (0..$ids);
# Create the index
for (1..400000) {
my $id = int rand 1e6;
substr($bin, $id,1,pack ("c",1));
}
# Search $ids
open my $fh2, "<", "in.txt" or die $!;
while (<$fh2>){
my ($num) = m/^(\d+)/;
print "$num, " if ((unpack "c",substr ($bin,$num,1)) == 1);
}
| [reply] [d/l] |
Re: searching through data
by FunkyMonk (Bishop) on Apr 16, 2009 at 13:34 UTC
|
vec produced the fastest searching on my system:
Update:
Not any more!
- Added AnonyMonk's bit_vec2
- Added array_slice
- Added array_for since I expected array_slice to be faster than it was
- Generally tweaked stuff
- Made it more Perl 5.8 friendly
As you can see, bit_vec2 is miles faster.
Rate baxy bin_chop citromatic hash array_slice bit_v
+ec1 array_for vec bit_vec2
baxy 0.193/s -- -75% -89% -90% -94% -
+96% -96% -97% -100%
bin_chop 0.770/s 300% -- -56% -62% -77% -
+83% -84% -86% -99%
citromatic 1.74/s 806% 127% -- -13% -48% -
+62% -63% -69% -97%
hash 2.02/s 947% 162% 16% -- -40% -
+56% -57% -64% -97%
array_slice 3.38/s 1656% 339% 94% 68% -- -
+25% -28% -40% -95%
bit_vec1 4.54/s 2256% 489% 160% 125% 34%
+ -- -4% -20% -93%
array_for 4.73/s 2355% 514% 171% 135% 40%
+ 4% -- -17% -93%
vec 5.67/s 2846% 637% 225% 181% 68%
+25% 20% -- -91%
bit_vec2 63.5/s 32870% 8150% 3539% 3050% 1778% 13
+00% 1243% 1019% --
Full benchmarks of all techniques so far below
| [reply] [d/l] [select] |
|
|
FunkyMonk that is great stuff, I rewrote the bitvector to make it a lot faster.
bit_vector2 => sub {
my $vec1 = Bit::Vector->new(1000001);
$vec1->Index_List_Store(@searched);
my $vec2 = Bit::Vector->new(1000001);
$vec2->Index_List_Store(@to_find);
my $vec3 = Bit::Vector->new(1000001);
$vec3->And($vec1, $vec2);
my @present = $vec3->Index_List_Read();
return 0+@present;
},
| [reply] [d/l] |
Re: searching through data
by dreadpiratepeter (Priest) on Apr 16, 2009 at 12:19 UTC
|
I would use a hash, like so (untested code warning):
use strict;
use warnings;
my @array; # contains numbers (1..1000000) ~ 400000 numbers
my %hash;
open (in , "<", "in.txt") || die "$!";
while(<in>){
m/^(\d+)/;
$hash{$1} = 1;
}
close in;
foreach my $num (@array) {
print "$num, " if exists $hash{$num};
}
-pete
"Worry is like a rocking chair. It gives you something to do, but it doesn't get you anywhere."
| [reply] [d/l] |
Re: searching through data
by wfsp (Abbot) on Apr 16, 2009 at 12:28 UTC
|
hashing it (@array) would take up too much RAM...
A hash is still your best bet though. You could consider a disk based hash like DBM::Deep.
| [reply] |
Re: searching through data
by repellent (Priest) on Apr 16, 2009 at 16:12 UTC
|
@articles = ( 1..10, 150..2000, 2017 );
undef $read;
for (@articles) { vec($read,$_,1) = 1 }
Now check whether vec($read,$n,1) is true for some $n.
These methods guarantee fast individual tests but require a re-organization of the original list or array. They only pay off if you have to test multiple values against the same array.
| [reply] [d/l] [select] |
Re: searching through data
by wol (Hermit) on Apr 16, 2009 at 12:32 UTC
|
I found that sorting @array takes a significant amount of time when there are 400000 elements.
Given that your search loops don't depend on it being sorted, I'd suggest removing that "optimisation" and seeing how that affects your timings.
As an aside, how many lines are there in your input file?
--
use JAPH;
print JAPH::asString();
| [reply] |
Re: searching through data
by linuxer (Curate) on Apr 16, 2009 at 12:12 UTC
|
Just an idea:
Instead of searching through @array1 for each line of your input file, you could store the values you read from the file in another array and then compare the two arrays... and print the matching values.
| [reply] [d/l] |
Re: searching through data
by StupidLarry (Initiate) on Apr 16, 2009 at 12:19 UTC
|
If hash is not available you can try using binary search. | [reply] |
|
|
maybe you could use a Bit::Vector it will certainly be fast
| [reply] |
Re: searching through data
by Anonymous Monk on Apr 16, 2009 at 12:57 UTC
|
#!/usr/bin/perl -w
use strict;
use warnings;
use Bit::Vector;
my @array; # contains numbers (1..1000000) ~ 400000 numbers
my $vector = Bit::Vector->new(1000001);
# please replace this bit ---------
map { $vector->Bit_On($_) } @array;
# ---------------------------------
open (IN , "<", "in.txt") or die "$!";
while(<IN>){
m/^(\d+)/;
print $_, ', ' if ($vector->bit_test($_));
}
close IN;
sorry I posted below the Bit::Vector idea here the untested code. | [reply] [d/l] |
Re: searching through data
by NiJo (Friar) on Apr 16, 2009 at 20:11 UTC
|
Bloom filters might provide an intelligent and scaling approach by "reusing bits for multiple rows". If your real application can tolerate some false positives, Bloom::Faster might be the ticket.
Reversing source and target arrays, redoing the comparison and comparing results should get rid of the false positives.
The effectiveness of all initial work depends on the probability of hits.
| [reply] |