searching through data

baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: searching through data by almut (Canon) on Apr 16, 2009 at 12:17 UTC
Use a hash, with your numbers as keys. That way, grepping through the entire array would become a simple hash lookup. `#!/usr/bin/perl my @array = map {int rand 1e6} 1..400000; # create file with numbers to look up open my $fh, ">", "in.txt" or die "$!"; for (1..1000000) { print $fh int rand 1e6, "\n"; } close $fh; my %lookup_table; $lookup_table{$_}++ for @array; open (in , "<", "in.txt") \|\| die "$!"; while (<in>){ my ($num) = m/^(\d+)/; print "$num, " if $lookup_table{$num}; } close in; __END__ $ time ./757954.pl >out real 0m4.141s user 0m4.004s sys 0m0.132s` [download] (Memory requirement approx. 100 M — or 80 M, if you get rid of the `map` for the `@array` initialisation) Update: with 300_000_000 rows, it takes about 15 min., which includes creating the 2 Gig random data file "in.txt" plus writing a 760 M output file. (Memory requirement is the same.)	[reply] [d/l] [select]
Re^2: searching through data by evaluator (Monk) on Apr 17, 2009 at 08:59 UTC
Instead of `my @array = map {int rand 1e6} 1..400000; ... $lookup_table{$_}++ for @array;` [download] one should use something like `my %lookup_table = map {int rand 1e6 => 1} 1..400000;` [download] in order to save memory requirements. There is no need to have the list of numbers both in an array and in the hash.	[reply] [d/l] [select]
Re^3: searching through data by almut (Canon) on Apr 17, 2009 at 09:28 UTC
There is no need to have the list of numbers both in an array and in the hash. Well, I just left in the `@array` to make it easier for the OP to see what is what... And if the idea is to reduce memory usage, you should definitely also get rid of the `map`, which would cut it down to 45 MB, as opposed to 108 MB with the `map`. (`map` creates all the elements on the stack before assigning them...) `my %lookup_table; $lookup_table{int rand 1e6}++ for 1..400000; # 45 M --- my %lookup_table = map {int rand 1e6 => 1} 1..400000; # 108 M` [download] Also, using `++` instead of assigning 1 has the added benefit of detecting duplicate numbers, should this ever be of interest...	[reply] [d/l] [select]
Re: searching through data by citromatik (Curate) on Apr 16, 2009 at 12:43 UTC
What about getting the job done in less than a half of second? (no DBMs, no Hashes): `use strict; use warnings; my ($file1, $file2) = @ARGV; my $ids = 1000000; # last id my $bin=0; substr($bin,$_,1,pack ("c",0)) for (0..$ids); # Create the index open my $fh1, "<", $file1 or die $!; while (my $id = <$fh1>){ chomp $id; substr($bin, $id,1,pack ("c",1)); } close $fh1; # Search $ids from file 2 open my $fh2, "<", $file2 or die $!; while (my $id2 = <$fh2>){ chomp $id2; print "$id2\n" if ((unpack "c",substr ($bin,$id2,1)) == 1); } close $fh2;` [download] `file1` contains 400000 numbers between 0 and 1000000. `file2` contains 10000 numbers in the same range. Processing both takes less than half a second. `time perl sp.pl file1 file2 > /dev/null real 0m0.448s user 0m0.448s sys 0m0.000s` [download] A bit of explanation: The program constructs an scalar (`$bin`) containing 1000000 of 0s. While processing the index (the first list of ids), it substitutes every 0 at position $id with a 1. Then, it reads the second file of ids and checks if at position $id2, the $bin scalar has a 0 (not seen previously) or a 1 ($id already seen in the first file). Hope this helps citromatik	[reply] [d/l] [select]
Re^2: searching through data by almut (Canon) on Apr 16, 2009 at 14:31 UTC
getting the job done in less than a half of second While this is certainly a neat approach, and much better memory-wise (as long as you can keep the maximum possible number within limits...), it isn't actually faster than using a hash. The slightly modified code (to make it comparable with the hash version I suggested) takes about the same running time on my system. For example, with 1000_000 values to look up: $ time ./757954_bytevector.pl >out real 0m4.421s user 0m4.368s sys 0m0.048s ---------- #!/usr/bin/perl # create file with numbers to look up open (my $fh, ">", "in.txt") \|\| die "$!"; for (1..1000000) { print $fh int rand 1e6, "\n"; } close $fh; my $ids = 1000000; # last id my $bin=0; substr($bin,$_,1,pack ("c",0)) for (0..$ids); # Create the index for (1..400000) { my $id = int rand 1e6; substr($bin, $id,1,pack ("c",1)); } # Search $ids open my $fh2, "<", "in.txt" or die $!; while (<$fh2>){ my ($num) = m/^(\d+)/; print "$num, " if ((unpack "c",substr ($bin,$num,1)) == 1); } [download]	[reply] [d/l]
Re: searching through data by FunkyMonk (Bishop) on Apr 16, 2009 at 13:34 UTC
~~vec produced the fastest searching on my system:~~ Update: Not any more! Added AnonyMonk's bit_vec2 Added array_slice Added array_for since I expected array_slice to be faster than it was Generally tweaked stuff Made it more Perl 5.8 friendly As you can see, bit_vec2 is miles faster. Rate baxy bin_chop citromatic hash array_slice bit_v +ec1 array_for vec bit_vec2 baxy 0.193/s -- -75% -89% -90% -94% - +96% -96% -97% -100% bin_chop 0.770/s 300% -- -56% -62% -77% - +83% -84% -86% -99% citromatic 1.74/s 806% 127% -- -13% -48% - +62% -63% -69% -97% hash 2.02/s 947% 162% 16% -- -40% - +56% -57% -64% -97% array_slice 3.38/s 1656% 339% 94% 68% -- - +25% -28% -40% -95% bit_vec1 4.54/s 2256% 489% 160% 125% 34% + -- -4% -20% -93% array_for 4.73/s 2355% 514% 171% 135% 40% + 4% -- -17% -93% vec 5.67/s 2846% 637% 225% 181% 68% +25% 20% -- -91% bit_vec2 63.5/s 32870% 8150% 3539% 3050% 1778% 13 +00% 1243% 1019% -- [download] Full benchmarks of all techniques so far below Read more... (4 kB)	[reply] [d/l] [select]
Re^2: searching through data by Anonymous Monk on Apr 16, 2009 at 19:26 UTC
FunkyMonk that is great stuff, I rewrote the bitvector to make it a lot faster. `bit_vector2 => sub { my $vec1 = Bit::Vector->new(1000001); $vec1->Index_List_Store(@searched); my $vec2 = Bit::Vector->new(1000001); $vec2->Index_List_Store(@to_find); my $vec3 = Bit::Vector->new(1000001); $vec3->And($vec1, $vec2); my @present = $vec3->Index_List_Read(); return 0+@present; },` [download]	[reply] [d/l]
Re: searching through data by dreadpiratepeter (Priest) on Apr 16, 2009 at 12:19 UTC
I would use a hash, like so (untested code warning): `use strict; use warnings; my @array; # contains numbers (1..1000000) ~ 400000 numbers my %hash; open (in , "<", "in.txt") \|\| die "$!"; while(<in>){ m/^(\d+)/; $hash{$1} = 1; } close in; foreach my $num (@array) { print "$num, " if exists $hash{$num}; }` [download] -pete "Worry is like a rocking chair. It gives you something to do, but it doesn't get you anywhere."	[reply] [d/l]
Re: searching through data by wfsp (Abbot) on Apr 16, 2009 at 12:28 UTC
hashing it (@array) would take up too much RAM... A hash is still your best bet though. You could consider a disk based hash like DBM::Deep.	[reply]
Re: searching through data by repellent (Priest) on Apr 16, 2009 at 16:12 UTC
It's in perlfaq4: How can I tell whether a certain element is contained in a list or array? If the values in question are integers instead of strings, you can save quite a lot of space by using bit strings instead: `@articles = ( 1..10, 150..2000, 2017 ); undef $read; for (@articles) { vec($read,$_,1) = 1 }` [download] Now check whether `vec($read,$n,1)` is true for some `$n`. These methods guarantee fast individual tests but require a re-organization of the original list or array. They only pay off if you have to test multiple values against the same array.	[reply] [d/l] [select]
Re: searching through data by wol (Hermit) on Apr 16, 2009 at 12:32 UTC
I found that sorting @array takes a significant amount of time when there are 400000 elements. Given that your search loops don't depend on it being sorted, I'd suggest removing that "optimisation" and seeing how that affects your timings. As an aside, how many lines are there in your input file? -- use JAPH; print JAPH::asString();	[reply]
Re: searching through data by linuxer (Curate) on Apr 16, 2009 at 12:12 UTC
Just an idea: Instead of searching through `@array1` for each line of your input file, you could store the values you read from the file in another array and then compare the two arrays... and print the matching values.	[reply] [d/l]
Re: searching through data by StupidLarry (Initiate) on Apr 16, 2009 at 12:19 UTC
If hash is not available you can try using binary search.	[reply]
Re^2: searching through data by Anonymous Monk on Apr 16, 2009 at 12:43 UTC
maybe you could use a Bit::Vector it will certainly be fast	[reply]
Re: searching through data by Anonymous Monk on Apr 16, 2009 at 12:57 UTC
`#!/usr/bin/perl -w use strict; use warnings; use Bit::Vector; my @array; # contains numbers (1..1000000) ~ 400000 numbers my $vector = Bit::Vector->new(1000001); # please replace this bit --------- map { $vector->Bit_On($_) } @array; # --------------------------------- open (IN , "<", "in.txt") or die "$!"; while(<IN>){ m/^(\d+)/; print $_, ', ' if ($vector->bit_test($_)); } close IN;` [download] sorry I posted below the Bit::Vector idea here the untested code.	[reply] [d/l]
Re: searching through data by NiJo (Friar) on Apr 16, 2009 at 20:11 UTC
Bloom filters might provide an intelligent and scaling approach by "reusing bits for multiple rows". If your real application can tolerate some false positives, Bloom::Faster might be the ticket. Reversing source and target arrays, redoing the comparison and comparing results should get rid of the false positives. The effectiveness of all initial work depends on the probability of hits.	[reply]