baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

hi , i have this problem in which i'm using grep function but the process is tooooo slow.

i have a file that has some numbers in it like :

in.txt __DATA__ 13454 34456 1342 5667 987 3354 145
and i have a set of numbers in the @array (form 1 .. 1000000) - numbers are in that range but there is no 1000000 numbers but 400000 (some are missing). so what i wish to do is go through my file and see if the number in the file is in the @array. if it is then it is enough to print it out.
use strict; use warnings; my @array; # contains numbers (1..1000000) ~ 400000 numbers my @array1 = sort @array; # sort it to speed up the process?? open (in , "<", "in.txt") || die "$!"; while(<in>){ m/^(\d+)/; my @a = grep {$_ == $1} @array1; print "@a, " if (@a); } close in;
to run the search it takes me ~150 min. does any one knows any trick to speed this up.(without using databases SQLite, mySql ....)

so basically what i'm asking for is a way to replace grep efficiently for this case only (if that is possible)

thnx

update :

hashing it (@array) would take up too much RAM(because here i stated the most simplified example and not the actual thing - to complex and unnecessary)

update 2:

the in file is too big to be stored in memory (on a machine with 512 MB of ram) :). that is also important fact that i neglected :) but idea is ok !

update 3:

in file has 0.3 billion rows (all numbers, some repeating)(300000000 (billion in my language is 1000x more than in english - so to be sure))

i don't know where wol is going with 'the' question but there are other things in memory living along those numbers :)

Replies are listed 'Best First'.
Re: searching through data
by almut (Canon) on Apr 16, 2009 at 12:17 UTC

    Use a hash, with your numbers as keys. That way, grepping through the entire array would become a simple hash lookup.

    #!/usr/bin/perl my @array = map {int rand 1e6} 1..400000; # create file with numbers to look up open my $fh, ">", "in.txt" or die "$!"; for (1..1000000) { print $fh int rand 1e6, "\n"; } close $fh; my %lookup_table; $lookup_table{$_}++ for @array; open (in , "<", "in.txt") || die "$!"; while (<in>){ my ($num) = m/^(\d+)/; print "$num, " if $lookup_table{$num}; } close in; __END__ $ time ./757954.pl >out real 0m4.141s user 0m4.004s sys 0m0.132s

    (Memory requirement approx. 100 M — or 80 M, if you get rid of the map for the @array initialisation)

    Update: with 300_000_000 rows, it takes about 15 min., which includes creating the 2 Gig random data file "in.txt" plus writing a 760 M output file. (Memory requirement is the same.)

      Instead of
      my @array = map {int rand 1e6} 1..400000; ... $lookup_table{$_}++ for @array;
      one should use something like
      my %lookup_table = map {int rand 1e6 => 1} 1..400000;
      in order to save memory requirements. There is no need to have the list of numbers both in an array and in the hash.
        There is no need to have the list of numbers both in an array and in the hash.

        Well, I just left in the @array to make it easier for the OP to see what is what...  And if the idea is to reduce memory usage, you should definitely also get rid of the map, which would cut it down to 45 MB, as opposed to 108 MB with the map. (map creates all the elements on the stack before assigning them...)

        my %lookup_table; $lookup_table{int rand 1e6}++ for 1..400000; # 45 M --- my %lookup_table = map {int rand 1e6 => 1} 1..400000; # 108 M

        Also, using ++ instead of assigning 1 has the added benefit of detecting duplicate numbers, should this ever be of interest...

Re: searching through data
by citromatik (Curate) on Apr 16, 2009 at 12:43 UTC

    What about getting the job done in less than a half of second? (no DBMs, no Hashes):

    use strict; use warnings; my ($file1, $file2) = @ARGV; my $ids = 1000000; # last id my $bin=0; substr($bin,$_,1,pack ("c",0)) for (0..$ids); # Create the index open my $fh1, "<", $file1 or die $!; while (my $id = <$fh1>){ chomp $id; substr($bin, $id,1,pack ("c",1)); } close $fh1; # Search $ids from file 2 open my $fh2, "<", $file2 or die $!; while (my $id2 = <$fh2>){ chomp $id2; print "$id2\n" if ((unpack "c",substr ($bin,$id2,1)) == 1); } close $fh2;

    file1 contains 400000 numbers between 0 and 1000000. file2 contains 10000 numbers in the same range. Processing both takes less than half a second.

    time perl sp.pl file1 file2 > /dev/null real 0m0.448s user 0m0.448s sys 0m0.000s

    A bit of explanation: The program constructs an scalar ($bin) containing 1000000 of 0s. While processing the index (the first list of ids), it substitutes every 0 at position $id with a 1.

    Then, it reads the second file of ids and checks if at position $id2, the $bin scalar has a 0 (not seen previously) or a 1 ($id already seen in the first file).

    Hope this helps

    citromatik

      getting the job done in less than a half of second

      While this is certainly a neat approach, and much better memory-wise (as long as you can keep the maximum possible number within limits...), it isn't actually faster than using a hash.  The slightly modified code (to make it comparable with the hash version I suggested) takes about the same running time on my system. For example, with 1000_000 values to look up:

      $ time ./757954_bytevector.pl >out real 0m4.421s user 0m4.368s sys 0m0.048s ---------- #!/usr/bin/perl # create file with numbers to look up open (my $fh, ">", "in.txt") || die "$!"; for (1..1000000) { print $fh int rand 1e6, "\n"; } close $fh; my $ids = 1000000; # last id my $bin=0; substr($bin,$_,1,pack ("c",0)) for (0..$ids); # Create the index for (1..400000) { my $id = int rand 1e6; substr($bin, $id,1,pack ("c",1)); } # Search $ids open my $fh2, "<", "in.txt" or die $!; while (<$fh2>){ my ($num) = m/^(\d+)/; print "$num, " if ((unpack "c",substr ($bin,$num,1)) == 1); }
Re: searching through data
by FunkyMonk (Bishop) on Apr 16, 2009 at 13:34 UTC
    vec produced the fastest searching on my system:

    Update:

    Not any more!

    • Added AnonyMonk's bit_vec2
    • Added array_slice
    • Added array_for since I expected array_slice to be faster than it was
    • Generally tweaked stuff
    • Made it more Perl 5.8 friendly

    As you can see, bit_vec2 is miles faster.

    Rate baxy bin_chop citromatic hash array_slice bit_v +ec1 array_for vec bit_vec2 baxy 0.193/s -- -75% -89% -90% -94% - +96% -96% -97% -100% bin_chop 0.770/s 300% -- -56% -62% -77% - +83% -84% -86% -99% citromatic 1.74/s 806% 127% -- -13% -48% - +62% -63% -69% -97% hash 2.02/s 947% 162% 16% -- -40% - +56% -57% -64% -97% array_slice 3.38/s 1656% 339% 94% 68% -- - +25% -28% -40% -95% bit_vec1 4.54/s 2256% 489% 160% 125% 34% + -- -4% -20% -93% array_for 4.73/s 2355% 514% 171% 135% 40% + 4% -- -17% -93% vec 5.67/s 2846% 637% 225% 181% 68% +25% 20% -- -91% bit_vec2 63.5/s 32870% 8150% 3539% 3050% 1778% 13 +00% 1243% 1019% --

    Full benchmarks of all techniques so far below

      FunkyMonk that is great stuff, I rewrote the bitvector to make it a lot faster.
      bit_vector2 => sub { my $vec1 = Bit::Vector->new(1000001); $vec1->Index_List_Store(@searched); my $vec2 = Bit::Vector->new(1000001); $vec2->Index_List_Store(@to_find); my $vec3 = Bit::Vector->new(1000001); $vec3->And($vec1, $vec2); my @present = $vec3->Index_List_Read(); return 0+@present; },
Re: searching through data
by dreadpiratepeter (Priest) on Apr 16, 2009 at 12:19 UTC
    I would use a hash, like so (untested code warning):
    use strict; use warnings; my @array; # contains numbers (1..1000000) ~ 400000 numbers my %hash; open (in , "<", "in.txt") || die "$!"; while(<in>){ m/^(\d+)/; $hash{$1} = 1; } close in; foreach my $num (@array) { print "$num, " if exists $hash{$num}; }


    -pete
    "Worry is like a rocking chair. It gives you something to do, but it doesn't get you anywhere."
Re: searching through data
by wfsp (Abbot) on Apr 16, 2009 at 12:28 UTC
    hashing it (@array) would take up too much RAM...
    A hash is still your best bet though. You could consider a disk based hash like DBM::Deep.
Re: searching through data
by repellent (Priest) on Apr 16, 2009 at 16:12 UTC
    It's in perlfaq4: How can I tell whether a certain element is contained in a list or array?

      If the values in question are integers instead of strings, you can save quite a lot of space by using bit strings instead:
      @articles = ( 1..10, 150..2000, 2017 ); undef $read; for (@articles) { vec($read,$_,1) = 1 }
      Now check whether vec($read,$n,1) is true for some $n.

      These methods guarantee fast individual tests but require a re-organization of the original list or array. They only pay off if you have to test multiple values against the same array.
Re: searching through data
by wol (Hermit) on Apr 16, 2009 at 12:32 UTC
    I found that sorting @array takes a significant amount of time when there are 400000 elements.

    Given that your search loops don't depend on it being sorted, I'd suggest removing that "optimisation" and seeing how that affects your timings.

    As an aside, how many lines are there in your input file?

    --
    use JAPH;
    print JAPH::asString();

Re: searching through data
by linuxer (Curate) on Apr 16, 2009 at 12:12 UTC

    Just an idea:

    Instead of searching through @array1 for each line of your input file, you could store the values you read from the file in another array and then compare the two arrays... and print the matching values.

Re: searching through data
by StupidLarry (Initiate) on Apr 16, 2009 at 12:19 UTC
    If hash is not available you can try using binary search.
      maybe you could use a Bit::Vector it will certainly be fast
Re: searching through data
by Anonymous Monk on Apr 16, 2009 at 12:57 UTC
    #!/usr/bin/perl -w use strict; use warnings; use Bit::Vector; my @array; # contains numbers (1..1000000) ~ 400000 numbers my $vector = Bit::Vector->new(1000001); # please replace this bit --------- map { $vector->Bit_On($_) } @array; # --------------------------------- open (IN , "<", "in.txt") or die "$!"; while(<IN>){ m/^(\d+)/; print $_, ', ' if ($vector->bit_test($_)); } close IN;
    sorry I posted below the Bit::Vector idea here the untested code.
Re: searching through data
by NiJo (Friar) on Apr 16, 2009 at 20:11 UTC
    Bloom filters might provide an intelligent and scaling approach by "reusing bits for multiple rows". If your real application can tolerate some false positives, Bloom::Faster might be the ticket.

    Reversing source and target arrays, redoing the comparison and comparing results should get rid of the false positives.

    The effectiveness of all initial work depends on the probability of hits.