in reply to searching through data

Use a hash, with your numbers as keys. That way, grepping through the entire array would become a simple hash lookup.

#!/usr/bin/perl my @array = map {int rand 1e6} 1..400000; # create file with numbers to look up open my $fh, ">", "in.txt" or die "$!"; for (1..1000000) { print $fh int rand 1e6, "\n"; } close $fh; my %lookup_table; $lookup_table{$_}++ for @array; open (in , "<", "in.txt") || die "$!"; while (<in>){ my ($num) = m/^(\d+)/; print "$num, " if $lookup_table{$num}; } close in; __END__ $ time ./757954.pl >out real 0m4.141s user 0m4.004s sys 0m0.132s

(Memory requirement approx. 100 M — or 80 M, if you get rid of the map for the @array initialisation)

Update: with 300_000_000 rows, it takes about 15 min., which includes creating the 2 Gig random data file "in.txt" plus writing a 760 M output file. (Memory requirement is the same.)

Replies are listed 'Best First'.
Re^2: searching through data
by evaluator (Monk) on Apr 17, 2009 at 08:59 UTC
    Instead of
    my @array = map {int rand 1e6} 1..400000; ... $lookup_table{$_}++ for @array;
    one should use something like
    my %lookup_table = map {int rand 1e6 => 1} 1..400000;
    in order to save memory requirements. There is no need to have the list of numbers both in an array and in the hash.
      There is no need to have the list of numbers both in an array and in the hash.

      Well, I just left in the @array to make it easier for the OP to see what is what...  And if the idea is to reduce memory usage, you should definitely also get rid of the map, which would cut it down to 45 MB, as opposed to 108 MB with the map. (map creates all the elements on the stack before assigning them...)

      my %lookup_table; $lookup_table{int rand 1e6}++ for 1..400000; # 45 M --- my %lookup_table = map {int rand 1e6 => 1} 1..400000; # 108 M

      Also, using ++ instead of assigning 1 has the added benefit of detecting duplicate numbers, should this ever be of interest...