mdog has asked for the wisdom of the Perl Monks concerning the following question:

Brethern --

I want to find most effiecient / fastest way to compare an array of sorted data against a gzipped file of data that is also sorted against the same sort of key.

The array would look like:

@array = qw(item1 item2 item4 item5);

and the data in the file would look like:

blahv {TAB} ackg {TAB} item2 blahs {TAB} acka {TAB} item3 blaha {TAB} ackd {TAB} item4

I need to see if items in the array exist in the file and then do something with that line of data.

The rub is that the array has about 3 million values and the file has about 4 million lines so my usual hackish brute force foreach loops just won't cut it. I have looked at other nodes about comparing arrays but still can't get it right in my head.

I'm not even sure which way to process this...With the outer loop being the file or the array.

I know that once I am up to "item2" in the array I don't want start scanning the file from the beginning but I don't know the best way to accomplish that.

I know I could put the file in memory (seems like a bad idea...that's a lot of memory in use then) and have the array be the outer loop, I could keep a counter that tells where I left off each time in the file content array and jump to that starting spot...or have the counter keep jumping through the file until it hits that line number if I were to leave it gzipped but I'd like the pros opinion if you don't mind.

Many thanks,
mdog

  • Comment on Efficiently compare sorted array vs another sorted array and pattern match
  • Download Code

Replies are listed 'Best First'.
Re: Efficiently compare sorted array vs another sorted array and pattern match
by Roy Johnson (Monsignor) on Apr 08, 2005 at 22:35 UTC
    There's no "outer loop". This is like the merge in mergesort. You look at the first element of the array and of the file. If they're equal, you do your thing and advance them both. Otherwise, you advance the smaller one and repeat.

    Caution: Contents may have been coded under pressure.
Re: Efficiently compare sorted array vs another sorted array and pattern match
by tlm (Prior) on Apr 09, 2005 at 02:11 UTC

    I made the script below (click "Read more...") a lot more formal than necessary to do what it does, just to make it clearer and easier to modify. Think of all the little functions as hooks you can modify to suit your needs.

    There's one wrinkle that needs clarification, and it has to do with the issue of repeats. In the example you gave @array has no repeated elements, which seems reasonable, but I was not sure if it was fair to assume that the ordering of the lines in the file was "strict" (i.e. no two lines in the input file have the same entry in the final position), so the script below does not make this assumption. This is why the position in the array is advanced only if the word in the array is strictly less than the word coming from the file. If in fact it is possible to assume that the ordering of the lines in the file is strict, then you could make the code more symmetrical and also shave off a few nanoseconds from the running time by uncommenting the commented line; this will result in having both the position in the array and in the file advance whenever a match is found. I think it is better as it is, since it works equally well whether the input file is strictly ordered or not.

    % perl -le 'print join "\t", map 'item'.int(rand 500), 1..5 for 1..100 +0' \ | sort -k 5,5 | gzip > mongo.gz % perl 446183.pl mongo.gz item326 item362 item39 item84 item1 item378 item164 item305 item336 item1 item238 item431 item284 item452 item2 item115 item76 item230 item319 item4 item121 item479 item258 item301 item4 item3 item480 item187 item269 item4 item278 item327 item252 item459 item5
    HTH.

    the lowliest monk

    Update: Minor refactoring: tightened the scope of a few lexicals. Basic program structure remains unchanged.

Re: Efficiently compare sorted array vs another sorted array and pattern match
by eXile (Priest) on Apr 09, 2005 at 04:46 UTC
    why not use a hash instead of an array? untested:
    %hash = map { $_ => 1 } @array; open (FILE, $file) or die "blah"; while (<FILE>) { my $key = (split("\t",$_))[2]; next unless ($hash{$key}); # do processing of line here ... }
Re: Efficiently compare sorted array vs another sorted array and pattern match
by thekestrel (Friar) on Apr 09, 2005 at 15:02 UTC
    Perhaps this is an option...
    1) Loop all values of array putting them in hash with hash{val} -= 1
    2) Loop all values in the file putting them in hash with hash{val} += 1

    Then hash values of 0 exist in both, values of +1 exist only in the file and values of -1 exist only in the array.

    Regards Paul.