Re: many to many join on text files

The following was designed to optimize for memory usage. It stores only a hash, whose keys are the values of the key field seen in one of the files, and whose values are arrays of integers. I guess that means that if one file has a million lines, there will be a million numbers stored in memory, plus the overhead for the hash and however many arrays. Despite this memory optimization, I believe this is pretty efficient. The cost is a second, random, read of one of the files.


    # first, scan the first file, noting the file pos's on which each 
+key occurs.

    my %key_pos_in_first_file; # key=key, val=array of file positions.

    open F1, "+< $first_file" or die "open $first_file for random read
+ - $!\n";
    my $p1 = 0;
    while (<F1>) {
        chomp;
        my @l = split /\|/;
        push @{ $key_pos_in_first_file{ $l[0] } }, $p1;
        $p1 = tell F1;
    }

    # second, go through the second file, joining.

    open F2, "< $second_file" or die "open $second_file for read - $!\
+n";
    while (<F2>) {
        chomp;
        my @l2 = split /\|/;
        # go to each pos in the first file and use that line
        for my $p1 ( @{ $key_pos_in_first_file{ $l2[0] } } ) {
            seek F1, $p1, 0;
            my $l1 = <F1>;
            chomp $l1;
            my @l1 = split /\|/, $l1;
            # join
            print "@l2 - @l1\n";
        }
    }
    close F2;
    close F1;
[download]

jdporter
The 6th Rule of Perl Club is -- There is no Rule #6.

Comment on Re: many to many join on text files Download Code