Re: Indexing two large text files

A couple of questions on the structures of your files:

Are the source files sorted by the key value? If so, you may be able to use some sort of binary search algorithm.
Is the data static? Perhaps some other storage format for the file (database, DBM::Deep, etc) might be a 'better' way of storing the data. Additionally, if a text format is the 'correct' way of storing the data, perhaps pre-sorting it would provide the ability to use a more efficient search algorithm.
Are the records fixed size? If so, it makes the binary search (above) easier to implement, otherwise you will need to possibly index your text files. You would do that by reading through one of the files and recording the index (tell) of where each key is found in the file, and use that to read only the single line needed back into memory (seek). If the number of lines in the data files is significantly large, it is possible that even your indexes will exhaust your available memory.
If both files are sorted by the keys, you can just step through them in tandem, skipping records that are missing from one or the other until you reach the end of the data. This would enable you to only have one line from each file in memory at a time, and make only a single pass through the data files.
If you are on unix, this could be accomplished with sort and join without the memory constraints, given sufficient disk space.

The way you currently have this implemented is approximately O(N**2) (assuming that the number of lines in each file are approximately equal). For data files on disk, this is not a good situation. Sorted data files can reduce this to O(N), which is about as good as you are going to get.

Update: Reread the original code, saw what it was actually doing rather than the apparent intent of what it should do:

open my $if1, '<', $input_f1 or die "Can't open $input_f1: $!\n";
open my $if2, '<', $input_f2 or die "Can't open $input_f2: $!\n";
while(<$if1>) {  # Read each line of file1
    my $line = $_;
    chomp($line);
    my ($key1, $vf1, $vf2)  = split(/\*/, $line);
    seek($if2, 0, 0); # Make sure file handle point to the beginning o
+f the file 
    while (<$if2>) {  # Read each line of file2
        my $line2 = $_;
        chomp($line2);
        my ($key2, $value) = split(/\*/, $line2);
        if ($key1 eq $key2) {
            $vf1 = $value;
############ <strike>
#        } else {
#            $vf1 = ' ';
############ </strike>
        }
    }
############ <add>
    print join( '*', $key1, $vf1, $vf2 ), "\n";
############ </add>
}
[download]

The inner loop does not quite do what you state you want to do. You will only get an updated value for the last key in file1, and only then if the last key in file2 is also the same. Otherwise, you are clearing each and every value for $vf1. Strike out the marked section, and I think your script's logic will be correct, although it may not work quickly on larger data sets.

--MidLifeXis

Comment on Re: Indexing two large text files Select or Download Code

Replies are listed 'Best First'.
Re^2: Indexing two large text files by never_more (Initiate) on Apr 09, 2012 at 13:41 UTC
Thanks MidLifeXis for your quick response. Here are the answers to your questions: file2 is sorted by the key value, but I think I can sort file1 before processing it. Depends on the date, the content of file1 and file2 might change. And for module DBM::Deep, because it is an old Unix environment, DBM::Deep is not available, and I don't have the permission to install it. :( Yes, these records are fixed size, I am gonna try binary search. This sounds a good idea, trying... sort and join would be my last choice :) Thanks again	[reply]

Replies are listed 'Best First'.

Re^2: Indexing two large text files
by never_more (Initiate) on Apr 09, 2012 at 13:41 UTC

Thanks MidLifeXis for your quick response. Here are the answers to your questions:

file2 is sorted by the key value, but I think I can sort file1 before processing it.
Depends on the date, the content of file1 and file2 might change. And for module DBM::Deep, because it is an old Unix environment, DBM::Deep is not available, and I don't have the permission to install it. :(
Yes, these records are fixed size, I am gonna try binary search.
This sounds a good idea, trying...
sort and join would be my last choice :)

[reply]