in reply to Indexing two large text files

A couple of questions on the structures of your files:

The way you currently have this implemented is approximately O(N**2) (assuming that the number of lines in each file are approximately equal). For data files on disk, this is not a good situation. Sorted data files can reduce this to O(N), which is about as good as you are going to get.

Update: Reread the original code, saw what it was actually doing rather than the apparent intent of what it should do:

open my $if1, '<', $input_f1 or die "Can't open $input_f1: $!\n"; open my $if2, '<', $input_f2 or die "Can't open $input_f2: $!\n"; while(<$if1>) { # Read each line of file1 my $line = $_; chomp($line); my ($key1, $vf1, $vf2) = split(/\*/, $line); seek($if2, 0, 0); # Make sure file handle point to the beginning o +f the file while (<$if2>) { # Read each line of file2 my $line2 = $_; chomp($line2); my ($key2, $value) = split(/\*/, $line2); if ($key1 eq $key2) { $vf1 = $value; ############ <strike> # } else { # $vf1 = ' '; ############ </strike> } } ############ <add> print join( '*', $key1, $vf1, $vf2 ), "\n"; ############ </add> }

The inner loop does not quite do what you state you want to do. You will only get an updated value for the last key in file1, and only then if the last key in file2 is also the same. Otherwise, you are clearing each and every value for $vf1. Strike out the marked section, and I think your script's logic will be correct, although it may not work quickly on larger data sets.

--MidLifeXis

Replies are listed 'Best First'.
Re^2: Indexing two large text files
by never_more (Initiate) on Apr 09, 2012 at 13:41 UTC

    Thanks MidLifeXis for your quick response. Here are the answers to your questions:

    • file2 is sorted by the key value, but I think I can sort file1 before processing it.
    • Depends on the date, the content of file1 and file2 might change. And for module DBM::Deep, because it is an old Unix environment, DBM::Deep is not available, and I don't have the permission to install it. :(
    • Yes, these records are fixed size, I am gonna try binary search.
    • This sounds a good idea, trying...
    • sort and join would be my last choice :)
    Thanks again