shawshankred has asked for the wisdom of the Perl Monks concerning the following question:

I have a perl script that tries to merge 3 files into one file. However due to some issues with the algorithm it is currently taking a long time to finish. Need some suggestions/advice from the expert monks here. Any help is much appreciated.

Example of Input and Output files

Inputfile1 === Inputfile 2 === Inputfile3

A1 B1 ====== B3, C3, D3 ====== B4
A2 B2 ====== B7, C7, D7 ====== C5
A3 B3 ====== B10, C10, D10 ====== D1
A4 B4 ====== B1, C1, D1 ====== D20
….
….

Output file should look like this

A1 B1 ====== B1 C1 D1 ====== B1 C1 D1
A2 B2 ====== B2 C2 D2 ====== B2 C2 D2
A3 B3 ====== B3 C3 D3 ====== B3 C3 D3
A4 B4 ====== B4 C4 D4 ====== B4 C4 D4
A5 B5 ====== NOT_FNDinFile2 ====== B5 C5 D5
A6 B6 ====== B6 C6 D6 ====== NOT_FNDinFile3
A7 B7 ====== NOT_FNDinFile2 ====== NOT_FNDinFile3
NOT_FNDinFile1 ====== B8 C8 D8 ====== B8 C8 D8
…… ……

Here is my Code..

First I merge File1 and File2 and create a Output file OUT1
Next I merge the Output file 1 and File3.
open(OUT1, '>', $File4) or die("Can't create output file \"$File4\": $!\n"); open(OUT2, '>', $File5) or die("Can't create output file \"$File5\": $!\n"); %file2_data; { open($fh_keys, '<', $File2) or die("Can't open key file \"$File2\": + $!\n"); while ($line1 = <$fh_keys>) { chomp($line1); $line1 =~ s/^\s+//; # Remove the space from begining of the lin +e $line1 =~ s/\s+$//; # Remove the space from end of the line $first = (split /\t/, $line1)[0]; $file2_data{$first} = $line1; } } close($fh_keys); %file1_data; { open($fh_in, '<', $File1) or die("Can't open input file \"$File1\": $!\n"); while (<$fh_in>) { chomp; $flag = 0; $sec = (split /\t/, $_)[1]; $file1_data{$sec} = $_; if ($file2_data{$sec}) { print OUT1"$_\tFOUND_Data\t$file2_data{$sec}\n"; } else { print OUT1"$_\tNOTFOUND_IN_FILE2\n"; } } } close($fh_in); while ( ($KEY_1, $VALUE_1) = each %file2_data ) { if ($file1_data{$KEY_1}) { print OUT1"$file1_data{$KEY_1}\tFOUND_Data\t$VALUE_1\n"; } else { print OUT1"$VALUE_1\tNOTFOUND_IN_FILE1\n"; } } open my $File3_IN1, q{<}, "$File3" or die qq{Can't open "$File3": $!\n}; my @numbers = (); while (<$File3_IN1>) { chomp; $_ =~ s/^\s+//; $_ =~ s/\s+$//; if ($_ == "B") { push @numbers_B, $_; } els if ($_ == "C") { push @numbers_C, $_; } els if ($_ == "D") { push @numbers_D, $_; } } close($File3_IN1); close(OUT1); my $rxFindB = do{ local $" = q{|}; qr{(@numbers_B)}; }; my $rxFindC = do{ local $" = q{|}; qr{(@numbers_C)}; }; my $rxFindD = do{ local $" = q{|}; qr{(@numbers_D)}; }; open OUT1_IN2,"cat $File4 |" or die "Can't open $File4: $!\n"; while (<OUT1_IN2>) { chomp; if m{$rxFindB} { print OUT2"$_\t$1\t"; } else { print OUT2"$_\tNOT_FOUND_B_InFile3\t"; } if m{$rxFindC} { print OUT2"$2\t"; } else { print OUT2"$_\tNOT_FOUND_C_InFile3\t"; } if m{$rxFindD} { print OUT2"$3\n"; } else { print OUT2"$_\tNOT_FOUND_D_InFile3\t"; } } close(OUT1_IN2);

Replies are listed 'Best First'.
Re: Merge 3 files into one file
by ELISHEVA (Prior) on Mar 12, 2009 at 20:01 UTC

    You don't actually need two output files here. You can eliminate a rather large step by reading in files 1,2,3 into hashes and then doing the matching directly on the hashes. You will save a lot of time because you will eliminate 1 whole file write and 1 whole file read.

    The solution is not very different from what you learned in Compare 2 files and create a new one if it matches. There you had two input files that you wanted to merge into one and you learned you could do it by reading the files into hashes and matching them in memory. Now you have three.

    ++ for trying to apply what you learned. The only mistake you made was thinking that you needed one output file for every two files, but you don't. The key thing for merging the files is the hashes. Here's pseudo-code for comparing three hashes (rather than two):

    #%hFirst stores lines loaded in from first file #%hSecond stores lines loaded in from second file #%hThird stores lines loaded in from third file #OUT is a file handle to your one and only output file while (my ($sKey, $fld1) = each(%hFirst)) { my $fld1 = $hFirst{$sKey}; my $fld2 = exists $hSecond{$sKey} ? $hSecond{$sKey} : "NOT DEFINED"; my $fld3 = exists $hThird{$sKey} ? $hThird{$sKey} : "NOT DEFINED"; print OUT "$fld1======$fld2======$fld3\n"; }

    Best, beth

      ELISHEVA,
      Thanks a lot, as suggested I used hash and it worked fine and takes less than a minute.
Re: Merge 3 files into one file
by jethro (Monsignor) on Mar 12, 2009 at 20:12 UTC

    Obligatory plug for "use strict; use warnings;"

    Note that the regexes at the end are likely the cause of any excess runtime not caused by disk I/O. Like ELISHEVA said, hashes are much better here (although memory usage will be somewhat higher).

      If the regexps are a problem, they'll go much faster in 5.10+ thanks to tries.
      jethro,
      Thanks a lot, as suggested I used hash and it worked fine and takes less than a minute.
Re: Merge 3 files into one file
by bellaire (Hermit) on Mar 12, 2009 at 19:50 UTC
    My advice is that you profile your code to identify where the slowness originates, then you'll know what you need to address. If your bottleneck is the disk I/O and you're dealing with large files, there may not be much you can do.
      bellaire, Thanks a lot for the suggestion, I used hash and the script now is taking less than a minute.