shawshankred has asked for the wisdom of the Perl Monks concerning the following question:

I have 2 files and want to create another file comparing
these 2 source files.

file1 is subset of file2.

file1 - (60,000 lines)

123
234
345
678

file2 - (16 mill lines)
A|12|r|some|56|78|90
E|123|r|some|56|78|90
D|678|r|some|56|78|90
C|12|r|some|56|78|90
F|345|r|y|98|0|0

Output is this.

E|123|r|some|56|78|90
D|678|r|some|56|78|90
F|345|r|y|98|0|0


I have written a script but its taking more than 2
weeks to execute on a SUN Solaris server. Here the
code please suggest a faster way to do it.
#! /bin/perl $DATA_DIR = "/export/home/data"; $Dump_Data = "/export/home/data/Dump"; $HSS_Data = "/export/home/HSS/data"; $File2 = "$Dump_Data/File2"; open IN1,"cat $File1|" or die "Can't open $File1: $!\n"; open (OUT1, ">>File3.txt") or die "Can't open $File3.txt: $!\n"; while ($line1 = <IN1>) { chomp($line1); open IN2,"cat $File2 |" or die "Can't open $File2: $!\n"; while ($line2 = <IN2>) { if ($line2 =~ /$line1/) { print OUT1"$line2"; last; } } close(IN2); } close(IN1);

Replies are listed 'Best First'.
Re: Compare 2 files and create a new one if it matches
by GrandFather (Saint) on Sep 20, 2008 at 02:35 UTC

    Rereading a file 60,000 times is likely to slow things down somewhat. The trick is to read each file only once. To realise the trick you need to get the information you need from the smaller file and store it in memory. In this case a hash is probably the best way to store it because you can test very quickly for the match condition (assuming you don't actually require a regex match). Consider:

    use strict; use warnings; # Sample data my $catFile = <<CAT; 123 234 345 678 CAT my $dataFile = <<DATA; A|12|r|some|56|78|90 E|123|r|some|56|78|90 D|678|r|some|56|78|90 C|12|r|some|56|78|90 F|345|r|y|98|0|0 DATA # Build the hash open my $catIn, '<', \$catFile; my %keys = map {chomp; $_ => 1} <$catIn>; close $catIn; open my $dataIn, '<', \$dataFile; while (<$dataIn>) { chomp; my @parts = split /\|/; next unless exists $keys{$parts[1]}; print join ('|', @parts), "\n"; } close($dataIn);

    Prints:

    E|123|r|some|56|78|90 D|678|r|some|56|78|90 F|345|r|y|98|0|0

    Perl reduces RSI - it saves typing
Re: Compare 2 files and create a new one if it matches
by ikegami (Patriarch) on Sep 20, 2008 at 02:30 UTC
    Load the first file into a hash instead of reading the a file over and over again.

    And what's with using cat?!?!?!

    #!/usr/bin/perl use strict; use warnings; my $File1 = '...'; my $File2 = '...'; my $File3 = '...'; my %keep; { open(my $fh_keys, '<', $File1) or die("Can't open key file \"$File1\": $!\n); while (<$fh_keys>) { chomp; $keep{$_} = 1; } } { open(my $fh_in, '<', $File2) or die("Can't open input file \"$File2\": $!\n"); open(my $fh_out, '>', $File3) or die("Can't create output file \"$File3\": $!\n"); while (<$fh_in>) { my ($key) = /^[^|]*\|([^|]*)/; print $fh_out $_ if $keep{$key}; } }

    Update: Added missing chomp as per reply.
    Update: Added missing "$" in "my $fh_in" and "my $fh_out".

      A chomp is needed for the hash keys.
      Thanks a lot ikegami. I'll try this out and see how long it takes.
Re: Compare 2 files and create a new one if it matches
by McDarren (Abbot) on Sep 20, 2008 at 02:47 UTC
    Here is what I would do (code snippets untested):

    • Open the first file
    • Iterate through the file, building a hash making each line a key, eg:
      # assumes that duplicates in first file should be ignored. my %wanted; open my $in, '<', '$file1' or die "$!\n"; while (my $line = <$in>) { chomp $line); $wanted{$line}++; }
    • Open the second file
    • Iterate through it line by line
    • Extract the second "field" from each line using split
      my $foo = (split /|/, $line)[1];
    • If this value exists as a key in the hash you built earlier, print the record to your third file
      print OUT $line if $wanted{$foo};

    This may not be the fastest approach, but it does only open and read each of your input files once. As opposed to 60,000 X 16,000,000 = 960,000,000,000 times - which is what your current code does.
    So I'd expect it to be just a tad faster ;)

    Hope this helps,
    Darren :)

      Actually your code is only about 60,000 times faster than the OP's code. The large file is opened and read once for each line (60,000 times that is) in the smaller file in the OP's version. The smaller file is opened and read once only.

      In other respects your reply is pretty much the same as ikegami and my replies ;).


      Perl reduces RSI - it saves typing
Re: Compare 2 files and create a new one if it matches
by swampyankee (Parson) on Sep 20, 2008 at 14:16 UTC

    Pure Perl solutions are no doubt best, but one could read the smaller file into an array, add appropriate markers, e.g. escaped pipe symbols at both ends of each element, eliminate duplicates, and write the resulting array to a new file, and use fgrep -f, capturing its output by using backticks (`).

    Being a brute-force-and-ignorance sort of guy, my first pure Perl attempt would be to read both files into arrays, generate a really, really long regex from the search criteria (smaller) file, and use grep. Given my Perl-mojo, this would not work, and I'd have to loop through the individual records of the larger file.


    Information about American English usage here and here. Floating point issues? Please read this before posting. — emc

      It would require tons of memory.
      #!/usr/bin/perl use strict; use warnings; use Regexp::List qw( ); my $File1 = '...'; my $File2 = '...'; my $File3 = '...'; my $keep_re; { open(my $fh_keys, '<', $File1) or die("Can't open key file \"$File1\": $!\n); $keep_re = Regexp::List ->new() ->list2re( map { my $s = $_; chomp($s); $s } <$fh_keys> ); } { open(my $fh_in, '<', $File2) or die("Can't open input file \"$File2\": $!\n"); open(my fh_out, '>', $File3) or die("Can't create output file \"$File3\": $!\n"); print $fh_out grep /^[^|]*\|$keep_re\|/, <$fh_in>; }

        In addition to the raw storage of a large and not-so large file, I've no idea how much memory processing the regex would take. I am also a bit doubful of the likelihood of a regex with several thousand alternatives actually working. The Brute Force & Ignorance method does have its downsides.


        Information about American English usage here and here. Floating point issues? Please read this before posting. — emc

Re: Compare 2 files and create a new one if it matches
by lamp (Chaplain) on Sep 20, 2008 at 03:12 UTC
    How about using 'Tie::File' module??
    use strict; use warnings; use Tie::File; my %seen; tie my @file1, 'Tie::File', 'file.txt' or die; tie my @file2, 'Tie::File', 'file2.txt' or die; foreach (@file1) { chomp; $seen{$_}++; } for(@file2) { my $key = (split /\|/,$_)[1]; print "$_\n" if $seen{$key}; } untie(@file1); untie(@file2);
    output: E|123|r|some|56|78|90 D|678|r|some|56|78|90 F|345|r|y|98|0|0

      Why complicate the problem by introducing tied files when you need make only one pass through the file in any case? You are simply adding overhead for no gain and obfuscating the the code into the bargain.


      Perl reduces RSI - it saves typing