in reply to Compare 2 files and create a new one if it matches

Pure Perl solutions are no doubt best, but one could read the smaller file into an array, add appropriate markers, e.g. escaped pipe symbols at both ends of each element, eliminate duplicates, and write the resulting array to a new file, and use fgrep -f, capturing its output by using backticks (`).

Being a brute-force-and-ignorance sort of guy, my first pure Perl attempt would be to read both files into arrays, generate a really, really long regex from the search criteria (smaller) file, and use grep. Given my Perl-mojo, this would not work, and I'd have to loop through the individual records of the larger file.


Information about American English usage here and here. Floating point issues? Please read this before posting. — emc

  • Comment on Re: Compare 2 files and create a new one if it matches

Replies are listed 'Best First'.
Re^2: Compare 2 files and create a new one if it matches
by ikegami (Patriarch) on Sep 22, 2008 at 20:15 UTC
    It would require tons of memory.
    #!/usr/bin/perl use strict; use warnings; use Regexp::List qw( ); my $File1 = '...'; my $File2 = '...'; my $File3 = '...'; my $keep_re; { open(my $fh_keys, '<', $File1) or die("Can't open key file \"$File1\": $!\n); $keep_re = Regexp::List ->new() ->list2re( map { my $s = $_; chomp($s); $s } <$fh_keys> ); } { open(my $fh_in, '<', $File2) or die("Can't open input file \"$File2\": $!\n"); open(my fh_out, '>', $File3) or die("Can't create output file \"$File3\": $!\n"); print $fh_out grep /^[^|]*\|$keep_re\|/, <$fh_in>; }

      In addition to the raw storage of a large and not-so large file, I've no idea how much memory processing the regex would take. I am also a bit doubful of the likelihood of a regex with several thousand alternatives actually working. The Brute Force & Ignorance method does have its downsides.


      Information about American English usage here and here. Floating point issues? Please read this before posting. — emc

        I am also a bit doubful of the likelihood of a regex with several thousand alternatives actually working.

        I used List::Regexp, so there's at most {character set size} alternatives in each alteration.