Compare 2 files and create a new one if it matches

shawshankred has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Compare 2 files and create a new one if it matches by GrandFather (Saint) on Sep 20, 2008 at 02:35 UTC
Rereading a file 60,000 times is likely to slow things down somewhat. The trick is to read each file only once. To realise the trick you need to get the information you need from the smaller file and store it in memory. In this case a hash is probably the best way to store it because you can test very quickly for the match condition (assuming you don't actually require a regex match). Consider: `use strict; use warnings; # Sample data my $catFile = <<CAT; 123 234 345 678 CAT my $dataFile = <<DATA; A\|12\|r\|some\|56\|78\|90 E\|123\|r\|some\|56\|78\|90 D\|678\|r\|some\|56\|78\|90 C\|12\|r\|some\|56\|78\|90 F\|345\|r\|y\|98\|0\|0 DATA # Build the hash open my $catIn, '<', \$catFile; my %keys = map {chomp; $_ => 1} <$catIn>; close $catIn; open my $dataIn, '<', \$dataFile; while (<$dataIn>) { chomp; my @parts = split /\\|/; next unless exists $keys{$parts[1]}; print join ('\|', @parts), "\n"; } close($dataIn);` [download] Prints: `E\|123\|r\|some\|56\|78\|90 D\|678\|r\|some\|56\|78\|90 F\|345\|r\|y\|98\|0\|0` [download] Perl reduces RSI - it saves typing	[reply] [d/l] [select]
Re: Compare 2 files and create a new one if it matches by ikegami (Patriarch) on Sep 20, 2008 at 02:30 UTC
Load the first file into a hash instead of reading the a file over and over again. And what's with using `cat`?!?!?! `#!/usr/bin/perl use strict; use warnings; my $File1 = '...'; my $File2 = '...'; my $File3 = '...'; my %keep; { open(my $fh_keys, '<', $File1) or die("Can't open key file \"$File1\": $!\n); while (<$fh_keys>) { chomp; $keep{$_} = 1; } } { open(my $fh_in, '<', $File2) or die("Can't open input file \"$File2\": $!\n"); open(my $fh_out, '>', $File3) or die("Can't create output file \"$File3\": $!\n"); while (<$fh_in>) { my ($key) = /^[^\|]\\|([^\|])/; print $fh_out $_ if $keep{$key}; } }` [download] Update: Added missing `chomp` as per reply. Update: Added missing "`$`" in "`my $fh_in`" and "`my $fh_out`".	[reply] [d/l] [select]
Re^2: Compare 2 files and create a new one if it matches by repellent (Priest) on Sep 20, 2008 at 18:44 UTC
A `chomp` is needed for the hash keys.	[reply] [d/l]
Re^2: Compare 2 files and create a new one if it matches by shawshankred (Sexton) on Sep 22, 2008 at 18:37 UTC
Thanks a lot ikegami. I'll try this out and see how long it takes.	[reply]
Re: Compare 2 files and create a new one if it matches by McDarren (Abbot) on Sep 20, 2008 at 02:47 UTC
Here is what I would do (code snippets untested): Open the first file Iterate through the file, building a hash making each line a key, eg: `# assumes that duplicates in first file should be ignored. my %wanted; open my $in, '<', '$file1' or die "$!\n"; while (my $line = <$in>) { chomp $line); $wanted{$line}++; }` [download] Open the second file Iterate through it line by line Extract the second "field" from each line using split `my $foo = (split /\|/, $line)[1];` [download] If this value exists as a key in the hash you built earlier, print the record to your third file `print OUT $line if $wanted{$foo};` [download] This may not be the fastest approach, but it does only open and read each of your input files once. As opposed to 60,000 ~~X 16,000,000 = 960,000,000,000~~ times - which is what your current code does. So I'd expect it to be just a tad faster ;) Hope this helps, Darren :)	[reply] [d/l] [select]
Re^2: Compare 2 files and create a new one if it matches by GrandFather (Saint) on Sep 20, 2008 at 03:39 UTC
Actually your code is only about 60,000 times faster than the OP's code. The large file is opened and read once for each line (60,000 times that is) in the smaller file in the OP's version. The smaller file is opened and read once only. In other respects your reply is pretty much the same as ikegami and my replies ;). Perl reduces RSI - it saves typing	[reply]
Re: Compare 2 files and create a new one if it matches by swampyankee (Parson) on Sep 20, 2008 at 14:16 UTC
Pure Perl solutions are no doubt best, but one could read the smaller file into an array, add appropriate markers, e.g. escaped pipe symbols at both ends of each element, eliminate duplicates, and write the resulting array to a new file, and use fgrep -f, capturing its output by using backticks (`). Being a brute-force-and-ignorance sort of guy, my first pure Perl attempt would be to read both files into arrays, generate a really, really long regex from the search criteria (smaller) file, and use grep. Given my Perl-mojo, this would not work, and I'd have to loop through the individual records of the larger file. Information about American English usage here and here. Floating point issues? Please read this before posting. — emc	[reply]
Re^2: Compare 2 files and create a new one if it matches by ikegami (Patriarch) on Sep 22, 2008 at 20:15 UTC
It would require tons of memory. #!/usr/bin/perl use strict; use warnings; use Regexp::List qw( ); my $File1 = '...'; my $File2 = '...'; my $File3 = '...'; my $keep_re; { open(my $fh_keys, '<', $File1) or die("Can't open key file \"$File1\": $!\n); $keep_re = Regexp::List ->new() ->list2re( map { my $s = $_; chomp($s); $s } <$fh_keys> ); } { open(my $fh_in, '<', $File2) or die("Can't open input file \"$File2\": $!\n"); open(my fh_out, '>', $File3) or die("Can't create output file \"$File3\": $!\n"); print $fh_out grep /^[^\|]*\\|$keep_re\\|/, <$fh_in>; } [download]	[reply] [d/l]
Re^3: Compare 2 files and create a new one if it matches by swampyankee (Parson) on Sep 23, 2008 at 01:50 UTC
In addition to the raw storage of a large and not-so large file, I've no idea how much memory processing the regex would take. I am also a bit doubful of the likelihood of a regex with several thousand alternatives actually working. The Brute Force & Ignorance method does have its downsides. Information about American English usage here and here. Floating point issues? Please read this before posting. — emc	[reply]
Re^4: Compare 2 files and create a new one if it matches by ikegami (Patriarch) on Sep 23, 2008 at 02:05 UTC
Re: Compare 2 files and create a new one if it matches by lamp (Chaplain) on Sep 20, 2008 at 03:12 UTC
How about using 'Tie::File' module?? `use strict; use warnings; use Tie::File; my %seen; tie my @file1, 'Tie::File', 'file.txt' or die; tie my @file2, 'Tie::File', 'file2.txt' or die; foreach (@file1) { chomp; $seen{$_}++; } for(@file2) { my $key = (split /\\|/,$_)[1]; print "$_\n" if $seen{$key}; } untie(@file1); untie(@file2);` [download] `output: E\|123\|r\|some\|56\|78\|90 D\|678\|r\|some\|56\|78\|90 F\|345\|r\|y\|98\|0\|0` [download]	[reply] [d/l] [select]
Re^2: Compare 2 files and create a new one if it matches by GrandFather (Saint) on Sep 20, 2008 at 03:32 UTC
Why complicate the problem by introducing tied files when you need make only one pass through the file in any case? You are simply adding overhead for no gain and obfuscating the the code into the bargain. Perl reduces RSI - it saves typing	[reply]