Re: Compare 2 files and create a new one if it matches
by GrandFather (Saint) on Sep 20, 2008 at 02:35 UTC
|
Rereading a file 60,000 times is likely to slow things down somewhat. The trick is to read each file only once. To realise the trick you need to get the information you need from the smaller file and store it in memory. In this case a hash is probably the best way to store it because you can test very quickly for the match condition (assuming you don't actually require a regex match). Consider:
use strict;
use warnings;
# Sample data
my $catFile = <<CAT;
123
234
345
678
CAT
my $dataFile = <<DATA;
A|12|r|some|56|78|90
E|123|r|some|56|78|90
D|678|r|some|56|78|90
C|12|r|some|56|78|90
F|345|r|y|98|0|0
DATA
# Build the hash
open my $catIn, '<', \$catFile;
my %keys = map {chomp; $_ => 1} <$catIn>;
close $catIn;
open my $dataIn, '<', \$dataFile;
while (<$dataIn>) {
chomp;
my @parts = split /\|/;
next unless exists $keys{$parts[1]};
print join ('|', @parts), "\n";
}
close($dataIn);
Prints:
E|123|r|some|56|78|90
D|678|r|some|56|78|90
F|345|r|y|98|0|0
Perl reduces RSI - it saves typing
| [reply] [d/l] [select] |
Re: Compare 2 files and create a new one if it matches
by ikegami (Patriarch) on Sep 20, 2008 at 02:30 UTC
|
#!/usr/bin/perl
use strict;
use warnings;
my $File1 = '...';
my $File2 = '...';
my $File3 = '...';
my %keep;
{
open(my $fh_keys, '<', $File1)
or die("Can't open key file \"$File1\": $!\n);
while (<$fh_keys>) {
chomp;
$keep{$_} = 1;
}
}
{
open(my $fh_in, '<', $File2)
or die("Can't open input file \"$File2\": $!\n");
open(my $fh_out, '>', $File3)
or die("Can't create output file \"$File3\": $!\n");
while (<$fh_in>) {
my ($key) = /^[^|]*\|([^|]*)/;
print $fh_out $_ if $keep{$key};
}
}
Update: Added missing chomp as per reply.
Update: Added missing "$" in "my $fh_in" and "my $fh_out".
| [reply] [d/l] [select] |
|
|
A chomp is needed for the hash keys.
| [reply] [d/l] |
|
|
Thanks a lot ikegami. I'll try this out and see how long it takes.
| [reply] |
Re: Compare 2 files and create a new one if it matches
by McDarren (Abbot) on Sep 20, 2008 at 02:47 UTC
|
Here is what I would do (code snippets untested):
- Open the first file
- Iterate through the file, building a hash making each line a key, eg:
# assumes that duplicates in first file should be ignored.
my %wanted;
open my $in, '<', '$file1' or die "$!\n";
while (my $line = <$in>) {
chomp $line);
$wanted{$line}++;
}
- Open the second file
- Iterate through it line by line
- Extract the second "field" from each line using split
my $foo = (split /|/, $line)[1];
- If this value exists as a key in the hash you built earlier, print the record to your third file
print OUT $line if $wanted{$foo};
This may not be the fastest approach, but it does only open and read each of your input files once. As opposed to 60,000 X 16,000,000 = 960,000,000,000 times - which is what your current code does.
So I'd expect it to be just a tad faster ;)
Hope this helps,
Darren :)
| [reply] [d/l] [select] |
|
|
Actually your code is only about 60,000 times faster than the OP's code. The large file is opened and read once for each line (60,000 times that is) in the smaller file in the OP's version. The smaller file is opened and read once only.
In other respects your reply is pretty much the same as ikegami and my replies ;).
Perl reduces RSI - it saves typing
| [reply] |
Re: Compare 2 files and create a new one if it matches
by swampyankee (Parson) on Sep 20, 2008 at 14:16 UTC
|
Pure Perl solutions are no doubt best, but one could read the smaller file into an array, add appropriate markers, e.g. escaped pipe symbols at both ends of each element, eliminate duplicates, and write the resulting array to a new file, and use fgrep -f, capturing its output by using backticks (`).
Being a brute-force-and-ignorance sort of guy, my first pure Perl attempt would be to read both files into arrays, generate a really, really long regex from the search criteria (smaller) file, and use grep. Given my Perl-mojo, this would not work, and I'd have to loop through the individual records of the larger file.
Information about American English usage here and here. Floating point issues? Please read this before posting. — emc
| [reply] |
|
|
It would require tons of memory.
#!/usr/bin/perl
use strict;
use warnings;
use Regexp::List qw( );
my $File1 = '...';
my $File2 = '...';
my $File3 = '...';
my $keep_re;
{
open(my $fh_keys, '<', $File1)
or die("Can't open key file \"$File1\": $!\n);
$keep_re = Regexp::List
->new()
->list2re( map { my $s = $_;
chomp($s);
$s
}
<$fh_keys>
);
}
{
open(my $fh_in, '<', $File2)
or die("Can't open input file \"$File2\": $!\n");
open(my fh_out, '>', $File3)
or die("Can't create output file \"$File3\": $!\n");
print $fh_out grep /^[^|]*\|$keep_re\|/, <$fh_in>;
}
| [reply] [d/l] |
|
|
| [reply] |
|
|
Re: Compare 2 files and create a new one if it matches
by lamp (Chaplain) on Sep 20, 2008 at 03:12 UTC
|
How about using 'Tie::File' module??
use strict;
use warnings;
use Tie::File;
my %seen;
tie my @file1, 'Tie::File', 'file.txt' or die;
tie my @file2, 'Tie::File', 'file2.txt' or die;
foreach (@file1)
{ chomp; $seen{$_}++; }
for(@file2) {
my $key = (split /\|/,$_)[1];
print "$_\n" if $seen{$key};
}
untie(@file1);
untie(@file2);
output:
E|123|r|some|56|78|90
D|678|r|some|56|78|90
F|345|r|y|98|0|0
| [reply] [d/l] [select] |
|
|
| [reply] |