Re: Filtering Output from two files
by LanX (Saint) on Feb 04, 2018 at 11:12 UTC
|
Overview:
- read file1 into a %hash
- * inside a loop
- ... read file2 line by $line
- ... split the $line to @fields at |
- ... if the first $fields[0] exists in the %hash , print the whole $line to file3
you'll need open , readline , chomp , split , exists , print and while loops for this.
| [reply] [d/l] [select] |
|
|
i am actually new to scripting languages
i didn't quite follow what you meant in step 2
| [reply] |
|
|
# read file1 into a %hash
... code to do that here ...
# inside a loop
while (my $line = <$file2>) {
# read file2 line by line
... this was done in the loop condition above ...
# split the $line to @fields at |
# if the first $fields[0] exists in the %hash,
# print the whole $line to file 3
}
This is a relatively common question, so LanX gave you the outline of a good solution to the problem.
A frequent mistake is to try to read *both* files inside the loop, giving one of two bad outcomes:
- Either the first file is completely read in the first pass of the loop, so the code can only find a single match
if it happens to be the first line in the second file, or
- the code re-opens the first file each time, and therefore can find all the matches, but runs extremely slowly(1)
because it reads the first file completely for each line in the second file.
(1) Extremely slowly in the relative sense--for small files you may not notice it. But if your files get large
enough, you'll wonder why such a fast computer is so freakin' slow.
...roboticus
When your only tool is a hammer, all problems look like your thumb. | [reply] [d/l] |
|
|
|
|
|
|
|
|
|
| [reply] |
|
|
|
|
|
|
| [reply] |
|
|
use strict;
use warnings;
use Data::Dumper;
my $file1 = 'file1';
my $file2 = 'file2';
#reading file1 into a hash
my %hash=();
open (my $fh,'<',$file2) or die $!;
while(my $line=<$fh>)
{
chomp $line;
$hash{line}=1;
print Dumper %hash;
}
close $fh;
#reading file2 line by line
open (my $fh2,'<',$file1) or die $!;
while (my $row = <$fh2>) {
chomp $row;
my @fields = split(/\|/, $row);
print $row if exists $hash{$fields[0]};
}
close $fh2;
~
| [reply] |
|
|
You have at least one bug of forgetting sigil $ in line once, but yes this was the basic idea.
And dumping inside the loop is costly.
Furthrmore you might want to use <code> tags next time. :)
Minor nitpick: when setting value 1, you don't need exists anymore. :)
| [reply] [d/l] [select] |
|
|
|
|
|
|
|
|
|
|
|
#!/nairvigv/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $file1 = 'BBIDs.txt';
my $file2 = 'fixedincomeTransparency.out.px.derived.updates';
#reading file1 into a hash
my %hash;
my @fields;
open (my $fh,'<',$file1) or die $!;
while(my $line=<$fh>)
{
chomp $line;
next if $line =~ /^\s*$/;
$hash{$line}=1;
}
#print Dumper(\%hash);
close $fh;
open (my ($fh2),$file2) or die $!;
while (my $row = <$fh2>)
{
chomp $row;
print "$row\n";
next if $row =~ /^\s*$/;
my (@fields) = split(/\|/, $row);
print "$fields[0]\n ";
if (exists $hash{$fields[0]})
{
print "$row\n";
}
}
close $fh2;
~
Hello this works for a program with less lines i.e 10. However when i run it on a big file say about 700k lines nothing happens. Any idea what can be causing this? | [reply] [d/l] |
|
|
| [reply] |
|
|
|
|
|
|
|
|
Re: Filtering Output from two files
by Marshall (Canon) on Feb 04, 2018 at 22:07 UTC
|
LanX++ gave a good algorithm.
I am not sure if this is homework or not? If so, you should tell us.
However, I will give you some actual code.
I process text files frequently - Perl is great at this.
Skipping blank lines in the input is a normal "reflex reaction" by me and I show a common way to do that.
#!/usr/bin/perl
use warnings;
use strict;
use Inline::Files;
my %File1Hash;
while (my $line = <FILE1>)
{
next if $line =~ /^\s*$/; # skip blank lines
$line =~ s/\s*$//; # remove all trailing space,
# including the line ending
$File1Hash{$line}++;
}
while (my $line = <FILE2>)
{
next if $line =~ /^\s*$/; # skip blank lines
my ($id) = split /\|/,$line; # get the first field
print $line if exists $File1Hash{$id};
}
=Prints
COA213345|a|b|c|
COA213345|a|b|c|
=cut
__FILE1__
COA213345
COA213345
COA213445
DOB213345
EOA213345
__FILE2__
COA213345|a|b|c|
COA213345|a|b|c|
LOA213345|a|b|c|
kOB213345|a|b|c|
LOA213345|a|b|c|
Update: I read more of the posts in this thread. If file 1 is 700K lines, this should work just fine on a modern computer. My ancient (now dead) XP laptop would have had some issues with a hash of that size due to memory issues. A modern 64 bit computer won't even blink. If there are issues, there are ways to reduce the memory footprint. Let's not go there unless it is necessary. | [reply] [d/l] |