Filtering Output from two files

vighneshmufc has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Filtering Output from two files by LanX (Saint) on Feb 04, 2018 at 11:12 UTC
Overview: read file1 into a %hash * inside a loop ... read file2 line by $line ... split the $line to @fields at `\|` ... if the first `$fields[0]` exists in the %hash , print the whole $line to file3 you'll need `open` , `readline` , `chomp` , `split` , `exists` , `print` and `while` loops for this. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^2: Filtering Output from two files by vighneshmufc (Acolyte) on Feb 04, 2018 at 11:55 UTC
i am actually new to scripting languages i didn't quite follow what you meant in step 2	[reply]
Re^3: Filtering Output from two files by roboticus (Chancellor) on Feb 04, 2018 at 14:25 UTC
pvighneshmufc: He meant that the following lines are inside of a loop, like this: `# read file1 into a %hash ... code to do that here ... # inside a loop while (my $line = <$file2>) { # read file2 line by line ... this was done in the loop condition above ... # split the $line to @fields at \| # if the first $fields[0] exists in the %hash, # print the whole $line to file 3 }` [download] This is a relatively common question, so LanX gave you the outline of a good solution to the problem. A frequent mistake is to try to read both files inside the loop, giving one of two bad outcomes: Either the first file is completely read in the first pass of the loop, so the code can only find a single match if it happens to be the first line in the second file, or the code re-opens the first file each time, and therefore can find all the matches, but runs extremely slowly⁽¹⁾ because it reads the first file completely for each line in the second file. (1) Extremely slowly in the relative sense--for small files you may not notice it. But if your files get large enough, you'll wonder why such a fast computer is so freakin' slow. ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply] [d/l]
Re^4: Filtering Output from two files by vighneshmufc (Acolyte) on Feb 04, 2018 at 15:09 UTC
Re^5: Filtering Output from two files by AnomalousMonk (Archbishop) on Feb 04, 2018 at 16:18 UTC
Re^5: Filtering Output from two files by LanX (Saint) on Feb 04, 2018 at 17:45 UTC
Some notes below your chosen depth have not been shown here
Re^3: Filtering Output from two files by LanX (Saint) on Feb 04, 2018 at 12:17 UTC
Perl read file line by line led me to http://perlmaven.com/open-and-read-from-files The first code example shows it already, though $line is named $row here. I'm not going to give you complete code, because I want to help you to learn Perl. :) Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Wikisyntax for the Monastery}	[reply]
Re^4: Filtering Output from two files by vighneshmufc (Acolyte) on Feb 04, 2018 at 15:06 UTC
Re^5: Filtering Output from two files by marto (Cardinal) on Feb 04, 2018 at 16:03 UTC
Re^3: Filtering Output from two files by hippo (Archbishop) on Feb 04, 2018 at 12:38 UTC
See Files and I/O in perlintro.	[reply]
Re^2: Filtering Output from two files by Anonymous Monk on Feb 05, 2018 at 11:50 UTC
use strict; use warnings; use Data::Dumper; my $file1 = 'file1'; my $file2 = 'file2'; #reading file1 into a hash my %hash=(); open (my $fh,'<',$file2) or die $!; while(my $line=<$fh>) { chomp $line; $hash{line}=1; print Dumper %hash; } close $fh; #reading file2 line by line open (my $fh2,'<',$file1) or die $!; while (my $row = <$fh2>) { chomp $row; my @fields = split(/\\|/, $row); print $row if exists $hash{$fields[0]}; } close $fh2; ~	[reply]
Re^3: Filtering Output from two files by LanX (Saint) on Feb 05, 2018 at 12:10 UTC
You have at least one bug of forgetting sigil $ in `line` once, but yes this was the basic idea. And dumping inside the loop is costly. Furthrmore you might want to use `<code>` tags next time. :) Minor nitpick: when setting value 1, you don't need `exists` anymore. :) Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^4: Filtering Output from two files by vighneshmufc (Acolyte) on Feb 05, 2018 at 15:00 UTC
Re^5: Filtering Output from two files by LanX (Saint) on Feb 05, 2018 at 15:10 UTC
Re^4: Filtering Output from two files by Anonymous Monk on Feb 06, 2018 at 04:17 UTC
Re^5: Filtering Output from two files by roboticus (Chancellor) on Feb 06, 2018 at 05:09 UTC
Some notes below your chosen depth have not been shown here
Re^2: Filtering Output from two files by Anonymous Monk on Feb 06, 2018 at 11:38 UTC
#!/nairvigv/bin/perl use strict; use warnings; use Data::Dumper; my $file1 = 'BBIDs.txt'; my $file2 = 'fixedincomeTransparency.out.px.derived.updates'; #reading file1 into a hash my %hash; my @fields; open (my $fh,'<',$file1) or die $!; while(my $line=<$fh>) { chomp $line; next if $line =~ /^\s$/; $hash{$line}=1; } #print Dumper(\%hash); close $fh; open (my ($fh2),$file2) or die $!; while (my $row = <$fh2>) { chomp $row; print "$row\n"; next if $row =~ /^\s$/; my (@fields) = split(/\\|/, $row); print "$fields[0]\n "; if (exists $hash{$fields[0]}) { print "$row\n"; } } close $fh2; ~ [download] Hello this works for a program with less lines i.e 10. However when i run it on a big file say about 700k lines nothing happens. Any idea what can be causing this?	[reply] [d/l]
Re^3: Filtering Output from two files by LanX (Saint) on Feb 06, 2018 at 11:48 UTC
Due to broken indentation it's hard to read and I won't go into details. Have a look at http://perl.plover.com/FAQs/Buffering.html and basic debugging checklist Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Wikisyntax for the Monastery}	[reply]
Re^4: Filtering Output from two files by vighneshmufc (Acolyte) on Feb 06, 2018 at 12:25 UTC
Re^5: Filtering Output from two files (updated) by LanX (Saint) on Feb 06, 2018 at 12:39 UTC
Re^4: Filtering Output from two files by vighneshmufc (Acolyte) on Feb 06, 2018 at 15:52 UTC
Re^5: Filtering Output from two files by hippo (Archbishop) on Feb 06, 2018 at 16:36 UTC
Re: Filtering Output from two files by Marshall (Canon) on Feb 04, 2018 at 22:07 UTC
LanX++ gave a good algorithm. I am not sure if this is homework or not? If so, you should tell us. However, I will give you some actual code. I process text files frequently - Perl is great at this. Skipping blank lines in the input is a normal "reflex reaction" by me and I show a common way to do that. #!/usr/bin/perl use warnings; use strict; use Inline::Files; my %File1Hash; while (my $line = <FILE1>) { next if $line =~ /^\s$/; # skip blank lines $line =~ s/\s$//; # remove all trailing space, # including the line ending $File1Hash{$line}++; } while (my $line = <FILE2>) { next if $line =~ /^\s$/; # skip blank lines my ($id) = split /\\|/,$line; # get the first field print $line if exists $File1Hash{$id}; } =Prints COA213345\|a\|b\|c\| COA213345\|a\|b\|c\| =cut __FILE1__ COA213345 COA213345 COA213445 DOB213345 EOA213345 __FILE2__ COA213345\|a\|b\|c\| COA213345\|a\|b\|c\| LOA213345\|a\|b\|c\| kOB213345\|a\|b\|c\| LOA213345\|a\|b\|c\| [download] Update:* I read more of the posts in this thread. If file 1 is 700K lines, this should work just fine on a modern computer. My ancient (now dead) XP laptop would have had some issues with a hash of that size due to memory issues. A modern 64 bit computer won't even blink. If there are issues, there are ways to reduce the memory footprint. Let's not go there unless it is necessary.	[reply] [d/l]