grep of readline matching more lines than elements in array

bdorsey has asked for the wisdom of the Perl Monks concerning the following question:

Hello all, I am a newbie and this is my first post so thanks in advance for any help. I am having trouble with what should be an easy part of a larger script. The result of the code below is a file with more lines than elements in the array (@isos20) used in the grep. I have done this multiple ways and can't figure out why grep is matching more than one line for some elements (if that's really the problem). The lines of the file to read are formatted like this:

comp1234_c0_seq1,1,5,3,8,0,6,...

where the string before the first comma is an ID in which numbers vary but the letters are consistent throughout and that potentially exists in @isos20.

Here is the code I have now:


#open file to write filtered lines to
open (OUT2, ">$pos_two") or die "cannot open $pos_two ";
#open file to filter
open (IN2, "<$posfile") or die "cannot open $posfile ";

while (<IN2>) {
  chomp(my $line = $_);
  if ($line =~ m/(comp\d+_c\d+_seq\d+),.+/) {
    my $comp = $1;

    if (grep (/$comp/, @isos20)) {
              
      print OUT2 "$line\n";
    }
    
  }
}

close IN2;
close OUT2;
[download]

I can only guess that the grep is matching more than one line for some of the elements in @isos20 but I don't know why that would happen - unless the $comp variable is actually a regex and not a literal string. Thanks very much for any help. I know this must be a simple fix but I am at a loss.
Best, BD

Comment on grep of readline matching more lines than elements in array Download Code

Replies are listed 'Best First'.
Re: grep of readline matching more lines than elements in array by BrowserUk (Patriarch) on Dec 03, 2013 at 19:22 UTC
This:`if (grep ($comp, @isos20)) {` Equates to `if (grep ( "comp1234_c0_seq1", @isos20 )) {` Which isn't a valid way to use grep. Try: `if( grep( /$comp/, @isos20 ) ) {` With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^2: grep of readline matching more lines than elements in array by bdorsey (Initiate) on Dec 03, 2013 at 19:31 UTC
Thanks for catching that. I actually had the // construction in my original code. Fixed now in posted code.	[reply]
Re: grep of readline matching more lines than elements in array by Kenosis (Priest) on Dec 03, 2013 at 19:50 UTC
If I may, I'd like to offer just a few suggestions which may assist your efforts: In case you haven't, always `use strict; use warnings;` at the top of your scripts Use the three-argument form of open Use `split` to get the ID from the file's lines Build a hash from `@isos20` and use that to check for an ID match when reading the file Give the above items, consider the following refactoring: `use strict; use warnings; my $posfile = 'posfile.txt'; my $pos_two = 'pos_two.txt'; my @isos20 = qw/these are the array elements/; my %isos20 = map { $_ => 1 } @isos20; #open file to write filtered lines to open my $OUT2, '>', $pos_two or die "cannot open $pos_two: $!"; #open file to filter open my $IN2, '<', $posfile or die "cannot open $posfile: $!"; while (<$IN2>) { my $comp = ( split /,/ )[0]; print $OUT2 $_ if $isos20{$comp}; } close $IN2; close $OUT2;` [download] Notice that there are fewer instructions within the `while` loop. Instead of assigning the value of Perl's default scalar `$_` to `$line`, just operate on `$_`. The `split` gets the ID from the file's string by creating a list of string elements and taking the zeroth element of that list. The `$isos20{$comp}` notation 'looks' for that ID within the hash (constructed using `map` above), and prints the line to the output file if it's in the hash. Why use a hash instead of `grep`ping the array for a match? Using a hash is a much faster, more efficient way of detecting a match, in this case. For each line, the entire array is traversed by `grep` to find a possible match. However, a hash has a very efficient look-up algorithm, so will work better--significantly so, if the array is quite large. Hope this helps!	[reply] [d/l] [select]
Re^2: grep of readline matching more lines than elements in array by Laurent_R (Canon) on Dec 03, 2013 at 21:41 UTC
I can only agree with the excellent advice offered by Kenosis, you should really follow them, they will save you a lot of debugging time. There is just one point in the re-factored code which could be improved in my view: `my %isos20 = map { $_ => 1 } @isos20;` [download] I would give a different name to the hash and the array. Granted, Perl can manage this without any problem, and it will work without any problem in the case in point. But giving the same name to two different entities can lead to difficult to track bugs with more complicated data structure. Sometimes, you goof your data dereferencing and the Perl compiler would be able to tell you about your error if each data structure had its own name, but it does not see the error if two different entities have the same name, so that you get the error at run time instead of compile time, or, worse, that you are not using the data you thought you were using.	[reply] [d/l]
Re^3: grep of readline matching more lines than elements in array by Kenosis (Priest) on Dec 04, 2013 at 01:19 UTC
Excellent hash-naming suggestion, Laurent_R. Thank you for adding this.	[reply]
Re: grep of readline matching more lines than elements in array by Laurent_R (Canon) on Dec 03, 2013 at 19:49 UTC
I can only guess that the grep is matching more than one line for some of the elements in @isos20 Even assuming it did, you would still only print $line only once, since the grep is used only as a conditional for deciding whether to print or not. Your error, if any, must be somewhere else. But I can't say where, not having seen your input.	[reply]
Re: grep of readline matching more lines than elements in array by BrowserUk (Patriarch) on Dec 03, 2013 at 19:43 UTC
I can only guess that the grep is matching more than one line for some of the elements in @isos20 Why are you guessing? Don't you know? If you don't even know if the problem exists, how can we help you? And if the problem does exist, the way to verify it is to check the data, which we don't have. I don't know why that would happen There are two possibilities (assuming your actual code is correct. How can you "lose" a pair of //s when copy&pasting? You didn't C&P? Why not?): The file contains more than one line with the same identifier. You are not anchoring your regex to the the start of the line. Is it possible for a different identifier to appear someplace other than the beginning of the line? The array contains two or more copies of the same identifier. Have you checked the contents of the array? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^2: grep of readline matching more lines than elements in array by AnomalousMonk (Archbishop) on Dec 03, 2013 at 20:02 UTC
1.The [input] file contains more than one line with the same identifier. This seems to me the only way to obtain the results suggested in the OP. If there were multiple matches of a sub-string from an input file with the contents of the `@isos20` array, I would still expect the input file line to be output only once: grep in a "boolean" context would return true for one or more matches. (Update: I overlooked the fact that Laurent_R had already made essentially the same point here.) bdorsey: Is there perhaps some further constraint on the content of the input file that you have neglected to mention?	[reply] [d/l]
Re: grep of readline matching more lines than elements in array by bdorsey (Initiate) on Dec 05, 2013 at 22:39 UTC
Thanks to everyone for responding and for the helpful suggestions to improve my code. I didn't find any duplicate lines in the infile so I am examining the rest of the script to see where the error might lie. Best- BD	[reply]