in reply to Re^2: Removing partially duplicated lines from a file
in thread Removing partially duplicated lines from a file

The line of code you asked about basically says if $columns[1] is true (has any value Perl evaluates as true) and contains a string that begins with "HLA-A" then take the following actions. I included the first "does it have a true value" check because I assumed use warnings; would end up complaining for any line that didn't have an element at index 1 in $columns. I didn't actually try it without it, but I just assumed that would happen for at least the all "---" lines.

As for the code changes you requested:

#!/usr/bin/perl use warnings; use strict; open(my $in_fh, '<', 'input.txt') or die $!; my $output; my %seen_lines; while (<$in_fh>) { chomp; my @columns = split; if ($columns[1] and $columns[1] =~ /^HLA-A/) { my $HLA_Peptide = $columns[1] . $columns[2]; $output .= "$_\n" if (!exists $seen_lines{$HLA_Peptide}); $seen_lines{$HLA_Peptide} = 1; } else { $output .= "$_\n"; } } close $in_fh; print $output;

EDIT: I did just try it without that first check and I was correct, it does throw warnings without it. There may be a better way to avoid that warning (it does occur to me that false values like "0" or an empty string would be evaluated as such), but I use this trick a lot in an attempt to appease use warnings; or "-w". I wonder if there is something like exists which I use a lot for hashes only meant for use to check if an array element exists?

I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites
I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious

Replies are listed 'Best First'.
Re^4: Removing partially duplicated lines from a file
by AnomalousMonk (Archbishop) on Jul 27, 2016 at 00:55 UTC
    ... that first check ... a better way to avoid that warning ... something like exists ...

    defined is the way I would typically finesse this problem:
        if (defined($columns[1]) && $columns[1] =~ /^HLA-A/) {
            ...
            }
    In the case of your posted code, the empty string and  '0' will not, as you say, be tested against the regex, and in this particular case it will not matter because they cannot match anyway. In the general case, I think it's better to use defined because you can better avoid the "It'll never happen... Oh, it does happen..." situations that lead to those wonderful 3 AM debug sessions.


    Give a man a fish:  <%-{-{-{-<

      Ah, thank you very much, defined sounds like exactly the type of thing I was looking for. I think I even skimmed over the perldoc for it before (I did learn about the existence of defined from a quick mention of it in "Learning Perl") but erroneously disregarded it in this case due to the perldoc bit that says "Use of defined on aggregates (hashes and arrays) is deprecated." Which is my fault for skimming rather than actually RTFM'ing, because the perldoc is pretty clear about what it actually meant by "use... on aggregates" via the examples it gives, and it doesn't mention anything being frowned upon for using defined in a scalar context on a single array element like the defined($columns[1]) you show.

      Small follow up question. You chose to add parenthesis and switch the and to an && instead. I understand the different order of precedence between and vs. &&, but is there a reason you elected to rewrite it that way? Or is just a case of, "that's just the way I decided to write it"?

      EDIT: Sorry, my brain saw added parenthesis where there were none.

      I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites
      I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious
        You ... switch the and to an && ... is there a reason you elected to rewrite it that way?

        Personally, the use of  and or not and even  xor for enhanced readability in "simple" logical expressions is extremely seductive. However, these operators were not introduced to improve readability, but for flow control. Their ultimately low precedence is their raison d'être, and this low precedence introduces so many potental pitfalls when you try to use them for readablity enhancement that it's just not worth all the headaches. (And simple expressions have been known to become more hairy.) My practice is to use them for their intended purpose — although I have to admit I do find myself backsliding from time to time.


        Give a man a fish:  <%-{-{-{-<

Re^4: Removing partially duplicated lines from a file
by Sandy_Bio_Perl (Beadle) on Jul 26, 2016 at 21:35 UTC

    Thank you Perldigious, I am very very grateful