LeBran has asked for the wisdom of the Perl Monks concerning the following question:

Hello all, I'm struggling to get Perl to work in skipping lines containing certain features which means I don't need to use them, basically I'm just unsure of the syntax, though it appears to work in some places and not in others

Looking at this following table;

chrM 2928 A G 17 0 0% A 46 105 69.54% + R Somatic 1 1.23052720566294E-008 17 29 50 55 + 4 13 0 0 chr1 108310 T C 9 0 0% T 3 5 62.5% Y + Somatic 1 0.0090497738 2 1 2 3 7 2 0 + 0 chr1 726958 T A 11 0 0% T 9 4 30.77% + W Somatic 1 0.0672877847 6 3 1 3 5 6 0 + 0 chr1 1412720 C A 33 0 0% C 22 6 21.43% + M Somatic 1 0.0067850063 10 12 3 3 14 1 +9 0 0 chr1 1822396 C G 6 0 0% C 4 4 50% S + Somatic 1 0.0699300699 3 1 2 2 4 2 0 +0 chr1 1822457 C T 10 0 0% C 4 4 50% Y + Somatic 1 0.022875817 3 1 2 2 6 4 0 + 0

I'm trying to skip the lines which are chrM rather than chr1, the following code is what I came up with (I'm also extracting different columns)

while (my $line = <FILE>) { next if ($. == 1); chomp $line; my @sepline = split ("\t", $line); my $chromosome = $sepline[0]; my $chrpos = $sepline[1]; my $nmreads = $sepline[8]; my $mutants = $sepline[9]; my $totalreads = $nmreads + $mutants; next if ($chromosome = /^chrM/); print ("$nmreads $mutants $totalreads\n"); }

So the script works fine if I # out the lower next if function towards the bottom of the script, I've also tried "next if ($chromosome = "chrM");" and "if ($chromosome = "chrM") { next;} but neither will work. Is there something incorrect about my syntax or am I simply going about it completely the wrong way? Appreciate any help, cheers

Replies are listed 'Best First'.
Re: Trouble skipping lines using Perl
by haukex (Archbishop) on Nov 21, 2017 at 16:01 UTC
    next if ($chromosome =~ /^chrM/);

    You're very close, you just need to use the "binding operator" =~ instead of = and your code works. See also perlrequick and perlretut.

    Note that if you're using warnings, which you should, you should have gotten a warning "Use of uninitialized value $_ in pattern match (m//)". See also Use strict and warnings.

    BTW, please enclose your sample input data in <code> tags as well (not <p>).

      Hi haukex, Thanks very much, works perfectly now. I went with

       next if ($chromosome =~ "chrM");

      I was using both Strict and Warnings and using the quotes instead of the regex doesn't produce any warnings :D

      My apologies about the input data format

      Cheers

        I went with

        next if ($chromosome =~ "chrM");

        That works, but personally I wouldn't write it that way, because writing a regex like for example /chrM/ or m{chrM} makes it more visually clear what you want to do (and also allows you to add modifiers).

        I was using both Strict and Warnings and using the quotes instead of the regex doesn't produce any warnings

        Are you sure? next if ($chromosome = "chrM"); should have given you the warning "Found = in conditional, should be ==". Perhaps you're not enabling warnings correctly?

        Update:

        My apologies about the input data format

        You can edit your posts (please mark updates as such), see How do I change/delete my post?

        LeBran:

        It didn't give you any warnings because the expression $chromosome = /^chrM/ is perfectly fine. It just doesn't do what you want it to. Instead of checking whether $chromosome starts with "chrM", it instead checks whether $_ starts with "chrM", and then sets $chromosome to a true value if it does, and a false value otherwise. Since you're not using $_ while parsing your lines, it never starts with "chrM" and always returns a false value.

        It's a common enough mistake that I could see a case being made for "if ($var = /rex/)" generating a warning, as I expect that "if ($var = ($_ =~ /rex/))" is pretty uncommon (at least, when looking at my code).

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.

        I went with

        next if ($chromosome =~ "chrM");

        Also note that  $chromosome =~ "chrM" matches if  "chrM" is found anywhere in the  $chromosome string:

        c:\@Work\Perl\monks>perl -wMstrict -le "my $chromosome = 'foo xx chrM yy bar'; print 'found a chrM' if $chromosome =~ 'chrM'; " found a chrM
        This is because you no longer anchor the match to the start of the string (as you do in the code in the OP) with the  ^ match anchor regex operator. The match  /^chrM/ would IMHO be better for what you seem to want.

        Another point is that the data posted in the OP has a leading space or spaces in some cases. Leading whitespace will cause the  /^chrM/ match to fail. If leading whitespace may be present in real data, I would recommend something along the lines of  /^\s*chrM/ instead.


        Give a man a fish:  <%-{-{-{-<

Re: Trouble skipping lines using Perl
by Eily (Monsignor) on Nov 21, 2017 at 16:16 UTC

    $chromosome =~ /^chrM/ may do what you need if you want $chromosome to start with chrM (that's what ^ means, it's the start of the string). Note that if you want to check if it is equal to "chrM" this is done with the eq operator, so $chromosome eq "chrM" (see Relational Operators and Equality Operators)

    Also FYI if you want to check that a string is contained in another, you can use index which is easier to use and less tricky than regular expressions.

      Hi,

      Yeah that  eq operator actually seems really useful, and presumably would deal with certain warnings in relation to using a =

      Cheers

Re: Trouble skipping lines using Perl
by Laurent_R (Canon) on Nov 21, 2017 at 20:01 UTC
    You've been given complete answers to your question, but I would comment that:

    • You don't need to chomp your line, since you are not using its end
    • You're not using $chrpos anywhere, so don't need to process it
    • In the code below, there is also in principle no use for $chromosome, but I left it because maybe you actually want to use it
    • It is better to discard useless lines at the beginning of the loop, rather than splitting them, etc., and then not use the result of this work;

    So the code could be made significantly shorter (and probably slightly faster) as follows, without losing any clarity:

    while (my $line = <FILE>) { next if $. == 1; next if $line =~ /^\s*chrM/; my ($chromosome, $nmreads, $mutants) = (split "\t", $line)[0, 8, 9 +]; # note: $chromosome is not used, this could be simplified further my $totalreads = $nmreads + $mutants; print ("$nmreads $mutants $totalreads\n"); }
    I'm not saying that your code was bad, it wasn't, but I'm just trying to show some opportunities for improvement.

    Update: Fixed a typo (s/loosing/losing/), thanks to 1nickt for letting me know.

      Hi,

      Yeah, I can see the logic there, is certainly a lot neater and the time saving aspect (though probably minor in this case) is something I should try to remember

      I did actually use $chromosome and $chrpos in another print statement later, I just hadn't reached that part yet because the next if loop was bugging me haha :D

      Thanks for the help