biologistatsea has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I have a regex whose behavior doesn't match my expectations.

The input data looks like this:

. transcript_id "g29202.t1"; gene_id "g29202"; gene_name "G42051"; xloc "XLOC_053322"; cmp_ref "G42051.1"; class_code "c"; tss_id "TSS54758";

. transcript_id "g29205.t1"; gene_id "g29205"; xloc "XLOC_053323"; class_code "u"; tss_id "TSS54760";

. transcript_id "g29176.t1"; gene_id "g29176"; xloc "XLOC_053324"; class_code "u"; tss_id "TSS54761";

. transcript_id "g29178.t1"; gene_id "g29178"; gene_name "G42030"; xloc "XLOC_053326"; cmp_ref "G42030.1"; class_code "o"; tss_id "TSS54763";

The code below works fine:
use warnings; use strict; my $usage = "perl select_bracker.pl [bracker gtf] [output id list]\n"; my $gfin = shift or die $usage; my $output = shift or die $usage; open(IN, '<', $gfin); open(OUT, '>>', $output); while (my $record = <IN>){ $record =~ s/\R//g; if ($record =~ /^.*transcript_id "([^"]*).*class_code "([^"]*)/){ my $trans = $1; my $class = $2; if($class eq 's' | $class eq 'x' | $class eq 'u'){ print OUT "$trans\n"; } } } close IN; close OUT;

but if instead of if($class eq 's' | $class eq 'x' | $class eq 'u') I have  if('sxu' =~ /$class/g) then the script works fine for the first line with a particular '$class' value it reads, but if it has two adjacent lines with the same '$class' value, the regex doesn't match and the print loop doesn't run for the second line (eg line 3 of the example input). I don't understand this at all, so any help would be much appreciated! Alastair

Replies are listed 'Best First'.
Re: strange behavior of regex
by parv (Parson) on Feb 20, 2020 at 11:17 UTC

    WORKSFORME: output of adjusted code on Perl Banjo with perl 5.30 is ...

    g29205.t1 g29176.t1

    NEVERMIND: I missed /g flag when matching 'sux' against /$class/: I had typed the test instead of copying from OP. I just did not think it was needed as the single captured letter will match the string without the flag. Yes, OP's problem persists if /g is insisted.

    After a session of perl -Mre=debug ... as I understand the behaviour, when /g flag is used ('sux' =~ /$class/g), the last matched position in "sux" is remembered; next match is then started after that position. So if "u" was matched once, then next time match will start at "x". That will fail if the value of "class" on next line is "u" also.

    The correct test would be: $class =~ /[sux]/  # /g is not needed; does not hurt either.

      'sux' =~ $class

      would also work. It was the /g that was the problem.

      Perfect - that makes total sense now. Thanks very much!
Re: strange behavior of regex
by johngg (Canon) on Feb 20, 2020 at 15:17 UTC
    if($class eq 's' | $class eq 'x' | $class eq 'u'){

    Note that | is a bit-wise OR, you probably meant || which is a logical OR.

    Cheers,

    JohnGG