halak77 has asked for the wisdom of the Perl Monks concerning the following question:

I am doing something which seems very simple to me, yet the regex is not behaving as I had thought it would. The file I am parsing is a fstab, I am trying to ignore commented lines (begin with "#.") I also want to ignore any 'comment' lines which are preceded by whitespace. My example file

# This file is a comment and should not be used # This comment has one leading space # This comment has three leading spaces /dev/mapper/centos-var_log /var/log xfs defaults + 1 2 /dev/donstest /expert none 1 1 -rw,noglob # this is a fs #/dev/donstest22 /expert none 1 1 -rw,noglob # this is a + fS
and code

#!/usr/bin/perl use strict; open (MYFILE, "./file.txt" ) or die "Could not open file: $! \n"; while ( my $line = <MYFILE> ) { chomp $line; if ( $line =~ /^\s*[^#].*/ ) { print "NOT Comment: $line \n"; } else { print "IS Comment: $line \n"; } }

My output:

IS Comment: # This file is a comment and should not be used NOT Comment: # This comment has one leading space NOT Comment: # This comment has three leading spaces NOT Comment: /dev/mapper/centos-var_log /var/log xfs + defaults 1 2 NOT Comment: /dev/donstest /expert none 1 1 -rw,noglob + # this is a fs NOT Comment: #/dev/donstest22 /expert none 1 1 -rw,noglo +b # this is a fS

What am I missing? As I understand /^\s*^#/ to mean "match any strings which may have zero of more whitespace characters from line begining followed by any character not a '#' the \s* should be greedy up to the first non-whitespace character. If I replace my regex with /^\s*^#\s ... it works as desired.

Replies are listed 'Best First'.
Re: regex whitespace quantifiers
by Athanasius (Archbishop) on Apr 24, 2015 at 12:33 UTC

    Hello halak77, and welcome to the Monastery!

    Yes, the \s* is greedy, but if no match is found then the regex engine backtracks. So it tries to match the line:

    # This comment has one leading space

    with one space character, but fails because the following character (#) doesn’t match [^#]. So the engine backtracks and tries zero whitespace characters, followed by [^#], which succeeds because the first character (a space) is not a hash, and so matches [^#].

    If I replace my regex with /^\s*[^#\s]... it works as desired.

    Yes, because now the case where a space matches the no-hash character class is explicitly excluded.

    On the working of the Perl regex engine, see the section “The Little Engine That /Could(n't)?/” in The Camel Book (4th Edition, 2012), pages 241–246.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: regex whitespace quantifers
by QM (Parson) on Apr 24, 2015 at 11:22 UTC
    Edit: Spoke too soon.

    Seems you should turn your logic around, and match the comment:

    /^\s*#/

    Your regex should have a plus instead of a star after \s:

    /^\s+[^#].*/

    And apparently <strike> doesn't span code blocks.

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

Re: regex whitespace quantifers
by kcott (Archbishop) on Apr 24, 2015 at 15:13 UTC

    G'day halak77,

    Welcome to the Monastery.

    "/^\s*[^#].*/"

    [^#] is a character class containing all characters except '#'. What you really want is a character class containing only '#'.

    The first line doesn't match: (^) - line start OK; (\s*) - match zero whitespace OK; ([^#]) - '#' is a '#' FAIL

    The second line does match: (^) - line start OK; (\s*) - match zero whitespace OK; ([^#]) - '' is NOT a '#' OK

    I'll leave you to continue through the remaining lines of data.

    I'll also point out that the '.*' (at the end of the regex) is superfluous. It will always match: either none, one or some of any character matched by '.'. You're only interested in the front of the string, anyway.

    Here's my test (pm_1124511_comment_removal.pl):

    #!/usr/bin/env perl use strict; use warnings; my $comment_line_re = qr{ \A \s* [#] }x; while (<DATA>) { print "@{[/$comment_line_re/ ? 'IS' : 'NOT']} Comment: $_"; } __DATA__ # This file is a comment and should not be used # This comment has one leading space # This comment has three leading spaces /dev/mapper/centos-var_log /var/log xfs defaults + 1 2 /dev/donstest /expert none 1 1 -rw,noglob # this is a fs #/dev/donstest22 /expert none 1 1 -rw,noglob # this is a + fS

    Output:

    $ pm_1124511_comment_removal.pl IS Comment: # This file is a comment and should not be used IS Comment: # This comment has one leading space IS Comment: # This comment has three leading spaces NOT Comment: /dev/mapper/centos-var_log /var/log xfs + defaults 1 2 NOT Comment: /dev/donstest /expert none 1 1 -rw,noglob # +this is a fs IS Comment: #/dev/donstest22 /expert none 1 1 -rw,noglob + # this is a fS

    -- Ken