http://qs1969.pair.com?node_id=484238

Yesterday, I had just finished reading Perl Best Practices and I felt electrified. So many things that you could have done and you haven't bothered for so long!

The book is so full of useful advice that I felt ashamed for every good piece that I could have found on my own and instead, due to too much laziness, I had not. Thus, I decided, from now on, I will put into my personal practice all the advice that I liked (most of them, actually) and were not already consolidated in my day-by-day programming.

Then, I updated some of my templates for subs, class creation, and so on, and I started the next project in my agenda fully armed with new knowledge. (Update Please notice: I said "next project". Truly to the if-it-works-don't-change-it principle, I leave my existing code as it is until it's in need of maintenance.)

One of the useful pieces of advice that struck me as simple and very easy to adopt was one about regular expressions. The book says "always use the /x modifier" (and the /m and /s as well, for reasons that I leave to the reader to find out in the book), so that your regex are easier to read. No big deal, I thought. I occasionally use the /x modifier, and why not making a habit of it? So I set off with the new course of action, and I changed one of my templates for parsing a simple data file. Before, I used to write things like this:

#!/usr/bin/perl use strict; use warnings; PARSE: while (<DATA>) { chomp; next PARSE if /^\s*$/ ; # skip blank lines next PARSE if /^\s*#/ ; # skip comments # do something useful with the data print "<$_>\n"; } __DATA__ some data here # a comment # another comment, followed by a blank line more data # another comment final data

Since I use this kind of thing very often, I have a template for it in my editor. I updated it so it became:

#!/usr/bin/perl use strict; use warnings; PARSE: while ( my $line = <DATA> ) { chomp $line; next PARSE if $line =~ m{ ^ \s* $ }xsm; next PARSE if $line =~ m{ ^ \s* # }xsm; # do something useful with the data print "<$line>\n"; } __DATA__ some data here # a comment # another comment, followed by a blank line more data # another comment final data

It looks better, doesn't it?

Unfortunately, this code is not the same as the previous one. When I ran my program for the first time, I got an empty result set. No lines were parsed at all.

I spent several minutes scratching my head, until I realized that the idiom I had used countless times in the past was failing me now.

The problem is, /^\s*#/ is not the same as /^\s*#/x, because the /x modifier allows not only whitespace, but also comments and therefore the "#" character is not a literal any more!

Of course, I should have adopted yet another best practice piece of advice, i.e. the one saying to put a character to escape into a character class.   (*)

Now,  m{ ^ \s* [#] }xms works as advertised, but I wonder if it was worth the trouble of deviating from the less readable but consolidated m/^\s*#/.   (**)

Lesson learned: cargo cult coding is always a risk, even for experienced programmers. Think before refactoring!

(*) Provided that I realized first that there was a character to escape!

(**) Yes, of course it was. I just have to remember to connect fingers and brain before starting a coding session!

 _  _ _  _  
(_|| | |(_|><
 _|   

Replies are listed 'Best First'.
Re: A refactoring trap
by jhourcle (Prior) on Aug 16, 2005 at 21:57 UTC
    Think before refactoring!

    Immediately after reading Perl Best Practices, and before, after reading Perl Medic, and countless other other books, I've wanted to go back and rewrite my code.

    Luckily, I have version control, and was able to roll back everything that I broke. Don't change things just for change's sake. I know, I'm a bad one to preach on this one, as I've done 'optimizations' and 'clarifications' and such to the projects that I've worked on way too many times. Yes, it might be a one line fix, but it might introduce other subtle errors, as you found, and it may be something that you might not catch for days, weeks, or even months down the road. At that point, you can end up losing significant time from trying to pretty something up.

    So, I'd say that the two rules that you should learn from Perl Best Practices apply to more than just Perl -- version control, and a good library of tests.

    Now, there might be times when it's worth fixing up the documentation (I'm going to be handing off this project to someone else, and I want to make sure everything's good), or optimization (the processes is taking more than 2 seconds, and that's unacceptable).

    I still cringe every time I look at some of my old code. Instead of risking breaking it, by trying to clean it up, I prefer to just go in, and make sure there's a comment at the top reflecting how long ago it was written, so if anyone starts giving me crap for it, I can point to that.

    I've started working in a few of the tips from the book into my work, but if I went back and tried fixing every bad script I've written, it might take me a year, as I deal with cleaning up new messes that I'd make.

    Update: Ovid's right -- I've been changing more than one thing at a time, and I haven't been running tests after every change. I guess this is one of the times when having too many tests can be a problem ... I don't mind running 5 min of tests for an install, but it's damned annoying for every 10 sec modification.

      I've also been reading and enjoying the book. However, I have been going through and making sweeping changes in the codebase I am currently working on. Why isn't my stuff breaking? Because I am only changing one thing at a time and I'm running the tests after every change.

      I can feel confident that I am not hurting anything and I'm very confident that some of the tricky bits in my code are much easier to understand. The poor programmers who will have to go in and maintain my code may not notice it, but they would certainly have noticed some of the things that were in there before.

      In fact, I've committed my .perltidy.rc file to the code base to ensure that programmers who follow me will find it easy to at least maintain consistent formatting.

      Cheers,
      Ovid

      New address of my CGI Course.

        In fact, I've committed my .perltidy.rc file to the code base to ensure that programmers who follow me will find it easy to at least maintain consistent formatting.

        Thats a very interesting idea. Id almost like to see your perltidy.rc posted here, along with any others that folks might like to contribute. I think it would be an interesting addition to the site.

        ---
        $world=~s/war/peace/g

Re: A refactoring trap
by Albannach (Monsignor) on Aug 16, 2005 at 21:22 UTC
    One of my personal rules: Always question a rule that starts with always.

    While I agree that /x can be mighty handy, I think that your case is so simple as to be made less clear by the change. Perhaps this is because I am used to certain regex idioms, or I just don't like your choice of braces, but I would have thought your first version was better, apart from the explict use of $line which is an improvement.

    Well the best thing about opinions is that you can always get another.

    --
    I'd like to be able to assign to an luser

Re: A refactoring trap
by Nkuvu (Priest) on Aug 16, 2005 at 21:32 UTC
    Actually? In this case I prefer the regexen without the x modifier. The regex /^\s*$/ is not long enough to require internal comments, the comment after the regex explains it fully. The addition of whitespace just spreads it out.

    As opposed, of course, to potentially confusing regexen such as /\[\s*([^\]]+)/ which, I think, gained a lot of readability with the x:

    /\[\s* # literal [ followed by optional whitespace ( # begin capture [^\]]+ # get one or more chars that are not ] ) # end capture /x # end regex
    (note that this regex is part of a script at work where the future maintainers will probably not know regexen well, which explains the over commenting)
      Does it really gain that much? I think commenting out a regex like that is just as silly and pointless as writing:
      my $tmp = 0; # Make $tmp 0. $i ++; # Increment $i. print "hello"; # Print 'hello' to the screen.
      If I want to comment the regex, I'd write:
      # Capture what's between brackets, # omitting leading white space. /\[\s*([^\]]+)/;
      Because that documents the intent of the regex, not the mechanics. Given the context, I'd probably even document what's between the brackets, and not just the fact I'm capturing between brackets, for instance:
      # Capture the wiki-link. /\[\s*([^\]]+)/;
      Now, that's useful, and you immediately know what to change if the syntax of the wiki-links changes. Where as your example carefully documents the tiny steps it takes, and not the overall picture.

      You might want to read Brian W. Kernighan, Rob Pike: The Practice of Programming, which discusses correct commenting style as well. And there's no reason to assume you should use a commenting style for regexes that's considered bad style for other code.

        Does it gain that much in this instance? Probably not. Commenting on intent is actually a nice way to put it, and one habit that I'm trying to break is commenting on mechanics. I've written a number of scripts for people who are not Perl-savvy, and they come back with questions on how this or that works. So for a long time I started commenting on every single regex on how it works -- because that's what people were asking about. These people are general C/C++ programmers, so I didn't get into the habit of commenting code like your first example, thank dog.

        So you have a very good point, and I'll add your reading suggestion to my list of books. I have some good habits, some not so good. I knew when I posted this that I might have comments about the style -- which is really what I want. Nothing like critiques of bad practices to break the habit.

      On the off chance that you really use your "potentialy confusing" regex, I will comment that it seems a little inconsistent to me. Can you tell me what it does (including what becomes $1) for each of: "[ x]", "[ ]", and "[]"? Is it what you wanted it to do?
        I should also have mentioned that this was a known-fragile regex to parse out some very specific files. The files in question are Matlab files that always have something in the brackets, but not always a leading space. I do have this commented in the actual script, I just didn't add the comment here.
Re: A refactoring trap
by bart (Canon) on Aug 17, 2005 at 11:27 UTC
    You don't have to use /[#]/x, you can use /\#/x just as well; just like you more or less have to use /\ /x in order to be able to match a space.
      /[ ]/x matches a space as well.
        But the regular expression engine is more likely to be able to optimize a literal character as opposed to a character class.
Re: A refactoring trap
by redlemon (Hermit) on Aug 17, 2005 at 07:34 UTC

    There's another problem with your refactored code:

    while (<DATA>) { ... }
    is not the same as
    while (my $line = <DATA>) { ... }

    The first while will stop as soon as <DATA> returns undef, so at the end of the data. The second while however, will stop as soon as it finds an empty line. The correct refactoring should have been:

    while (defined (my $line = <DATA>)) { ... }

    --
    Lyon

      No. This is magic in Perl5:
      wolverian@noesis:~$ perl -MO=Deparse -e'while (my $foo = <DATA>) { }' while (defined(my $foo = <DATA>)) { (); } __DATA__ -e syntax OK
      The second while however, will stop as soon as it finds an empty line

      Just to test your assertion:

      while (my $line = <DATA>) {print $line}; __DATA__ one two three four

      results in

      one
      
      two
      three
      
      four
      
      

      It seems that empty lines don't break the loop at all.

        Hmmm, you're right. It doesn't break.

        The reason is that blank lines are not truely blank, of course. You're saved by the fact that a blank line still contains the "\n".

        --
        Lyon

      The problem with
      while (my $line = <HANDLE>) {...}
      is that it stops the loop prematurely if $line is false, but defined. There are only two string values that are defined, but false: the empty string, and the string containing the character 0, and nothing else. The first cannot happen - even an "empty line" ends with a newline. The latter can happen in two situations: for a file ending in "$/0", that is, the last line contains 0, and does not end with the line terminator. But regular text files always do (otherwise, they aren't proper text files anyway). The second case is if you read in the file character by character, by setting $/ = \1;, and the file contains a 0.

      If you have opened the file using a PerlIO filter, anything can happen (or if you have tied the file handle). But then, an undefined value may be returned as well, even if the file hasn't completely read.

      Dispite that many has pointed out that

      while (my $line = <DATA>) { ... } while (defined (my $line = <DATA>)) { ... }
      are identical, your first remark is true. The problem lies elsewhere:
      while (<DATA>) { ... }
      doesn't localize $_ and thus leaves $_ undefined after the loop.
      while (my $line = <DATA>) { ... }
      doesn't suffer from this as it doesn't touch $_.

      If put in a subroutine this innocent little bug cause a serious headache while trying to find out why $_ all of a sudden is undefined.

      ihb

      See perltoc if you don't know which perldoc to read!

Re: A refactoring trap
by Roy Johnson (Monsignor) on Aug 17, 2005 at 19:00 UTC
    You could also join the two expressions into one:
    next PARSE if /^\s*(?:#|$)/;
    That might be complex enough to refactor using /x.

    Caution: Contents may have been coded under pressure.
Re: A refactoring trap ( regex /x modifier activates new metacharacters)
by LanX (Saint) on Feb 18, 2015 at 01:39 UTC
    I just ran into the same problem and it took me a while to debug it.

    Wanted write a warning thread ... but of course PM had it already, so I just need pushing it up again. =)

    Shortly stated: When composing regexes from smaller parts be aware that " " and "#" are real new metacharacters under /x and not only syntactic sugar on the top level.

    DB<135> $a='#x' => "#x" DB<136> "x#x#x" =~ s/$a$a/-/r => "x-" DB<137> "x#x#x" =~ s/$a$a/-/xr #oops => "-x#x#x"

    "$a$a" becomes "#x#x" which is an empty regex under /x since it starts with a comment.

    One solution¹ is to pre-compile the smaller parts w/o x-flag

    DB<138> $a=qr/#x/ => qr/#x/ DB<139> "x#x#x" =~ s/$a$a/-/r => "x-" DB<140> "x#x#x" =~ s/$a$a/-/xr => "x-"

    another one escaping or using a character class

    DB<144> $a='\#x' # or ='[#]x' => "\\#x" DB<145> "x#x#x" =~ s/$a$a/-/r => "x-" DB<146> "x#x#x" =~ s/$a$a/-/xr => "x-"

    simply using quotemeta might bite you again when you wanted to use other metacharacters.

    NB: same problem with whitespace.

    Cheers Rolf

    PS: Je suis Charlie!

    ¹) IMHO the cleanest and still unmentioned in this thread :)