in reply to parsing comments in newline-delimited files as lists

Why make your life more difficult than it needs to be? I prefer the straightforward approach: First deal with comments by making them whitespace, then deal with extraneous whitespace:

while (<DATA>) { chomp; s/#.*$/ /; # first replace any comments with space # skip this line if it consists only of whitespace (or nothing) next if /^\s*$/; # at this point we KNOW there is at least some non-blank on the line # which is also non-comment. # Just grab everything sans leading/trailing whitespace my ($non_blank) = /^\s*(.+)\s*$/; push @enabled_lines, $non_blank if $non_blank; # 'if' may not be ne +cessary }

Of course this does not account for escaped comment characters, since I thought it would obscure the simplicity (I leave it as an exercise for the reader).

dmm

You can give a man a fish and feed him for a day ...
Or, you can
teach him to fish and feed him for a lifetime

Replies are listed 'Best First'.
Re: Re: parsing comments in newline-delimited files as lists
by innerfire (Novice) on Dec 27, 2001 at 10:54 UTC

    You don't need to turn comments into whitespace, and you can get more exact. I've already built a reputation on pedantry--just ask footpad...

    while (<DATA>) { chomp; s/#.*$//; next if /^\s*$/; /([^\s]+)/; push(@enabled_lines, $1); # the if was not necessary :) }

    Of course this does not account for escaped comment characters, since I thought it would obscure the simplicity (I leave it as an exercise for the reader).

    The original poster specified that the URLs are URL-encoded, in which '#' appears as '%23', so there's nothing more to do.

    http://www.nodewarrior.org/chris/

      You pushed my "$1 used outside the context of a conditional" button here. That'd be failed in a code review if I were running the show. And yes, after I stared at the code for a minute or so, I can see that the assertion from the previous line ensures that there's always a match. But in that case, why not make the match, the match!
      while (<DATA>) { chomp; s/#.*$//; next unless /([^\s]+)/; push(@enabled_lines, $1); }
      There: it's now clear to me that we can't get to the push unless the match succeeds. I'd let this stand in a code review, but if I was looking for further optimization, I'd just keep pressing forward for more clarity:
      while (<DATA>) { chomp; s/#.*$//; push(@enabled_lines, $1) if /([^\s]+)/; }
      Nicer. Tighter. Dare I say, "faster" as well? But I see some equivalances that are down in the "nice" category (first was "must", second was "want", now "nice"):
      while (<DATA>) { chomp; s/#.*$//; push @enabled_lines, $1 if /(\S+)/; }
      There. Clean, maintainable, pretty. I don't know if this does what the original poster wanted, but I didn't change the meaning at all from the node to which I'm replying.

      -- Randal L. Schwartz, Perl hacker

        Without taking this code twisting too far, I think this is even nicer:
        while (<DATA>) { s/\s*#.*//; push @enabled_lines, /\s*(.+)/; }

      True, comments can be eliminated completely (along with any preceding whitespace), to wit:

      s/\s*#.*$//; # first remove any comments

      But the expression,

      /([^\s]+)/;
      is incorrect. Even the shorter equivalent,

      /(\S+)/;

      is incorrect: if the line contains more than one word, this will only match the first one; you are explicitly disallowing embedded whitespace. We need to match from the first non-whitespace character to the last non-whitespace character and should include all intervening characters (including embedded whitespace).

      Perhaps your point regarding the final if has merit, but assuming we are dealing with files of up to several thousand lines (not, say millions), the performance hit should be nearly negligible.

      dmm

      You can give a man a fish and feed him for a day ...
      Or, you can
      teach him to fish and feed him for a lifetime

        You don't need to remove whitespace preceding the comment, because it will be removed from the final result by the regex matching only non-whitespace characters. I originally had the first regex the way you have it.

        is incorrect: if the line contains more than one word, this will only match the first one; you are explicitly disallowing embedded whitespace. We need to match from the first non-whitespace character to the last non-whitespace character and should include all intervening characters (including embedded whitespace).

        No, we don't. Re-read the original post: we need to match URL-encoded URLs. Encoded URLs have no whitespace characters. My code is correct.

        http://www.nodewarrior.org/chris/