parsing comments in newline-delimited files as lists

Amoe has asked for the wisdom of the Perl Monks concerning the following question:

Hey.

I'm rewriting some horrible void-context mapping, and have stumbled upon a few...stumbling blocks :) I'm trying to parse a newline delimited config file, and populate an array and a hash with its contents. The file is basically a flat newline-delimited list containing URLs. This makes parsing slightly easier as we can ignore whitespace (URLs are encoded).

The problem is with comments. I want to ignore everything after a comment, and ignore lines that are made blank because of this. (Basically, I want to parse off comments in the same way the Perl parser does, only with the added bonus of being able to ignore whitespace.) The array should be every enabled line in the file (lines not commented out), and the hash should be keyed with every line in the file, ignoring whether or not it's commented out (effectively stripping comment characters from the line), and the values could be anything (I plan to use it as a lookup hash). I came up with this draft:

#!/usr/bin/perl -w

use Data::Dumper;

my %alllines;            # hash: line => (arbitrary value)
my @enabled_lines;        # array: flat list

while (<DATA>) {
    chomp;
    
    # not testing with defined as '0' couldn't be a valid url.
    
    my ($data) = /([^\s#]+)/;                    # this
    $alllines{$data} = 1 if $data;                    # works
    
    my ($enabled) = /([^\s#]+)#/;                    # this
    push @enabled_lines, $enabled if $enabled;            # doesn't wo
+rk
}

print Dumper(\%alllines);
print Dumper(\@enabled_lines);

__DATA__
astandardline

#acommentedoutline
 # alinewithspacesbeforeandafterthecommentcharacter
therealdata            # a subordinate comment, this contains whitespa
+ce as it should be ignored
    linewherethetabsshouldbeignored
alinewith        # multiple    # comment    # characters
[download]

%alllines gets populated fine, and just as I expected (here I used 1 as the values for the hash). @enabled_lines doesn't work, though; it's always a null list when it's printed. I expect I am missing something simple here, probably to do with my regex, although that one looks okay to my eyes. Aid?

--
my one true love

Comment on parsing comments in newline-delimited files as lists Select or Download Code

Replies are listed 'Best First'.
(Ovid) Re: parsing comments in newline-delimited files as lists by Ovid (Cardinal) on Dec 27, 2001 at 03:15 UTC
Quick hack to fix it: #!/usr/bin/perl -w use strict; use Data::Dumper; my %alllines; # hash: line => (arbitrary value) my @enabled_lines; # array: flat list while (<DATA>) { next if /^\s*#/; # may as well skip commented out lines chomp; # not testing with defined as '0' couldn't be a valid url. my ($data) = /([^\s#]+)/; # this $alllines{$data} = 1 if $data; # works my ($enabled) = /([^#]+)#/; # this push @enabled_lines, trim($enabled) if $enabled; # does +n't work } print Dumper(\%alllines); print Dumper(\@enabled_lines); sub trim { $_ = shift; s/^\s+//; s/\s+$//; return $_; } __DATA__ astandardline #acommentedoutline # alinewithspacesbeforeandafterthecommentcharacter therealdata # a subordinate comment, this contains whitespa +ce as it should be ignored linewherethetabsshouldbeignored alinewith # multiple # comment # characters [download] Basically, I skip lines that start with comments and no data. I also assume that all sharps (#) in URIs will be encoded as `%23`. That's the only reason why the `$enable` regex works. Also, rather than try to complicate that regex, I created an easy to read `trim()` function to deal with the excess white space. Cheers, Ovid Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.	[reply] [d/l]
Re: parsing comments in newline-delimited files as lists by frag (Hermit) on Dec 27, 2001 at 03:45 UTC
Your regular expression isn't correct, in a few ways: `my ($enabled) = /([^\s#]+)#/;` doesn't work, because it specifies that there can't be any spaces between the valid text and the `#`. Add `\s`. To catch lines that aren't commented at all, change `#` to `#?` Now you'll wind up matching "acommentedoutline" in "#acommentedoutline". Block that by adding `^\s` to the start of the regexp; now the match has to be between the start of the line and the first `#`, if there is one. -- Frag. -- "Just remember what ol' Jack Burton does when the earth quakes, the poison arrows fall from the sky, and the pillars of Heaven shake. Yeah, Jack Burton just looks that big old storm right in the eye and says, "Give me your best shot. I can take it."	[reply] [d/l] [select]
Re: parsing comments in newline-delimited files as lists by dmmiller2k (Chaplain) on Dec 27, 2001 at 04:03 UTC
Why make your life more difficult than it needs to be? I prefer the straightforward approach: First deal with comments by making them whitespace, then deal with extraneous whitespace: `while (<DATA>) { chomp; s/#.$/ /; # first replace any comments with space # skip this line if it consists only of whitespace (or nothing) next if /^\s$/; # at this point we KNOW there is at least some non-blank on the line # which is also non-comment. # Just grab everything sans leading/trailing whitespace my ($non_blank) = /^\s(.+)\s$/; push @enabled_lines, $non_blank if $non_blank; # 'if' may not be ne +cessary }` [download] Of course this does not account for escaped comment characters, since I thought it would obscure the simplicity (I leave it as an exercise for the reader). dmm You can give a man a fish and feed him for a day ... Or, you can teach him to fish and feed him for a lifetime	[reply] [d/l]
Re: Re: parsing comments in newline-delimited files as lists by innerfire (Novice) on Dec 27, 2001 at 10:54 UTC
You don't need to turn comments into whitespace, and you can get more exact. I've already built a reputation on pedantry--just ask footpad... `while (<DATA>) { chomp; s/#.$//; next if /^\s$/; /([^\s]+)/; push(@enabled_lines, $1); # the if was not necessary :) }` [download] Of course this does not account for escaped comment characters, since I thought it would obscure the simplicity (I leave it as an exercise for the reader). The original poster specified that the URLs are URL-encoded, in which '#' appears as '%23', so there's nothing more to do. http://www.nodewarrior.org/chris/	[reply] [d/l]
Re: Re: Re: parsing comments in newline-delimited files as lists by merlyn (Sage) on Dec 27, 2001 at 19:51 UTC
You pushed my "`$1` used outside the context of a conditional" button here. That'd be failed in a code review if I were running the show. And yes, after I stared at the code for a minute or so, I can see that the assertion from the previous line ensures that there's always a match. But in that case, why not make the match, the match! `while (<DATA>) { chomp; s/#.$//; next unless /([^\s]+)/; push(@enabled_lines, $1); }` [download] There: it's now clear to me that we can't get to the push unless the match succeeds. I'd let this stand in a code review, but if I was looking for further optimization, I'd just keep pressing forward for more clarity: `while (<DATA>) { chomp; s/#.$//; push(@enabled_lines, $1) if /([^\s]+)/; }` [download] Nicer. Tighter. Dare I say, "faster" as well? But I see some equivalances that are down in the "nice" category (first was "must", second was "want", now "nice"): `while (<DATA>) { chomp; s/#.*$//; push @enabled_lines, $1 if /(\S+)/; }` [download] There. Clean, maintainable, pretty. I don't know if this does what the original poster wanted, but I didn't change the meaning at all from the node to which I'm replying. -- Randal L. Schwartz, Perl hacker	[reply] [d/l] [select]
Re: Re: Re: Re: parsing comments in newline-delimited files as lists by Anonymous Monk on Dec 28, 2001 at 01:21 UTC
Re(3): parsing comments in newline-delimited files as lists by dmmiller2k (Chaplain) on Dec 27, 2001 at 19:37 UTC
True, comments can be eliminated completely (along with any preceding whitespace), to wit: `s/\s#.$//; # first remove any comments` [download] But the expression, `/([^\s]+)/;` [download] is incorrect. Even the shorter equivalent, `/(\S+)/;` [download] is incorrect: if the line contains more than one word, this will only match the first one; you are explicitly disallowing embedded whitespace. We need to match from the first non-whitespace character to the last non-whitespace character and should include all intervening characters (including embedded whitespace). Perhaps your point regarding the final `if` has merit, but assuming we are dealing with files of up to several thousand lines (not, say millions), the performance hit should be nearly negligible. dmm You can give a man a fish and feed him for a day ... Or, you can teach him to fish and feed him for a lifetime	[reply] [d/l] [select]
Re: Re(3): parsing comments in newline-delimited files as lists by innerfire (Novice) on Dec 28, 2001 at 00:00 UTC
Re(5): parsing comments in newline-delimited files as lists by dmmiller2k (Chaplain) on Dec 28, 2001 at 00:34 UTC


P is for Practical
	PerlMonks