Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

parsing comments in newline-delimited files as lists

by Amoe (Friar)
on Dec 27, 2001 at 02:55 UTC ( [id://134504]=perlquestion: print w/replies, xml ) Need Help??

Amoe has asked for the wisdom of the Perl Monks concerning the following question:

Hey.

I'm rewriting some horrible void-context mapping, and have stumbled upon a few...stumbling blocks :) I'm trying to parse a newline delimited config file, and populate an array and a hash with its contents. The file is basically a flat newline-delimited list containing URLs. This makes parsing slightly easier as we can ignore whitespace (URLs are encoded).

The problem is with comments. I want to ignore everything after a comment, and ignore lines that are made blank because of this. (Basically, I want to parse off comments in the same way the Perl parser does, only with the added bonus of being able to ignore whitespace.) The array should be every enabled line in the file (lines not commented out), and the hash should be keyed with every line in the file, ignoring whether or not it's commented out (effectively stripping comment characters from the line), and the values could be anything (I plan to use it as a lookup hash). I came up with this draft:

#!/usr/bin/perl -w use Data::Dumper; my %alllines; # hash: line => (arbitrary value) my @enabled_lines; # array: flat list while (<DATA>) { chomp; # not testing with defined as '0' couldn't be a valid url. my ($data) = /([^\s#]+)/; # this $alllines{$data} = 1 if $data; # works my ($enabled) = /([^\s#]+)#/; # this push @enabled_lines, $enabled if $enabled; # doesn't wo +rk } print Dumper(\%alllines); print Dumper(\@enabled_lines); __DATA__ astandardline #acommentedoutline # alinewithspacesbeforeandafterthecommentcharacter therealdata # a subordinate comment, this contains whitespa +ce as it should be ignored linewherethetabsshouldbeignored alinewith # multiple # comment # characters

%alllines gets populated fine, and just as I expected (here I used 1 as the values for the hash). @enabled_lines doesn't work, though; it's always a null list when it's printed. I expect I am missing something simple here, probably to do with my regex, although that one looks okay to my eyes. Aid?



--
my one true love

Replies are listed 'Best First'.
(Ovid) Re: parsing comments in newline-delimited files as lists
by Ovid (Cardinal) on Dec 27, 2001 at 03:15 UTC

    Quick hack to fix it:

    #!/usr/bin/perl -w use strict; use Data::Dumper; my %alllines; # hash: line => (arbitrary value) my @enabled_lines; # array: flat list while (<DATA>) { next if /^\s*#/; # may as well skip commented out lines chomp; # not testing with defined as '0' couldn't be a valid url. my ($data) = /([^\s#]+)/; # this $alllines{$data} = 1 if $data; # works my ($enabled) = /([^#]+)#/; # this push @enabled_lines, trim($enabled) if $enabled; # does +n't work } print Dumper(\%alllines); print Dumper(\@enabled_lines); sub trim { $_ = shift; s/^\s+//; s/\s+$//; return $_; } __DATA__ astandardline #acommentedoutline # alinewithspacesbeforeandafterthecommentcharacter therealdata # a subordinate comment, this contains whitespa +ce as it should be ignored linewherethetabsshouldbeignored alinewith # multiple # comment # characters

    Basically, I skip lines that start with comments and no data. I also assume that all sharps (#) in URIs will be encoded as %23. That's the only reason why the $enable regex works. Also, rather than try to complicate that regex, I created an easy to read trim() function to deal with the excess white space.

    Cheers,
    Ovid

    Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

Re: parsing comments in newline-delimited files as lists
by frag (Hermit) on Dec 27, 2001 at 03:45 UTC
    Your regular expression isn't correct, in a few ways:
    1. my ($enabled) = /([^\s#]+)#/; doesn't work, because it specifies that there can't be any spaces between the valid text and the #. Add \s*.
    2. To catch lines that aren't commented at all, change # to #?
    3. Now you'll wind up matching "acommentedoutline" in "#acommentedoutline". Block that by adding ^\s* to the start of the regexp; now the match has to be between the start of the line and the first #, if there is one.

    -- Frag.
    --
    "Just remember what ol' Jack Burton does when the earth quakes, the poison arrows fall from the sky, and the pillars of Heaven shake. Yeah, Jack Burton just looks that big old storm right in the eye and says, "Give me your best shot. I can take it."

Re: parsing comments in newline-delimited files as lists
by dmmiller2k (Chaplain) on Dec 27, 2001 at 04:03 UTC

    Why make your life more difficult than it needs to be? I prefer the straightforward approach: First deal with comments by making them whitespace, then deal with extraneous whitespace:

    while (<DATA>) { chomp; s/#.*$/ /; # first replace any comments with space # skip this line if it consists only of whitespace (or nothing) next if /^\s*$/; # at this point we KNOW there is at least some non-blank on the line # which is also non-comment. # Just grab everything sans leading/trailing whitespace my ($non_blank) = /^\s*(.+)\s*$/; push @enabled_lines, $non_blank if $non_blank; # 'if' may not be ne +cessary }

    Of course this does not account for escaped comment characters, since I thought it would obscure the simplicity (I leave it as an exercise for the reader).

    dmm

    You can give a man a fish and feed him for a day ...
    Or, you can
    teach him to fish and feed him for a lifetime

      You don't need to turn comments into whitespace, and you can get more exact. I've already built a reputation on pedantry--just ask footpad...

      while (<DATA>) { chomp; s/#.*$//; next if /^\s*$/; /([^\s]+)/; push(@enabled_lines, $1); # the if was not necessary :) }

      Of course this does not account for escaped comment characters, since I thought it would obscure the simplicity (I leave it as an exercise for the reader).

      The original poster specified that the URLs are URL-encoded, in which '#' appears as '%23', so there's nothing more to do.

      http://www.nodewarrior.org/chris/

        You pushed my "$1 used outside the context of a conditional" button here. That'd be failed in a code review if I were running the show. And yes, after I stared at the code for a minute or so, I can see that the assertion from the previous line ensures that there's always a match. But in that case, why not make the match, the match!
        while (<DATA>) { chomp; s/#.*$//; next unless /([^\s]+)/; push(@enabled_lines, $1); }
        There: it's now clear to me that we can't get to the push unless the match succeeds. I'd let this stand in a code review, but if I was looking for further optimization, I'd just keep pressing forward for more clarity:
        while (<DATA>) { chomp; s/#.*$//; push(@enabled_lines, $1) if /([^\s]+)/; }
        Nicer. Tighter. Dare I say, "faster" as well? But I see some equivalances that are down in the "nice" category (first was "must", second was "want", now "nice"):
        while (<DATA>) { chomp; s/#.*$//; push @enabled_lines, $1 if /(\S+)/; }
        There. Clean, maintainable, pretty. I don't know if this does what the original poster wanted, but I didn't change the meaning at all from the node to which I'm replying.

        -- Randal L. Schwartz, Perl hacker

        True, comments can be eliminated completely (along with any preceding whitespace), to wit:

        s/\s*#.*$//; # first remove any comments

        But the expression,

        /([^\s]+)/;
        is incorrect. Even the shorter equivalent,

        /(\S+)/;

        is incorrect: if the line contains more than one word, this will only match the first one; you are explicitly disallowing embedded whitespace. We need to match from the first non-whitespace character to the last non-whitespace character and should include all intervening characters (including embedded whitespace).

        Perhaps your point regarding the final if has merit, but assuming we are dealing with files of up to several thousand lines (not, say millions), the performance hit should be nearly negligible.

        dmm

        You can give a man a fish and feed him for a day ...
        Or, you can
        teach him to fish and feed him for a lifetime

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://134504]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (3)
As of 2024-04-19 22:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found