smackdab has asked for the wisdom of the Perl Monks concerning the following question:

Hola Monks,

I wrote a "c-like" routine for parsing a "here doc"...and thought someone probably has a regex that does this hanging around. I am finding that my parsing code is hard to understand after not working with it for a while ;-)

My ultimate goal is to support 2 types of config items in my config file, either: "end of line" or "here doc".

For example:

param1 = yes param2 = <<END_STRING this would be good for beginning and trailing spaces and where multi-line things are needed... END_STRING param 3 = maybe
Thanks for any pointers or regexes!

And if you have a regex could you explain how it works? I can do the basics 'ok', but not anything really advanced ;-)

Replies are listed 'Best First'.
Re: regex for here doc?
by jonadab (Parson) on Oct 14, 2003 at 02:25 UTC
    while (<CONFIG>) { if (/^\s*#/) { # ignore comment line } elsif (/^\s*$/) { # ignore blank line } elsif (/(\w+)\s*=\s*[<]{2}(\w+)/) { # heredoc (my $name, local $/) = ($1, "\n$2"); # ++ysth $config{$name} = <CONFIG>; chomp $config{$name}; # as etcshadow points out. } elsif (/(\w+)\s*=\s*(.*?)\s*$/) { # regular pair $config{$1}=$2; } else { warn "Ptooey: Could not parse config line: $_\n"; } }

    This does not handle the sorts of heredocs where the type of quoting is specified (e.g., <<'HEREDOC'), however. That could be a future improvement, if you need it.

    if you have a regex could you explain how it works?

    The first couple are pretty basic, assuming you know that \s matches whitespace (spaces, tabs, and so forth), so I'll let you figure those out on your own. The other two bear more explaining... I'll start with the last one:

    /(\w+)\s*=\s*(.*?)\s*$/

    \w matches a word character (letters, numbers, underscore, ...). + means one or more, and the parens capture those word characters to $1. Then you have an equal sign (possibly surrounded by zero or more whitespace characters). After that, this variation slurps forward, taking as few characters as possible (that's what the ? is for, to make it non-greedy) for $2, until it encounters the whitespace at the end of the line.

    The one you're probably most interested in is the one that does the here document:

    } elsif (/(\w+)\s*=\s*[<]{2}(\w+)/) { # heredoc (my $name, local $/) = ($1, $2); $config{$name} = <CONFIG>;

    The first part is the same, matching the name of the config option and the equal sign, with any surrounding whitespace. I put the less-than symbol in a character class because I couldn't remember whether it's a special character in the main part of a regular expression. (I don't think so, but I wanted to be safe and give you code I knew would work.) the {2} is just a quantifier, telling how many times we want to match that preceding atom, so basically that all matches two less-than symbols in a row. Then, as before, it matches a series of one or more word characters. Now, the trick is that I didn't use the regex to match the rest of the here document: I grabbed the key from the regex and also the string used to mark the end of the here document, then I set the input record separator ($/), which causes any read on the filehandle to go forward until it hits that point. This does have a weakness, in that a true here document can have that string in the document as long as it's not on a line by itself, but for config file purposes I figured I'd take the shortcut. The local qualifier on the assignment to $/ ensures that when the elsif block is exited the input record separator returns to its normal state, so that subsequent reads on the filehandle work as per normal.


    $;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}} split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/
      Good, but, in the heredoc block of code, after reading <CONFIG>:
      chomp $config{$name};
      Or else you've got the string terminator in the param.

      ------------
      :Wq
      Not an editor command: Wq
      BIG thanks, this is so much simplier than my code!

      I think the idea of $/ makes a huge difference. Even though I had working code, I am glad to replace it with something more manageable and something that adds to my experience.

        You probably want to set $/ to "\n$2", not just $2. At least, that's how heredocs usually work.