JStrom has asked for the wisdom of the Perl Monks concerning the following question:

I have a language that supports a heredoc syntax similar to Perl's:
var = [[END . "other stuff"; heredoc data END
I'm trying to find a way to parse this without rolling my own tokenizer, but I'm running into problems with the standard tools (The newline changes its meaning and the expression between the heredoc content and the heredoc term). Has someone tackled this problem before?

Edit -- found a solution using Eyapp. The language I'm using doesn't have a << operator so munging the lexer works:

sub _Lexer { for( $input ) { if( @heredoc ) { /\A(.*?)\n$heredoc[0][0]/s or die "Unterminated heredoc"; $strings[ $heredoc[0][1] ] = $1; shift @heredoc; } s/^\s*//; return ($1,$1) if s/^([;.])//; return ('IDENT',$1) if s/^(\w+)//; if( s/^<<(\w+)// ) { push @heredoc, [ $1, $id ]; return ( 'HEREDOC', $id++ ); } } return ('',undef); }
(there should be a flag in the white space eater in the above code that switches on the heredoc parsing. upload the correct code later)

Replies are listed 'Best First'.
Re: Parsing HereDocs
by GrandFather (Saint) on Jun 11, 2008 at 00:32 UTC

    What are you using to parse the source at present? Are you looking for a chunk of code using regexen and loops, or something you can plug into a Parse::RecDescent rule set?

    The following sketch code for handling the problem using regexen may help:

    use strict; use warnings; while (<DATA>) { s/\[\[(\w+)/parseHereDoc ("$1")/e if /\[\[\w+/; print; } sub parseHereDoc { my $id = shift; my $str = '"'; while (<DATA>) { last if /^$id$/; $str .= $_; } return $str . '"'; } __DATA__ var = [[END . "other stuff"; heredoc data END

    Prints:

    var = "heredoc data " . "other stuff";

    Perl is environmentally friendly - it saves trees
      I've been working with Parse::RecDescent, but I have no problem switching to YAPP or one of the others as I've invested little time in the parsing code so far.
Re: Parsing HereDocs
by jethro (Monsignor) on Jun 11, 2008 at 00:24 UTC
    Hopefully others understand better what you mean by standard tools.

    Shouldn't it be possible to use a different (simple) parser für the heredoc? Parse::RecDescent for example has no problem switching between different parsers.