Hi all,
Q1: what's the policy on StackOverflow cross posts? I've been away from PM for a number of years, so I'm a little out of date on things :)
SO xpost link: http://stackoverflow.com/questions/34889351/whitespace-important-parsing-with-parserecdescent-eg-haml-python
Actual Q:
I'm trying to parse HAML (haml.info) with Parse::RecDescent. If you don't know haml, the problem in question is the same as parsing Python - blocks of syntax are grouped by the indentation level.
Starting with a very simple subset, I've tried a few approaches but I think I don't quite understand either the greediness or recursive order of P::RD. Given the haml:
%p
%span foo
The simplest grammar I have that I think should work is (with bits unnecessary for the above snippet):
<autotree>
startrule : <skip:''> block(s?)
non_space : /[^ ]/
space : ' '
indent : space(s?)
indented_line : indent line
indented_lines : indented_line(s) <reject: do { Perl6::Junction::
+any(map { $_->level } @{$item[1]}) != $item[1][0]->level }>
block : indented_line block <reject: do { $item[2]->leve
+l <= $item[1]->level }>
| indented_lines
line : single_line | multiple_lines
single_line : line_head space line_body newline | line_head sp
+ace(s?) newline | plain_text newline
# ALL subsequent lines ending in | are consumed
multiple_lines : line_head space line_body continuation_marker ne
+wline continuation_line(s)
continuation_marker : space(s) '|' space(s?)
continuation_line : space(s?) line_body continuation_marker
newline : "\n"
line_head : haml_comment | html_element
haml_comment : '-#'
html_element : '%' tag
# TODO: xhtml tags technically allow unicode
tag_start_char : /[:_a-z]/i
tag_char : /[-:_a-z.0-9]/i
tag : tag_start_char tag_char(s?)
line_body : /.*/
plain_text : backslash ('%' | '!' | '.' | '#' | '-' | '/' | '=' | '&
+' | ':' | '~') /.*/ | /.*/
backslash : '\\'
The problem is in the block definition. As above, it does not capture any of the text, though it does capture the following correctly:
-# haml comment
%p a paragraph
If I remove the second reject line from the above (the one on the first block rule) then it does capture everything, but of course incorrectly grouped since the first block will slurp all lines, irrespective of indentation.
I've also tried using lookahead actions to inspect $text and a few other approaches with no luck.
Can anyone (a) explain why the above doesn't work and/or (b) if there's an approach without using perl actions/rejects? I tried grabbing the number of spaces in the indent, and then using that in an interpolated lookahead condition for the number of spaces in the next line, but I could never quite get the interpolation syntax right (since it requires an arrow operator).
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.