I noticed a bug in:
# Consume matching text, if found next: sub eat { my( $re, # Regex or string to match against next part o +f doc. $keep_ws, # Don't skip over whitespace first? $sv, # Append matched text to referenced scalar (if + any). $to_end, # Allow match to hit end of $Buf? $peek, # Keep pos() unchanged? ) = @_; $Buf =~ /\G\s+/gc if ! $keep_ws; my $start = pos( $Buf ); $re = _compile( $re ); # Prepends \G also do { return 0 if $Buf !~ /$re/gc; } while( ! $to_end && _hit_end(\$start) ); $$sv .= substr( $Buf, $start, pos($Buf)-$start ) if $sv; pos($Buf) = $start if $peek; return 1; }
In trying to minimize how often $Buf gets extended and thinking mostly only about the cases I ran into in my examples, I neglected the more obvious case of looking for the next token but you are right up against the end of $Buf.
There is no perfect solution to this problem. My preferred solution is to just avoid matching potentially long strings as tokens so I can just pick an arbitrary value like "1024 more chars must be present in buffer" that is tiny in comparison to how much I read into the buffer each time (to reduce the percentage of document text that I end up copying as I slide the window along) and yet way too long to worry about a single token not fitting into it.
You can see that approach in parse_string().
The consequence of such an approach is that, for example, when parsing XML you couldn't match a whole XML tag as a single token. Even something as simple as "</b>" you'd end up matching in at least two steps, first '</' then '\w+\s*>', for example. More likely you'd match '<', '/', '\w+', then '>' (which works nicely because you get skipping of whitespace automatically and you can easily pull out the tag name without using capturing parens).
But that still potentially imposes limitations like not allowing tag names that are longer than 1024 characters. Luckily, the '\w+' case is easily handled by the version of eat() above as not having the full tag name in the buffer will not prevent '\w+' from matching the first part of the tag name so the code will extend the buffer and run the match again, at least eventually reading in the whole 8GB tag name. ;)
So a better version would be more like:
# Consume matching text, if found next: sub eat { my( $re, # Regex or string to match against next part o +f doc. $keep_ws, # Don't skip over whitespace first? $sv, # Append matched text to referenced scalar (if + any). $to_end, # Allow match to hit end of $Buf? $peek, # Keep pos() unchanged? ) = @_; 0 while ! $keep_ws && eat( '\s+', 1, '', 1 ); my $start = pos( $Buf ); _hit_end( \$start ); $re = _compile( $re ); # Prepends \G also do { return 0 if $Buf !~ /$re/gc; } while( ! $to_end && _hit_end(\$start) ); $$sv .= substr( $Buf, $start, pos($Buf)-$start ) if $sv; pos($Buf) = $start if $peek; return 1; }
Also, _hit_end() surely needs to keep track of line numbers for the sake of reporting syntax errors and should track when end-of-file is hit so it can short-circuit and just return false.
- tye
In reply to Re^2: How would you parse this? (bug)
by tye
in thread How would you parse this?
by BrowserUk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |