comment on

I noticed a bug in:

# Consume matching text, if found next:
sub eat {
    my( $re,            # Regex or string to match against next part o
+f doc.
        $keep_ws,       # Don't skip over whitespace first?
        $sv,            # Append matched text to referenced scalar (if
+ any).
        $to_end,        # Allow match to hit end of $Buf?
        $peek,          # Keep pos() unchanged?
    ) = @_;
    $Buf =~ /\G\s+/gc
        if  ! $keep_ws;
    my $start = pos( $Buf );
    $re = _compile( $re );  # Prepends \G also
    do {
        return 0
            if  $Buf !~ /$re/gc;
    } while( ! $to_end && _hit_end(\$start) );
    $$sv .= substr( $Buf, $start, pos($Buf)-$start )
        if  $sv;
    pos($Buf) = $start
        if  $peek;
    return 1;
}
[download]

In trying to minimize how often $Buf gets extended and thinking mostly only about the cases I ran into in my examples, I neglected the more obvious case of looking for the next token but you are right up against the end of $Buf.

There is no perfect solution to this problem. My preferred solution is to just avoid matching potentially long strings as tokens so I can just pick an arbitrary value like "1024 more chars must be present in buffer" that is tiny in comparison to how much I read into the buffer each time (to reduce the percentage of document text that I end up copying as I slide the window along) and yet way too long to worry about a single token not fitting into it.

You can see that approach in parse_string().

The consequence of such an approach is that, for example, when parsing XML you couldn't match a whole XML tag as a single token. Even something as simple as "</b>" you'd end up matching in at least two steps, first '</' then '\w+\s*>', for example. More likely you'd match '<', '/', '\w+', then '>' (which works nicely because you get skipping of whitespace automatically and you can easily pull out the tag name without using capturing parens).

But that still potentially imposes limitations like not allowing tag names that are longer than 1024 characters. Luckily, the '\w+' case is easily handled by the version of eat() above as not having the full tag name in the buffer will not prevent '\w+' from matching the first part of the tag name so the code will extend the buffer and run the match again, at least eventually reading in the whole 8GB tag name. ;)

So a better version would be more like:

# Consume matching text, if found next:
sub eat {
    my( $re,            # Regex or string to match against next part o
+f doc.
        $keep_ws,       # Don't skip over whitespace first?
        $sv,            # Append matched text to referenced scalar (if
+ any).
        $to_end,        # Allow match to hit end of $Buf?
        $peek,          # Keep pos() unchanged?
    ) = @_;
    0   while   ! $keep_ws
            &&  eat( '\s+', 1, '', 1 );
    my $start = pos( $Buf );
    _hit_end( \$start );
    $re = _compile( $re );  # Prepends \G also
    do {
        return 0
            if  $Buf !~ /$re/gc;
    } while( ! $to_end && _hit_end(\$start) );
    $$sv .= substr( $Buf, $start, pos($Buf)-$start )
        if  $sv;
    pos($Buf) = $start
        if  $peek;
    return 1;
}
[download]

Also, _hit_end() surely needs to keep track of line numbers for the sake of reporting syntax errors and should track when end-of-file is hit so it can short-circuit and just return false.

- tye

In reply to Re^2: How would you parse this? (bug) by tye
in thread How would you parse this? by BrowserUk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.