Re^2: How would you parse this? (bug)

I noticed a bug in:

# Consume matching text, if found next:
sub eat {
    my( $re,            # Regex or string to match against next part o
+f doc.
        $keep_ws,       # Don't skip over whitespace first?
        $sv,            # Append matched text to referenced scalar (if
+ any).
        $to_end,        # Allow match to hit end of $Buf?
        $peek,          # Keep pos() unchanged?
    ) = @_;
    $Buf =~ /\G\s+/gc
        if  ! $keep_ws;
    my $start = pos( $Buf );
    $re = _compile( $re );  # Prepends \G also
    do {
        return 0
            if  $Buf !~ /$re/gc;
    } while( ! $to_end && _hit_end(\$start) );
    $$sv .= substr( $Buf, $start, pos($Buf)-$start )
        if  $sv;
    pos($Buf) = $start
        if  $peek;
    return 1;
}
[download]

In trying to minimize how often $Buf gets extended and thinking mostly only about the cases I ran into in my examples, I neglected the more obvious case of looking for the next token but you are right up against the end of $Buf.

There is no perfect solution to this problem. My preferred solution is to just avoid matching potentially long strings as tokens so I can just pick an arbitrary value like "1024 more chars must be present in buffer" that is tiny in comparison to how much I read into the buffer each time (to reduce the percentage of document text that I end up copying as I slide the window along) and yet way too long to worry about a single token not fitting into it.

You can see that approach in parse_string().

The consequence of such an approach is that, for example, when parsing XML you couldn't match a whole XML tag as a single token. Even something as simple as "</b>" you'd end up matching in at least two steps, first '</' then '\w+\s*>', for example. More likely you'd match '<', '/', '\w+', then '>' (which works nicely because you get skipping of whitespace automatically and you can easily pull out the tag name without using capturing parens).

But that still potentially imposes limitations like not allowing tag names that are longer than 1024 characters. Luckily, the '\w+' case is easily handled by the version of eat() above as not having the full tag name in the buffer will not prevent '\w+' from matching the first part of the tag name so the code will extend the buffer and run the match again, at least eventually reading in the whole 8GB tag name. ;)

So a better version would be more like:

# Consume matching text, if found next:
sub eat {
    my( $re,            # Regex or string to match against next part o
+f doc.
        $keep_ws,       # Don't skip over whitespace first?
        $sv,            # Append matched text to referenced scalar (if
+ any).
        $to_end,        # Allow match to hit end of $Buf?
        $peek,          # Keep pos() unchanged?
    ) = @_;
    0   while   ! $keep_ws
            &&  eat( '\s+', 1, '', 1 );
    my $start = pos( $Buf );
    _hit_end( \$start );
    $re = _compile( $re );  # Prepends \G also
    do {
        return 0
            if  $Buf !~ /$re/gc;
    } while( ! $to_end && _hit_end(\$start) );
    $$sv .= substr( $Buf, $start, pos($Buf)-$start )
        if  $sv;
    pos($Buf) = $start
        if  $peek;
    return 1;
}
[download]

Also, _hit_end() surely needs to keep track of line numbers for the sake of reporting syntax errors and should track when end-of-file is hit so it can short-circuit and just return false.

- tye

Comment on Re^2: How would you parse this? (bug) Select or Download Code

Replies are listed 'Best First'.
Re^3: How would you parse this? (bug) by BrowserUk (Patriarch) on Oct 26, 2013 at 17:18 UTC
I noticed a bug in: Just a heads up. I won't be parsing documents that will come even close to being too big for memory; so don't expend further effort in this area on my behalf. An update on my thinking so far is that I'm caught in the dilemma of hand-coding the parser to just do what I need, whilst seeing the potential for abstracting out some of the common components and writing code to generate it. And then I come full circle and realise that I'm just as likely to make all the same mistakes as those modules I've decried :) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]

Replies are listed 'Best First'.

Re^3: How would you parse this? (bug)
by BrowserUk (Patriarch) on Oct 26, 2013 at 17:18 UTC

I noticed a bug in:

Just a heads up. I won't be parsing documents that will come even close to being too big for memory; so don't expend further effort in this area on my behalf.

An update on my thinking so far is that I'm caught in the dilemma of hand-coding the parser to just do what I need, whilst seeing the potential for abstracting out some of the common components and writing code to generate it.

And then I come full circle and realise that I'm just as likely to make all the same mistakes as those modules I've decried :)

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

[reply]