in reply to XML::SAX::PurePerl Performance

If you are aiming for performance, my first question would be whether you need next and nextchar to be overridable. Write them as function calls and you will get a significant speedup. Since the common path when you call next should be that the next character is in the buffer, you can get a further speedup by moving the buffer check into nextchar and skipping the call entirely. Adding this logic should be a speedup even if the logic is left alone in the original function because of the relative cost of an if and a function call.

As for the internals of the next method, I haven't run benchmark, but the basic approaches I would consider are the substr approach, splitting the string into an internal array you shift off, and using a /(.)/gs pattern match. (Warning on the last. Use of $' etc anywhere will slow that down. This may be a good reason to avoid no matter what benchmarking says.) I assume you have tried all of them? (Probably but it doesn't hurt to check.)

An incidental note. Your goto can be removed from next with perceivable performance change. Make the if condition be empty, put an else around throwing the exception, and then move the BUFFERED_READ section after the decision logic. This should also be marginally faster because Perl doesn't have to spent time figuring out where the goto goes. (That shouldn't be the common path though, so the change should be marginal.)

I am betting that 5.005 performance is not a priority of yours. But if you are allowing the logic to skip going to next anyways, you can replace a lot of the 5.005 logic with something like the following (untested):

my $len = $n >= 0xFC ? 5 : $n >= 0xF8 ? 4 : $n >= 0xF0 ? 3 : $n >= 0xE0 ? 2 : $n >= 0xC0 ? 1 : throw XML::SAX::Exception::Parse( Message => sprintf("Invalid character 0x%x", $n), ColumnNumber => $self->column, LineNumber => $self->line, PublicId => $self->public_id, SystemId => $self->system_id, ); if ($len <= length($self->[BUFFER])) { $current .= substr($self->[BUFFER], 0, $len, ''); $self->[CURRENT] = substr($self->[BUFFER], -1); } else { $len -= length($self->[BUFFER]); $current .= $self->[BUFFER]; $self->[BUFFER] = ''; while (-1 < --$len) { next($self); $current .= $self->[CURRENT]; } }
Again, it is more complex, but the fact you avoid a whole series of function calls should be a significant speedup.

Replies are listed 'Best First'.
Re: Re (tilly) 1: XML::SAX::PurePerl Performance
by Matts (Deacon) on Feb 05, 2002 at 13:36 UTC
    If you are aiming for performance, my first question would be whether you need next and nextchar to be overridable. Write them as function calls and you will get a significant speedup.

    Actually that's more of a myth than a truism. I did try it, but the speedup wasn't significant, though it varies from perl to perl.

    However I think there may be some value in the buffer check being moved to nextchar. I was originally thinking it needed to be in next() because sometimes next() is called on its own (for the encoding detection routines which need a byte-by-byte view), but that read()s in character by character anyway, so it might be a reasonable optimisation.

    You're right, I have tried all of the various "give me a character" methods, and substr() comes out on top. Which is a pain in the ass really - it's one point where Perl loses out to python where you can do string[0] to get the first character, just like you can in C.

    I'll leave the goto as is for now. It's not as bad as people make out - it's only bad when it's used for all flow control, and I think it's intention is quite clear here. Plus given a 1024 buffer, it's only part of the path 70 times in the parse of this 70K XML file.

    I do like that last optimisation though - that coupled with moving the buffer test into nextchar() might make a big difference (though maybe not as I don't think the particular test file in question has any UTF-8 characters in it). I'll try it and come back and let you know.

      YMMV, but when I tested it on my machine the popular myth was correct, a method call was massively slower than a function call. (Of course if you do anything interesting in your function...)

      Speaking of functions, I am kind of wondering what the purpose of next is. It seems from the code I see in it that it was intended to keep track of things like line numbers. But I don't see the rest of the code that would be needed to do that. (Sign of a change in design?) If that is the case, then what next is really providing is buffering.

      But isn't buffering exactly what read is supposed to do for you? OK, its speed is highly platform (and compilation option) dependent. But it seems to me that either you are better off using read, or else next should remove the additional buffering by using sysread.

        Points answered in order:

        The methods vs functions thing looks bad in a benchmark, but generally only when you have something like 100_000 calls or something like that. And usually only with empty function bodies, like you suggest. So the call itself may appear to be twice as slow, but it doesn't really show up that bad in real life applications. Plus in 5.7.2, Doug MacEachern of mod_perl fame has made it so that sometimes method calls can be faster than function calls (don't ask me how - his voodoo is way beyond mine).

        The purpose of next, I admit, has been lost in a series of refactorings. I think it's either time to stop refactoring for speed and try to clean things up, or stop refactoring for speed and fix the remaining compliance issues instead ;-)

        I thought read was supposed to buffer too. But I was surprised to see a speedup when I did some buffering of my own. Maybe it only buffers if you ask for a significant number of characters? I have no idea what the internals are of it all, I only know to not believe everything you read, even from gurus ;-)