If you are aiming for performance, my first question would be whether you need next and nextchar to be overridable. Write them as function calls and you will get a significant speedup. Since the common path when you call next should be that the next character is in the buffer, you can get a further speedup by moving the buffer check into nextchar and skipping the call entirely. Adding this logic should be a speedup even if the logic is left alone in the original function because of the relative cost of an if and a function call.

As for the internals of the next method, I haven't run benchmark, but the basic approaches I would consider are the substr approach, splitting the string into an internal array you shift off, and using a /(.)/gs pattern match. (Warning on the last. Use of $' etc anywhere will slow that down. This may be a good reason to avoid no matter what benchmarking says.) I assume you have tried all of them? (Probably but it doesn't hurt to check.)

An incidental note. Your goto can be removed from next with perceivable performance change. Make the if condition be empty, put an else around throwing the exception, and then move the BUFFERED_READ section after the decision logic. This should also be marginally faster because Perl doesn't have to spent time figuring out where the goto goes. (That shouldn't be the common path though, so the change should be marginal.)

I am betting that 5.005 performance is not a priority of yours. But if you are allowing the logic to skip going to next anyways, you can replace a lot of the 5.005 logic with something like the following (untested):

my $len = $n >= 0xFC ? 5 : $n >= 0xF8 ? 4 : $n >= 0xF0 ? 3 : $n >= 0xE0 ? 2 : $n >= 0xC0 ? 1 : throw XML::SAX::Exception::Parse( Message => sprintf("Invalid character 0x%x", $n), ColumnNumber => $self->column, LineNumber => $self->line, PublicId => $self->public_id, SystemId => $self->system_id, ); if ($len <= length($self->[BUFFER])) { $current .= substr($self->[BUFFER], 0, $len, ''); $self->[CURRENT] = substr($self->[BUFFER], -1); } else { $len -= length($self->[BUFFER]); $current .= $self->[BUFFER]; $self->[BUFFER] = ''; while (-1 < --$len) { next($self); $current .= $self->[CURRENT]; } }
Again, it is more complex, but the fact you avoid a whole series of function calls should be a significant speedup.

In reply to Re (tilly) 1: XML::SAX::PurePerl Performance by tilly
in thread XML::SAX::PurePerl Performance by Matts

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.