comment on

This is a hardcore performance question, hope you like it.

I'm the author of XML::SAX::PurePerl, a pure perl XML parser. It's close enough to complete that I've decided lately to examine it's performance issues. It's a very slow parser, especially compared to C based parsers (expat, libxml2), mostly because Perl itself is very slow at this task. So I've decided to post a breakdown of where I'm at, what I've done and hopefully get some ideas for further improvements.

First of all, this parser is character based. This means it parses based on seeing a single character in a stream at a time, and backtracks when it needs to. This makes things slow in Perl, because ironically while Perl deals with strings very quickly, it sucks at dealing with characters. I'm not going to change this architectural decision because it would require a complete re-write of the parser (which is already 3000 lines of code in total).

Anyway, down to the nitty gritty. I'm going to simply post what dprofpp tells me are the most time consuming functions, and then post those functions so you can take a look.

dprofpp output from dprofpp -O 6:

Total Elapsed Time = 9.712727 Seconds
  User+System Time = 9.672727 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 26.2   2.538  4.018  86883   0.0000 0.0000  XML::SAX::PurePerl::Reade
+r::nextchar
 24.5   2.379  4.518  60628   0.0000 0.0001  XML::SAX::PurePerl::Reade
+r::match_not
 23.3   2.259  1.999  86884   0.0000 0.0000  XML::SAX::PurePerl::Reade
+r::Stream::next
 11.3   1.099  5.442   4826   0.0002 0.0011  XML::SAX::PurePerl::Reade
+r::consume_not
 9.72   0.940  1.415  53003   0.0000 0.0000  XML::SAX::PurePerl::Reade
+r::match_char
 9.60   0.929  1.438  19261   0.0000 0.0001  XML::SAX::PurePerl::Reade
+r::match_re
[download]

Note that the percentages are a little confusing, since some of those call each other, so there's probably some overlap - I don't know how to get dprofpp to make that a non-issue (i.e. only time spent *actually* in the subroutine, not time spent calling other subs. I don't even think its possible to calculate that).

Anyway, here's the code for those six. Go for your life - shoot me if I've done something stupid. Don't post without running Benchmark.pm on things though - I've already spent a few days on this!

# get the next utf-8 character from the input stream
# and encode as UTF8 if possible/necessary
sub nextchar {
    my $self = shift;
    $self->next;

    return unless defined $self->[CURRENT];

    if ($self->[CURRENT] eq "\x0D") {
        $self->next;
        return unless defined($self->[CURRENT]);
        if ($self->[CURRENT] ne "\x0A") {
            $self->buffer("\x0A");
        }
    }

    return unless $self->[ENCODING];
    my $n = ord($self->[CURRENT]);
    # warn(sprintf("ch: 0x%x ($self->[CURRENT])\n", $n));
    if (($] < 5.007002) && ($n > 0x7F)) {
        # utf8 surrogate
        my $current = $self->[CURRENT];
        if    ($n >= 0xFC) {
            # read 5 chars
            $self->next; $current .= $self->[CURRENT];
            $self->next; $current .= $self->[CURRENT];
            $self->next; $current .= $self->[CURRENT];
            $self->next; $current .= $self->[CURRENT];
            $self->next; $current .= $self->[CURRENT];
        }
        elsif ($n >= 0xF8) {
            # read 4 chars
            $self->next; $current .= $self->[CURRENT];
            $self->next; $current .= $self->[CURRENT];
            $self->next; $current .= $self->[CURRENT];
            $self->next; $current .= $self->[CURRENT];
        }
        elsif ($n >= 0xF0) {
            # read 3 chars
            $self->next; $current .= $self->[CURRENT];
            $self->next; $current .= $self->[CURRENT];
            $self->next; $current .= $self->[CURRENT];
        }
        elsif ($n >= 0xE0) {
            # read 2 chars
            $self->next; $current .= $self->[CURRENT];
            $self->next; $current .= $self->[CURRENT];
        }
        elsif ($n >= 0xC0) {
            # read 1 char
            $self->next; $current .= $self->[CURRENT];
        }
        else {
            throw XML::SAX::Exception::Parse(
                Message => sprintf("Invalid character 0x%x", $n),
                ColumnNumber => $self->column,
                LineNumber => $self->line,
                PublicId => $self->public_id,
                SystemId => $self->system_id,
            );
        }
        if ($] >= 5.006001) {
            $self->[CURRENT] = pack("U0A*", $current);
        }
        else {
            $self->[CURRENT] = $current;
        }
    }
}

######################################################

# match anything *not* in the list of characters given
sub match_not {
    my $self = shift;

    my $current = $self->[CURRENT];
    return 0 unless defined $current;

    for my $m (@_) {
        if ($current eq $m) {
            $self->[MATCHED] = '';
            return 0;
        }
    }
    $self->[MATCHED] = $current;
    $self->nextchar;
    return 1;
}

######################################################

# get the next byte or character (character on 5.7.2+
# or byte on 5.5 or 5.6) from a fh, or from the buffer
# if one exists.
sub next {
    my $self = shift;

    # check for chars in buffer first.
    if (length($self->[BUFFER])) {
        return $self->[CURRENT] = substr($self->[BUFFER], 0, 1, ''); #
+ last param truncates buffer
    }


    if (length($self->[INTERNAL_BUFFER])) {
BUFFERED_READ:
        $self->[CURRENT] = substr($self->[INTERNAL_BUFFER], 0, 1, '');
        if ($self->[CURRENT] eq "\x0A") {
            $self->[LINE]++;
            $self->[COLUMN] = 1;
        }
        else { $self->[COLUMN]++ }
        return;
    }

    my $bytesread = read($self->[FH], $self->[INTERNAL_BUFFER], $self-
+>[BUFFER_SIZE]);
    if ($bytesread) {
        # yes, goto. If you have a faster way feel free!
        goto BUFFERED_READ;
    }
    elsif (defined($bytesread)) {
        $self->[EOF]++;
        return $self->[CURRENT] = undef;
    }
    throw XML::SAX::Exception::Parse(
        Message => "Error reading from filehandle: $!",
    );
}

##########################################################

# consume everything not matching chars listed in @_
sub consume_not {
    my $self = shift;

    my $consumed = '';

    while(!$self->[EOF] && $self->match_not(@_)) {
        $consumed .= $self->[MATCHED];
    }
    return length($self->[CONSUMED] = $consumed);
}

########################################################

# Does the current token match a single character?
sub match_char {
    my $self = shift;

    if (defined($self->[CURRENT]) && $self->[CURRENT] eq $_[0]) {
        $self->[MATCHED] = $_[0];
        $self->nextchar;
        return 1;
    }
    $self->[MATCHED] = '';
    return 0;
}

#########################################################

# Does the current token match a single regexp?
sub match_re {
    my $self = shift;

    if ($self->[CURRENT] =~ $_[0]) {
        $self->[MATCHED] = $self->[CURRENT];
        $self->nextchar;
        return 1;
    }
    $self->[MATCHED] = '';
    return 0;
}
[download]

The original code is on CPAN, but not with these performance tweaks. If you feel you need the live code (i.e. the code you see above along with the rest of it) to look at it's available from anon-cvs via: cvs -z3 -d :pserver:anonymous@axkit.org:/home/cvs login/co XML-SAX

Edit Petruchio Wed Feb 6 08:16:30 UTC 2002 - Added READMORE Tag

In reply to XML::SAX::PurePerl Performance by Matts

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.