This is a hardcore performance question, hope you like it.

I'm the author of XML::SAX::PurePerl, a pure perl XML parser. It's close enough to complete that I've decided lately to examine it's performance issues. It's a very slow parser, especially compared to C based parsers (expat, libxml2), mostly because Perl itself is very slow at this task. So I've decided to post a breakdown of where I'm at, what I've done and hopefully get some ideas for further improvements.

First of all, this parser is character based. This means it parses based on seeing a single character in a stream at a time, and backtracks when it needs to. This makes things slow in Perl, because ironically while Perl deals with strings very quickly, it sucks at dealing with characters. I'm not going to change this architectural decision because it would require a complete re-write of the parser (which is already 3000 lines of code in total).

Anyway, down to the nitty gritty. I'm going to simply post what dprofpp tells me are the most time consuming functions, and then post those functions so you can take a look.

dprofpp output from dprofpp -O 6:

Total Elapsed Time = 9.712727 Seconds User+System Time = 9.672727 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 26.2 2.538 4.018 86883 0.0000 0.0000 XML::SAX::PurePerl::Reade +r::nextchar 24.5 2.379 4.518 60628 0.0000 0.0001 XML::SAX::PurePerl::Reade +r::match_not 23.3 2.259 1.999 86884 0.0000 0.0000 XML::SAX::PurePerl::Reade +r::Stream::next 11.3 1.099 5.442 4826 0.0002 0.0011 XML::SAX::PurePerl::Reade +r::consume_not 9.72 0.940 1.415 53003 0.0000 0.0000 XML::SAX::PurePerl::Reade +r::match_char 9.60 0.929 1.438 19261 0.0000 0.0001 XML::SAX::PurePerl::Reade +r::match_re
Note that the percentages are a little confusing, since some of those call each other, so there's probably some overlap - I don't know how to get dprofpp to make that a non-issue (i.e. only time spent *actually* in the subroutine, not time spent calling other subs. I don't even think its possible to calculate that).

Anyway, here's the code for those six. Go for your life - shoot me if I've done something stupid. Don't post without running Benchmark.pm on things though - I've already spent a few days on this!

# get the next utf-8 character from the input stream # and encode as UTF8 if possible/necessary sub nextchar { my $self = shift; $self->next; return unless defined $self->[CURRENT]; if ($self->[CURRENT] eq "\x0D") { $self->next; return unless defined($self->[CURRENT]); if ($self->[CURRENT] ne "\x0A") { $self->buffer("\x0A"); } } return unless $self->[ENCODING]; my $n = ord($self->[CURRENT]); # warn(sprintf("ch: 0x%x ($self->[CURRENT])\n", $n)); if (($] < 5.007002) && ($n > 0x7F)) { # utf8 surrogate my $current = $self->[CURRENT]; if ($n >= 0xFC) { # read 5 chars $self->next; $current .= $self->[CURRENT]; $self->next; $current .= $self->[CURRENT]; $self->next; $current .= $self->[CURRENT]; $self->next; $current .= $self->[CURRENT]; $self->next; $current .= $self->[CURRENT]; } elsif ($n >= 0xF8) { # read 4 chars $self->next; $current .= $self->[CURRENT]; $self->next; $current .= $self->[CURRENT]; $self->next; $current .= $self->[CURRENT]; $self->next; $current .= $self->[CURRENT]; } elsif ($n >= 0xF0) { # read 3 chars $self->next; $current .= $self->[CURRENT]; $self->next; $current .= $self->[CURRENT]; $self->next; $current .= $self->[CURRENT]; } elsif ($n >= 0xE0) { # read 2 chars $self->next; $current .= $self->[CURRENT]; $self->next; $current .= $self->[CURRENT]; } elsif ($n >= 0xC0) { # read 1 char $self->next; $current .= $self->[CURRENT]; } else { throw XML::SAX::Exception::Parse( Message => sprintf("Invalid character 0x%x", $n), ColumnNumber => $self->column, LineNumber => $self->line, PublicId => $self->public_id, SystemId => $self->system_id, ); } if ($] >= 5.006001) { $self->[CURRENT] = pack("U0A*", $current); } else { $self->[CURRENT] = $current; } } } ###################################################### # match anything *not* in the list of characters given sub match_not { my $self = shift; my $current = $self->[CURRENT]; return 0 unless defined $current; for my $m (@_) { if ($current eq $m) { $self->[MATCHED] = ''; return 0; } } $self->[MATCHED] = $current; $self->nextchar; return 1; } ###################################################### # get the next byte or character (character on 5.7.2+ # or byte on 5.5 or 5.6) from a fh, or from the buffer # if one exists. sub next { my $self = shift; # check for chars in buffer first. if (length($self->[BUFFER])) { return $self->[CURRENT] = substr($self->[BUFFER], 0, 1, ''); # + last param truncates buffer } if (length($self->[INTERNAL_BUFFER])) { BUFFERED_READ: $self->[CURRENT] = substr($self->[INTERNAL_BUFFER], 0, 1, ''); if ($self->[CURRENT] eq "\x0A") { $self->[LINE]++; $self->[COLUMN] = 1; } else { $self->[COLUMN]++ } return; } my $bytesread = read($self->[FH], $self->[INTERNAL_BUFFER], $self- +>[BUFFER_SIZE]); if ($bytesread) { # yes, goto. If you have a faster way feel free! goto BUFFERED_READ; } elsif (defined($bytesread)) { $self->[EOF]++; return $self->[CURRENT] = undef; } throw XML::SAX::Exception::Parse( Message => "Error reading from filehandle: $!", ); } ########################################################## # consume everything not matching chars listed in @_ sub consume_not { my $self = shift; my $consumed = ''; while(!$self->[EOF] && $self->match_not(@_)) { $consumed .= $self->[MATCHED]; } return length($self->[CONSUMED] = $consumed); } ######################################################## # Does the current token match a single character? sub match_char { my $self = shift; if (defined($self->[CURRENT]) && $self->[CURRENT] eq $_[0]) { $self->[MATCHED] = $_[0]; $self->nextchar; return 1; } $self->[MATCHED] = ''; return 0; } ######################################################### # Does the current token match a single regexp? sub match_re { my $self = shift; if ($self->[CURRENT] =~ $_[0]) { $self->[MATCHED] = $self->[CURRENT]; $self->nextchar; return 1; } $self->[MATCHED] = ''; return 0; }
The original code is on CPAN, but not with these performance tweaks. If you feel you need the live code (i.e. the code you see above along with the rest of it) to look at it's available from anon-cvs via: cvs -z3 -d :pserver:anonymous@axkit.org:/home/cvs login/co XML-SAX

Edit Petruchio Wed Feb 6 08:16:30 UTC 2002 - Added READMORE Tag


In reply to XML::SAX::PurePerl Performance by Matts

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.