Hi Monks,
An XML::SAX reader need to use one or more "consumers" that will receive the data parsed the XML::SAX::Reader. It can be a predefined accumulator based on a string, array, code and you can define your own.

While working on my SAX-based xml splitter I need to use my own consumer (descendant of ConsumerInterface) to store some data, with adding the possibility to query the current size or reset the data. My first implementation used .= operator to concatenate and length() to return the length of the data and it appeared to me after testing on file greater than several KB that the perfs were exponentially linerarly (as stated by readers) regressive.

I hacked another implementation based on a file to store the temp data, with print() to store and tell() to get the size. It radically improved the performance, but I was wondering about the origin of the problem.

On my Win7 box, I noticed that the perl.exe process was heavily querying the Perl_utf8_length function. And after some tests, I could confirm that it is rather the calls to length in utf-8 context rather than .= that are to blame.
Here is my test program that mimicks my SAX parser custom data store. It is showing the time to concatenate an utf-8 string in a loop and getting its size by chunks of 1000 iterations.
I have implemented 2 objects: It seems to me that bigger is the string, longer is the time to get its size, (exponentialy? linearly). Here are the times I get:
with length() 1000 L= 256000 t=0.510602 2000 L= 512000 t=1.544809 3000 L= 768000 t=2.558511 4000 L= 1024000 t=3.598220 5000 L= 1280000 t=4.622924 6000 L= 1536000 t=5.666133 7000 L= 1792000 t=6.634827 8000 L= 2048000 t=7.653030 9000 L= 2304000 t=8.687737 10000 L= 2560000 t=9.727445 11000 L= 2816000 t=10.728646 12000 L= 3072000 t=11.764352 13000 L= 3328000 t=12.804560 14000 L= 3584000 t=13.783257 15000 L= 3840000 t=14.836966 with a scalar 1000 L= 256000 t=0.003712 2000 L= 512000 t=0.003200 3000 L= 768000 t=0.003433 4000 L= 1024000 t=0.003232 5000 L= 1280000 t=0.003398 6000 L= 1536000 t=0.003669 7000 L= 1792000 t=0.004407 8000 L= 2048000 t=0.002218 9000 L= 2304000 t=0.004507 10000 L= 2560000 t=0.002192 11000 L= 2816000 t=0.005269 12000 L= 3072000 t=0.002203 13000 L= 3328000 t=0.006128 14000 L= 3584000 t=0.002311 15000 L= 3840000 t=0.002231
In my real use case, the file-based or stored length based code can process a 25MB xml file in 60s while the same code just using the naive length() based code is spending about 25 minutes on the same data!
Can you confirm my analysis, and tell if my workaround is suitable?

In the production code, I will probably keep the file storage to please my boss and limit the memory charge (it may store until 200MB of data, or more depending on the settings, temporarily during the process), but it may be a false good idea...

Here is my test code:
use strict; use warnings; use feature 'say'; use feature 'state'; use utf8; use Time::HiRes; $|++; my $chunk = '€' x 256; my $td = Time::HiRes::time; my $tf; my $l; say "with length()"; my $str = new LenTestA; for my $n (1..15_000){ state $count = 0; $str->add($chunk); $l = $str->len; $count++; if ($count % 1000 == 0){ $tf = Time::HiRes::time; say sprintf "%12d L=%10d t=%f", $n, $l, $tf-$td; $td = $tf; } } $td = Time::HiRes::time; say "\nwith a scalar"; $str = new LenTestB; for my $n (1..15_000){ state $count = 0; $str->add($chunk); $l = $str->len; $count++; if ($count % 1000 == 0){ $tf = Time::HiRes::time; say sprintf "%12d L=%10d t=%f", $n, $l, $tf-$td; $td = $tf; } } { package LenTestA; sub new { my $class = shift; my $self = ''; return bless \$self, $class; } sub add { my ($self, $data) = @_; $$self .= $data; } sub len { my $self = shift; return length $$self; } } { package LenTestB; my $len; sub new { my $class = shift; my $self = ''; return bless \$self, $class; } sub add { my ($self, $data) = @_; $$self .= $data; $len += length($data); } sub len { my $self = shift; return $len; } }
The best programs are the ones written when the programmer is supposed to be working on something else. - Melinda Varian

In reply to performance of length() in utf-8 by seki

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.