comment on

Hi Monks,
An XML::SAX reader need to use one or more "consumers" that will receive the data parsed the XML::SAX::Reader. It can be a predefined accumulator based on a string, array, code and you can define your own.

While working on my SAX-based xml splitter I need to use my own consumer (descendant of ConsumerInterface) to store some data, with adding the possibility to query the current size or reset the data. My first implementation used .= operator to concatenate and length() to return the length of the data and it appeared to me after testing on file greater than several KB that the perfs were ~~exponentially~~ linerarly (as stated by readers) regressive.

I hacked another implementation based on a file to store the temp data, with print() to store and tell() to get the size. It radically improved the performance, but I was wondering about the origin of the problem.

On my Win7 box, I noticed that the perl.exe process was heavily querying the Perl_utf8_length function. And after some tests, I could confirm that it is rather the calls to length in utf-8 context rather than .= that are to blame.
Here is my test program that mimicks my SAX parser custom data store. It is showing the time to concatenate an utf-8 string in a loop and getting its size by chunks of 1000 iterations.
I have implemented 2 objects:

one is using length on the resulting string to return its size
the second gets the length of the added data and keeps the size in a scalar that is returned on demand

It seems to me that bigger is the string, longer is the time to get its size, (~~exponentialy?~~ linearly). Here are the times I get:

with length()
        1000 L=    256000 t=0.510602
        2000 L=    512000 t=1.544809
        3000 L=    768000 t=2.558511
        4000 L=   1024000 t=3.598220
        5000 L=   1280000 t=4.622924
        6000 L=   1536000 t=5.666133
        7000 L=   1792000 t=6.634827
        8000 L=   2048000 t=7.653030
        9000 L=   2304000 t=8.687737
       10000 L=   2560000 t=9.727445
       11000 L=   2816000 t=10.728646
       12000 L=   3072000 t=11.764352
       13000 L=   3328000 t=12.804560
       14000 L=   3584000 t=13.783257
       15000 L=   3840000 t=14.836966

with a scalar
        1000 L=    256000 t=0.003712
        2000 L=    512000 t=0.003200
        3000 L=    768000 t=0.003433
        4000 L=   1024000 t=0.003232
        5000 L=   1280000 t=0.003398
        6000 L=   1536000 t=0.003669
        7000 L=   1792000 t=0.004407
        8000 L=   2048000 t=0.002218
        9000 L=   2304000 t=0.004507
       10000 L=   2560000 t=0.002192
       11000 L=   2816000 t=0.005269
       12000 L=   3072000 t=0.002203
       13000 L=   3328000 t=0.006128
       14000 L=   3584000 t=0.002311
       15000 L=   3840000 t=0.002231
[download]

In my real use case, the file-based or stored length based code can process a 25MB xml file in 60s while the same code just using the naive length() based code is spending about 25 minutes on the same data!
Can you confirm my analysis, and tell if my workaround is suitable?

In the production code, I will probably keep the file storage to please my boss and limit the memory charge (it may store until 200MB of data, or more depending on the settings, temporarily during the process), but it may be a false good idea...

Here is my test code:

use strict;
use warnings;
use feature 'say';
use feature 'state';
use utf8;
use Time::HiRes;

$|++;
my $chunk = '€' x 256;
my $td = Time::HiRes::time;
my $tf;
my $l;

say "with length()";
my $str = new LenTestA;
for my $n (1..15_000){
    state $count = 0;
    $str->add($chunk);
    $l = $str->len;

    $count++;
    if ($count % 1000 == 0){
        $tf = Time::HiRes::time;
        say sprintf "%12d L=%10d t=%f", $n, $l, $tf-$td;
        $td = $tf;
    }
}

$td = Time::HiRes::time;
say "\nwith a scalar";
$str = new LenTestB;
for my $n (1..15_000){
    state $count = 0;
    $str->add($chunk);
    $l = $str->len;

    $count++;
    if ($count % 1000 == 0){
        $tf = Time::HiRes::time;
        say sprintf "%12d L=%10d t=%f", $n, $l, $tf-$td;
        $td = $tf;
    }
}

{
    package LenTestA;
    sub new {
        my $class = shift;
        my $self = '';
        return bless \$self, $class;
    }
    sub add {
        my ($self, $data) = @_;
        $$self .= $data;
    }
    sub len {
        my $self = shift;
        return length $$self;
    }
}
{
    package LenTestB;
    my $len;
    sub new {
        my $class = shift;
        my $self = '';
        return bless \$self, $class;
    }
    sub add {
        my ($self, $data) = @_;
        $$self .= $data;
        $len += length($data);
    }
    sub len {
        my $self = shift;
        return $len;
    }
}
[download]

The best programs are the ones written when the programmer is supposed to be working on something else. - Melinda Varian

In reply to performance of length() in utf-8 by seki

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.