Hi Monks,
An XML::SAX reader need to use one or more "consumers" that will receive the data parsed the XML::SAX::Reader. It can be a predefined accumulator based on a string, array, code and you can define your own.
While working on my SAX-based xml splitter I need to use my own consumer (descendant of ConsumerInterface) to store some data, with adding the possibility to query the current size or reset the data. My first implementation used
.= operator to concatenate and
length() to return the length of the data and it appeared to me after testing on file greater than several KB that the perfs were
exponentially linerarly (as stated by readers) regressive.
I hacked another implementation based on a file to store the temp data, with
print() to store and
tell() to get the size. It radically improved the performance, but I was wondering about the origin of the problem.
On my Win7 box, I noticed that the perl.exe process was heavily querying the
Perl_utf8_length function. And after some tests, I could confirm that it is rather the calls to
length in utf-8 context rather than
.= that are to blame.
Here is my test program that mimicks my SAX parser custom data store. It is showing the time to concatenate an utf-8 string in a loop and getting its size by chunks of 1000 iterations.
I have implemented 2 objects:
- one is using length on the resulting string to return its size
- the second gets the length of the added data and keeps the size in a scalar that is returned on demand
It seems to me that bigger is the string, longer is the time to get its size, (
exponentialy? linearly). Here are the times I get:
with length()
1000 L= 256000 t=0.510602
2000 L= 512000 t=1.544809
3000 L= 768000 t=2.558511
4000 L= 1024000 t=3.598220
5000 L= 1280000 t=4.622924
6000 L= 1536000 t=5.666133
7000 L= 1792000 t=6.634827
8000 L= 2048000 t=7.653030
9000 L= 2304000 t=8.687737
10000 L= 2560000 t=9.727445
11000 L= 2816000 t=10.728646
12000 L= 3072000 t=11.764352
13000 L= 3328000 t=12.804560
14000 L= 3584000 t=13.783257
15000 L= 3840000 t=14.836966
with a scalar
1000 L= 256000 t=0.003712
2000 L= 512000 t=0.003200
3000 L= 768000 t=0.003433
4000 L= 1024000 t=0.003232
5000 L= 1280000 t=0.003398
6000 L= 1536000 t=0.003669
7000 L= 1792000 t=0.004407
8000 L= 2048000 t=0.002218
9000 L= 2304000 t=0.004507
10000 L= 2560000 t=0.002192
11000 L= 2816000 t=0.005269
12000 L= 3072000 t=0.002203
13000 L= 3328000 t=0.006128
14000 L= 3584000 t=0.002311
15000 L= 3840000 t=0.002231
In my real use case, the file-based or stored length based code can process a 25MB xml file in 60s while the same code just using the naive
length() based code is spending about 25 minutes on the same data!
Can you confirm my analysis, and tell if my workaround is suitable?
In the production code, I will probably keep the file storage to please my boss and limit the memory charge (it may store until 200MB of data, or more depending on the settings, temporarily during the process), but it may be a false good idea...
Here is my test code:
use strict;
use warnings;
use feature 'say';
use feature 'state';
use utf8;
use Time::HiRes;
$|++;
my $chunk = '€' x 256;
my $td = Time::HiRes::time;
my $tf;
my $l;
say "with length()";
my $str = new LenTestA;
for my $n (1..15_000){
state $count = 0;
$str->add($chunk);
$l = $str->len;
$count++;
if ($count % 1000 == 0){
$tf = Time::HiRes::time;
say sprintf "%12d L=%10d t=%f", $n, $l, $tf-$td;
$td = $tf;
}
}
$td = Time::HiRes::time;
say "\nwith a scalar";
$str = new LenTestB;
for my $n (1..15_000){
state $count = 0;
$str->add($chunk);
$l = $str->len;
$count++;
if ($count % 1000 == 0){
$tf = Time::HiRes::time;
say sprintf "%12d L=%10d t=%f", $n, $l, $tf-$td;
$td = $tf;
}
}
{
package LenTestA;
sub new {
my $class = shift;
my $self = '';
return bless \$self, $class;
}
sub add {
my ($self, $data) = @_;
$$self .= $data;
}
sub len {
my $self = shift;
return length $$self;
}
}
{
package LenTestB;
my $len;
sub new {
my $class = shift;
my $self = '';
return bless \$self, $class;
}
sub add {
my ($self, $data) = @_;
$$self .= $data;
$len += length($data);
}
sub len {
my $self = shift;
return $len;
}
}
The best programs are the ones written when the programmer is supposed to be working on something else. - Melinda Varian
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.