Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things

XML::Parser ( or Perl internals ) speed mysticism

by zagzag (Novice)
on Mar 25, 2006 at 18:58 UTC ( #539223=perlquestion: print w/replies, xml ) Need Help??

zagzag has asked for the wisdom of the Perl Monks concerning the following question:

Hi !
Problem : XML::Parser slow working on long non-markup sequence of characters...


1. XML file test.xml:

<?xml version="1.0" encoding="UTF-8"?> <dump> tMAllxGCUoOtVNrbex5jlgM1e2HoW+VgtBGZaN8cYmi+bMZDOUxQht44sFX+j57S4FvYoy +2Y16kD +uPwMpt+FSjfJww4gXKXpGVkMmMA3AFduCM9K8wVPyVy8fI8F +I+7pIBQt0/Hz5PErMhqJ2ngyt49 +75WQpmP9a1n3wCRE1vBSGAs4jr4 +UcJtkLbZ/07SR3RLDWqByDDjxrsGQYuqxoy4+XrXM01ZTfGp
!skip about 18400 base64 lines !
JMCZng== ==== </dump>
2. Have perl script for parse:

#!/usr/bin/perl use XML::Parser; our $str; my @arr; my $file = shift; my $parser = new XML::Parser:: Handlers =>{ Char => sub { $str .= $_[1]; # push @arr,$_[1]; } } ; $parser->parsefile($file);
2. Run script :
$ time ./ test.xml 35.984u 0.557s 0:41.20 88.6% 10+9137k 0+0io 0pf+0w
3. Now uncomment line with push. Run !
$ time ./ test.xml 0.119u 0.031s 0:00.15 93.3% 10+4276k 0+0io 0pf+0w
First run Time 0:41.20. Second run time 0:00.15 !!! Why ? Any ideas ???
perl -V Summary of my perl5 (revision 5 version 8 subversion 8) configuration: Platform: osname=freebsd, osvers=6.1-prerelease, archname=i386-freebsd-64int uname='freebsd home.zag 6.1-prerelease freebsd 6.1-prerelease #0: +sun feb 26 13:15:04 msk 2006 root@home.zag:usrobjusrsrcsysmykernel i3 +86 ' config_args='-sde -Dprefix=/usr/local -Darchlib=/usr/local/lib/per +l5/5.8.8/mach -Dprivlib=/usr/local/lib/perl5/5.8.8 -Dman3dir=/usr/loc +al/lib/perl5/5.8.8/perl/man/man3 -Dman1dir=/usr/local/man/man1 -Dsite +arch=/usr/local/lib/perl5/site_perl/5.8.8/mach -Dsitelib=/usr/local/l +ib/perl5/site_perl/5.8.8 -Dscriptdir=/usr/local/bin -Dsiteman3dir=/us +r/local/lib/perl5/5.8.8/man/man3 -Dsiteman1dir=/usr/local/man/man1 -U +i_malloc -Ui_iconv -Uinstallusrbinperl -Dcc=cc -Duseshrplib -Dccflags +=-DAPPLLIB_EXP="/usr/local/lib/perl5/5.8.8/BSDPAN" -Doptimize=-O2 -pi +pe -march=pentiumpro -Ud_dosuid -Ui_gdbm -Dusethreads=n -Dusemymalloc +=y -Duse64bitint' hint=recommended, useposix=true, d_sigaction=define usethreads=undef use5005threads=undef useithreads=undef usemultipl +icity=undef useperlio=define d_sfio=undef uselargefiles=define usesocks=undef use64bitint=define use64bitall=undef uselongdouble=undef usemymalloc=y, bincompat5005=undef Compiler: cc='cc', ccflags ='-DAPPLLIB_EXP="/usr/local/lib/perl5/5.8.8/BSDPA +N" -DHAS_FPSETMASK -DHAS_FLOATINGPOINT_H -fno-strict-aliasing -pipe - +Wdeclaration-after-statement -I/usr/local/include', optimize='-O2 -pipe -march=pentiumpro', cppflags='-DAPPLLIB_EXP="/usr/local/lib/perl5/5.8.8/BSDPAN" -DHAS_ +FPSETMASK -DHAS_FLOATINGPOINT_H -fno-strict-aliasing -pipe -Wdeclarat +ion-after-statement -I/usr/local/include' ccversion='', gccversion='3.4.4 [FreeBSD] 20050518', gccosandvers= +'' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=12345678 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=1 +2 ivtype='long long', ivsize=8, nvtype='double', nvsize=8, Off_t='of +f_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='cc', ldflags =' -Wl,-E -L/usr/local/lib' libpth=/usr/lib /usr/local/lib libs=-lgdbm -lm -lcrypt -lutil perllibs=-lm -lcrypt -lutil libc=, so=so, useshrplib=true, gnulibc_version='' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags=' -Wl,-R +/usr/local/lib/perl5/5.8.8/mach/CORE' cccdlflags='-DPIC -fPIC', lddlflags='-shared -L/usr/local/lib' Characteristics of this binary (from libperl): Compile-time options: MYMALLOC PERL_MALLOC_WRAP USE_64_BIT_INT USE_LARGE_FILES USE_PERLIO Locally applied patches: defined-or Built under freebsd Compiled at Mar 15 2006 23:07:53 %ENV: @INC: /usr/local/lib/perl5/5.8.8/BSDPAN /usr/local/lib/perl5/site_perl/5.8.8/mach /usr/local/lib/perl5/site_perl/5.8.8 /usr/local/lib/perl5/site_perl/5.8.7 /usr/local/lib/perl5/site_perl/5.8.6 /usr/local/lib/perl5/site_perl/5.8.5 /usr/local/lib/perl5/site_perl/5.8.2 /usr/local/lib/perl5/site_perl/5.8.0 /usr/local/lib/perl5/site_perl/5.005 /usr/local/lib/perl5/site_perl /usr/local/lib/perl5/5.8.8/mach /usr/local/lib/perl5/5.8.8

READMORE tags added by Arunbear

Replies are listed 'Best First'.
Re: XML::Parser ( or Perl internals ) speed mysticism
by sfink (Deacon) on Mar 26, 2006 at 05:46 UTC
    With the line commented out, the previous line (the concatenation) is used as the return value of the handler. That return value is passed around and used for who knows what. It gets very large. If you replace the commented out line with "1;", it will be even faster.

    It sure is surprising until you figure out what's going on, though!

      sfink, your right ! This code working fine !
      #!/usr/bin/perl use XML::Parser; our $str; my @arr; my $file = shift; my $parser = new XML::Parser:: Handlers =>{ Char => sub { $str .= $_[1]; return; #clear return value } } ; $parser->parsefile($file);
        sfink, your right ! This code working fine !
        The original code should work fine as well (you've found a bug :)). Since the return value isn't being used for anything, that callback (among others) should be called in void context. The patch is easy, modify Expat.xs, and replace all instances of
        I've tested this, and it doesn't seem break anything.

        I've also filed a bug report.

        MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
        I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
        ** The third rule of perl club is a statement of fact: pod is sexy.

Re: XML::Parser ( or Perl internals ) speed mysticism
by perrin (Chancellor) on Mar 25, 2006 at 20:10 UTC
    Your OS is caching the file in memory so it doesn't have to read it from disk again.
      I think that if caching had any effect on this at all, it involves the difference between 0.557s (first run) and 0.031s (second run), which represents the number of seconds of cpu time spent servicing OS library calls -- in other words, negligeable.

      If the OP's data is 18400 lines like the first few shown, that's well under 2 MB total, and reading from disk vs. cache memory for that amount of data could not account for a difference of over 40 sec in run times.

      sfinks's reply looks good...

      (update: I had mistakenly put in the wrong monk's name when linking to the reply that follows -- sorry about the confusion, and thanks to the monks who msg'd me about it.)

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://539223]
Approved by Arunbear
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2023-01-30 20:51 GMT
Find Nodes?
    Voting Booth?

    No recent polls found