Xenofur has asked for the wisdom of the Perl Monks concerning the following question:

I'm currently converting CSV files, which are luckily formatted in such a matter that i can take them apart with split(). Currently I'm running the code like this in a CGI::App:
use autodie; $c->do_update(); sub do_update { my ($c) = @_; for my $file ( @files ) { $c->process_dump_file( $file ); } } sub process_dump_file { my ($c, $file) = @_; my @orders; open my $csv, "<", $file; push @orders, [ split ' , ', $_ ] while ( <$csv> ); close $csv; shift @orders; return; }
The empty return is intentional for now.

The problem is this: The very first run is extremely fast. Takes maybe 10 seconds. After that each file, even when it is of the same size, takes 1-2 minutes. (Each one takes roughly similar time though, so it's not exponential.)

Each file has about 300_000 - 400_000 lines, with 14 fields on each line and each file being about 50-80 MB. The application gets to around 300 MB ram usage, but that still leaves plenty of free ram, the hdd is not very active during processing and the entire load seems to be cpu activity.

For a run of four files, Benchmark gives this result:
Extracted 308272 orders. CSV time: 4 wallclock secs ( 4.05 usr + 0.28 sys = 4.33 CPU) Extracted 301468 orders. CSV time: 127 wallclock secs (123.47 usr + 0.44 sys = 123.91 CPU) Extracted 316912 orders. CSV time: 136 wallclock secs (131.77 usr + 0.42 sys = 132.19 CPU) Extracted 426854 orders. CSV time: 145 wallclock secs (139.91 usr + 0.66 sys = 140.56 CPU) Duration: 432 wallclock secs (412.98 usr + 3.31 sys = 416.30 CPU)


I'm looking for any sort of idea as to how this could be and what i could do against it.

Replies are listed 'Best First'.
Re: Performance oddity when splitting a huge file into an AoA
by roubi (Hermit) on May 03, 2009 at 14:18 UTC
    You're certain that the first file is actually opened? I see that you don't check the status of 'open' with something like this:
    open(my $csv, "<", $file) or die "Unable to open $file: $!";

    Update: Original OP's post did not include "use autodie"
      Yeah, I'm running under "use autodie;" and it behaves the same no matter which file is the first one.
        Okay. Since you are not getting much help so far, here is a second half-brained idea: maybe what you are seeing is related to the deallocation of the @orders array created during the previous run. You could test that theory by keeping those around and see if your timings change at all.
Re: Performance oddity when splitting a huge file into an AoA
by BrowserUk (Patriarch) on May 04, 2009 at 09:00 UTC

    You're running this under mod_perl or fastcgi? Cos I can't reproduce your findings using straight perl.

    #! perl -sw use 5.010; use strict; use Time::HiRes qw[ time ];; sub x{ open my $fh, '<', shift or die $!; my @AoA; push @AoA, [ split ',' ] while <$fh>; close $fh; return scalar @AoA;; } for ( 1 .. 5 ) { my $start = time; printf "Records: %d in %.3f seconds\n", x( sprintf 'junk%d.dat', 1+ ($_ & 1) ), time() - $start; } __END__ c:\test>junk Records: 400000 in 5.884 seconds Records: 300000 in 4.752 seconds Records: 400000 in 4.599 seconds Records: 300000 in 3.473 seconds Records: 400000 in 4.569 seconds c:\test>junk Records: 400000 in 4.826 seconds Records: 300000 in 3.408 seconds Records: 400000 in 4.613 seconds Records: 300000 in 3.481 seconds Records: 400000 in 4.557 seconds

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      You're right. It is dependant on what version of Perl is used. I'm utterly confused now:

      ActivePerl:
      d:\Web-Dev\arrays>perl -v This is perl, v5.10.0 built for MSWin32-x86-multi-thread (with 5 registered patches, see perl -V for more detail) Copyright 1987-2007, Larry Wall Binary build 1004 [287188] provided by ActiveState http://www.ActiveSt +ate.com Built Sep 3 2008 13:16:37 [snip] D:\Web-Dev\arrays>perl test.pl Records: 308273 in 5.641 seconds Records: 279997 in 98.281 seconds Records: 308273 in 128.656 seconds Records: 279997 in 96.953 seconds Records: 308273 in 129.188 seconds
      Cygwin:
      bash-3.2$ /bin/perl -v This is perl, v5.10.0 built for cygwin-thread-multi-64int (with 6 registered patches, see perl -V for more detail) Copyright 1987-2007, Larry Wall [snip] bash-3.2$ /bin/perl test.pl Records: 308273 in 6.719 seconds Records: 279997 in 5.875 seconds Records: 308273 in 6.484 seconds Records: 279997 in 5.906 seconds Records: 308273 in 6.515 seconds

        Even stranger, cos I'm using AS1004 also. The only difference is that I'm using the 64-bit version:

        c:\test>perl -V Summary of my perl5 (revision 5 version 10 subversion 0) configuration +: Platform: osname=MSWin32, osvers=5.2, archname=MSWin32-x64-multi-thread ... Characteristics of this binary (from libperl): Compile-time options: MULTIPLICITY PERL_DONT_CREATE_GVSV PERL_IMPLICIT_CONTEXT PERL_IMPLICIT_SYS PERL_MALLOC_WRAP PL_OP_SLAB_ALLOC USE_64_BIT_I +NT USE_ITHREADS USE_LARGE_FILES USE_PERLIO USE_SITECUSTOMIZE Locally applied patches: ActivePerl Build 1004 [287188] 33741 avoids segfaults invoking S_raise_signal() (on Linux) 33763 Win32 process ids can have more than 16 bits 32809 Load 'loadable object' with non-default file extension 32728 64-bit fix for Time::Local Built under MSWin32 Compiled at Sep 3 2008 12:22:07 @INC: C:/Perl64/site/lib C:/Perl64/lib .

        And I don't see the problem with 5.8.9/32-bit either:

        c:\test>\perl32\bin\perl5.8.9.exe junk.pl Records: 400000 in 6.732 seconds Records: 300000 in 4.329 seconds Records: 400000 in 4.537 seconds Records: 300000 in 3.365 seconds Records: 400000 in 4.495 seconds

        I think you should look closer at what is going on there. (I'll try to grab a copy of AS1004 32-bit and run it for comparison purposes.)


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Performance oddity when splitting a huge file into an AoA
by aufflick (Deacon) on May 04, 2009 at 06:54 UTC
    On an unrelated note, the last column in your array will have a newline at the end, which may or may not be what you want.