Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

handler performance question

by magawake (Novice)
on Feb 04, 2009 at 00:39 UTC ( [id://741153]=perlquestion: print w/replies, xml ) Need Help??

magawake has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to parse the content of car.csv and place it into a directory neatly. My code works perfectly, its just the performance is lagging since our car database has over 5 million lines. I suspect the bottleneck is in the open() and close() calls.
$ cat car.csv make,model,color,year Honda,Civic,Red,2008 Toyota,Camry,Blue,2002 Honda,Accord,Red,1992 Nissan,Sentra,Blue,2009 Ford,Focus,Green,2009 Honda,Civic,Red,2003 Toyota,Corolla,Green,2002 Honda,Civic,Red,1992 Honda,Civic,Green,2008 Toyota,Camry,Orange,2002 Honda,Accord,Black,1992 Nissan,Sentra,White,2009 Ford,Focus,Green,2007
#!/usr/bin/perl -w use strict; #Run the script like, tail -n +2 car.csv | ./foo.pl #I want to ignore the header my $make; #Car Make my $model; #Car Model my $out_dir="/var/tmp/cars"; #Output directory for while (<>) { ($make,$model) = split(/,/,$_,4); system("mkdir -p $out_dir/$make"); open FILE, ">>$out_dir/$make/$model" or die $!; print FILE $_; close FILE; }
The output will be:
$ cat /var/tmp/cars/Honda/Civic Honda,Civic,Red,2008 Honda,Civic,Red,2003 Honda,Civic,Red,1992 Honda,Civic,Green,2008 $ cat /var/tmp/cars/Honda/Accord Honda,Accord,Red,1992 Honda,Accord,Black,1992 $ cat /var/tmp/cars/Ford/Focus Ford,Focus,Green,2009 Ford,Focus,Green,2007
Any thoughts? TIA

Replies are listed 'Best First'.
Re: handler performance question
by Joost (Canon) on Feb 04, 2009 at 01:19 UTC
    I'd suspect the main slowdown is the system("mkdir ...") call. Assuming you have significantly less than 5 million car manufacturers you'd probably be better off

    a) testing if the directory exists before attempting to create it

    b) using the perl built-in mkdir command

    c) cache this info. IOW, don't try to recreate a directory you've already created before. The difference between system and mkdir and stat is quite vast, but in this situation it completely dwarfed by the sheer amount of unnecessary calls you'd make. update: this is the important suggestion. you can ignore the rest for this particular problem.

      using the perl built-in mkdir command

      The Perl equivalent of mkdir -p is mkpath if you don't want to do the chain of mkdir commands. On the other hand, I don't see why -p is needed on any but the first call.

Re: handler performance question
by ikegami (Patriarch) on Feb 04, 2009 at 03:12 UTC

    By sorting, you can minimize the calls to mkdir, to open and to close. You can even sort such that the order is preserved in each file if so desired.

    #!/usr/bin/perl -w use strict; # # Run the script like: # tail -n +2 car.csv | sort | ./foo.pl # # Or if you want to preserve order: # tail -n +2 car.csv | sort -t, -k1,2 -s | ./foo.pl # use File::Path qw( mkpath ); my $out_dir = "/var/tmp/cars"; mkpath($out_dir, 0, 0777) or die; my $last_make = '---'; my $last_model = '---'; my $fh; while (<>) { my ($make,$model) = split(/,/, $_, 4); if ($make ne $last_make) { mkdir("$out_dir/$make", 0777) or die; $last_make = $make; $last_model = '---'; } if ($model ne $last_model) { $fh = undef; open($fh, '>', "$out_dir/$make/$model") or die; $last_model = $model; $fh = undef; } print $fh $_; }
      ++ obvious-in-retrospect bright idea :)
Re: handler performance question
by hobbs (Monk) on Feb 04, 2009 at 01:56 UTC
    If the number of make/model pairs isn't likely to get too out-of-hand, you can also cache handles:
    { my (%handles, %did_mkdir); sub get_out_file { my ($make, $model) = @_; if (!defined $handles{$make}{$model}) { if (! $did_mkdir{$make} && ! -d "$out_dir/$make") { system "mkdir -p $out_dir/$make"; $did_mkdir{$make} = 1; } open $handles{$make}{$model}, "$out_dir/$make/$model" or die $!; } return $handles{$make}{$model}; } }
    Then your main loop is reduced to:
    while (<>) { my ($make, $model) = split /,/, $_, 4; my $out = get_out_file($make, $model); print $out $line; }
    and all of the handles are closed at the end of execution. If this doesn't run you up against your filehandle limit, it will save you any number of unnecessary opens and mkdirs. If it does, well then you're going to have to start worrying about discarding entries from the cache, but you also have to ask yourself whether performance trumps complexity in that case. :)
Re: handler performance question
by toolic (Bishop) on Feb 04, 2009 at 02:19 UTC
    If you do not need to preserve the order in which you read the lines of your input for each make/model, then you could stuff all the data into a hash-of-hashes structure. This may be more of a memory hog than you can tolerate, but it certainly is faster on my machine than all those mkdir's and open/closes (I created an input file of 13 million lines):
    use strict; use warnings; # Read all input into data structure my %cars; my $i = 0; # unique tag for hash keys while (<>) { chomp; my ($make, $model, $rest) = split /,/, $_, 3; $cars{$make}{$model}{"$rest,$i"}++; $i++; } # Create each directory once # Open/close each file once my $out_dir = "/var/tmp/cars"; #Output directory for for my $make (keys %cars) { my $dir = "$out_dir/$make"; system "mkdir -p $dir"; for my $model (keys %{ $cars{$make} }) { open my $fh, '>', "$out_dir/$make/$model" or die $!; for my $rest (keys %{ $cars{$make}{$model} }) { my @specs = split /,/, $rest, 2; print $fh (join ',', ($make, $model, @specs)), "\n"; } close $fh; } }

    Update: On second thought, I think you can preserve input order by using an HoHoA instead of the HoHoH above. That should get rid of the smelly tag I introduced, too.

Re: handler performance question
by hbm (Hermit) on Feb 04, 2009 at 02:37 UTC

    Can you pipe it in sorted? Then you could hash your data until you get a new make; then write it out and start a new hash.

    And are you locked in to that directory structure? Or could you create a Honda.Civic file instead of ../Honda/Civic?

    And maybe you can do some of that in parallel: Read in a make; write it to file in a child process; and read in the next make.

    In the end, it depends on where your performance is lagging.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://741153]
Approved by Joost
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (7)
As of 2024-04-19 10:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found