handler performance question

magawake has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to parse the content of car.csv and place it into a directory neatly. My code works perfectly, its just the performance is lagging since our car database has over 5 million lines. I suspect the bottleneck is in the open() and close() calls.

$ cat car.csv
make,model,color,year
Honda,Civic,Red,2008
Toyota,Camry,Blue,2002
Honda,Accord,Red,1992
Nissan,Sentra,Blue,2009
Ford,Focus,Green,2009
Honda,Civic,Red,2003
Toyota,Corolla,Green,2002
Honda,Civic,Red,1992
Honda,Civic,Green,2008
Toyota,Camry,Orange,2002
Honda,Accord,Black,1992
Nissan,Sentra,White,2009
Ford,Focus,Green,2007
[download]

#!/usr/bin/perl -w
use strict;

#Run the script like, tail -n +2 car.csv | ./foo.pl
#I want to ignore the header

my $make; #Car Make
my $model; #Car Model
my $out_dir="/var/tmp/cars"; #Output directory for

while (<>) {
($make,$model) = split(/,/,$_,4);
 system("mkdir -p $out_dir/$make");
 open FILE, ">>$out_dir/$make/$model" or die $!;
   print FILE $_;
 close FILE;
}
[download]

The output will be:

$ cat /var/tmp/cars/Honda/Civic 
Honda,Civic,Red,2008
Honda,Civic,Red,2003
Honda,Civic,Red,1992
Honda,Civic,Green,2008

$ cat /var/tmp/cars/Honda/Accord 
Honda,Accord,Red,1992
Honda,Accord,Black,1992

$ cat /var/tmp/cars/Ford/Focus 
Ford,Focus,Green,2009
Ford,Focus,Green,2007
[download]

Any thoughts? TIA

Comment on handler performance question Select or Download Code

Replies are listed 'Best First'.
Re: handler performance question by Joost (Canon) on Feb 04, 2009 at 01:19 UTC
I'd suspect the main slowdown is the system("mkdir ...") call. Assuming you have significantly less than 5 million car manufacturers you'd probably be better off a) testing if the directory exists before attempting to create it b) using the perl built-in mkdir command c) cache this info. IOW, don't try to recreate a directory you've already created before. The difference between system and mkdir and stat is quite vast, but in this situation it completely dwarfed by the sheer amount of unnecessary calls you'd make. update: this is the important suggestion. you can ignore the rest for this particular problem. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re^2: handler performance question by ikegami (Patriarch) on Feb 04, 2009 at 03:02 UTC
using the perl built-in mkdir command The Perl equivalent of `mkdir -p` is `mkpath` if you don't want to do the chain of `mkdir` commands. On the other hand, I don't see why `-p` is needed on any but the first call.	[reply] [d/l] [select]
Re: handler performance question by ikegami (Patriarch) on Feb 04, 2009 at 03:12 UTC
By sorting, you can minimize the calls to `mkdir`, to `open` and to `close`. You can even sort such that the order is preserved in each file if so desired. #!/usr/bin/perl -w use strict; # # Run the script like: # tail -n +2 car.csv \| sort \| ./foo.pl # # Or if you want to preserve order: # tail -n +2 car.csv \| sort -t, -k1,2 -s \| ./foo.pl # use File::Path qw( mkpath ); my $out_dir = "/var/tmp/cars"; mkpath($out_dir, 0, 0777) or die; my $last_make = '---'; my $last_model = '---'; my $fh; while (<>) { my ($make,$model) = split(/,/, $_, 4); if ($make ne $last_make) { mkdir("$out_dir/$make", 0777) or die; $last_make = $make; $last_model = '---'; } if ($model ne $last_model) { $fh = undef; open($fh, '>', "$out_dir/$make/$model") or die; $last_model = $model; $fh = undef; } print $fh $_; } [download]	[reply] [d/l] [select]
Re^2: handler performance question by hobbs (Monk) on Feb 04, 2009 at 04:00 UTC
++ obvious-in-retrospect bright idea :)	[reply]
Re: handler performance question by hobbs (Monk) on Feb 04, 2009 at 01:56 UTC
If the number of make/model pairs isn't likely to get too out-of-hand, you can also cache handles: `{ my (%handles, %did_mkdir); sub get_out_file { my ($make, $model) = @_; if (!defined $handles{$make}{$model}) { if (! $did_mkdir{$make} && ! -d "$out_dir/$make") { system "mkdir -p $out_dir/$make"; $did_mkdir{$make} = 1; } open $handles{$make}{$model}, "$out_dir/$make/$model" or die $!; } return $handles{$make}{$model}; } }` [download] Then your main loop is reduced to: `while (<>) { my ($make, $model) = split /,/, $_, 4; my $out = get_out_file($make, $model); print $out $line; }` [download] and all of the handles are closed at the end of execution. If this doesn't run you up against your filehandle limit, it will save you any number of unnecessary opens and mkdirs. If it does, well then you're going to have to start worrying about discarding entries from the cache, but you also have to ask yourself whether performance trumps complexity in that case. :)	[reply] [d/l] [select]
Re: handler performance question by toolic (Bishop) on Feb 04, 2009 at 02:19 UTC
If you do not need to preserve the order in which you read the lines of your input for each make/model, then you could stuff all the data into a hash-of-hashes structure. This may be more of a memory hog than you can tolerate, but it certainly is faster on my machine than all those mkdir's and open/closes (I created an input file of 13 million lines): use strict; use warnings; # Read all input into data structure my %cars; my $i = 0; # unique tag for hash keys while (<>) { chomp; my ($make, $model, $rest) = split /,/, $_, 3; $cars{$make}{$model}{"$rest,$i"}++; $i++; } # Create each directory once # Open/close each file once my $out_dir = "/var/tmp/cars"; #Output directory for for my $make (keys %cars) { my $dir = "$out_dir/$make"; system "mkdir -p $dir"; for my $model (keys %{ $cars{$make} }) { open my $fh, '>', "$out_dir/$make/$model" or die $!; for my $rest (keys %{ $cars{$make}{$model} }) { my @specs = split /,/, $rest, 2; print $fh (join ',', ($make, $model, @specs)), "\n"; } close $fh; } } [download] Update: On second thought, I think you can preserve input order by using an HoHoA instead of the HoHoH above. That should get rid of the smelly tag I introduced, too.	[reply] [d/l]
Re: handler performance question by hbm (Hermit) on Feb 04, 2009 at 02:37 UTC
Can you pipe it in sorted? Then you could hash your data until you get a new make; then write it out and start a new hash. And are you locked in to that directory structure? Or could you create a `Honda.Civic` file instead of `../Honda/Civic`? And maybe you can do some of that in parallel: Read in a make; write it to file in a child process; and read in the next make. In the end, it depends on where your performance is lagging.	[reply]


Just another Perl shrine
	PerlMonks