As others have mentioned, you're a bit short on details. But I had a bit of time this morning, so I looked over the code briefly. I noticed that in the first subroutine, you have a nested loop in which you're making a pass through the file for each pattern. This is normally less efficient than making a single pass through the file, checking for each pattern every line, as I/O time is generally "expensive" compared to scanning a string. (I swapped the inner and outer loops in getCsvHash2 shown later in the code listing later.)
Also, since you're looping through a small set of patterns, you may be paying too much time for recompiling the regular expression each time through the loop. If you're only using a small number of patterns, it may be worth it (performance-wise) to let perl compile the regular expressions only once. I rearranged the code a bit and came up with the function getCsvHash3. (Note: I may have error(s) in this, so you'll want to test it to ensure that you get the same results.)
Of course, the only way to really be sure about the performance of changes is to measure it. So I whipped up a test file and coded up a benchmark to compare them:
#!/usr/bin/perl use strict; use warnings; use Benchmark qw(cmpthese); my @Patterns=("xxx", "SSS", "s:S"); my %master; cmpthese(50, { orig=>\&getCsvHash1, swap=>\&getCsvHash2, regex=>\&getCsvHash3, }); sub getCsvHash1 { %master=(); # unset the HASH. foreach my $wlp (@Patterns) { my $key=$wlp; $key =~ s/[\s+|:]/_/g; open FILE, "csv_file.csv" or die $!; while(<FILE>) { my $line=$_; { my @csv = split(",", $line); if ($csv[1] =~ /"$wlp/) { push (@{$master{$key}}, $line); # push as value of a hash- } } } #while(<FILE>) ends here close FILE; } #foreach $wlp ends here } #Function getWhiteListCsvArrays ends here. sub getCsvHash2 { %master=(); # unset the HASH. open FILE, "csv_file.csv" or die $!; while(<FILE>) { my $line=$_; foreach my $wlp (@Patterns) { my $key=$wlp; $key =~ s/[\s+|:]/_/g; { my @csv = split(",", $line); if ($csv[1] =~ /"$wlp/) { push (@{$master{$key}}, $line); # push as value of a hash- } } } } close FILE; } sub getCsvHash3 { %master=(); # unset the HASH. open FILE, "csv_file.csv" or die $!; while(<FILE>) { my $line=$_; my @csv = split(",", $line); if ($csv[1] =~ /xxx/) { push @{$master{$1}}, $line; } if ($csv[1] =~ /SSS/) { push @{$master{$1}}, $line; } if ($csv[1] =~ /s:S/) { push @{$master{s_S}}, $line; } } close FILE; }
So if all of your time is spent in the getCsvHash routine, then making these changes will help you out--the third version is a shade over 4 times faster on my machine. But if most of your execution time is spent in other routines, then you'll want to profile them and make improvements as indicated.
$ time perl pm923771.pl Rate swap orig regex swap 0.544/s -- -47% -81% orig 1.03/s 89% -- -64% regex 2.87/s 427% 178% -- real 2m38.179s user 2m34.503s sys 0m3.463s
...roboticus
When your only tool is a hammer, all problems look like your thumb.
In reply to Re^5: how to split huge file reading into multiple threads
by roboticus
in thread how to split huge file reading into multiple threads
by sagarika
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |