Re^2: how to split huge file reading into multiple threads

Replies are listed 'Best First'.
Re^3: how to split huge file reading into multiple threads by AR (Friar) on Aug 30, 2011 at 12:30 UTC
Please show some code. We can help you best if you show a stripped down, but working, version of your code with sample data. Maybe your problem is that you're opening files over and over again when you should be keeping them open. I can't tell from your description of the code.	[reply]
Re^4: how to split huge file reading into multiple threads by sagarika (Novice) on Sep 02, 2011 at 06:36 UTC
Alright Here is some snippet of the code: The script mainly consumes time in the for loop where Hash of arrays keys are referred. @Patterns=("xxx", "SSS", "s:S"); sub getCsvHash { %master=(); # unset the HASH. foreach my $wlp (@Patterns) { my $key=$wlp; $key =~ s/[\s+\|:]/_/g; open FILE, "csv_file.csv" or die $!; while(<FILE>) { my $line=$_; { my @csv = split(",", $line); if ($csv[1] =~ /"$wlp/) { push (@{$master{$key}}, $line); # push as value of a hash } } } #while(<FILE>) ends here close FILE; } #foreach $wlp ends here } #Function getWhiteListCsvArrays ends here. sub Processfiles { open DFH, "$log_file"; while(<DFH>) { my $line=$_; if ($line =~/(.?)\t(.?)\t(.?)\t(.?)\t(.?)\t(.?)\t(.?)\t(.?) +\t(.?)\t(.?)\t(.?)/) { my $rc=$3; my $ct=$10; my $cl=$6; my $retval=applyList($line,$rc,$cl,$ct) } } } sub applyList { foreach my $row (@{$master{$key}}) { my $param_op1_flag="nc"; #Set the flag to indicate that the + optional parameter 1 is nc(not-checked). my $param_op2_flag="nc"; #Set the flag to indicate that the + optional parameter 2 is nc(not-checked). my @row_csv = split(",", $row); $row_csv[3] =~s/"(.?)"/$1/g; #Get the mandatory part 1. Th +is is the domain-name. $row_csv[4] =~s/"(.?)"/$1/g; #Get the mandatory part 2. Th +is is part after domain-name. my $param_man = $row_csv[3] . $row_csv[4]; #combine the man +datory parts. my $param_op1=$row_csv[5]; #Get the optional parameter 1. my $param_op2=$row_csv[6]; #Get the optional parameter 2. $param_op1=~ s/\n//g; #Remove the new-lines if any. $param_op2=~ s/\n//g; #Remove the new-lines if any. if(length($param_op1)) #check if optional parameter 1 has s +omething to check or not. { $param_op1 =~s/"(.?)"/$1/g; #Remove the double inverted +commas. $param_op1 =~s/\?/\\\?/g; #Escape the special characters +like: ?. if($url =~/$param_op1/) #check if optional parameter 1 is + present in URL or not. { $param_op1_flag="cf"; #Set the optional parameter 1 fla +g to cf (checked-found). } else { $param_op1_flag="cnf"; #Set the optional parameter 1 fl +ag to cnf (checked-not-found). } } if(length($param_op2) > 1 ) { $param_op2 =~s/"(.*?)"/$1/g; #Remove the double inverted +commans. $param_op2 =~s/\?/\\\?/g; #Escape the special characters +like: ?. if($url =~ /$param_op2/) #check if optional parameter 2 i +s present in URL or not. { $param_op2_flag="cf"; #Set the optional parameter 2 fla +g to cf (checked-found). } else { $param_op2_flag="cnf"; #Set the optional parameter 2 fl +ag to cnf (checked-not-found). } } if($url=~/$param_man/ && ($param_op1_flag eq "cf" \|\| $param +_op1_flag eq "nc") && ($param_op2_flag eq "cf" \|\| $param_op2_flag eq +"nc")) { if (($cl < 5000 \|\| $rc == 206) && $key =~/^AS_D/) { open OPF, ">>out/$key_cont"; print OPF $line; close OPF; applyBlackList($line,$key_cont,$ct); $retval_def=1; return $retval_def; } open OPF, ">>out/$key"; print OPF $line; close OPF; if($key !~/^AD/) { applyBlackList($line,$key,$ct); } $retval_def=1; return $retval_def } } } return $retval_def; } } } [download] Please suggest.	[reply] [d/l]
Re^5: how to split huge file reading into multiple threads by roboticus (Chancellor) on Sep 02, 2011 at 12:17 UTC
sagarika: As others have mentioned, you're a bit short on details. But I had a bit of time this morning, so I looked over the code briefly. I noticed that in the first subroutine, you have a nested loop in which you're making a pass through the file for each pattern. This is normally less efficient than making a single pass through the file, checking for each pattern every line, as I/O time is generally "expensive" compared to scanning a string. (I swapped the inner and outer loops in getCsvHash2 shown later in the code listing later.) Also, since you're looping through a small set of patterns, you may be paying too much time for recompiling the regular expression each time through the loop. If you're only using a small number of patterns, it may be worth it (performance-wise) to let perl compile the regular expressions only once. I rearranged the code a bit and came up with the function getCsvHash3. (Note: I may have error(s) in this, so you'll want to test it to ensure that you get the same results.) Of course, the only way to really be sure about the performance of changes is to measure it. So I whipped up a test file and coded up a benchmark to compare them: #!/usr/bin/perl use strict; use warnings; use Benchmark qw(cmpthese); my @Patterns=("xxx", "SSS", "s:S"); my %master; cmpthese(50, { orig=>\&getCsvHash1, swap=>\&getCsvHash2, regex=>\&getCsvHash3, }); sub getCsvHash1 { %master=(); # unset the HASH. foreach my $wlp (@Patterns) { my $key=$wlp; $key =~ s/[\s+\|:]/_/g; open FILE, "csv_file.csv" or die $!; while(<FILE>) { my $line=$_; { my @csv = split(",", $line); if ($csv[1] =~ /"$wlp/) { push (@{$master{$key}}, $line); # push as value of a hash- } } } #while(<FILE>) ends here close FILE; } #foreach $wlp ends here } #Function getWhiteListCsvArrays ends here. sub getCsvHash2 { %master=(); # unset the HASH. open FILE, "csv_file.csv" or die $!; while(<FILE>) { my $line=$_; foreach my $wlp (@Patterns) { my $key=$wlp; $key =~ s/[\s+\|:]/_/g; { my @csv = split(",", $line); if ($csv[1] =~ /"$wlp/) { push (@{$master{$key}}, $line); # push as value of a hash- } } } } close FILE; } sub getCsvHash3 { %master=(); # unset the HASH. open FILE, "csv_file.csv" or die $!; while(<FILE>) { my $line=$_; my @csv = split(",", $line); if ($csv[1] =~ /xxx/) { push @{$master{$1}}, $line; } if ($csv[1] =~ /SSS/) { push @{$master{$1}}, $line; } if ($csv[1] =~ /s:S/) { push @{$master{s_S}}, $line; } } close FILE; } [download] So if all of your time is spent in the getCsvHash routine, then making these changes will help you out--the third version is a shade over 4 times faster on my machine. But if most of your execution time is spent in other routines, then you'll want to profile them and make improvements as indicated. `$ time perl pm923771.pl Rate swap orig regex swap 0.544/s -- -47% -81% orig 1.03/s 89% -- -64% regex 2.87/s 427% 178% -- real 2m38.179s user 2m34.503s sys 0m3.463s` [download] ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply] [d/l] [select]
Re^6: how to split huge file reading into multiple threads by sagarika (Novice) on Sep 07, 2011 at 06:03 UTC
Re^3: how to split huge file reading into multiple threads by GrandFather (Saint) on Aug 30, 2011 at 10:00 UTC
As Corion suggests it is hard to offer much by way of constructive advice without something concrete to play with. However, it may be that you can leverage regular expressions in some fashion to speed up the matching phase of the process. I can't provide much more focused advice without some information about the nature of the matching. True laziness is hard work	[reply]
Re^3: how to split huge file reading into multiple threads by Corion (Patriarch) on Aug 30, 2011 at 09:43 UTC
This is not code we can download and run for ourselves. Please reduce your problem to a program of about 20 lines, and also post about 20 lines of representative data.	[reply]