in reply to Which is more faster? While or tr///

I have to calculate number of all letters in files and accordingly increase count
while ($string=~m{(W)}g){
Your code does not match your description. Your code counts the number of Ws.

Replies are listed 'Best First'.
Re^2: Which is more faster? While or tr///
by tej (Scribe) on Feb 01, 2011 at 09:37 UTC
    This was just example... Whole code looks like
    while ($string=~m{<(ths|193)>|\.}g) { $count=$count+($tbsz * 0.25); } while ($string=~m{<(ens|194)>}g) { $count=$count+($tbsz * 0.5); } while ($string=~m{<(ems|195)>|\s}g) { $count=$count+$tbsz; } $string=~s/<[A-z]+>//g; if($ftype==1){ while ($string=~m{(W|\s|%)}g){ ### % is added temporarily + for some testing purpose $count=$count+($tbsz*1); } while ($string=~m{(w|\)|\()}g){ $count=$count+($tbsz*0.84375); } while ($string=~m{(M|m)}g){ $count=$count+($tbsz*0.8125); } while ($string=~m{(N|Q)}g){ $count=$count+($tbsz*0.7188); } while ($string=~m{(O|Y)}g){ $count=$count+($tbsz*0.6875); } while ($string=~m{(A|D|G|H|K|U|V|X)}g){ $count=$count+($tbsz*0.6562); } while ($string=~m{(R)}g){ $count=$count+($tbsz*0.625); } while ($string=~m{(B|C|P|T|Z|a|b|d|h|k|n|p|q|u|v|x)}g){ $count=$count+($tbsz*0.5625); } while ($string=~m{(6)}g){ $count=$count+($tbsz*0.55); } while ($string=~m{(0)}g){ $count=$count+($tbsz*0.5375); } while ($string=~m{(g|y)}g){ $count=$count+($tbsz*0.5313); } while ($string=~m{(4)}g){ $count=$count+($tbsz*0.5281); } while ($string=~m{(7|8)}g){ $count=$count+($tbsz*0.5156); } while ($string=~m{(o|2|3)}g){ $count=$count+($tbsz*0.5); } while ($string=~m{(5)}g){ $count=$count+($tbsz*0.4938); } while ($string=~m{(9)}g){ $count=$count+($tbsz*0.4813); } while ($string=~m{(E|L)}g){ $count=$count+($tbsz*0.46875); } while ($string=~m{(F|c|e|z)}g){ $count=$count+($tbsz*0.4375); } while ($string=~m{(J|S|f)}g){ $count=$count+($tbsz*0.4063); } while ($string=~m{(1)}g){ $count=$count+($tbsz*0.3625); } while ($string=~m{(r)}g){ $count=$count+($tbsz*0.35); } while ($string=~m{(s)}g){ $count=$count+($tbsz*0.3188); } while ($string=~m{(l|t)}g){ $count=$count+($tbsz*0.285); } while ($string=~m{(l)}g){ $count=$count+($tbsz*0.25); } while ($string=~m{(i|j)}g){ $count=$count+($tbsz*0.2345); } }else{ while ($string=~m{(W)}g){ $count=$count+($tbsz*0.7844); } while ($string=~m{(w)}g){ $count=$count+($tbsz*0.6989); } while ($string=~m{(A)}g){ $count=$count+($tbsz*0.5656); } while ($string=~m{(X)}g){ $count=$count+($tbsz*0.55); } while ($string=~m{(Q|O)}g){ $count=$count+($tbsz*0.5469); } while ($string=~m{(R|K|Y)}g){ $count=$count+($tbsz*0.5375); } while ($string=~m{(C|V)}g){ $count=$count+($tbsz*0.5313); } while ($string=~m{(N)}g){ $count=$count+($tbsz*0.5283); } while ($string=~m{(D|G|T)}g){ $count=$count+($tbsz*0.525); } while ($string=~m{(S|H)}g){ $count=$count+($tbsz*0.5125); } while ($string=~m{(B)}g){ $count=$count+($tbsz*0.5); } while ($string=~m{(4|U|Z)}g){ $count=$count+($tbsz*0.4875); } while ($string=~m{(8|9|P|3|6|7)}g){ $count=$count+($tbsz*0.475); } while ($string=~m{(0|5|a|2)}g){ $count=$count+($tbsz*0.4688); } while ($string=~m{(x|y)}g){ $count=$count+($tbsz*0.4594); } while ($string=~m{(L|b|g|o|p|q|v)}g){ $count=$count+($tbsz*0.4469); } while ($string=~m{(E|F|c|d|e)}g){ $count=$count+($tbsz*0.4438); } while ($string=~m{(h)}g){ $count=$count+($tbsz*0.4313); } while ($string=~m{(n|u)}g){ $count=$count+($tbsz*0.4219); } while ($string=~m{(z|J|k)}g){ $count=$count+($tbsz*0.4031); } while ($string=~m{(s|r)}g){ $count=$count+($tbsz*0.3969); } while ($string=~m{(t)}g){ $count=$count+($tbsz*0.3219); } while ($string=~m{(f)}g){ $count=$count+($tbsz*0.3188); } while ($string=~m{(1)}g){ $count=$count+($tbsz*0.3031); } while ($string=~m{(j)}g){ $count=$count+($tbsz*0.2438); } while ($string=~m{(I|i|l)}g){ $count=$count+($tbsz*0.1438); } }

      I can see four problems with the posted code that would make it run slower:

      1. You are using alternation when a character class would be faster.
      2. You are using capturing parentheses when you don't use the results of those captures.
      3. You are using the "+" Additive operator instead of the more efficient "+=" assignment operator.
      4. You are looping over the same string 28 or 29 times, depending on the value of $ftype, when you probably should only have to loop over the string twice.

      For example:

      while ($string=~m{(B|C|P|T|Z|a|b|d|h|k|n|p|q|u|v|x)}g){ $count=$count+($tbsz*0.5625); }

      Would be more efficient as:

      while ($string=~m{[BCPTZabdhknpquvx]}g){ $count+=($tbsz*0.5625); }

      That would cover points 1, 2 and 3.    For point 4 you could use hash tables for the calculations, something like:

      my %start_table = ( '\s' => 1, '<ems>' => 1, '<195>' => 1, '\.' => 0.25, '<ths>' => 0.25, '<193>' => 0.25, '<ens>' => 0.5, '<194>' => 0.5, ); my $start_lookup = join '|', keys %start_table; my %ftype_table = ( W => 1, '\s' => 1, '%' => 1, ### % is added temporarily for some testing purpose w => 0.84375, '\)' => 0.84375, ### need to escape meta-characters!!! '\(' => 0.84375, M => 0.8125, m => 0.8125, N => 0.7188, Q => 0.7188, # etc, ); my $ftype_lookup = join '', keys %ftype_table; my %non_ftype_table = ( W => 0.7844, w => 0.6999, A => 0.5656, X => 0.55, Q => 0.5469, O => 0.5469, R => 0.5375, K => 0.5375, Y => 0.5375, # etc. ); my $non_ftype_lookup = join '' keys %non_ftype_table; while ( $string =~ /($start_lookup)/og ) { $count += $tbsz * $start_table{ $1 }; } $string =~ s/<[A-Z\[\\\]\^_`a-z]+>//g; if ( $ftype == 1 ) { while ( $string =~ /([$ftype_lookup])/og ) { $count += $tbsz * $ftype_table{ $1 }; } else { while ( $string =~ /([$non_ftype_lookup])/og ) { $count += $tbsz * $non_ftype_table{ $1 }; } }

      Couldn't you just put all the weights in a lookup table, and then iterate once over the characters of the string?  Something like this:

      my %weight = ( A => 0.6562, B => 0.3571, #... z => 0.42, ); for my $ch (split //, $string) { $count += $tbsz * $weight{$ch}; }

      Or (if the string is huge)

      ... while ($string =~ /(.)/gs) { $count += $tbsz * $weight{$1}; }

      (And if ord($ch) of the characters is within a narrow range (such as ASCII), you could also use an array, and store the weights under $array[ord($ch)] — which might be a tad faster than a hash.)

      Whole code looks like
      while ($string=~m{<(ths|193)>|\.}g) { $count=$count+($tbsz * 0.25); }
      That makes a tr/// solution a non-candidate, doesn't it?

      I'd probably go for something like:

      my %factor; $factor{ths} = $factor{193} = 0.25; $factor{ems} = $factor{194} = 0.5; $factor{ens} = $factor{195} = 1; $factor{W} = $factor{' '} = $factor{"\n"} = $factor{"\t"} = ... = 1; $factor{w} = $factor{'('} = $factor{')'} = 0.84375; .... my $count; while (/(?|<([A-Za-z0-9]+)>|(.))/g) { no warnings 'uninitialized'; $count += $factor{$1}; } $count *= $tbsz;
      You may want to consider replacing the (.) with ($charclass), where:
      my $charclass = join "", grep {1 == length} keys %factor;
      Whether that makes a difference depends on your data set.