comment on

Please consider the following two programs (both of which work fine - suggestions for better function names welcome). The first replaces redundant white space characters with a single white space character.

#!/usr/bin/perl -d:NYTProf

use warnings;
use strict;
use String::Util 'trim';
use Benchmark qw(cmpthese timethese);

cmpthese(
    -30,
    {
    compress_1  => q|compress_1(' Mary      had a   little     lamb. '
+);|,
    compress_2  => q|compress_2(' Mary      had a   little     lamb. '
+);|,
    compress_3  => q|compress_3(' Mary      had a   little     lamb. '
+);|,
    squash  => q|squash(' Mary      had a   little     lamb. ');|,
    split_join  => q|split_join(' Mary      had a   little     lamb. '
+);|,
    }
);
print "'compress_1' => '",compress_1(' Mary      had a   little     la
+mb. '),"'\n";
print "'compress_2' => '",compress_2(' Mary      had a   little     la
+mb. '),"'\n";
print "'compress_3' => '",compress_3(' Mary      had a   little     la
+mb. '),"'\n";
print "'squash' => '",squash(' Mary      had a   little     lamb. '),"
+'\n";
print "'split_join' => '",split_join(' Mary      had a   little     la
+mb. '),"'\n";

exit;

sub compress_1 {
  my $string = shift;
  $string =~ s/ +/ /g;
  return $string;
}

sub compress_2 {
  my $string = shift;
  $string =~ s/\h+/ /g;
  return $string;
}

sub compress_3 {
  my $string = shift;
  $string =~ s/ {1,}/ /g;
  return $string;
}

sub squash {
  my $string = shift;
  $string =~ tr/ //s;
  return $string;
}

sub split_join {
  my $string = shift;
  $string = join ' ', split ' ', $string;
  return $string;
}
[download]

The next trims leading and trailing whitespace.

#!/usr/bin/perl -d:NYTProf

use warnings;
use strict;
use String::Util 'trim';
use Benchmark qw(cmpthese timethese);

cmpthese(
    -30,
    {
        'double_star' => q|double_star(' Mary had a little lamb. ');|,
        'double_plus' => q|double_plus(' Mary had a little lamb. ');|,
        'double_plus2' => q|double_plus(' Mary had a little lamb.  Mar
+y had a little lamb.  Mary had a little lamb.  Mary had a little lamb
+.  Mary had a little lamb.  Mary had a little lamb.  Mary had a littl
+e lamb.  Mary had a little lamb.  Mary had a little lamb.  Mary had a
+ little lamb.  Mary had a little lamb.  Mary had a little lamb.  Mary
+ had a little lamb.  Mary had a little lamb. ');|,
        'replace' => q|replace( ' Mary had a little lamb. ');|,
        'for_star' => q|for_star( ' Mary had a little lamb. ');|,
        'for_plus' => q|for_plus( ' Mary had a little lamb. ');|,
        'regex_or' => q|regex_or( ' Mary had a little lamb. ');|,
        'one_liner' => q|one_liner( ' Mary had a little lamb. ');|,
        'trim' => q|trim( ' Mary had a little lamb. ');|,
    }
);

print "'trim' => '",trim(' Mary had a little lamb. '),"'\n";
print "'double_star' => '",double_star(' Mary had a little lamb. '),"'
+\n";
print "'double_plus' => '",double_plus(' Mary had a little lamb. '),"'
+\n";
print "'double_plus2' => '",double_plus(' Mary had a little lamb.  Mar
+y had a little lamb.  Mary had a little lamb.  Mary had a little lamb
+.  Mary had a little lamb.  Mary had a little lamb.  Mary had a littl
+e lamb.  Mary had a little lamb.  Mary had a little lamb.  Mary had a
+ little lamb.  Mary had a little lamb.  Mary had a little lamb.  Mary
+ had a little lamb.  Mary had a little lamb. '),"'\n";
print "'replace' => '",replace( ' Mary had a little lamb. '),"'\n";
print "'for_star' => '",for_star( ' Mary had a little lamb. '),"'\n";
print "'for_plus' => '",for_plus( ' Mary had a little lamb. '),"'\n";
print "'regex_or' => '",regex_or( ' Mary had a little lamb. '),"'\n";
print "'one_liner' => '",one_liner( ' Mary had a little lamb. '),"'\n"
+;

exit;

sub one_liner {
  my $string = shift;
#  $string  =~ s/^\ *([A-Z,a-z,0-9]*)\ *$/$1/g;
  $string =~ s/^\s+|\s+$//g ;
  return $string;
}

sub double_star {
  my $string = shift;
  $string =~ s/^\s*//;
  $string =~ s/\s*$//;
  return $string;
}

sub double_plus {
    my $string = shift;
    $string =~ s/^\s+//; #remove leading spaces
    $string =~ s/\s+$//; #remove trailing spaces
    return $string;
}

sub replace {
    my $string = shift;
    $string =~ s/^\s*(\S*(?:\s+\S+)*)\s*$/$1/;
    return $string;
}
sub for_star {
    my $string = shift;
    for ($string) { s/^\s+//; s/\s+$//; }
    return $string;
}
sub for_plus {
    my $string = shift;
    for ($string) { s/^\s*//; s/\s*$//; }
    return $string;
}
sub regex_or {
    my $string = shift;
    $string =~ s/(?:^ +)||(?: +$)//g;
    return $string;
}
[download]

And here is what I get when I execute these programs.

ted@linux-jp04:~/Work/Projects/misc.tests> ./compress.multiple.spaces.
+to.single.space.pl
               Rate compress_3 compress_1 compress_2 split_join     sq
+uash
compress_3 135174/s         --        -2%        -6%       -34%       
+-45%
compress_1 137798/s         2%         --        -4%       -33%       
+-44%
compress_2 143178/s         6%         4%         --       -30%       
+-42%
split_join 205421/s        52%        49%        43%         --       
+-17%
squash     247547/s        83%        80%        73%        21%       
+  --
'compress_1' => ' Mary had a little lamb. '
'compress_2' => ' Mary had a little lamb. '
'compress_3' => ' Mary had a little lamb. '
'squash' => ' Mary had a little lamb. '
'split_join' => 'Mary had a little lamb.'
ted@linux-jp04:~/Work/Projects/misc.tests> ./trim.ws.pl
                 Rate double_plus2 regex_or trim for_plus for_star dou
+ble_star one_liner double_plus replace
double_plus2  69971/s           --      -5% -21%     -28%     -36%    
+    -37%      -43%        -46%    -46%
regex_or      73562/s           5%       -- -17%     -24%     -33%    
+    -34%      -40%        -43%    -44%
trim          88942/s          27%      21%   --      -8%     -19%    
+    -20%      -27%        -32%    -32%
for_plus      96591/s          38%      31%   9%       --     -12%    
+    -13%      -21%        -26%    -26%
for_star     109941/s          57%      49%  24%      14%       --    
+     -1%      -10%        -16%    -16%
double_star  111060/s          59%      51%  25%      15%       1%    
+      --       -9%        -15%    -15%
one_liner    122651/s          75%      67%  38%      27%      12%    
+     10%        --         -6%     -6%
double_plus  130149/s          86%      77%  46%      35%      18%    
+     17%        6%          --     -0%
replace      130236/s          86%      77%  46%      35%      18%    
+     17%        6%          0%      --                               
+                                       
'trim' => 'Mary had a little lamb.'                                   
+                                                                     
+                                       
'double_star' => 'Mary had a little lamb.'                            
+                                                                     
+                                       
'double_plus' => 'Mary had a little lamb.'                            
+                                                                     
+                                       
'double_plus2' => 'Mary had a little lamb.  Mary had a little lamb.  M
+ary had a little lamb.  Mary had a little lamb.  Mary had a little la
+mb.  Mary had a little lamb.  Mary had a little lamb.  Mary had a lit
+tle lamb.  Mary had a little lamb.  Mary had a little lamb.  Mary had
+ a little lamb.  Mary had a little lamb.  Mary had a little lamb.  Ma
+ry had a little lamb.'                                               
+                                                                     
+                                                  
'replace' => 'Mary had a little lamb.'                                
+                                                                     
+                                       
'for_star' => 'Mary had a little lamb.'                               
+                                                                     
+                                       
'for_plus' => 'Mary had a little lamb.'                               
+                                                                     
+                                       
'regex_or' => 'Mary had a little lamb.'                               
+                                                                     
+                                       
'one_liner' => 'Mary had a little lamb.'                              
+                                                                     
+                                       
ted@linux-jp04:~/Work/Projects/misc.tests>
[download]

First, I would like to understand the differences in performance among these regular expressions. I realize that the specific numbers will depend on the hardware being used and it's load, but I am interested in the ranking (aside from the obvious that applying the functions to a much longer string will impact these numbers). And, related to this, do they scale differently, or will I get the same ranking of functions regardless of the length of string? Second, I am curious as to why the split/join approach is so much faster than the fastest regular expression. Thirdly, I can see that if I want to both trim leading and trailing white space AND compress sequences of white space characters by a single space, I can use the split/join algorithm, but what about combining the regular expressions? I have included ONLY those regular expressions and algorithms that I found on the web, and one or two I came up with, and tested to work as advertised, but are there other regular expressions and/or functions that will serve one or the other or both functional requirements that would be faster still?

Thanks

ted

Why do I get this ridiculous splitting of my lines of code, so that the code begins at the far left of my screen and stops only a quarter of the way across my screen, and is there a way to stop that?

In reply to Question about regex performance by ted.byers

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.