comment on

I benchmarked your code.
Here is my implementation:

use strict;
use warnings;

# substitute whole word only
my %w1 = qw{
  going go
  getting get
  goes go
  knew know
  trying try
  tried try
  told tell
  coming come
  saying say
  men man
  women woman
  took take
  lying lie
  dying die
  made make
};
# substitute on prefix
my %w2 = qw{
  need need
  talk talk
  tak take
  used use
  using use
};
# substitute on substring
my %w3 = qw{
  mean mean
  work work
  read read
  allow allow
  gave give
  bought buy
  want want
  hear hear
  came come
  destr destroy
  paid pay
  selve self
  cities city
  fight fight
  creat create
  makin make
  includ include
};
my $re1 = qr{\b(@{[ join '|', reverse sort keys %w1 ]})\b}i;       
my $re2 = qr{\b(@{[ join '|', reverse sort keys %w2 ]})\w*}i;      
my $re3 = qr{\b\w*?(@{[ join '|', reverse sort keys %w3 ]})\w*}i;  #se
+e discussion
#my $re3 = qr{\w*?(@{[ join '|', reverse sort keys %w3 ]})\w*}i;    

#print "$re3\n";  #for debugging

my $out='out-perl.dat';
open my $OUT, '>', $out or die "unable to open $out $!";

my $start = time();
my $finish;

open my $IN, '<', "nightfall.txt" or die " $!";  #75 MB file

while (<$IN>)
{
    tr/-!"#%&'()*,.\/:;?@\[\\\]_{}0123456789//d; # no punct no digits
                                                 # other formulations 
+possible
      s/w(as|ere)/be/gi;
      s{$re1}{ $w1{lc $1} }g;  #this ~2-3 sec
      s{$re2}{ $w2{lc $1} }g;  #this ~3 sec
      s{$re3}{ $w3{lc $1} }g;  #this ~6 (best) - 14 sec
      print $OUT "$_";  
}

$finish = time();

my $total_seconds = $finish-$start;
my $minutes = int ($total_seconds/60);
my $seconds = $total_seconds - ($minutes*60);

print "minutes: $minutes  seconds: $seconds\n";

__END__
Time to completion with \b added to begin of $re3
minutes: 0  seconds: 12
[download]

As expected, $re1 is the fastest, $re2 has 1/2 the terms but takes a bit longer than $re2. $re3 as you posted took a LOT longer - 14 secs.
$re3 is the one where the target can be in the middle of other characters and that is "expensive". I added a \b to regex3 which I don't think changes the meaning of what it does, but that cuts about 8 seconds off the execution time!

I did the substitutions on a per line basis. In other testing, I found that to be faster than running "one shot" on the input as a single string. I suspect that is because less stuff needs to be moved around when doing a substitution into the much smaller line string.

With a 12 second run time, this is getting into the range of the sed solution. I am not at all confident that the 5 second number can be equaled, much less bested. However, this is a lot closer to the goal.

In reply to Re^4: Need to speed up many regex substitutions and somehow make them a here-doc list by Marshall
in thread Need to speed up many regex substitutions and somehow make them a here-doc list by xnous

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.