comment on

Thanks for the additional detail.

Is it the intention that each of these substitutions replaces one word with another word? Because the use of .* in many of the patterns means that's not what is actually happening. For example it looks like the intention is to replace the text "one two coworker three four" with the text "one two work three four", but it will actually be replaced with "one work " because the pattern \s.*work.* will match from the first space to the end of the line.

Assuming that the intention is to replace one word with another word, that could look something like this:

# substitute whole word only
my %w1 = qw{
  going go
  getting get
  goes go
  knew know
  trying try
  tried try
  told tell
  coming come
  saying say
  men man
  women woman
  took take
  lying lie
  dying die
  made make
};
# substitute on prefix
my %w2 = qw{
  need need
  talk talk
  tak take
  used use
  using use
};
# substitute on substring
my %w3 = qw{
  mean mean
  work work
  read read
  allow allow
  gave give
  bought buy
  want want
  hear hear
  came come
  destr destroy
  paid pay
  selve self
  cities city
  fight fight
  creat create
  makin make
  includ include
};
my $re1 = qr{\b(@{[ join '|', reverse sort keys %w1 ]})\b}i;
my $re2 = qr{\b(@{[ join '|', reverse sort keys %w2 ]})\w*}i;
my $re3 = qr{\w*?(@{[ join '|', reverse sort keys %w3 ]})\w*}i;

# then in the loop
  s/[[:punct:]]/ /g;
  tr/[0-9]//d;
  s/w(as|ere)/be/gi;
  s{$re1}{ $w1{lc $1} }g;
  s{$re2}{ $w2{lc $1} }g; 
  s{$re3}{ $w3{lc $1} }g;
  print $OUT "$_\n";
[download]

If the input is always ASCII, the initial cleanup for punctuation and digits could potentially be something like s/[^a-z ]/ /gi or equivalently tr/a-zA-Z / /cs, unless you specifically wanted to replace "ABC123D" with the single word "ABCD" rather than the two words "ABC D". However if it may be Unicode, you would instead need something like s/[^\w ]/ /g, with no tr equivalent.

The standalone substitution for w(as|ere) should probably be two additional entries in one of the existing hashes: currently this substitution is unique in replace a substring with another substring, so for example it will change "showered" into "shobed".

It will also help a bit to move the close $IN out of the loop (though it doesn't actually seem to cause a noticeable slowdown).

The above code runs for me about five times faster than your example perl code, though as described it behaves quite differently.

In reply to Re^3: Need to speed up many regex substitutions and somehow make them a here-doc list by hv
in thread Need to speed up many regex substitutions and somehow make them a here-doc list by xnous

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.