comment on

Thank you all for your replies, here are the scripts, data and other pertinent information as requested. My Perl version is 5.36.0 on Linux and the sample text file I used in the following "benchmarks" was created with:

wget http://www.astro.sunysb.edu/fwalter/AST389/TEXTS/Nightfall.htm
html2text-cpp Nightfall.htm >nightfall.txt
for i in {1..1000}; do cat nightfall.txt >>in.txt; done
[download]

Now, in.txt is 77MB large. Bash/sed script:

#!/bin/bash

cat *.txt | \
    tr -d '[:punct:]' | \
    sed 's/[0-9]//g' | \
    sed 's/w\(as\|ere\)/be/gi' | \
    sed 's/ need.* / need /gi' | \
    sed 's/ .*meant.* / mean /gi' | \
    sed 's/ .*work.* / work /gi' | \
    sed 's/ .*read.* / read /gi' | \
    sed 's/ .*allow.* / allow /gi' | \
    sed 's/ .*gave.* / give /gi' | \
    sed 's/ .*bought.* / buy /gi' | \
    sed 's/ .*want.* / want /gi' | \
    sed 's/ .*hear.* / hear /gi' | \
    sed 's/ .*came.* / come /gi' | \
    sed 's/ .*destr.* / destroy /gi' | \
    sed 's/ .*paid.* / pay /gi' | \
    sed 's/ .*selve.* / self /gi' | \
    sed 's/ .*self.* / self /gi' | \
    sed 's/ .*cities.* / city /gi' | \
    sed 's/ .*fight.* / fight /gi' | \
    sed 's/ .*creat.* / create /gi' | \
    sed 's/ .*makin.* / make /gi' | \
    sed 's/ .*includ.* / include /gi' | \
    sed 's/ .*mean.* / mean /gi' | \
    sed 's/ talk.* / talk /gi' | \
    sed 's/ going / go /gi' | \
    sed 's/ getting / get /gi' | \
    sed 's/ start.* / start /gi' | \
    sed 's/ goes / go /gi' | \
    sed 's/ knew / know /gi' | \
    sed 's/ trying / try /gi' | \
    sed 's/ tried / try /gi' | \
    sed 's/ told / tell /gi' | \
    sed 's/ coming / come /gi' | \
    sed 's/ saying / say /gi' | \
    sed 's/ men / man /gi' | \
    sed 's/ women / woman /gi' | \
    sed 's/ took / take /gi' | \
    sed 's/ tak.* / take /gi' | \
    sed 's/ lying / lie /gi' | \
    sed 's/ dying / die /gi' | \
    sed 's/ made /make /gi' | \
    sed 's/ used.* / use /gi' | \
    sed 's/ using.* / use /gi' \
>|out-sed.dat
[download]

This script executes in around 5 seconds:

% time ./re.sh 
real    0m5,201s
user    0m43,394s
sys    0m1,302s
[download]

First Perl script, slurping input file at once and processing line-by-line:

#!/usr/bin/perl
use strict;
use warnings;
use 5.36.0;

my $BLOCKSIZE = 1024 * 1024 * 128;
my $data;
my $IN;
my $out='out-perl.dat';
truncate $out, 0;
open my $OUT, '>>', $out;

my @text = glob("*.txt");
foreach my $t (@text) {
  open($IN, '<', $t) or next;
  read($IN, $data, $BLOCKSIZE);
  my @line = split /\n/, $data;
  foreach (@line) {
    s/[[:punct:]]/ /g;
    tr/[0-9]//d;
    s/w(as|ere)/be/gi;
    s/\sneed.*/ need /gi;
    s/\s.*meant.*/ mean /gi;
    s/\s.*work.*/ work /gi;
    s/\s.*read.*/ read /gi;
    s/\s.*allow.*/ allow /gi;
    s/\s.*gave.*/ give /gi;
    s/\s.*bought.*/ buy /gi;
    s/\s.*want.*/ want /gi;
    s/\s.*hear.*/ hear /gi;
    s/\s.*came.*/ come /gi;
    s/\s.*destr.*/ destroy /gi;
    s/\s.*paid.*/ pay /gi;
    s/\s.*selve.*/ self /gi;
    s/\s.*self.*/ self /gi;
    s/\s.*cities.*/ city /gi;
    s/\s.*fight.*/ fight /gi;
    s/\s.*creat.*/ create /gi;
    s/\s.*makin.*/ make /gi;
    s/\s.*includ.*/ include /gi;
    s/\s.*mean.*/ mean /gi;
    s/\stalk.*/ talk /gi;
    s/\sgoing / go /gi;
    s/\sgetting / get /gi;
    s/\sstart.*/ start /gi;
    s/\sgoes / go /gi;
    s/\sknew / know /gi;
    s/\strying / try /gi;
    s/\stried / try /gi;
    s/\stold / tell /gi;
    s/\scoming / come /gi;
    s/\ssaying / say /gi;
    s/\smen / man /gi;
    s/\swomen / woman /gi;
    s/\stook / take /gi;
    s/\stak.*/ take /gi;
    s/\slying / lie /gi;
    s/\sdying / die /gi;
    s/\smade /make /gi;
    s/\sused.*/ use /gi;
    s/\susing.*/ use /gi;

    close $IN;
    print $OUT "$_\n";
  }
}
[download]

Please, ignore the technicality of failed matches before/after a newline, as this line-by-line implementation is uselessly slow anyway at over 4 minutes. Time to slurp the input and split it in lines < 1 second.

% time ./re1.pl 
real    4m1,655s
user    4m29,242s
sys    0m0,380s
[download]

If I split by /\s/ instead, it consumes 5 seconds at it, but the substitutions take 1 minute, i.e. 12 times slower than bash/sed:

% time ./re2.pl 
real    1m5,096s
user    1m11,889s
sys    0m0,524s
[download]

Final test, I created 1000 copies of nightfall.txt (77KB) with % for i in {1..1000}; do cp nightfall.txt nightfall-$i.txt; done. All scripts took roughly the same amount of time to complete. So, it would seem that my initial estimation of "60-70% slower Perl" was very optimistic, as the full scripts perform other tasks too, where Perl's operators and conditionals obviously blow Bash's out of the water.

For the record, I do all file read/write operations on tmpfs (ramdisk), so disk I/O isn't an issue. I'll implement AnomalousMonk's solution with hash lookup and report back soonest.

An idea that just occured to me is that when doing matches in word-splits, most regexes can apparently terminate the loop (and next;) as no further matches are expected below. Still, I'd like to exhaust all possibilities before admitting defeat.

In reply to Re^2: Need to speed up many regex substitutions and somehow make them a here-doc list by xnous
in thread Need to speed up many regex substitutions and somehow make them a here-doc list by xnous

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.