Thank you all for your replies, here are the scripts, data and other pertinent information as requested. My Perl version is 5.36.0 on Linux and the sample text file I used in the following "benchmarks" was created with:
wget http://www.astro.sunysb.edu/fwalter/AST389/TEXTS/Nightfall.htm html2text-cpp Nightfall.htm >nightfall.txt for i in {1..1000}; do cat nightfall.txt >>in.txt; done
Now, in.txt is 77MB large. Bash/sed script:
#!/bin/bash cat *.txt | \ tr -d '[:punct:]' | \ sed 's/[0-9]//g' | \ sed 's/w\(as\|ere\)/be/gi' | \ sed 's/ need.* / need /gi' | \ sed 's/ .*meant.* / mean /gi' | \ sed 's/ .*work.* / work /gi' | \ sed 's/ .*read.* / read /gi' | \ sed 's/ .*allow.* / allow /gi' | \ sed 's/ .*gave.* / give /gi' | \ sed 's/ .*bought.* / buy /gi' | \ sed 's/ .*want.* / want /gi' | \ sed 's/ .*hear.* / hear /gi' | \ sed 's/ .*came.* / come /gi' | \ sed 's/ .*destr.* / destroy /gi' | \ sed 's/ .*paid.* / pay /gi' | \ sed 's/ .*selve.* / self /gi' | \ sed 's/ .*self.* / self /gi' | \ sed 's/ .*cities.* / city /gi' | \ sed 's/ .*fight.* / fight /gi' | \ sed 's/ .*creat.* / create /gi' | \ sed 's/ .*makin.* / make /gi' | \ sed 's/ .*includ.* / include /gi' | \ sed 's/ .*mean.* / mean /gi' | \ sed 's/ talk.* / talk /gi' | \ sed 's/ going / go /gi' | \ sed 's/ getting / get /gi' | \ sed 's/ start.* / start /gi' | \ sed 's/ goes / go /gi' | \ sed 's/ knew / know /gi' | \ sed 's/ trying / try /gi' | \ sed 's/ tried / try /gi' | \ sed 's/ told / tell /gi' | \ sed 's/ coming / come /gi' | \ sed 's/ saying / say /gi' | \ sed 's/ men / man /gi' | \ sed 's/ women / woman /gi' | \ sed 's/ took / take /gi' | \ sed 's/ tak.* / take /gi' | \ sed 's/ lying / lie /gi' | \ sed 's/ dying / die /gi' | \ sed 's/ made /make /gi' | \ sed 's/ used.* / use /gi' | \ sed 's/ using.* / use /gi' \ >|out-sed.dat
This script executes in around 5 seconds:
% time ./re.sh real 0m5,201s user 0m43,394s sys 0m1,302s
First Perl script, slurping input file at once and processing line-by-line:
#!/usr/bin/perl use strict; use warnings; use 5.36.0; my $BLOCKSIZE = 1024 * 1024 * 128; my $data; my $IN; my $out='out-perl.dat'; truncate $out, 0; open my $OUT, '>>', $out; my @text = glob("*.txt"); foreach my $t (@text) { open($IN, '<', $t) or next; read($IN, $data, $BLOCKSIZE); my @line = split /\n/, $data; foreach (@line) { s/[[:punct:]]/ /g; tr/[0-9]//d; s/w(as|ere)/be/gi; s/\sneed.*/ need /gi; s/\s.*meant.*/ mean /gi; s/\s.*work.*/ work /gi; s/\s.*read.*/ read /gi; s/\s.*allow.*/ allow /gi; s/\s.*gave.*/ give /gi; s/\s.*bought.*/ buy /gi; s/\s.*want.*/ want /gi; s/\s.*hear.*/ hear /gi; s/\s.*came.*/ come /gi; s/\s.*destr.*/ destroy /gi; s/\s.*paid.*/ pay /gi; s/\s.*selve.*/ self /gi; s/\s.*self.*/ self /gi; s/\s.*cities.*/ city /gi; s/\s.*fight.*/ fight /gi; s/\s.*creat.*/ create /gi; s/\s.*makin.*/ make /gi; s/\s.*includ.*/ include /gi; s/\s.*mean.*/ mean /gi; s/\stalk.*/ talk /gi; s/\sgoing / go /gi; s/\sgetting / get /gi; s/\sstart.*/ start /gi; s/\sgoes / go /gi; s/\sknew / know /gi; s/\strying / try /gi; s/\stried / try /gi; s/\stold / tell /gi; s/\scoming / come /gi; s/\ssaying / say /gi; s/\smen / man /gi; s/\swomen / woman /gi; s/\stook / take /gi; s/\stak.*/ take /gi; s/\slying / lie /gi; s/\sdying / die /gi; s/\smade /make /gi; s/\sused.*/ use /gi; s/\susing.*/ use /gi; close $IN; print $OUT "$_\n"; } }
Please, ignore the technicality of failed matches before/after a newline, as this line-by-line implementation is uselessly slow anyway at over 4 minutes. Time to slurp the input and split it in lines < 1 second.
% time ./re1.pl real 4m1,655s user 4m29,242s sys 0m0,380s
If I split by /\s/ instead, it consumes 5 seconds at it, but the substitutions take 1 minute, i.e. 12 times slower than bash/sed:
% time ./re2.pl real 1m5,096s user 1m11,889s sys 0m0,524s
Final test, I created 1000 copies of nightfall.txt (77KB) with % for i in {1..1000}; do cp nightfall.txt nightfall-$i.txt; done. All scripts took roughly the same amount of time to complete. So, it would seem that my initial estimation of "60-70% slower Perl" was very optimistic, as the full scripts perform other tasks too, where Perl's operators and conditionals obviously blow Bash's out of the water.

For the record, I do all file read/write operations on tmpfs (ramdisk), so disk I/O isn't an issue. I'll implement AnomalousMonk's solution with hash lookup and report back soonest.

An idea that just occured to me is that when doing matches in word-splits, most regexes can apparently terminate the loop (and next;) as no further matches are expected below. Still, I'd like to exhaust all possibilities before admitting defeat.


In reply to Re^2: Need to speed up many regex substitutions and somehow make them a here-doc list by xnous
in thread Need to speed up many regex substitutions and somehow make them a here-doc list by xnous

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.