in reply to Need to speed up many regex substitutions and somehow make them a here-doc list

G'day xnous,

Welcome to the Monastery.

Please follow the advice already given by ++hippo. In addition to representative input data, show the expected output data. Also, tell us what Perl version you're using: features in more recent Perls may improve performance.

Here's some very general advice:

When you provide the requested code and data, more specific advice will likely be possible.

— Ken

Replies are listed 'Best First'.
Re^2: Need to speed up many regex substitutions and somehow make them a here-doc list
by xnous (Sexton) on Oct 02, 2022 at 13:00 UTC
    Thank you all for your replies, here are the scripts, data and other pertinent information as requested. My Perl version is 5.36.0 on Linux and the sample text file I used in the following "benchmarks" was created with:
    wget http://www.astro.sunysb.edu/fwalter/AST389/TEXTS/Nightfall.htm html2text-cpp Nightfall.htm >nightfall.txt for i in {1..1000}; do cat nightfall.txt >>in.txt; done
    Now, in.txt is 77MB large. Bash/sed script:
    #!/bin/bash cat *.txt | \ tr -d '[:punct:]' | \ sed 's/[0-9]//g' | \ sed 's/w\(as\|ere\)/be/gi' | \ sed 's/ need.* / need /gi' | \ sed 's/ .*meant.* / mean /gi' | \ sed 's/ .*work.* / work /gi' | \ sed 's/ .*read.* / read /gi' | \ sed 's/ .*allow.* / allow /gi' | \ sed 's/ .*gave.* / give /gi' | \ sed 's/ .*bought.* / buy /gi' | \ sed 's/ .*want.* / want /gi' | \ sed 's/ .*hear.* / hear /gi' | \ sed 's/ .*came.* / come /gi' | \ sed 's/ .*destr.* / destroy /gi' | \ sed 's/ .*paid.* / pay /gi' | \ sed 's/ .*selve.* / self /gi' | \ sed 's/ .*self.* / self /gi' | \ sed 's/ .*cities.* / city /gi' | \ sed 's/ .*fight.* / fight /gi' | \ sed 's/ .*creat.* / create /gi' | \ sed 's/ .*makin.* / make /gi' | \ sed 's/ .*includ.* / include /gi' | \ sed 's/ .*mean.* / mean /gi' | \ sed 's/ talk.* / talk /gi' | \ sed 's/ going / go /gi' | \ sed 's/ getting / get /gi' | \ sed 's/ start.* / start /gi' | \ sed 's/ goes / go /gi' | \ sed 's/ knew / know /gi' | \ sed 's/ trying / try /gi' | \ sed 's/ tried / try /gi' | \ sed 's/ told / tell /gi' | \ sed 's/ coming / come /gi' | \ sed 's/ saying / say /gi' | \ sed 's/ men / man /gi' | \ sed 's/ women / woman /gi' | \ sed 's/ took / take /gi' | \ sed 's/ tak.* / take /gi' | \ sed 's/ lying / lie /gi' | \ sed 's/ dying / die /gi' | \ sed 's/ made /make /gi' | \ sed 's/ used.* / use /gi' | \ sed 's/ using.* / use /gi' \ >|out-sed.dat
    This script executes in around 5 seconds:
    % time ./re.sh real 0m5,201s user 0m43,394s sys 0m1,302s
    First Perl script, slurping input file at once and processing line-by-line:
    #!/usr/bin/perl use strict; use warnings; use 5.36.0; my $BLOCKSIZE = 1024 * 1024 * 128; my $data; my $IN; my $out='out-perl.dat'; truncate $out, 0; open my $OUT, '>>', $out; my @text = glob("*.txt"); foreach my $t (@text) { open($IN, '<', $t) or next; read($IN, $data, $BLOCKSIZE); my @line = split /\n/, $data; foreach (@line) { s/[[:punct:]]/ /g; tr/[0-9]//d; s/w(as|ere)/be/gi; s/\sneed.*/ need /gi; s/\s.*meant.*/ mean /gi; s/\s.*work.*/ work /gi; s/\s.*read.*/ read /gi; s/\s.*allow.*/ allow /gi; s/\s.*gave.*/ give /gi; s/\s.*bought.*/ buy /gi; s/\s.*want.*/ want /gi; s/\s.*hear.*/ hear /gi; s/\s.*came.*/ come /gi; s/\s.*destr.*/ destroy /gi; s/\s.*paid.*/ pay /gi; s/\s.*selve.*/ self /gi; s/\s.*self.*/ self /gi; s/\s.*cities.*/ city /gi; s/\s.*fight.*/ fight /gi; s/\s.*creat.*/ create /gi; s/\s.*makin.*/ make /gi; s/\s.*includ.*/ include /gi; s/\s.*mean.*/ mean /gi; s/\stalk.*/ talk /gi; s/\sgoing / go /gi; s/\sgetting / get /gi; s/\sstart.*/ start /gi; s/\sgoes / go /gi; s/\sknew / know /gi; s/\strying / try /gi; s/\stried / try /gi; s/\stold / tell /gi; s/\scoming / come /gi; s/\ssaying / say /gi; s/\smen / man /gi; s/\swomen / woman /gi; s/\stook / take /gi; s/\stak.*/ take /gi; s/\slying / lie /gi; s/\sdying / die /gi; s/\smade /make /gi; s/\sused.*/ use /gi; s/\susing.*/ use /gi; close $IN; print $OUT "$_\n"; } }
    Please, ignore the technicality of failed matches before/after a newline, as this line-by-line implementation is uselessly slow anyway at over 4 minutes. Time to slurp the input and split it in lines < 1 second.
    % time ./re1.pl real 4m1,655s user 4m29,242s sys 0m0,380s
    If I split by /\s/ instead, it consumes 5 seconds at it, but the substitutions take 1 minute, i.e. 12 times slower than bash/sed:
    % time ./re2.pl real 1m5,096s user 1m11,889s sys 0m0,524s
    Final test, I created 1000 copies of nightfall.txt (77KB) with % for i in {1..1000}; do cp nightfall.txt nightfall-$i.txt; done. All scripts took roughly the same amount of time to complete. So, it would seem that my initial estimation of "60-70% slower Perl" was very optimistic, as the full scripts perform other tasks too, where Perl's operators and conditionals obviously blow Bash's out of the water.

    For the record, I do all file read/write operations on tmpfs (ramdisk), so disk I/O isn't an issue. I'll implement AnomalousMonk's solution with hash lookup and report back soonest.

    An idea that just occured to me is that when doing matches in word-splits, most regexes can apparently terminate the loop (and next;) as no further matches are expected below. Still, I'd like to exhaust all possibilities before admitting defeat.

      Thanks for providing all this, that gives us a lot more to work on. I am intrigued by some of your s/// operations - perhaps you could confirm that these give your intended outputs?

      $ echo Washington werewolves are wasteful | perl -pe 's/w(as|ere)/be/g +i;' behington bewolves are beteful $ echo No work was carried out on Thursday as that as a day of rest | +perl -pe 's/\s.*work.*/ work /gi;' No work $ echo Did you swallow all that bacon | perl -pe 's/\s.*allow.*/ allow + /gi;' Did allow $

      As there's no point optimising code which doesn't do what you want it would be good to clear this sort of thing up first.


      🦛

        hippo> I am intrigued by some of your s/// operations - perhaps you could confirm that these give your intended outputs?

        Yes, you're right , the actual match/subs are non-greedy. I just wanted to provide a simpler and beautified version of my ugly script but the code structure is exactly the same.

        Corion> Regardless of the performance problems, you may be interested in using a proper stemmer to create a search index. See Lingua::Stem.

        I don't need (yet) a full stemming solution, which might not be the ideal tool as I'd have to override numerous substitutions.

        hv: Your hash lookup implementation runs twice as fast (34" vs 1'05" for my here-doc regexes). Another difference is it runs faster when operating on lines compared to words. sed seems unbeatable at 6 seconds.

        AnomalousMonk> Here's something that may address your needs more closely. As always, the fine details of regex definition are critical. I still have no idea as to relative speed :)

        I tested your solution last but unfortunately it took 2'23" to complete. I'll be doing more tests in the following days and report back with any progress. Thank you all for your wisdom.

      Thanks for the additional detail.

      Is it the intention that each of these substitutions replaces one word with another word? Because the use of .* in many of the patterns means that's not what is actually happening. For example it looks like the intention is to replace the text "one two coworker three four" with the text "one two work three four", but it will actually be replaced with "one work " because the pattern \s.*work.* will match from the first space to the end of the line.

      Assuming that the intention is to replace one word with another word, that could look something like this:

      # substitute whole word only my %w1 = qw{ going go getting get goes go knew know trying try tried try told tell coming come saying say men man women woman took take lying lie dying die made make }; # substitute on prefix my %w2 = qw{ need need talk talk tak take used use using use }; # substitute on substring my %w3 = qw{ mean mean work work read read allow allow gave give bought buy want want hear hear came come destr destroy paid pay selve self cities city fight fight creat create makin make includ include }; my $re1 = qr{\b(@{[ join '|', reverse sort keys %w1 ]})\b}i; my $re2 = qr{\b(@{[ join '|', reverse sort keys %w2 ]})\w*}i; my $re3 = qr{\w*?(@{[ join '|', reverse sort keys %w3 ]})\w*}i; # then in the loop s/[[:punct:]]/ /g; tr/[0-9]//d; s/w(as|ere)/be/gi; s{$re1}{ $w1{lc $1} }g; s{$re2}{ $w2{lc $1} }g; s{$re3}{ $w3{lc $1} }g; print $OUT "$_\n";

      If the input is always ASCII, the initial cleanup for punctuation and digits could potentially be something like s/[^a-z ]/ /gi or equivalently tr/a-zA-Z / /cs, unless you specifically wanted to replace "ABC123D" with the single word "ABCD" rather than the two words "ABC D". However if it may be Unicode, you would instead need something like s/[^\w ]/ /g, with no tr equivalent.

      The standalone substitution for w(as|ere) should probably be two additional entries in one of the existing hashes: currently this substitution is unique in replace a substring with another substring, so for example it will change "showered" into "shobed".

      It will also help a bit to move the close $IN out of the loop (though it doesn't actually seem to cause a noticeable slowdown).

      The above code runs for me about five times faster than your example perl code, though as described it behaves quite differently.

        I benchmarked your code.
        Here is my implementation:
        use strict; use warnings; # substitute whole word only my %w1 = qw{ going go getting get goes go knew know trying try tried try told tell coming come saying say men man women woman took take lying lie dying die made make }; # substitute on prefix my %w2 = qw{ need need talk talk tak take used use using use }; # substitute on substring my %w3 = qw{ mean mean work work read read allow allow gave give bought buy want want hear hear came come destr destroy paid pay selve self cities city fight fight creat create makin make includ include }; my $re1 = qr{\b(@{[ join '|', reverse sort keys %w1 ]})\b}i; my $re2 = qr{\b(@{[ join '|', reverse sort keys %w2 ]})\w*}i; my $re3 = qr{\b\w*?(@{[ join '|', reverse sort keys %w3 ]})\w*}i; #se +e discussion #my $re3 = qr{\w*?(@{[ join '|', reverse sort keys %w3 ]})\w*}i; #print "$re3\n"; #for debugging my $out='out-perl.dat'; open my $OUT, '>', $out or die "unable to open $out $!"; my $start = time(); my $finish; open my $IN, '<', "nightfall.txt" or die " $!"; #75 MB file while (<$IN>) { tr/-!"#%&'()*,.\/:;?@\[\\\]_{}0123456789//d; # no punct no digits # other formulations +possible s/w(as|ere)/be/gi; s{$re1}{ $w1{lc $1} }g; #this ~2-3 sec s{$re2}{ $w2{lc $1} }g; #this ~3 sec s{$re3}{ $w3{lc $1} }g; #this ~6 (best) - 14 sec print $OUT "$_"; } $finish = time(); my $total_seconds = $finish-$start; my $minutes = int ($total_seconds/60); my $seconds = $total_seconds - ($minutes*60); print "minutes: $minutes seconds: $seconds\n"; __END__ Time to completion with \b added to begin of $re3 minutes: 0 seconds: 12
        As expected, $re1 is the fastest, $re2 has 1/2 the terms but takes a bit longer than $re2. $re3 as you posted took a LOT longer - 14 secs.
        $re3 is the one where the target can be in the middle of other characters and that is "expensive". I added a \b to regex3 which I don't think changes the meaning of what it does, but that cuts about 8 seconds off the execution time!

        I did the substitutions on a per line basis. In other testing, I found that to be faster than running "one shot" on the input as a single string. I suspect that is because less stuff needs to be moved around when doing a substitution into the much smaller line string.

        With a 12 second run time, this is getting into the range of the sed solution. I am not at all confident that the 5 second number can be equaled, much less bested. However, this is a lot closer to the goal.

      sed can take take in several substitution regexes at once instead of piping each substitution result to the next: sed 's/ need.* / need /gi' | sed 's/ .*meant.* / mean /gi' can become sed 's/ need.* / need /gi;s/ .*meant.* / mean /gi'. This may speed up IO.

      For both Perl and bash/sed: their IO can be improved by creating a ramdisk and placing input and output files in there if you intend to process them multiple times. Better if the files are created from other processes then you can create them straight into the ramdisk, process them and then transfer them to more permanent store. In Linux this is as easy as: mount -o size=2g -t tmpfs /mnt/ramdisk1

      If you have all files living already in just one physical harddisk then parallelising their processing (which implies parallelising the IO) will show little improvement or, most likely, degradation. However you can see some improvement by implementing a pipeline: in one process files are copied into the ramdisk sequentially, the other processes are, in parallel, processing any files found in there. I assume that memory IO can be parallelised better than harddisk IO (but I am a lot behind in what modern OS and CPU can do, or it could be that MCE can work some magik with IO, so use some salt with this advice).

      Also in a recent discussion here, the issue came up that a threaded perl interpreter can be 10-20-30% slower than a non-threaded one. So, if you do not need threads that's a possible way to speed things up (check your perl's interpreter compilation flags with: perl -V and look for useithreads=define)

      This is an interesting problem to optimise because even small optimisations can lead to huge savings over your 1,000's to 1,000,000's files. So, I would start by benchmarking a few options with like 20 files: sed, sed+ramdisk, perl+ramdisk, pipeline, etc. Then you will be more confident in where to place your programming efforts or whether you can invest in learning new skills like MCE.

      bw, bliako

        In addition to the above good tips, there is a talk by Nicholas Clark (from 2005), "When Perl is not quite fast enough", that explores some other things you can do to make your code a bit faster.

        But in general, the best optimizations are by optimizing the algorithm, and also first finding out what exactly the slow parts are before you start optimizing.

      I ran your code on my Windows machine. Took 1 minute 34 seconds.
      My implementation shown below.
      I don't think that a huge BLOCKSIZE and using read() gained you anything. Because you immediately read all the data back out of memory, only to create a very large array of lines. Then read each line again in a loop. Having the 128MB buffer won't have much effect on the reading time of the disk. The data is typically organized in 4Kbyte hunks. On a physical drive, there will often be a mechanically induced delay after each "hunk" is read. I have a physical drive and even with it, total read time for the whole 75 MB file line by line is << 1 sec. SSD of course will be faster, but raw I/O speed doesn't appear to be the limit.

      #!/usr/bin/perl use strict; use warnings; use Time::Local; my $out='out-perl.dat'; open my $OUT, '>', $out or die "unable to open $out !"; my $start; my $finish; foreach my $text_file (<*.txt>) { print STDOUT "working on file $text_file\n"; $start = time(); open(my $IN, '<', $text_file) or die "invalid file: $text_file !"; # reading entire file line by line << 1 second overhead while (<$IN>) { tr/-!"#%&'()*,.\/:;?@\[\\\]_{}0123456789//d; s/w(as|ere)/be/gi; s/\sneed.*/ need /gi; s/\s.*meant.*/ mean /gi; s/\s.*work.*/ work /gi; s/\s.*read.*/ read /gi; s/\s.*allow.*/ allow /gi; s/\s.*gave.*/ give /gi; s/\s.*bought.*/ buy /gi; s/\s.*want.*/ want /gi; s/\s.*hear.*/ hear /gi; s/\s.*came.*/ come /gi; s/\s.*destr.*/ destroy /gi; s/\s.*paid.*/ pay /gi; s/\s.*selve.*/ self /gi; s/\s.*self.*/ self /gi; s/\s.*cities.*/ city /gi; s/\s.*fight.*/ fight /gi; s/\s.*creat.*/ create /gi; s/\s.*makin.*/ make /gi; s/\s.*includ.*/ include /gi; s/\s.*mean.*/ mean /gi; s/\stalk.*/ talk /gi; s/\sgoing / go /gi; s/\sgetting / get /gi; s/\sstart.*/ start /gi; s/\sgoes / go /gi; s/\sknew / know /gi; s/\strying / try /gi; s/\stried / try /gi; s/\stold / tell /gi; s/\scoming / come /gi; s/\ssaying / say /gi; s/\smen / man /gi; s/\swomen / woman /gi; s/\stook / take /gi; s/\stak.*/ take /gi; s/\slying / lie /gi; s/\sdying / die /gi; s/\smade /make /gi; s/\sused.*/ use /gi; s/\susing.*/ use /gi; print $OUT "$_"; } } $finish = time(); my $total_seconds = $finish-$start; my $minutes = int ($total_seconds/60); my $seconds = $total_seconds - ($minutes*60); print "minutes: $minutes seconds: $seconds\n"; __END__ working on file nightfall.txt minutes: 1 seconds: 34