in reply to Need to speed up many regex substitutions and somehow make them a here-doc list

See haukex's article Building Regex Alternations Dynamically:

Win8 Strawberry 5.8.9.5 (32) Sat 10/01/2022 17:18:27 C:\@Work\Perl\monks >perl use strict; use warnings; use Data::Dump qw(dd); # for debug my $text = <<'TEXT'; Regular expressions have the undeserved reputation of being abstract and difficult to understand. TEXT print "before ---$text--- \n"; my @regexlist = split /\n/, <<'REGEX'; a A i I e E REGEX my %replace = map split, @regexlist; # dd \%replace; # for debug my ($rx_search) = map qr{ $_ }xms, join ' | ', map quotemeta, reverse sort keys %replace ; # dd $rx_search; # for debug $text =~ s{ ($rx_search) }{$replace{$1}}xmsg; print "after +++$text+++ \n"; ^Z before ---Regular expressions have the undeserved reputation of being abstract and difficult to understand. --- after +++REgulAr ExprEssIons hAvE thE undEsErvEd rEputAtIon of bEIng AbstrAct And dIffIcult to undErstAnd. +++

Update: This approach assumes each text file can be slurped to memory; 2-100 MB should be no problem. It also assumes the number of substitutions is "reasonable"; 150-1000 should be no problem. Care must be exercised in building the $rx_search regex if it is more complex than shown in the example; see haukex's article for tips on this. I have no idea how fast this approach is versus the one you're using now. Good luck :)


Give a man a fish:  <%-{-{-{-<

Replies are listed 'Best First'.
Re^2: Need to speed up many regex substitutions and somehow make them a here-doc list
by LanX (Saint) on Oct 02, 2022 at 10:02 UTC
    > This approach assumes each text file can be slurped to memory; 2-100 MB should be no problem

    The OP could slice the input into big chunks separated at newline boundaries.

    If that's not possible he could alternatively use a sliding window which always continues at the pos where the last replacement ended.

    On a side note, your map qr{...} join ... irritated me a bit, because the processed list has only one element. Not sure if that's the clearest style.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      ... your map qr{...} join ... irritated me a bit, because the processed list has only one element.

      Yeah, that gets to me a bit too, whenever I use it. But that syntax is used in haukex's original article, so I'm willing to consider it an "idiom." :)

      The important point is that the regex elements be somehow converted into a regex object. It's at this stage that any necessary boundary assertions are added. The only reasonable alternative I can see is something like

      my $rx_search = join ' | ', map quotemeta, reverse sort keys %replace ; $rx_search = qr{ ... $rx_search ... }xms;
      That's slightly more irritating to me and doesn't seem to clarify anything either.


      Give a man a fish:  <%-{-{-{-<

        > $rx_search = qr{ ... $rx_search ... }xms;

        Ok it's somehow "wasting" a variable, but

        my $rx_search = qr{$joined_search}xms;

        wouldn't really irritate me.

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery